Instrumental Conditioning

Operant Conditioning

In instrumental conditioning, animals learn to modify their behavior in order to enforce a reward or to repress a punishment. The difference to classical conditioning is therefore that the animal does not receive the reward if he does not a perform desired action. As mentioned above, Thorndike already provided early evidence for this behavior in his law of effect. In some of the experiments, cats were put in puzzle boxes and they had to escape in order to receive a reward (like food). He noted that the cats initially tried actions that appeared random but gradually started to stamp out behavior which was not successful and stamp in rewarding behavior. As one could imagine, the cat became faster after a while. This showed that the cats were learning by trial and error and Thorndike called this the “law of effect”. The idea of the law of effect corresponds to learning algorithms that select among different alternatives and that actions on specific states are associated with a reward or even a right step to the expected future reward. Influenced by Thorndike’s research, Hull and Skinner argued that behavior is selected on the basis of the consequences they produce and coined the term operant conditioning. For his experiments, Skinner invented what is now called Skinner’s box in which he put pigeons that can press a lever in order to get a reward. Skinner further popularized what he called the process of shaping. Shaping occurs when the trainer rewards the agent with any taken action that has a slight resemblance to the desired behavior and this process converged to the correct result when applied to pigeons [21]. This process can be directly mapped to reward shaping in reinforcement learning.

Neural View

Neuroscience is the field that is concerned with studying the structure and function of the central nervous system including the brain. Neurons are the basic building blocks of brains and, unlike other cells, are densely interconnected. On average each neuron has 7000 synaptic connections and the cerebral cortex alone (the folded outer layer of the brain) is estimated to have $1.5 \times 10^{14}$ synapses [5]. Synaptic connections can be of a chemical or an electrical nature. We concentrate on the former because they are a basis for synaptic plasticity which is correlated with learning [7]. According to the Hebbian theory, repeated stimulation of the postsynaptic neurons increases or decreases the synaptic efficacy. Chemical communication occurs through the synapses by secreting neurotransmitters from the presynaptic cell to receptors on the postsynaptic cell through the synaptic cleft. Fig. 2 shows an illustration of such a chemical synapse. The effect of these neurotransmitters on the postsynaptic neurons can be of an excitatory or an inhibitory nature. Dopamine is perhaps the most famous neurotransmitter. Dopamine plays a role in multiple brain areas and is correlated with different brain functions including learning and will be discussed further in the subsections below. A key feature that makes dopamine a promising candidate to be involved with learning is that the dopamine system is a neuromodulator. Neuromodulators are not as restricted as excitatory or inhibitory neurotransmitters and can reach distant regions in the CNS and affect large numbers of neurons simultaneously.

Reward Prediction Error Hypothesis

Work by Schultz et al. and others have shown that there is a strong similarity between the phasic activation of midbrain dopamine neurons and the prediction error $\delta[20]$. They showed that when an animal receives an unpredicted reward, dopamine neuron activity increases substantially. After the conditioning phase, the neuronal activity relocates to the moment when the $\mathrm{CS}$ is presented and not of the reward itself. If the $\mathrm{CS}$ is presented but with omitting the reward afterwards, a decrease of the activity below the baseline is observed approximately at the moment when the reward was presented during conditioning. These observations are consistent with the concept of prediction error. Findings from functional Magnetic Resonance Imaging (fMRI) have shown activation correlated with prediction errors in the striatum and the orbitofrontal cortex [2]. The presence or absence of activity related to prediction errors in the striatum distinguishes participants who learn to perform optimally from those who do not [18].

