0:00

Welcome back. In the previous lecture, we learned about

Â super wise learning, but humans and animals in general do not get exact

Â supervisory signals when they're learning to find food in a maze or leaning to ride

Â a bike or play the piano. We learn by trial and error and we might

Â get rewards by punishments along the way. For example we might find food at the end

Â of the maze. Or get praise and critisisms from the

Â piano teacher or you might even get an amazing reward and the end of a course, a

Â certificate of accomplishment. This leads us to the last type of

Â learning that we'll consider in this course, reinforcement learning.

Â 0:36

In reinforcement learning we have an agent such as a rat interacting with an

Â environment, such as this barn. The agent at any point in time t can be

Â in a state denote by ut, where ut is a vector that could denote, for example,

Â the location of the rat in the barn. And the agent may get a reward at any

Â point in time t and this reward is denoted by rt.

Â And rt can be a scalar value that could be.

Â Positive or negative and the reward might denote for example the amount of food

Â that a rat gets in a particular location in a barn or it could represent a

Â particular nasty encounter with a cat in some location in the barn.

Â 1:20

Now the problem facing the agent or the rat in this case.

Â Is selecting the best actions that will maximize the total expected future reward

Â and this is the problem reinforcement learning.

Â Perhaps the earliest results in reinforcemont learning were obtaied by

Â Pavlov in this experiment with dogs. These are the classical conditioning or

Â Pavlovian conditioning experiments. So what did Pavlov do?

Â Well he rang a bell and followed the ringing of the bell with some food reward

Â for the dog. And he repeated this association of bell

Â followed by food reward many many times here's what he observed.

Â He observed that every time he rang the bell the dog began to salivate as

Â depicted by this animation here. So what do you conclude from this?

Â You can conclude that the conditioned stimulus, which in this case is the bell,

Â predicts the future reward, which is the food.

Â 2:24

So the problem faced by Pavlov's dog's brain is this.

Â How do we predict rewards that are delivered some time after a stimulus such

Â as the bell is presented. Well, let's see if we can formalize this

Â particular problem. So what we are given then are many, many

Â trials, each of length let's say capital T time steps.

Â And let's denote the time within one particular trial using this little t.

Â And let's denote the stimulus that we might get at any particular time step as

Â ut, so for example ut might be the ringing of a bell or not.

Â And the reward, rt is what the animal might get at each time step t.

Â So this could mean that the animal gets a food reward at some particular time step

Â t or maybe it doesn't get any reward at all so rt can be 0 for some time steps t.

Â And here's what we would like. So we would like a neuron whose output,

Â vt, predicts the expected total future reward.

Â And so what we would like is for the output vt to be approximately equal to

Â the average over all the trials of the summation of all the rewards from

Â timestep t onwards until the end of the trial denoted by capital T.

Â 3:52

Here's how you can get a neuron to predict the expected total future reward.

Â We can use a set of synaptic rates, wt and we can predict based on all past

Â stimuli, ut. So here is a network that can perform

Â this operation. We have to use what is called a tapped

Â delay line to feed in all the past inputs into this network.

Â And here is the output of the network. It's simply a weighted summation of all

Â of the weights with the past inputs. And you'll notice that this is nothing

Â but the equation for a discrete linear filter, so linear filtering strikes

Â again. And here are out standard trick for

Â learning the weights we can minimize an error function.

Â And here's the error function it's just the squared difference between the total

Â future reward and the prediction of the total future reward.

Â So how do we minimize the error function can we for example use gradient descent

Â and the delta rule as in the previous lecture?

Â 5:23

Well the key idea goes back to Richard Bellman and his optimization method known

Â as dynamic programming. And the idea is to rewrite the error

Â function recursively to get rid of the future terms that are not available at

Â this time. So how does this apply to our problem?

Â Well here's the problematic summation of future rewards.

Â And we can rewrite that as rt. Plus the sum of all the future rewards.

Â And here is the key jump. We replace the summation of all the

Â future rewards with the prediction by our network of the expected future reward.

Â Now we have an error function where we have all the quantities that are

Â available to us. Between time step t and t plus 1 and so

Â if you can minimize this error function now using our old friend gradient decent.

Â When we do that we get what's called the temporal difference rule or TD learning

Â rule and this was orginally proposed Sepheran Barto in the 1980's.

Â And here's what the learning rule looks like.

Â So the rates for each time step, tau, are updated according to these three terms.

Â So there's the learning rate as before and then there's a prediction error term

Â given by delta and there's also the input.

Â Now why is this learning rule called temporal difference learning?

Â Well, as you can see in this term, we have a temporal difference between the

Â prediction at t + 1 and the predication at time t.

Â 7:00

Well if skeptical about the temperal difference learning rule.

Â I wouldn't blame you. If it's unclear that replacing the sum of

Â future rewards would a prediction can actually work in practice.

Â Well hopefully this example will convince you.

Â Suppose we take the example Pavlov's dog. And suppose that the bell, the stimulus

Â is given at time step 100 and the reward, the food is given at time step 200 within

Â any given trial. Now let's look at the situation before

Â and after learning. So in the case of the stimulus and the

Â reward there is no difference before and after learning.

Â Because the stimulus and the reward are presented at time step 100 and around

Â time step 200 in both of these cases. But look at what happens to the

Â prediction. The prediction of the network is all

Â zeroes initially but after learning. There is a prediction of two starting at

Â the time of the stimulus at time step 100.

Â And why is it two? Well two is the total reward that is

Â delivered around that time step 200. And so you can see that the network has

Â learned to correctly predict the total reward it expects starting from the time

Â of the stimulus. It's also interesting to note what

Â happens to the delta, the prediction error.

Â So you can see before learning, the delta is high around the time of the reward and

Â so that's because the network is predicting all zeroes, whereas the reward

Â is delivered at time step 200. And so the prediction error is now going

Â to be essentially just the reward but look at what happens after learning.

Â So after learning around the time of the reward, there is a delta of 0.

Â So there is no error in prediction but the prediction error has shifted now to

Â the time of the stimulus. And that's because what this reflects is

Â just the value of the prediction. So v of 100 minus the value of the

Â prediction at the previous time step which is v of 99 so this reflects the

Â prediction error given by v 100 minus v 99.

Â Now this plot shows how delta changes as a function of trials so at the very of

Â trial number 1 we have a bump. Around the time of the reward around 200

Â and that's identical to this plot here. But as the network is exposed to several

Â trials, this bump moves backwards in time until it reaches a value of 2 at the time

Â of the stimulus. And that is exactly the situation we have

Â here. So that is where the network now has

Â learned to predict a value of 2 for the total reward expects from the time of the

Â stimulus from time step 100. Now here are some intriguing results from

Â Wolfun Scholtz and colleagues. They recorded from the ventral tegmental

Â area of the midbrain of a monkey. And the neurons in the ventral tegmental

Â area or VTA are dopaminergic, which means that they transmit the neurotransmitter

Â dopamine to different parts of the brain. Now dopamine has been implicated in

Â reward based learning and it is also involved in various addictive behaviors

Â such as addiction to drugs like cocains. In the experiments, the monkey was

Â presented with a stimulus, for example a sound and then the monkey had to press a

Â key. A short while later the monkey was

Â rewarded and here's what the neurons in the ventral tegmental area did in this

Â experimental paradigm. Before training, the neurons in the

Â ventral tegmental area had a very high firing rate around the time of the

Â reward. But after training, the neurons no longer

Â responded near the time of the reward. They started responding around the time

Â of the stimulus. Now what does this remind you of?

Â 11:03

That's right these two plots look very similar to the plots for delta, the

Â prediction error from the previous slide. So what this suggests is that the neurons

Â in the Ventral Tegmental area may be encoding reward prediction error.

Â And that would explain why you have a big response before training around the time

Â of the war. Where as after training the response is

Â very small because the reward prediction error now is very small since the animal

Â has learned to predict the reward. And then the reward prediction error is

Â larger around the time of the simulis because now that response encodes the

Â prediction error vt minus vt minus 1. Which is similar to the error that we saw

Â in the previous slide, v100 minus v99. Now here's an interesting question.

Â What do you think will happen if we don't give the monkey any reward at the time

Â that it expects to get the reward? Well the monkey is probably going to

Â think it's a cruel joke but what do you think is going to to happen to the

Â responses of neurons in the ventral tegmental area.

Â 12:10

Well that's right you would expect to see a negative error because the prediction

Â was not fulfilled and that's indeed what will Schouls and colleagues observed.

Â In the ventral tegmental area neurons. So when there was reward, you have this

Â response, which is similar to what we had in the previous light.

Â But when the reward is omitted, there's no reward, then you see a dip in the

Â firing rate of the neurons. And that corresponds to a negative error

Â in the temporal difference learning model of the dopaminergic cells NVTA.

Â Now that you know how the brain might learn to predict rewards you might be

Â asking the question how does the brain learn to select actions that maximize

Â future rewards. This will be the topic of our next

Â lecture. Until then.

Â Sy Chin and good bye.

Â