0:17

Well lets try to formulize this problem. Here's the diagram for the reinforcement

Â learning frame. And as you would recall, we had an agent

Â interacting with an environment and at any point in time T, the agent measures a

Â state UT of the environment and also potentially gets a reward, RT, from the

Â environment. And, the agent can then execute an

Â action. 80 which in turn would change the

Â environment in a specific way and then agent gets to measure again a new state

Â ut plus 1 and potentially get a reward rt plus 1.

Â Now the phased by the agent is figuring out which action to take given any

Â particular state. And this can be formalized as the problem

Â of learning a state-to-action mapping, or a policy, as it is called in

Â reinforcement learning circles. And the policy is typically denoted by

Â the function pi. So pi is a function that maps states to

Â actions. Now, what kind of a policy do we want?

Â We would ideally like to have a policy which maximizes the expected total future

Â reward. So here's the mathematical expression for

Â that, and it's, you'll notice, exactly the same expression we had when we were

Â discussing temporal difference learning. So it's the expectation of the sum of all

Â the rewards we're going to get from time step T, all the way up to the end of the

Â trial times state capital T. Let's try to understand the problem in

Â the context of an example. Here's our friendly rat and let's suppose

Â that the rat is in a barn. Given by this maze here, and different

Â locations in the barn, contain different amounts of food reward.

Â So let's suppose that this location here contains a food reward of five.

Â This location contains a food reward of two and these two locations contain no

Â food reward at all. Now, we could be mean to our rat and make

Â these two numbers negative, which would mean that these two locations contain

Â predators, such as cats, but let's be nice to our rat, so no predators in this

Â barn. Now we have marked three locations in

Â this barn A, B, and V. And the rat has to decide whether to go

Â to the left or to the right at each of these locations.

Â So, the reinforcement learning problem for the rat, then, is to choose an action

Â at each of these locations. So as to maximize the expected future

Â reward. Now in the context of the formal

Â reinforcement learning framework, can you tell me what the states and actions are

Â in this problem? That's right, the states are the

Â locations and the actions Are the two actions, go left or go right.

Â Here's another question that i would like to ask you.

Â If the rat chooses to go left or right uniformly at random.

Â In other words it's executing a random policy, then can you tell me what

Â expected reward would be for each state? The expected reward for each state is

Â also called the value for that state. So can you tell me what the value for

Â each of the states A B and C would be if the rat was executing a random policy of

Â going uniformly at random left or right at each of these locations.

Â 3:35

Well here's the answer. Let's first look at the value for B, so

Â if you're in location B, then if you uniformly go at random left or right,

Â then you're going to be half the time in the location with the 0 and half the time

Â in the location with the 5, and therefore the expected reward is 2.5.

Â Similarly if you're in location C, then you have an expected reward or value of

Â 1. And interestingly, if you are in location

Â A you an use the values you just computed for B and C, because those are the next

Â states following A if you go left or if you go right.

Â And so if you go left or right uniformly at random then you have the expression

Â half of the value of B plus half of the value of C.

Â Adding them together gives you the expected reward or the value for the

Â location A. Well, that was easy, but how can you

Â learn these values in an online manner? Remember that the rat Is experiencing

Â these locations sequentially and therefore, the rat has to learn these

Â values, from experience, as it is going through each of these locations.

Â How can we do that? The answer, as you might have guessed, is

Â to use our old friend, Temporal Difference Learning or TD Learning.

Â So let's represent the value of each state u, v of u by a weight w of u.

Â Now we can update w of u using the temporal difference learning rule.

Â So epsilon as you recall is the learning rate.

Â And here is the prediction error from the temporal difference learning rule.

Â And you can see that the prediction error term contains both the reward that you

Â get at the state u as well as the prediction of the expected reward, the

Â value for the state u prime. So u prime here is the next state that

Â you get after taking an action at the location u.

Â And vu, of course, is the prediction of expected reward, it's the value for the

Â state, u. Now let's look at what this u prime is.

Â So, if you are in a state, u, and you take an action, a, then, you're going to

Â end up in a state u prime. So, specifically, in our example here.

Â If we have a state A that we're in, and if we take an action go left, then the u

Â prime, the next state, is going to be the state B.

Â 6:03

Here are the results of using TD Learning to learn values for our problem of the

Â rat in the barn using a random policy. Each of these plots shows the value for

Â the states A, B and C, as represented by the weights wA, wB and wC.

Â The jagged lines show the values as a function of trial number.

Â And you can see that the values for each of these locations A, B, and C jumps

Â around a bit, but the running average, as represented by the dark line, converges

Â to the right answer. So, 1.75, 2.5 and one, were the values

Â that we calculated for A, B, and C on the previous slide.

Â So, indeed The temporal difference learning rule appears to be learning the

Â correct values for each of these locations.

Â Now why are these values jumping around so much well it's because we've set the

Â learning rate epsilon to a high value of 0.5 and that speeds up the learning the

Â process but it will also cause your estimates of your value to jump around a

Â bit. Now why did we go through the trouble of

Â finding the value of each of the states? Well, here is the answer, as observed by

Â our astute friend, the friendly rat. Once you know the values for the states,

Â you solve the action selection problem. Here's why.

Â If you're given the choice between two different actions that lead to two

Â different states, then all you have to do is pick the action that leads you to the

Â higher valued state in the next time step.

Â Let's see if this works in our example. Well, as you might have guessed, it does.

Â Let's consider the action that we should take in the location a so we have 2

Â possible action to go left or go right and all we have to do now is to look at

Â the values associated with the next states if we each of these respective

Â actions. So if we take a left we end up in the

Â state B and we can look the value which is the expected reward we get.

Â In this state B which as we computed in the previous slide is 2.5, now similarly

Â if we take the action right then we end up in the state C which has as we

Â computed earlier a value of one, so that's the expected reward that we might

Â get if we move to the state C. So given that we have these two possible

Â states that it would move to, if we take the action left or the action right, the

Â obvious choice here then, is to choose the action going left, and that will make

Â us go to the state b, which has the higher expected reward or value.

Â The important point here is that we're using values as surrogate immediate

Â rewards. So what do we mean by that?

Â Well consider the fact that in locations B and C, we do not get any immediate

Â rewards but we can compute the value which is the expected reward at B and at

Â C, and we can use the value as a surrogate.

Â For the immediate reward, and so we can use the value to guide our selection of

Â action at location A. This leads us to the important result

Â that a locally optimal choice here leads to a globally optimal policy, as long as

Â we have a Markov environment. And by Markov we mean that the next state

Â only depends on the current state and the current action.

Â This important result, which we can rigorously prove, is closely related to

Â the concept of dynamic programming first proposed by Richard Bellman.

Â 9:42

Okay, let's put it all together and see if we can come up with an algorithm For

Â learning optimal policies in Markov environments.

Â And the algorithm that we're going to look at is called actor-critic learning.

Â And it's called actor-critic learning because there are two components.

Â The first one is the actor, and the actor component selects actions and maintains

Â the policy. The critic component maintains the value

Â of each state. Lets first look at the critic component

Â and see how it learns. The critic component can also be looked

Â upon as performing policy evaluation because it's evaluating the current

Â policy by finding the value of each state under the current policy.

Â Now do we find the value of a state u well we can first represent it using a

Â weight w of u as we did before and then we can apply the temporal difference

Â learning rule as we did earlier in this lecture to find the value of a state u.

Â So here is the prediction error term as before.

Â And so the prediction error term is then used to update the weights wu for each

Â state u and that allows the critic component to compute the value of each

Â state u. Lets now look at the actor component.

Â The actor component as you recall selects actions and maintains the policy.

Â So does it select an action. It selects probabilistically by using

Â this function also known as the soft max function.

Â The soft max function is basically an exponential of a Q function which is

Â basically the value of a state and action pair.

Â So we'll come to that in just a minute but think of as a function very similar

Â to a value of a state except that now we're computing the value of a state

Â action pair. So the soft max function looks at the

Â value of any given state action pair and it runs it through this exponential.

Â Beta here is some fixed parameter so when you divide it by the sum of all the

Â exponential your normalizing it so that this probability the action for any given

Â state sums to 1. So given this set of probabilities for

Â any given action in any given state we can now select the action according to

Â this probability, and that gives us, the action that we execute.

Â 12:17

You might be wondering why we have to use a probabilistic method for selecting

Â actions. Well it let's us address what is known as

Â the exploration, versus the exploitation dilemma in reinforcement learning, and

Â what is that? Well, consider the fact that early on at

Â the very beginning of learning these Q values might not be very accurate because

Â you have not experienced the environment very much.

Â So what you would like to do at that stage is explore the environment, and so

Â having such a soft maximum selection method lets you explore the environment

Â if the beta values are small. And, as you've learned, better and better

Â of values of Q, through exploration you can then increase the value of beta, and

Â that will tend to then pick the actions that have the higher Q values.

Â So then, you get more and more towards a deterministic action selection process

Â that favors more and more, the actions that have the higher Q values.

Â 13:17

Once you've selected the action area of the state u, you can use this learning

Â rule to update the q values for all your actions.

Â A prime. This learning rule is quite similar to

Â the learning rule for w, and it also uses the reward prediction error, except that

Â now we multiply the reward prediction error with this term here which uses the

Â direct delta function that we all know and love.

Â And as you recall the direct delta function is going to be equal to a value

Â of 1 when a prime is equal to a, and a value of zero when a prime is not equal

Â to a. So the effect of this term is to multiple

Â the reward prediction error with a positive number when A' is equal is A and

Â a negative number when A' is not equal to A so the overall effect then is that the

Â Q value for A is increased if the action A leads to a.

Â Greater than expected reward and then the Q value is decreased if the action leads

Â to a less than expected reward. Now the Actor-Critic Learning Algorithm

Â proceeds by repeating the steps one and two above.

Â And under reasonable assumptions, it can be shown to converge to the optimal

Â policy, Now let's see if it finds the optimal policy in the example of our

Â friendly rat in the barn. Yes it indeed finds the optimal policy as

Â it turns out for our barn example. Here is the probability of going left.

Â At each location in the bar and as you can the probability of going left is

Â initially 0.5 for each of the locations A B C as is the probability of going right

Â but after several trials the algorithms assigns a probability almost close to 1.

Â For going left at location A. So that means that the action that's

Â favored is going left at location A, and now at location B the algorithm assigns a

Â probably that's quite close to zero if not equal to zero and that means that

Â going right is highly favored and so that is the preferred action at location B.

Â And now if you look at the location C, you can see that there is a gradual

Â convergence towards assigning a high property or a property close to one for

Â going left at C. Now can you tell me why is takes the

Â algorithm a longer time to find that the best action at location C is left.

Â 15:48

Well that's right, it's because of the exploration exploitation trade off that

Â we observed in the previous slide. And so the algorithm tends to go left

Â more often at location A, and so it very rarely tends to go to the right.

Â And so it does not get to experience the state or the location C.

Â Very often, which is why it takes a longer time to learn, the value

Â associated with C, and therefore it takes a longer time to realize that the best

Â action to take, at location C, is in fact, going left.

Â 16:24

Now researchers such as Andrew Barto have suggested, a mapping between the

Â components of the actor critic model. And the components of an important

Â structure in the brain known as the basal ganglia.

Â The dashed box on the left represents the components of the basal ganglia and the

Â dashed box on the right represents the components of the actor-critic model.

Â Note that we're using a hidden layer in this case to implement a multi-layered

Â network for mapping. State estimates to values, and state

Â estimates to actions. We can see that there's a rough

Â similarity between the components on the left side and the right side.

Â For example, the DA here on the left side stands for the dopamine signal.

Â And we already saw in the previous lecture that these dopamine signals Are

Â very similar to the temporal difference prediction area signal that we see in the

Â temporal difference learning model. Now, this is ultimately a very abstract

Â and high level model of basal ganglia function, but perhaps, it could serve as

Â a starting point for more detailed models.

Â 17:32

For more details on these actor critic type models of the Basal Ganglia please

Â see the supplementary materials on the course work site.

Â I would like to end the lecture by noting that reinforcement learning has been

Â applied to many real world problems, and as a grand finale for this lecture.

Â I would like to show a video of autonomous helicopter flight based on

Â some work by Andrew Eng, Peter Abel, and others at Stanford University.

Â In this case, the reinforcement learning algorithm learned a policy based on a

Â dynamics model for the helicopter and a reward function learned from human

Â demonstrations. If you're interested in more details, I

Â would encourage you to go to the website shown at the bottom of the slide.

Â Okay, so are you ready to see what reinforcement learning can do?

Â Here we go. [NOISE] Wow, that was an exciting way to

Â end the lecture. That in fact ends the last lecture of

Â this course. Thank you all again for joining us on

Â this first online journey through the land of Computational Neuroscience.

Â Until we meet again, happy adventures and goodbye.

Â