0:00

[SOUND] Now of course, algorithm would have to

Â use the second Monte Carlo version of our formula.

Â This is because in any particular environment,

Â you'd only get access to sessions.

Â You could only sample trajectories and record your states, actions, rewards,

Â next states, next actions, and so on, until the session ends.

Â 0:22

So the algorithm we appear at, it looks very close what we had in the method.

Â We start by defining our policy.

Â This will be your table or your neural network or

Â maybe a random forest if you feel itchy for something tree-based.

Â Now, we try this at random or

Â use any domain specific initials [INAUDIBLE] you can think of.

Â And then we sample sessions by following this particular policy.

Â So at our example you will initialize the weight of your convolution matrix with

Â random numbers, then you'll play a few games.

Â 1:05

Now, if you have this full set of states, actions and rewards,

Â you can actually use the rewards to compute this discounted commodity of game,

Â the commodity of returns, so called.

Â If you take your immediate rewards, the r0 r1 and so on until the session ends,

Â then you just add them up from the last one up to the first one.

Â You add them up with the coefficients defined by your discount, gamma.

Â You will obtain the estimate of your your g.

Â And this is some kind of estimate for the q function.

Â Now what you do next, is you use the policy gradient formula,

Â you have obtained a few chapters before.

Â To compute the estimate of the gradient of your expected reward with respect to

Â policy given just this one single session.

Â 1:48

Now this is a simple step after which you can

Â perform a gradient ascend restrict policy parameters.

Â Suppose if you use some kind of neural network you would not only have to compute

Â the gradient with respect to policy but

Â also with the respect to the weight of your neural network.

Â 2:03

So here is how the algorithm is formulated, first initialize,

Â then sample sessions.

Â Then use the session you obtained to compute both the and

Â then update the brand of the policy.

Â You can then repeat the steps from sampling more sessions and

Â getting more updates to improve your policy even further.

Â Now we have just learned on the definition for reinforce algorithms.

Â Now, let's see what kind of properties does it show and

Â how does it compare to other algorithms we already know.

Â 2:28

To start, let's consider the on-policy versus off-policy.

Â Remember we had the on-policy algorithms like SARSA, and

Â the off-policy ones like Q-learning.

Â Now I want you to answer, what kind of algorithm reinforce is.

Â Is it on-policy or off-policy?

Â Session send, as you have probably guessed, reinforce is on-policy, and

Â it's this very first step of the loop that betrays it.

Â In this step, it's required that the REINFORCE sample the trajectories for

Â training from its current policy.

Â So it would be kind of legal if you just replace this policy with something else

Â like human expert or an experienced samples.

Â Now, another very important concept REINFORCE, that there is

Â actually a way to improve it using what you've learned about q functions,

Â v functions, and other functions like.

Â To begin with, let's consider this example.

Â You're training your agent to perform well in a breakout game in

Â the Atari environment for example.

Â 3:19

This case you have either states where the ball is on the opposite side of your

Â field, and you're earning a lot of points really quickly.

Â And you'll still have more complicated states where you're

Â very close to missing a bowl and there is just one action that saves you and

Â the rest is more or less a guaranteed defeat.

Â 3:37

In this case, the Q faction you're going to get from different states of

Â random are going to be significantly different from one another.

Â The problem here is that if you use those Q functions,

Â if you actually multiply your gradients by those Q functions [INAUDIBLE] formula,

Â the easiest things where your agent already gets all the points, but

Â it doesn't actually do anything right now to increase the amount of points.

Â They'll get upweighted, they'll have large weights.

Â The difficult states will have low weights,

Â because your agent kind of gets small rewards in this case.

Â This is also true about pretty much any practical problem with any

Â complexity to it.

Â Let's say you are teaching your agent to translate sentences,

Â translate a natural language from English to French.

Â In this case, that'd be many kind of sentences.

Â The simple example with large rewards would be a sentence like,

Â how do I get to the library?

Â Or Jon Snow or whichever.

Â Those sentences can be translated very efficiently and your agent will almost

Â suddenly get a perfect score, say out of 100 for translating them,

Â even if it is not strictly the optimal translator for this sentence.

Â 4:39

Now, the other kind of sentence is a super complicated one.

Â Let's say you have maybe a transcript of this lecture or

Â some kind of excerpt from maybe a constitution of some sort.

Â In this case, the sentence will be overloaded with a lot of adjectives,

Â a lot of clauses.

Â And it's really hard to translate it by any known translation system.

Â 4:59

Now, the problem is that in this difficult sentence,

Â any improvement agent makes is actually going to affect the score much better

Â than any improvement in this Jon Snow example.

Â But the reinforce algorithm,

Â the policy gradient information we've just derived, kind of stays the opposite.

Â This case you would multiply your simple sentences,

Â the gradient of simple sentences.

Â But the slash you want is plus 100, and

Â your more complicated sentences with whatever the agent gets, say 20.

Â This is kind of where, this is not the kind of behavior you want to exhibit.

Â On the contrary, you want to encourage your agent for

Â doing things that aren't not just good by themselves.

Â Not just good because there is just a simple task to perform this time.

Â But you want to reward the things that our goods are in comparison to how your

Â agent usually performs here.

Â So if you on average perform very poor on these sentences,

Â you'll say you'll get a reward of 10 out of 100.

Â Now, you've just got a reward for, say, 30.

Â This is a very good improvement.

Â You have to capitalize on it.

Â You have to actually make sure that agents learn this and

Â learns to repeat this thing more often.

Â And if it says Jon Snow perfectly, just like during previous 100 iterations,

Â it's not a big deal, even though it gets a perfect score.

Â Now this basically translates,

Â that you have baseline in the reinforcement learning algorithm.

Â Now the idea here is that you want to reward not the Q function,

Â as is written in this formula, but something called the advantage.

Â So the advantage is, how good does your algorithm perform to what it usually does?

Â It's like the advantage versus the usual performance.

Â And this leads us to a bit more math here.

Â 6:35

Now, let's define advantage as the difference between

Â the Q function of a particular action and the V function.

Â The situation here is that the advantage for

Â kind of average, okay, action is going to be near zero.

Â The advantage of something remarkable,

Â which has accidentally gotten much more utility, much more cumulative gain,

Â much more Q than you expected, would be a high positive number.

Â In case of a very simple situation, where your agent routinely gets large rewards.

Â Even small detrimental change, even if you shift from 100 to plus 90,

Â you'd get the negative advantage.

Â Because this is the case for

Â your current Q value is smaller than the expected Q value.

Â 7:17

Now what you actually want to do,

Â you want to replace the Q value here with the advantage.

Â You want to encourage the actions that are better with respect to the average.

Â I think we can do that.

Â From mathematical point of view, what we do is we simply take the original

Â formulation and from every Q function here we subtract something called the baseline.

Â Now, baseline is just some function which is only dependent on the state, so

Â it does not depend on the action.

Â Mathematically we can actually do so without changing the optimal policy.

Â The main intuitive explanation is that let's say you have two actions.

Â The first gives you a reward of plus 100.

Â The second gives you the reward of plus 90.

Â And if you subtract 90 from both of those actions, you'll get the Q values of,

Â well, 10 and 0.

Â If you subtract 110, you'd actually get rewards of -10 and -20.

Â In both those cases, the optimal action is going to be the first one,

Â because it has the highest Q value.

Â The same applies for situations with many states.

Â Just for each state you have some

Â particular function [INAUDIBLE] from all the Q functions that you find for

Â this state that you won't change your [INAUDIBLE] policy here.

Â Because every option is being adjusted by the same amount.

Â [SOUND]

Â [MUSIC]

Â