0:00

[SOUND] Now of course, algorithm would have to

use the second Monte Carlo version of our formula.

This is because in any particular environment,

you'd only get access to sessions.

You could only sample trajectories and record your states, actions, rewards,

next states, next actions, and so on, until the session ends.

0:22

So the algorithm we appear at, it looks very close what we had in the method.

We start by defining our policy.

This will be your table or your neural network or

maybe a random forest if you feel itchy for something tree-based.

Now, we try this at random or

use any domain specific initials [INAUDIBLE] you can think of.

And then we sample sessions by following this particular policy.

So at our example you will initialize the weight of your convolution matrix with

random numbers, then you'll play a few games.

1:05

Now, if you have this full set of states, actions and rewards,

you can actually use the rewards to compute this discounted commodity of game,

the commodity of returns, so called.

If you take your immediate rewards, the r0 r1 and so on until the session ends,

then you just add them up from the last one up to the first one.

You add them up with the coefficients defined by your discount, gamma.

You will obtain the estimate of your your g.

And this is some kind of estimate for the q function.

Now what you do next, is you use the policy gradient formula,

you have obtained a few chapters before.

To compute the estimate of the gradient of your expected reward with respect to

policy given just this one single session.

1:48

Now this is a simple step after which you can

perform a gradient ascend restrict policy parameters.

Suppose if you use some kind of neural network you would not only have to compute

the gradient with respect to policy but

also with the respect to the weight of your neural network.

2:03

So here is how the algorithm is formulated, first initialize,

then sample sessions.

Then use the session you obtained to compute both the and

then update the brand of the policy.

You can then repeat the steps from sampling more sessions and

getting more updates to improve your policy even further.

Now we have just learned on the definition for reinforce algorithms.

Now, let's see what kind of properties does it show and

how does it compare to other algorithms we already know.

2:28

To start, let's consider the on-policy versus off-policy.

Remember we had the on-policy algorithms like SARSA, and

the off-policy ones like Q-learning.

Now I want you to answer, what kind of algorithm reinforce is.

Is it on-policy or off-policy?

Session send, as you have probably guessed, reinforce is on-policy, and

it's this very first step of the loop that betrays it.

In this step, it's required that the REINFORCE sample the trajectories for

training from its current policy.

So it would be kind of legal if you just replace this policy with something else

like human expert or an experienced samples.

Now, another very important concept REINFORCE, that there is

actually a way to improve it using what you've learned about q functions,

v functions, and other functions like.

To begin with, let's consider this example.

You're training your agent to perform well in a breakout game in

the Atari environment for example.

3:19

This case you have either states where the ball is on the opposite side of your

field, and you're earning a lot of points really quickly.

And you'll still have more complicated states where you're

very close to missing a bowl and there is just one action that saves you and

the rest is more or less a guaranteed defeat.

3:37

In this case, the Q faction you're going to get from different states of

random are going to be significantly different from one another.

The problem here is that if you use those Q functions,

if you actually multiply your gradients by those Q functions [INAUDIBLE] formula,

the easiest things where your agent already gets all the points, but

it doesn't actually do anything right now to increase the amount of points.

They'll get upweighted, they'll have large weights.

The difficult states will have low weights,

because your agent kind of gets small rewards in this case.

This is also true about pretty much any practical problem with any

complexity to it.

Let's say you are teaching your agent to translate sentences,

translate a natural language from English to French.

In this case, that'd be many kind of sentences.

The simple example with large rewards would be a sentence like,

how do I get to the library?

Or Jon Snow or whichever.

Those sentences can be translated very efficiently and your agent will almost

suddenly get a perfect score, say out of 100 for translating them,

even if it is not strictly the optimal translator for this sentence.

4:39

Now, the other kind of sentence is a super complicated one.

Let's say you have maybe a transcript of this lecture or

some kind of excerpt from maybe a constitution of some sort.

In this case, the sentence will be overloaded with a lot of adjectives,

a lot of clauses.

And it's really hard to translate it by any known translation system.

4:59

Now, the problem is that in this difficult sentence,

any improvement agent makes is actually going to affect the score much better

than any improvement in this Jon Snow example.

But the reinforce algorithm,

the policy gradient information we've just derived, kind of stays the opposite.

This case you would multiply your simple sentences,

the gradient of simple sentences.

But the slash you want is plus 100, and

your more complicated sentences with whatever the agent gets, say 20.

This is kind of where, this is not the kind of behavior you want to exhibit.

On the contrary, you want to encourage your agent for

doing things that aren't not just good by themselves.

Not just good because there is just a simple task to perform this time.

But you want to reward the things that our goods are in comparison to how your

agent usually performs here.

So if you on average perform very poor on these sentences,

you'll say you'll get a reward of 10 out of 100.

Now, you've just got a reward for, say, 30.

This is a very good improvement.

You have to capitalize on it.

You have to actually make sure that agents learn this and

learns to repeat this thing more often.

And if it says Jon Snow perfectly, just like during previous 100 iterations,

it's not a big deal, even though it gets a perfect score.

Now this basically translates,

that you have baseline in the reinforcement learning algorithm.

Now the idea here is that you want to reward not the Q function,

as is written in this formula, but something called the advantage.

So the advantage is, how good does your algorithm perform to what it usually does?

It's like the advantage versus the usual performance.

And this leads us to a bit more math here.

6:35

Now, let's define advantage as the difference between

the Q function of a particular action and the V function.

The situation here is that the advantage for

kind of average, okay, action is going to be near zero.

The advantage of something remarkable,

which has accidentally gotten much more utility, much more cumulative gain,

much more Q than you expected, would be a high positive number.

In case of a very simple situation, where your agent routinely gets large rewards.

Even small detrimental change, even if you shift from 100 to plus 90,

you'd get the negative advantage.

Because this is the case for

your current Q value is smaller than the expected Q value.

7:17

Now what you actually want to do,

you want to replace the Q value here with the advantage.

You want to encourage the actions that are better with respect to the average.

I think we can do that.

From mathematical point of view, what we do is we simply take the original

formulation and from every Q function here we subtract something called the baseline.

Now, baseline is just some function which is only dependent on the state, so

it does not depend on the action.

Mathematically we can actually do so without changing the optimal policy.

The main intuitive explanation is that let's say you have two actions.

The first gives you a reward of plus 100.

The second gives you the reward of plus 90.

And if you subtract 90 from both of those actions, you'll get the Q values of,

well, 10 and 0.

If you subtract 110, you'd actually get rewards of -10 and -20.

In both those cases, the optimal action is going to be the first one,

because it has the highest Q value.

The same applies for situations with many states.

Just for each state you have some

particular function [INAUDIBLE] from all the Q functions that you find for

this state that you won't change your [INAUDIBLE] policy here.

Because every option is being adjusted by the same amount.

[SOUND]

[MUSIC]