0:00

[MUSIC]

Â Now our main protagonist for this and

Â a couple of next sections will be the so called policy gradient.

Â The idea behind this is that instead of trying to do some cunning heavy wizardry

Â methods, like picking top sessions, calling them elites for

Â some reason and just trying to with those sessions.

Â Maybe you'd be better of if you just tried to write down what you want to optimize,

Â complete the of this thing, and then follow through

Â with via gradient dissent or back propagation neural networks.

Â Let's try to see how it works one step at a time.

Â Let's begin by introducing the notion of so called expected reward.

Â You'd probably used this thing on more than one occasion actually.

Â But this time we're going to have to explicitly write it down and

Â make some mathematical derivations with it.

Â So they yield of expected reward is that it is, well, what it seems to be.

Â It is an expectation over the rewards that you're going to get if you play a lot of

Â sessions with your agent.

Â Now, during our first week, we had an expected reward without gammas,

Â without discounts.

Â And this was just an expected sum of rewards of our session.

Â Now, there's also the expected discounted reward.

Â Follows the notion of discounted reward from weeks two and three, and Q-learning,

Â value iteration and so on.

Â And simply a sum of your rewards multiplied by gamma to the power of

Â how much time has passed since this moment of time.

Â The idea here is that you can try to maximize the average

Â discounted reward over states.

Â Or you can try to maximize the discounted reward you get from the first state on.

Â So let's start with the undiscounted version of our rewards because it produces

Â simpler math.

Â Even further,

Â let's simplify our problem by planning that there is only one step per session.

Â So your agent gets born in some random state.

Â It takes one action.

Â It observes the reward, it may trade on this reward.

Â And then a new session begins.

Â So there is no continuation in one session.

Â So if we write down what's actually hiding behind this need expectation,

Â you'll see the nasty double integral here.

Â So there is an integral over all possible states.

Â They are permissible for visiting the state.

Â Then integral over actions and probabilities of taking this action, and

Â then you will multiply the state by the reward.

Â This is how a mathematical expectation works for

Â the general case where you have infinite continuous amount of options.

Â 2:34

The first thing here is the state visitation frequency.

Â We have a provision of visiting state s.

Â And in this case, so far it's not dependent on an agent.

Â However, if you use a more complicated notion of multiple step process,

Â then visiting state is something that agent can influence.

Â So this would be also dependent on agent's policy.

Â The second thing is that the pi factor is the probability of agent taking the action

Â E in the state.

Â And this is what your agent actually influences here.

Â This is the unilateral prediction or

Â maybe they're of the table of all possible probabilities of all action probabilities.

Â Now the final term is the reward.

Â In this case you can use any notion, it's still just one time step.

Â Now remember this is the thing we're trying to maximize.

Â But so far, it's hard to even compute, not talking about any optimization at all.

Â So if you try to explicitly estimate the value of this J, you'll have to sum over

Â all possible breakout frames, sum over all possible actions.

Â If there's continuous amount of actions, you're screwed.

Â And then compute the rewards for the whole session if you're starting from there.

Â So surely there is a simple way you can estimate this thing,

Â although approximately.

Â How would you do that?

Â To even compute this, if you remember, this is still a mathematical expectation.

Â And instead of trying to compute expectation exactly,

Â you can simply approximate it with sampling.

Â So you play, let's say, 1,000 games, and

Â your average over all the states and actions you have met in those games.

Â So finally some good news have arrived.

Â You can at least approximately estimate the value of this J by

Â using the Monte Carlo sampling like this.

Â Now let's go to the second stage.

Â We want to optimize the J,

Â we want to compute the gradient of J with respect to the branches of your policy.

Â And to do so we need to compute the gradient of the sample's approximate

Â version because we can't deduce the regional version,

Â it's too hard to compute.

Â The question is, remember we had all the those cool frameworks for

Â neural networks, like TensorFlow.

Â Can we use them now?

Â Can we maybe use TensorFlow to explicitly compute the gradients of this formulation

Â of J respect to the theta, the value of the policy.

Â Or maybe there's something that prevents us from doing so.

Â 4:44

Well right, so the problem with TensorFlow here is that it can only compute

Â the derivatives with respect to something that is in the formula.

Â And if you take a closer look at this particular version of Monte Carlo sample

Â J, you'll notice that it has no place for parameters, for thetas, here.

Â The problem is that this sample definition, it does depend on parameters,

Â on policy, but it depends on them in the summation.

Â So you have sampled the sessions from your policy.

Â But you don't remember the values,

Â you don't remember the probabilities any longer.

Â So no, TensorFlow won't be able to compute this thing for us.

Â And in fact, not only TensorFlow, but any mathematician, if you give him the second

Â formula only, will probably call you a jerk if you ask him to differentiate.

Â Okay, so we have to do something else.

Â The answer will go here, we have to estimate the gradient of J.

Â Now some simple, so-called duct tape approaches you could try to employ are,

Â for example, the so-called finite differences.

Â So instead of trying to compute the derivative, you can just pretend that

Â the infinitely small value in the deficient of the derivative is equal to,

Â say, 10 to the power of minus five, this epsilon here.

Â You can compute something that looks like a derivative, but

Â it's not derivative because the value is not infinitely small.

Â This will technically work.

Â It will require you to, well get the J on the compost by sampling.

Â And then change this policy ever so slightly by this small value of epsilon.

Â And then find the new value of the policy by sampling more sessions.

Â Now another way that you are probably more familiar with by now,

Â is use cross-entropy method.

Â What it does is it tries to somewhat maximize something that looks

Â like J by sampling a lot of sessions.

Â And then taking those in which the J from a particular session was

Â higher than that of other sessions.

Â So the expected reward was larger.

Â And both of those methods will technically work, although they have some problems.

Â Now this time I would ask you to criticize those methods.

Â So while those two methods do work in theory, in practice they have an issue

Â of being very hard to efficiently estimate in any practical situation.

Â For example,

Â if you are trying to solve breakout via these finite differences method, it would

Â actually take you to play say 100 games to estimate this first J plus epsilon.

Â And then it'll take you another 100 games to estimate the second J.

Â And even then the amount of noise you introduce by sampling would be still much

Â larger than the difference between the two policies,

Â especially if J is sufficiently small.

Â And if you use large values of J your gradient would be useless for

Â anything more complicated than a linear model, or maybe a table in this case.

Â Stochastic optimization, like the crossentropy method is in this case much

Â more practical in terms of how to use samples.

Â But it still has some problems.

Â For example, remember, if you have some innate randomness.

Â For example, you have a slot machine in your environment or

Â there are some physically random process.

Â In this case you'll have to use your crossentropy method with some tweaks to

Â prevent it from favoring the lucky outcomes, from believing that the elite

Â sessions are elite because they have some way of tricking the slot machine.

Â The method we are going to study next will mitigate both these problems by the way it

Â computes the derivative of J.

Â It won't use any high wizardry.

Â Instead it will try to find an analytical derivative

Â of J which is easily approximated with sampling.

Â [SOUND]

Â