0:02

Let's now focus on how to use this ideas to actually learn the core function.

Â You can only now have access to trajectories and

Â not the whole set of possible states and action transitions.

Â Now, the most obvious but not the only approach you can try,

Â is you can sample the trajectories,

Â and then average over all the trajectories concerning a particular state.

Â So sampling breakouts.

Â And in this case,

Â let's say you've managed to sample a million sessions.

Â And of these million sessions you get 10 sessions where you are in

Â a stage where you're on the right and the ball comes from the left,

Â and see for that part of the brick wall is

Â destroyed and all those just keepers of the state match.

Â One of those states you might be lucky and you'd be able to bounce the ball off,

Â and hits two bricks before it bounces back.

Â Now, in another thing you might be less lucky you've probably missed the ball.

Â Sometimes you do because your age might be not perfect.

Â And you've probably lost a life and your average return for this session was lower.

Â You might have already gotten to the stage where you are super lucky and

Â you're able to boss the ball

Â behind the brick wall which gets you a lot of points and breakouts.

Â So if you're not familiar with rules of breakout,

Â just believe that there are many possible outcomes.

Â The idea is that for this particular state you have to average over those outcomes to get

Â an estimate out of this expectation of

Â possible outcomes that you get in the dynamic program formula.

Â Now, the niche here is that,

Â it does work technically.

Â The statistics say it isn't the best estimate and

Â it will converge provided you give a lot of samples,

Â but this amount of samples in practice might be unrealistic.

Â So while I have probably said in this example,

Â you take a million samples,

Â and out of a million samples that might be just 10 where you've ended up in

Â this particular state with this particular ornament of

Â bricks and your position and the ball position,

Â otherwise they are all just a little bit but they are different states nevertheless.

Â So again, if you're having a simple problem where you have not

Â that many states where there are some special exotic case that you can exploit this idea.

Â And if you are more or less,

Â basically if you know what you're doing,

Â you can apply this idea.

Â You can simply sample it out of times and average,

Â but we're actually going to find out that there is a better way,

Â a better way in terms of how it converts and more spectral problems.

Â So here it goes. Secondly is called the temporal difference.

Â The idea is that it's going to exploit the recurrent formula for Q values.

Â Have the Q value, and this left-hand side is what you currently know about the Q value,

Â while the right-hand side with the expectation of all the possible states again,

Â is the kind of free finds direction of Q value that you

Â would have used in your iteration,

Â in your dynamic program based methods.

Â What you can do is you can use this idea to

Â use your current estimate of the Q value on the right-hand side,

Â the maximum over those Q values,

Â to refine the Q value at a different stage.

Â And to do so, you only need

Â a state action or a reward at the next state. That's how this goes.

Â If the thing is clear let's

Â actually highlight some of the details that are hidden in the formula.

Â So you have this optimization term here,

Â you've probably noticed that this notation for Q Star,

Â or the action value for the optimal policy.

Â Since the optimal policy would take the action with highest Q value,

Â you would kind of replace the value

Â of the next state with maximization of all possible actions.

Â And then you'd obtain their current formula for your Q star.

Â That's more or less like copy paste from the previous week.

Â So another problem with this formula is you may have noticed this expectation here.

Â Basically it requires you to expect over the outcomes and so on.

Â In this case again,

Â we cannot compute it explicitly.

Â So, we have to do some other trick instead.

Â Which one? As usual,

Â the only way we can so far get away with something we can compute explicitly,

Â if it's an expectation we approximate it.

Â We approximate it by sampling.

Â In this case we can take C5 samples,

Â or its trajectories it may be the only option we

Â have to take one sample because there's no going back.

Â And we take this one sample from

Â our environment by blending in the action and seeing what happens,

Â to somehow estimate this expectation.

Â But we cannot use it as an actual Q value because it's so noisier.

Â The other important trick we can use here

Â is the so-called exponentially weighted moving average.

Â The basic formula is like this,

Â instead of this Q function by this noisy but unbiased value entirely.

Â So instead of assigning this noisy value to the Q function,

Â and forgetting everything there was before we updated the Q function.

Â You can use this smoother version of the update.

Â So take your updated noisier version,

Â this is the left part of the right-hand side.

Â Sorry for the description. And you

Â multiplied by say point one, this is the value of alpha.

Â So you take point one of your new noisier estimate,

Â and point nine one minus point one times the previous estimation you got.

Â So initially it will be a point one times your new estimate

Â plus point nine times zero because you start with zeroes.

Â Once you've got sum of the new estimate of Q function,

Â this moving average equal will prevent you from being noisy.

Â It will allow you to gradually converge to

Â the average because every time you take one sample,

Â you trust it on a little bit and you kind of average over time.

Â So, this is how you basically solve the problem of only one choice, only one trajectory.

Â So now you have this other form step by

Â step to give you a better idea how to implement it.

Â And spoiler alert, the implementation is part of your homework,

Â so please least your attention there.

Â What you require here is as usual have this trajectory of state action reward,

Â next state, next section so on.

Â And to train via Q learning to perform

Â one elementary update you only need a small section of the trajectory.

Â Then you need a state, an action,

Â a reward for this action the state immediate reward,

Â and the next state sampled from the decision process.

Â So these are in the s, a, r, s prime,

Â and to utilize them you simply block them into the formula we've just obtained.

Â So, first we need to obtain the value of the next state.

Â We want to find the Q value of next state part of the formula.

Â To do so, we have this next state,

Â we would probably know the set of all possible actions available because

Â that's how the set is usually a constant that we know it in advance.

Â What you do is you take all the known actions,

Â and we can put the Q value of those actions.

Â Since we trying to compute the Q star,

Â what we do is we take the maximum here.

Â So basically we take Q of s prime a1,

Â s prime a2, s prime a0 all the available actions.

Â And then we maximize.

Â This is basically the v star of the s

Â prime if you wish the mathematical way of saying it.

Â Now you multiply this by gamma,

Â because that's how these counter trays works ,

Â and you add the immediate reward.

Â This way you obtain one sample from

Â this expectation that was a part of our original formula.

Â Now you can update your Q function by getting a little nearer to this new sample,

Â via the exponentially with the moving average,

Â using the plug in the formula as usual.

Â So you have this reward plus gamma times the maximum over Q values for the next state,

Â and then you multiply it by say point one,

Â and after that you take your previous guess and multiply it by one minus point one,

Â point nine, add those things up and that's your new estimate of your Q function.

Â This algorithm has a number of super interesting consequences.

Â First, this session from partial experience from only this tuple of s,

Â a error spring, you can train your algorithm even before you've ended the first session.

Â So meaning you're training your loop or less loop to walk forward.

Â And what we should do is you're going to reward him from simply

Â moving forward even before he or she reaches this five minute threshold of a session.

Â What's going to happen then is mention that your Robert is well,

Â starting randomly, and regardless of what he does he's probably going to

Â end up with some kind of cycle because that's how walking usually happens.

Â So, even before it finishes going forward,

Â he's probably going to find itself in the same state at

Â least a few times per loop maybe once per loop.

Â So once per step,

Â you're going to find yourself with a position where say

Â your right leg is set up and you're about to make the step or something similar.

Â So even before you finish your very first trajectory,

Â you're probably gonna end up with something that is better than random of those Q values.

Â And this is super neat because sometimes you have infinite processing,

Â sometimes you want to walk forward indefinitely or you want to

Â optimize some ground forces with no apparent border for session.

Â And you can just feel it into Q-learning,

Â and Q-learning is going to train to in converse the optimal policy,

Â before it finishes and 12 without finishing even. How cool is that?

Â