0:02

This time, deducting going to be of a different blue color

Â because it's not technically heuristic.

Â It's rather a better way to

Â define the reward function adaptively as your agent progresses.

Â Imagine you are trying to train a bipedal robot to walk forward as fast as possible.

Â So, you have a robot with a ton of motors in each unit that they can move independently

Â and you give it a reward for the amount of distance it

Â managed to cover in say, 10 seconds.

Â The distance is measured in meters,

Â so you get a plus say, 10 for 10 meters.

Â So, in this case,

Â if you start training from a random guess,

Â you probably find out that for first several iterations,

Â the reward therefore is going to be somewhere between say,

Â zero and one because

Â the initial session is going to consist of robot falling down his face,

Â and maybe crawling forward by sporadic,

Â chaotic movements of his limbs.

Â So, this is how you're going to begin.

Â But eventually, as you hopefully converge to some kind of optimal way,

Â robot looks upright, the theory was of the plus 100,

Â maybe even more if your robot find a way to run fast.

Â This basically, it doesn't technically break anything.

Â But, it does make you adjust the learning parameters of your algorithm.

Â For example, if you find an optimal learning range for the beginning of your algorithm,

Â the [inaudible] will probably be an overshoot at the or interference where the rewards are latched.

Â So, instead of trying to adjust learning based on the fly, you can try a different thing.

Â You can try to use, redefine your rewards based on this simple equation.

Â So, if there is something that you subtract from each reward,

Â this some kind of constant value which doesn't depend on a particular session,

Â or multiplied by something core, or subtract something,

Â then maximizing the initial reward is exactly,

Â it's going to give you exactly the same solution as maximizing this new reward.

Â If your algorithm converters turn often, of course.

Â This is the same sequence that happens if we use the,

Â if we modify the last function for supervise learning.

Â This time we just normalize the rewards.

Â And to normalize, we can compute the median and standard deviation

Â based on either several sessions within one iteration.

Â So, if sampled, say a hundred sessions,

Â we get the mean reward,

Â the standard deviation via maybe your favorite numpile-like package,

Â and then you subtract one and divide by the second.

Â But, the session which is the best of

Â those 100-session batch will still get

Â the highest award because that's how arithmetics work.

Â So, this new reward,

Â which is the reward minus mean divided by standard deviation,

Â those rewards is usually referred to as the advantage.

Â How your algorithm performs compared to

Â the mean performance of other attempts within this iteration,

Â or well, other algorithms depending on how you define the,

Â involved the events function here, how you calculate the mean.

Â And basically, it has the same properties as,

Â it gives the same benefits as when you train a linear model when

Â you normalize the feature there as an input.

Â So, here it goes. You have a modified version.

Â Just block in this advantage instead of

Â every occurrence of rewards in the previous slide like this,

Â and then you'll get an algorithm which hopefully converts slightly faster.

Â We'll get a much closer look at this advantage function,

Â and its benefits and drawbacks in

Â the later weeks for our course when we cover the policy based methods.

Â So, here it goes.

Â Now, we'll see how the evolution strategies compares to

Â other algorithms we've already covered and the ones we are yet to cover.

Â It's again, yet another black box characterization trick.

Â So, the huge upside

Â of the evolution strategies algorithm that it's super easy to implement.

Â And since a single iteration of the algorithm is very simple to implement,

Â it's also kind of trivial to parallelize.

Â Imagine you have, imagine you live in

Â the modern era and for any research you have several CPUs,

Â say 1,000 CPUs that can compute things independently,

Â and you want to use all,

Â the whole bunch of CPUs to estimate this formulae.

Â Let's say that we take well,

Â a hundred samples for theta,

Â and for each sample we play 10 games with this particular theta to cover for the,

Â in say like a CCC of this second sum in the formulation.

Â How can we compute this formula explicitly without losing a lot of

Â time because of the sequential nature? Probably, right?

Â You can sample a thousands,

Â say a hundreds values of theta,

Â and then send each theta to 10 cores in your

Â 1,000 core cluster so that this core would be able to compute one trajectory.

Â Then for attempt computing just one trajectory you would get

Â a number of trajectories equal to your number, of course.

Â And this gives you a linear improvements of scale of 1,000.

Â This is super great because more complete algorithms usually don't scale that well

Â because they have a lot of sequential paths that cannot be easily paralyzed.

Â So, but here's an upside.

Â Now, we see some of the results of how

Â this algorithm works in the practical environments.

Â This is a report by OpenAI.

Â There's also a very well written blog post that you can read,

Â which we recommend you to do right now.

Â This probably be a pop-up with a text right here.

Â So, here are the training costs,

Â the rewards per the number of iterations the algorithm for

Â each of the three games they have tried the algorithm on.

Â There are of course much more experimentations in the black box,

Â they can find that yields slightly different although generally similar results.

Â And what you're going to find here is that the evolution strategies,

Â the one that's the orange line here,

Â is usually almost as efficient as this other method TRPO.

Â Now, TRPO takes almost an entire lecture to explain,

Â and in some cases it's even less efficient as we can see in the very first plot here.

Â But, if we can parallelize our method,

Â we can scale it to multiple cores,

Â it'll be much more efficient than TRPO because of the all well,

Â computation power you haven't had.

Â And you cannot do the same parallelization trick with TRPO as efficiently.

Â