0:02

This is the kind of question you probably asked much more

Â frequently in the final years of your school education.

Â You have this logarithm of

Â some arbitrary function f of x and you want to get the gradient,

Â the derivative of this function.

Â And I want you to simplify the derivative for me by applying some of

Â the properties of derivatives you've studied

Â previously in calculus or finally as a school education.

Â Yes, if you first apply chain rule and then try and then replace the gradient of,

Â well the derivative of logarithm with the appropriate derivative from the table,

Â you will see that this thing is equal to one over f of x,

Â times the derivative of the f of x itself.

Â If you multiply both parts by f of x,

Â if you just move the one over f of x to the left,

Â you see that this thing is equal to the following equation.

Â Have the derivative f of x on the right hand side,

Â which is equal to the function itself,

Â f of x times the derivative of its logarithm.

Â This is a very simple trick, it's called the logderivative trick.

Â It's actually a very powerful one.

Â Let's see how we can use it to simplify our lives.

Â So, you have this nabla J from the previous for like three slides, six slides ago.

Â And we're now going to plug the logderivative trick this equation directly

Â instead of the nabla of probability in this function of a normal distribution.

Â This gradient N theta mu sigma squared.

Â So, if we replaced five,

Â the right hand side of this formula below,

Â you'll see that this nabla J becomes equal to this formula which is even longer,

Â but as it turns out much simpler.

Â So, this is an integral over the probability from normal distribution of

Â thetas times the gradient of the logarithm of this probability here,

Â times the expected trajectory return defined by the second expectation.

Â Now, can we tune this particular format?

Â Can we adapt it to estimate the nabla J in a sampled manner to speed up,

Â to make the computations tractable in any practical environment. What do we do?

Â I'll guess. The easy part here is

Â that once we have the integral times of the probability density function,

Â we can pretend this is an expectation gain.

Â And this gradient of the logarithm is whatever is expected in

Â this expectation and be both

Â simplified for it becomes like this thing here on the top of the slide.

Â So, if you now replace the expectations with samples,

Â you'll see this guy here below.

Â And again it's more of the same thing but you multiplied by

Â the gradient of the logarithm of the normal distribution probability density function,

Â which you can explicitly compute via, well,

Â even do this on a sheet of paper in like 10 minutes,

Â or you can try to google it,

Â or work from it to whatever.

Â If you remember, TensorFlow or Theano or

Â any other symbolic graph frameworks from the deep learning course.

Â You also know that you can get the value of

Â this gradient explicitly by simply using the symbolic gradients.

Â So, here you go, you have a sampled estimate over

Â the simple estimate of a derivative of the expected return,

Â respect to your mu and sigma squared.

Â So, once you have finally obtained a way

Â to estimate the nabla J in a sampled based manner like this formula in the slide.

Â It's time to use this formula to devise

Â an actual algorithm which is capable of sorting and reinforcement during task at hand,

Â from kind of from scratch, from zero to the convergence station.

Â The algorithm's going to be very similar to what it had before.

Â And you will start with some initial guess of mu and sigma squared.

Â Let's see that the initial guess provided us with

Â this sample of green dots at the bottom center of this slide.

Â And then again, intuitively adjust it using the feedback,

Â the reward function of a trajectory defined by

Â this hew with a blue kind of easy clients on the right.

Â Basically, this is a simple type problem to illustrate how the algorithm performs.

Â So, you have a mu and sigma and once you compute the gradient of

Â the expected return which is slightly better on the upper right part of the samples.

Â If you can predict gradient of this thing with respect to mu and sigma squared,

Â you'll see that by following this gradient,

Â you'll move the mu upwards and to the right by

Â a slight degree using the learning rate you have for the gradient descent.

Â Still slightly, it will crawl upwards by a small step.

Â Then, if you repeat the whole process again,

Â you get even better estimate,

Â and maybe even better estimate, and so on and so forth.

Â Each time you draw new samples from the normal distribution,

Â say while say, 50 samples per iteration,

Â which would be more or less a key.

Â Three sampling a blade one or several gains and average their performance.

Â Now, eventually this thing is going to crawl uphill and find itself in the maximum,

Â maybe a local one here.

Â And since it's a probability distribution,

Â it's not yet the final stage,

Â because there's one more thing to change,

Â that you still can adjust the derivative of the sigma squared.

Â Adjust the sigma squared to minimize the loss given by the variance.

Â And this uses to this segue of crawling uphill the sampled base manner,

Â which will eventually find itself in a more or less delta function in the optimal point.

Â And by eventually, of course, they mean,

Â if you have an infinite amount of iterations,

Â and infinite amount of samples and so on,

Â because in a practical case,

Â you can only guarantee that it's kind of more likely to be

Â or in some vicinity of the optimum as you progress.

Â The formal step by step definition,

Â the algorithm looks very similar to whatever you had before in the cross-entropy method.

Â And basically, it shows that we first have to,

Â yes, the initial values for mu and sigma squared.

Â These are both, maybe large vectors.

Â You can try either using some generals,

Â a mu of zero and sigma of a small number

Â for every possible parameter in the entire array.

Â Or you could use a prior node issue, for example,

Â pre-train your neural network on some, well,

Â supervise the data or on some similar environment and use it as an initial guess.

Â So, once the initial mu seen with the taint,

Â you then get the new sample trajectories.

Â You sample mu, you sample theta from mu sigma sample trajectories and this theta.

Â And you use those trajectories to estimate the use

Â of the expected tretron using the formula we've just derived.

Â Now, once we get the formula here,

Â we use it to perform a gradient ascent.

Â So, we want to maximize the nabla J,

Â we say that our new mu is the previous mu plus aIpha,

Â some learning rate, say 0.01 times the derivative of J with respect to this mu.

Â And the same works for sigma. If you replace

Â those derivatives with a more ugly fixed stuff,

Â this actually means that once you've sampled the, well,

Â thetas and first theta your sample trajectories and returns,

Â then you can just break down the VGU which is non-analytical manner,

Â find a formula non-pi or any other, well,

Â write a formula non-pi or any other favorite library of yours.

Â And basically you repeat the process until you're satisfied with the result.

Â I've written a formula for

Â a one-dimensional mu here but you can easily extend it for anything.

Â And for sigma squared, that small is the same thing,

Â you have the derivative of normal distribution.

Â And since normal distribution is something times the expanse of

Â the square difference between mu and the particular sample,

Â the logarithm of this normal distribution is even easier to compute.

Â As you can see right now, the new definition

Â of initial strategy algorithm is not that simple.

Â If it's from a single slide now of the large font size.

Â As I promised to you, this definition is also quite

Â decent from our usual notion of evolution and,

Â well, natural selection yada yada yada.

Â It's rather a trick from domain of stochastic optimization methods.

Â And it is true for

Â any stochastic optimization method or

Â any reinforcement early learning algorithm for that matter.

Â We can slightly improve it by applying duct tape,

Â the usual dose of hacks, heuristics, and tricks.

Â