0:02

Now, what happens if instead of having say five or six discrete actions,

Â now deal with a continuous action space?

Â So, previously you had two options,

Â you steer the bike to the left or to the right.

Â Now, you have to provide a particular value between say,

Â minus pi over two and plus pi by over two,

Â which is the exact amount of steering you want to apply.

Â This is especially important in the means like say, robot control.

Â In this case, you have all the joints or all the limbs of your robot.

Â And you can move them via motors,

Â and motors are controlled via voltage.

Â So, you can apply any voltage you want within some particular diapason.

Â Now, this actually means that you can no longer deal,

Â you can no longer solve this problem by simply having one neuron per possible action,

Â and using a soft mark similarity.

Â So, how do you use say neural network to predict,

Â not discrete variable, but the real value of variable.

Â One way you can do this is,

Â you can simply solve regression problem.

Â It's probably the most obvious one.

Â For example, if you use default circuit theorem or Schirus regression models,

Â they usually minimize the mean squared error.

Â If you remember the Bayesian method scores,

Â minimizing squared error is

Â actually the same thing as maximizing the logarithm of likelihood,

Â in case that you predict the normal distribution for the action space.

Â So basically, it means that the probability of taking action is

Â a normal distribution with mean value given by your neural network,

Â and a standard deviation of one.

Â Again, in this case, you can simply fit your model using the existing classification or,

Â in this case, regression algorithm,

Â which is for neural network is another modification of bi-propagation.

Â And you can do this repeatedly until your model converges to an optimal result.

Â In our practical assignments,

Â once you can modify just two lines of code in your assignment,

Â you'll be able to solve a different problem which requires a real value of its output.

Â We'll use the details of

Â these particular changes later in the practical assignment itself.

Â So of course, it's not all theories.

Â Sometimes you get algorithms work really well.

Â You'll have to employ some kind of dirty hacks and practical heuristics.

Â For cross-entropy method, there are several kinds of those heuristics.

Â One family of heuristics is aimed at

Â reducing the amount of samples it takes to be in training.

Â In cross-entropy method, this thing is especially dire.

Â You have to play 100 sessions,

Â and you only use some fraction,

Â say 25 percent of them,

Â and it gets even smaller if you use larger sample sizes.

Â This is of course terribly bad,

Â and this is probably the worst case of sample inefficiency we have.

Â So, cross-entropy method relies on you being able to give it all the samples.

Â This is true for virtual reality, for games,

Â for computer models robots,

Â but it's not true if you want actual robotic car to steer on the actual streets.

Â So instead, you can try some hacks to get it to run more smoothly.

Â Example, you can re-use the samples from several past iterations.

Â So, you don't have to sample 1000 or 100 sessions.

Â You can sample say 20 sessions,

Â and use 80 sessions leftover from the previous iterations.

Â This of course make the training slightly less strategically nice,

Â but it tends to somewhat work, time to time.

Â Now, another problem with cross-entropy method,

Â is that it tends to sometimes fall into the local optima.

Â So, you have a neural network that has a weird structure,

Â so that the gradient sometimes explode.

Â And once they explode, there is a small chance that new ones will appear in

Â a situation where some action has a probability of almost zero.

Â Now, in the usual supervised during set up, this is not so bad.

Â Well, in the worst case you will get not a number everywhere,

Â but usually what you have,

Â is you'll have your network trained to fix this error.

Â Reinforcement learning, the problem is much worse.

Â Because if you don't have a probability of zero,

Â this means that your agent explicitly avoids

Â taking some particular action, some particular state.

Â This is bad, especially if this action

Â was the optimal one that you have not yet discovered.

Â So, since you never take this action,

Â you'll never get the samples where this action happens in the elite session.

Â So, you're stuck in this sub-optimal policy.

Â How you can improve this?

Â Well, of course there's many ways,

Â but one way to do so is to simply regularizing your network.

Â You can try to not only minimize the cross-entropy elite sessions,

Â but also as a regularizer,

Â slightly increase the entropy of the output distribution.

Â So, as we all know, entropy gets smallest

Â when engine is absolutely certain about one action,

Â and takes this action all the time.

Â So, the probability of one for this one action,

Â and zero for all the other actions.

Â The highest value of entropy is achieved for uniform distribution.

Â Now, this means that if you regularize,

Â the higher your entropy gets the better.

Â It means that your agent will be biased against completely giving up on actions.

Â So, if some action gets you personal probabilities,

Â the probability will eventually get slightly higher,

Â falling degree into the entropy.

Â We'll cover this in more detail in the reading section.

Â Now finally, since ruling in the modern world and

Â even your smartphone has more than one computer core, basically this means that,

Â whenever you have parallelizable algorithm,

Â you can get them to run 100 times as fast to 1,000 times as fast,

Â depending on how many servers do you have.

Â For cross-entropy method, it's very simple.

Â You have this phase, we sampled 1,000 sessions.

Â You can sample them all in parallel.

Â Of course sometimes, it requires that you buy

Â 1,000 separate kind of environment emulators if it's something physical.

Â But for videogames for example,

Â it's very easy to parallelize.

Â Finally, there is a very neat situation here.

Â Sometimes you want to experiment

Â the neural network architecture for cross-entropy method.

Â And for some cases,

Â if you don't want to only rely on your current observation,

Â you can use your current neural network based architecture to make your agents kind

Â of use a memory to store

Â whatever useful information they have seen on the previous observations.

Â This is of course slightly more complicated than it gets in this particular sentence.

Â So, we'll cover this in much more details near the end of the course.

Â So, I hope the cross-entropy method is

Â slightly lesser pack to you now. Now let's get to practice.

Â