[SOUND] The algorithms we study this week have one common property.

This property is the fact that they treat the decision process, be it decision

process or something else, as a black box, or well, almost always a black box.

You do take account for the fact that you have to take states, produce probabilities

of actions and so on, so there's this iterative structure of the process.

But otherwise, the assumption for example,

is not that widely used as we will use it later in this course.

Basically, you can think of it as a, again, black box family of algorithms.

So, you have this decision process here, or any process.

And you have all those things, actions, rewards.

Rewards from the whole trajectory in this process.

Now the way you think of it is as some kind of box to which you feed

the parameters of your policy.

Maybe ways of a neural network which constitutes your agent's probability,

action probability distribution, or a table of probabilities for

every possible state, if there is a fleet amount of them.

Anything you can think of, and then this box spits out the expected reward.

Or just reward from one of several trajectories averaged.

Now since we don't actually require that much from this process,

can think you can make this next step and assume it's a black box.

So you have a black box which takes a vector of weights.

You can just draw a few inputs for every respective weight here probabilities.

It spits out one number, and you want to tune these inputs to

get the output number as large as possible in expectation.

And again the method basically does this very thing.

Maybe not exactly black box but it is almost so.

And the method we're going to start right now, or to be more accurate,

a family of methods, the so-called evolution strategies.

Now counterintuitively, they only have a little bit to do with actual biological

evolution, but get another method like.

Now the idea behind them is, the first thing you have to do is you have to define

a distribution, probability distribution, over inputs to your black box,

which takes parameters that produce the reward.

So if you use a distribution, if you remember for each state.

You have to feed it one number per particular action in a particular state.

So it's the number of states times the numbers of actions.

Minus one if you are purely mathematical.

And in case you are using a neural network, say 100 neurons followed by yet

another 100 neurons, then you have to store, in this case at least,

100 squared numbers, which are the weights of this neural network.

So what you do is you define them via some kind of distribution.

For example, the fully factorized normal distribution.

So you have 10,000 weights.

What you do is you have 10,000 means of those respective weights.

And you have 10,000 weight wise variances, the sigma squares.