Now, if you recall the definition of

mathematical expectation from any probability theory course you took,

the expectation is basically a sum or integral over all possible outcomes available,

weighted or multiplied by their respective probabilities.

In the first expectation we had on the previous slide,

the expectation of a normal distribution is

changed to an integral because this continues to many possible outcomes.

They are all the vectors of real numbers.

And the second part of

the formula is weighted by the probability density function in the normal distribution.

So, you have this N theta given mu sigma squared,

which is the exact PDF of normal distribution,

or you plug in the mu, sigma squared from

your permutation across from the previous slide.

The second part, the second expectation is a little bit harder to think about,

even computing the probabilities might be a challenge in any practical environment.

So, this in expectation or trajectories.

And a trajectory is not just a number,

but it's something, it's a complicated structure sampled from a process.

So, you have a first state assembled from the distribution of the first states,

maybe it's a fixed first state depending on the environment.

And you generally don't know the distribution of the first states until you

explicitly sample them and maybe fit some probability distribution to model them.

And then, your agent has to pick action.

So, how the thetas from the left part of our formula,

the one sampled from a normal distribution

and you plug in those thetas to whatever policy your using or network.

You compute this policy from

the initial state and you get

the probabilities of actions on the last layer, you can use the softmax layer.

So, you have the probabilities of actions.

Now you have to pick an action,

which is also a thing that should be carried at random.

And then you fit this action back into the environment to get the next state.

Next state is again sampled from

the probability distribution of a next state given through a certain action.

Then, you have to take the next action after this next state.

Then the next, next state, next,

next action and every other iteration,

the one where you sample next state,

you have to use this unknown probability distribution of

a next state which is not given to you from the black-box environment.

So, this is how you could technically,

analytically compute this thing.

But we of course,

won't be able to do so in any real circumstance.

What you need this integral for,

is we needed to get a clue on how to maximize this expectation.

And to maximize it,

the simple way is to apply the gradient based approach for which you have

to compute the gradient of this J with respect to something that you can optimize,

so that you can influence this thing with.

What variables do you compute the gradient with respect to?

Gap. Those are the parameters of the probability distribution,

the vector of mus and sigma squared or

maybe some other vectors in case you define the probability distribution differently.

Now, once we try to compute this gradient,

we'll find ourselves in the following situation.

If you just plug the derivative sign before this double integral here,

you'll find that luckily for us,

a large part of this integral doesn't explicitly depend on

mu and sigma squared once you sample a particular theta.

So, the second part where you compute

the expected return from a trajectory given a particular theta,

is not expected to depend on mu sigma squared once you give

it a value of theta sampled from those mu sigma squared defined normal distribution.

This allows you to lift

the derivative sign a little bit and move the second integral outside of this derivative.

So, now, you get the integral of

all possible thetas times

the derivative of normal distributions probability density function,

times the second integral which is just

expected trajectory return given this particular theta.

So, this is still a double integral and we

have to devise some way to estimate it in practice.

Previously, we used Monte Carlo sampling of trajectories

for each possible theta samples from the normal distribution several times.

And right now, we have to devise something

similar lest we want to take integrals of distributions we don't know.

So, is it possible to devise a scheme that takes samples from this one,

or is there something that prevents us from doing so?

To backtrack, we are kind of screwed.

Because previously, we had the integral times the normal distribution,

which is a valid probability density function.

But the gradient of this distribution is no longer

a probability density function or at least in our case, it isn't.

The simplest explanation here is that if you take

the derivative of a probability density function that sometimes the derivative might be,

say, a negative, because the probability density function decreases.

And the probability density function itself cannot be negative, so we're screwed.

Of course, the other properties are broken as well because

the gradient might be larger than the one for some steep curves.

But, this should be far enough for us to

think about some other way to give up on this sampling approach,

and this makes things a little bit more complicated.

We'll need some trick from the math domain that helps us resolve this issue.

And this trick is, as it turns out,

a very popular one, which we're going to use several times further across.