To begin with, I want you to answer this simple mathematical quiz.
Let's say that you have a task to compute a derivative,
and you want to compute a derivative of logarithm of F of X or in this case,
logarithm of p of Z.
How do you do that?
How do you simplify the derivative?
Well, as a break,
we've probably been taught at school,
is to simply find the table of derivatives and use the chain rule.
So in this case, you can say that the derivative logarithm of
F of X is a product of derivative of the logarithm,
and the derivative of the function itself.
Now the derivative of the logarithm is one over whatever is there under the logarithm,
and then multiplied by the derivative of the phi,
or F in your abstract case.
Now, if you take the one over phi of Z and move it
from right to the left inverting it of course,
then you'll see this other formula,
you have the equation which holds for,
derivative of some function is equal to
this function times the derivative of its logarithm.
It's kind of a universal truth that comes from basic properties of derivatives.
And now we are going to apply this equation to
our formula to make it more convenient for approximation.
Now remember, we want to compute the derivative of our expected reward,
the so-called nabla G here.
This problem can be used to computing this derivative,
because the outer integration does not depend on policy in our case.
To do this in a more approximate manner,
let's first plug in our formula,
let's replace the nabla P here with our P times nabla log P formula we've just derived.
Results are going to be pretty much as you'd expect to get.
So you'll have this same integration,
but now in the inner integral,
you'll have first not a derivative,
but just the phi itself times the nabla logarithm of phi times your reward.
The unique part about this formula,
is that unlike your original formula,
which requires you to explicitly compute the integrals,
so the nabla phi is not a probability density,
so this is not a mathematical expectation.
The second formula allows you to approximate it via sampling.
In this case, you can sample over states and sample over actions.
And then over those samples,
you'll have to compute the thing
that is left in this formula, which is not an expectation.
Now, how would this formula change if you substitute the integrals with expectation,
as it was originally written?
Yes, right.
This is your expectation, and then it would be just the expectation over
states and action of nabla logarithm of policy times the reward.
The final thing we'll have to do,
is we want to find out how this thing translates to a multiple step decision process.
It's not cool if you can only solve one step process in this case.
The exact derivations of
the final formula are going to be a little bit more complicated than the original ones,
so we'll avoid derivation this time.
The final results are going to be unsurprising again,
what you going to have is instead of having a single reward here,
you use this discounted reward.
So if you want to maximize the expected discounted reward,
what you can do is you can sample states and actions,
and you can compute, this way,
you can approximate the expectation of derivative of
logarithm policy times the discounted reward,
times your G or the optimal Q,
whatever you'd prefer to name it.
And this is how you apply the policy gradient to breakouts,
to remote control, to any complicated process.
Next section, we'll see how this idea, the policy gradient,
is reduced to a practical algorithm called reinforce,
how this algorithm can be used to solve practical problems.