0:00

Now, we're ready to put everything together in

one particular scheme to compute

the Optimal Policy from data using Reinforcement Learning.

The first thing we do is updating the optimal policy.

We first do it in terms of the increments, increment Delta at.

The posterior probability of at is shown in equation 50.

Here Pi zero of Delta at,

is a prior distribution pi zero which is re-expressed in terms of Delta at instead of at.

Now because a Pi zero of Delta at is Gaussian,

and the expression in the exponent is quadratic,

the result is another Gaussian distribution as shown in the second line of equation 50.

The new mean and covariance of this distribution are shown in equations 51.

We can view this equation as Bayesian updates of prior values for these quantities.

Now we're almost there but we need one more element here namely,

a way to find the trajectory that we used before in our scheme.

Such trajectory is found in the model self consistently

iterating several times between a forward pass and a backward pass.

The method used in this calculation is called Iterated Quadratic-Gaussian Regulator.

And it was suggested in 2005 by Todorov and Li.

Here is how the method works.

We start with an initial trajectory that is

build using the mean of the prior policy distribution.

At each step, we take the current mean of the prior as a deterministic action,

and then compute a new state using the original state equation without the [inaudible].

Then we go backwards,

and compute the g-function by equation as we did above.

And this is the backward pass of the algorithm.

This produces an update for the mean of the policy.

And after that, in the forward pass,

we make a new trajectory around this mean,

and the whole thing begins again.

This is continued until convergence,

and produces self-consistent solution for the dynamics.

The end result can be expressed as an Optimal Data Implied Policy,

that has the same functional form as

our prior policy by zero but with updated parameters.

You can find the explicit expression for coefficient A zero and A one,

that are re-computed here at each step of

the trajectory optimization procedure in the original publications.

The updated covariance matrix,

sigma M is shown here and the change is clearly seen.

Sigma M is a regularized covariance matrix of the prior,

where regularization term is equal to beta times the matrix Gaat.

So, now, let's summarize and put the whole scheme together.

This is an algorithm that applies for

our model in the Conventional Reinforcement Learning Section,

when rewards are observable.

The algorithm starts with an initial trajectory as a sequence of

pairs x bar and u bar for all values of t from zero to capital T,

and this is the first forward pass.

Then we started to rating between the backward and forward process until convergence.

First, we perform a backward pass that goes back

in time from capital T minus one to zero.

And in this pass for each letter of T. In step one,

we compute the expected value of the F-function from the next period as we showed above.

Next in step two,

we use this computed value and observed rewards,

and compute the g-function at this step.

In step three, we compute the F-function at the same time step.

And after that in step four,

we recompute the policy for this time step by updating its mean and variance.

We repeat step one to four for all time steps,

until we complete the backward pass.

Then we do a forward pass by constructing

a new trajectory using the new mean and variance,

and the whole procedure repeats.

This summarizes the procedure for the Regular Reinforcement Learning with our model,

when rewards are observable.

But it turns out that even more interesting problems can

be addressed by this model in a setting when rewards are not observable.

I would like to remind you that

this setting is called Inverse Reinforcement Learning or IRL.

In this setting, we observe states and actions but not rewards.

In general, this setting is both very

interesting and more difficult than a Direct Reinforcement Learning case.

But we'll talk more about the IRL in general in a little bit

but first let's see how our model can work in such settings.

It turns out that this case is easy for our model because

Inverse Reinforcement Learning amounts in this model just to

conventional Maximum Likelihood Estimation.

The reason for this is that a policy that we obtain from this model is a Gaussian policy,

whose mean and variance depend on parameters of the work function.

Therefore, we can use the absorbed state action data,

and apply the regular Maximum Likelihood method to such Gaussian distribution.

This gives us the negative log-likelihood of data,

shown here in the equation 54.

Because parameters of these Gaussian likelihood

depend on the original parameters of the reward function,

we can use Stochastic Gradient Descent or Gradient Descent methods

to compute these parameters by minimization of the negative log-likelihood.

This should be done at each iteration of the above forward backward scheme.

So, for each iteration K,

we have to substitute here values of A zero and A one,

and Sigma M computed at this iteration.

This procedure replaces absorb rewards of

the Reinforcement Learning setting by

estimating rewards of Inverse Reinforcement Learning setting.

Now, we bring this additional element for Inverse Reinforcement Learning

and present the full scheme for the model for Inverse Reinforcement Learning setting.

The only new element comparing to the Reinforcement Learning case is a new step two,

where we use maximum likelihood to estimate the work parameters,

and other parameters that enter the model,

inclusion in particular the Inverse Temperature parameter beta.

The next step three,

uses this estimated rewards instead of observed rewards to compute

the g-function and the rest of the steps

is the same as for Reinforcement Learning setting.

This shows that computationally,

Inverse Reinforcement Learning in this model is not much

harder than Direct Reinforcement model.

Now, how could we use this model to interesting Inverse Reinforcement Learning things?

Recall that when we introduced the model,

we said that it can be used either as a model of a particular trader,

or as a model for the market portfolios such as S&P 500 portfolio.

In the first case, we need

proprietary trading data from that particular trader to learn the reward function,

and optimal policy for that trader.

If we have such data,

we can for example,

compute implied risk conversion parameter lambda for that trader.

But, more often than not,

we do not have such proprietary data because

only brokers or traders have it and that's why it's called proprietary data.

So what else we can do with the model is to use it with open data,

namely with market data.

We can apply it to the market S&P 500 portfolio in

a similar way to the Black-Litterman model that I mentioned earlier.

The whole scheme of the model,

can then be viewed as

a dynamic probabilistic and data driven extension of the Black-Litterman model.

And its inverse optimization interpretation by [inaudible] and co-workers.

In this case, the optimal policy that we computed before,

we shall repeat in the formula here,

would be a market implied optimal policy.

Given a set of commonly used predictors zt,

we can directly estimate this market implied

optimal policy using the method we just described.

On the other hand, if we have some private signals zt also called

private views in the Black-Litterman model

that are not available to the rest of the market,

we can use them to try to beat the market

by our improved optimal policy that takes those signals into account.

Such improved policy would be obtained by

edging these signals to the list of common predictors.

You can find more details on this model in a regional publications that you will

need to consult because you will be dealing with

this model in your final course project for this course.

And at this point, we are ready to wrap up

our week and our whole course on Reinforcement Learning.

We will summarize what we want in our next lecture.