0:02

So, how to compare to policies with each other.

Â One natural measure of policy performance is how

Â much of the reward could the policy get out of the environment.

Â In fact, we will say that policy pi is greater or equal than pi prime,

Â if, and only if,

Â the value function of policy pi is

Â greater or equal to the value function of policy pi prime,

Â for each and every state.

Â This comparison criteria, naturally yield the definition of the optimal policy.

Â We'll say that the policy pi star is the optimal policy if,

Â and only if, it is greater or equal to any other possible policy.

Â But what if for some environment the optimal policy does not exist?

Â What if for some states the one policy is better,

Â but for the states,

Â the other policy is on top?

Â Actually, for an Markov decision process this cannot be the case.

Â There is a beautiful theorem stating that in any finite Markov decision process,

Â there exists at least one optimal policy.

Â In addition, at least one of optimal policies should be deterministic.

Â The definition of optimal policy also uses

Â a definition of optimal value and optimal action value functions.

Â In fact, on the optimal value function,

Â denoted v star of s,

Â is defined as a value of the best possible policy.

Â That is the value of the v star,

Â is equal to the value of the best possible policy that starts making it's

Â decisions from state s. The same definition is true for the action value function.

Â However, the interpretation of

Â optimal action value function is a bit different from the optimal value function.

Â The optimal action value corresponds to

Â the expected return that is gained by committing action a instead s,

Â and then, and only then,

Â switching to the optimal policy for the rest of an episode.

Â Optimal value and action value functions,

Â also admit a recursive computation form.

Â This form is known as Bellman optimality equation.

Â In fact, the Bellman optimality equation is

Â a slight modification of Bellman expectation equation.

Â And now we are going to discover the differences between them.

Â On the slide, there are two copies of backup diagram

Â corresponding to Bellman expectation equation for value function.

Â Now we are going to modify

Â the right one to make it match the Bellman optimality equation.

Â So, when do we need to change in

Â the right diagram to represent the optimal value function computation?

Â Well, but then we have already computed the optimal values for leaves of the tree,

Â that is, we know v star of s prime for each the next state as prime.

Â Now recall that x on the bottom level,

Â correspond to environment probabilities over which we have no control.

Â That is we have to propagate the value from the bottom of

Â the tree to action node in the same way we have done it before.

Â In other words, we have to account for the average environment influence on the agent.

Â However, over action selection on the top level of the diagram,

Â we do to have control,

Â so we can compare the values in

Â the action nodes and propagate back to the root on the highest value.

Â Paraphrasing, we don't need to compute an expectation on the top level of the tree.

Â We should just put it up the highest value among the action root of the sub-trees.

Â This procedure of taking max at the top of the back up diagram,

Â exactly corresponds to the Bellman optimality equation.

Â There exists another point of view on the Bellman optimality equation.

Â Now that the outer max operator in the last formula,

Â has replaced the outer expectation.

Â I computed the report specificity in the Bellman expectation equation.

Â However, we can unify both approaches,

Â if we treat the max operator as an expectation.

Â That is an expectation with respect to

Â the policy which is greedy with respect to the inner expectation.

Â I mean, the expectation over its distribution that is equal to one for it's

Â action corresponding to the highest inner expectation and zero for all other actions.

Â But I also note that the inner expectation,

Â that is everything that stays to the right of

Â the max operator is nothing but an optimal action value function.

Â So, to wrap up, the optimality equation

Â is in fact expectation equation whereas policy pi,

Â replaced with the greedy policy,

Â greedy with respect to the optimal action value function.

Â The same logic applies to the Bellman optimality equation for action value function.

Â The top level acts of the back up diagram for

Â q function consists of environment reactions,

Â thus we have no control over top level acts,

Â because we have no control over environment probabilities.

Â So, we have to compute the expectation on the top level of the tree.

Â However, we do have control over action probabilities on the bottom level.

Â And just like for the value function,

Â here we also replace the expectation with a max operator.

Â And again, this max operator could be seen as an expectation of a policy,

Â which is greedy with respect to the optimal action values of the least.

Â Well, as you might have noticed, during this week,

Â there were a lot about expressing their solution

Â on the value computation problem in the recursive session.

Â That was basically, the dynamic programming in action.

Â However, there is a lot more to be discussed.

Â So, we know what are Bellman expectation and optimality equations.

Â And that knowledge is fundamental for the rest of

Â the course and reinforcement learning in general.

Â In the next videos, we are going to use these equations to

Â actually find an optimal policy. So, stay tuned.

Â