0:02

So let's get to the second part of our material for this week.

Â This time we're going to study a more advanced analysis of the ramifications of what it

Â takes to train by policy-based methods in

Â comparison to the stuff we already know, the value-based ones.

Â Now, instead of trying to give you yet another list of many things,

Â I want you to analyze,

Â I want you to make some of the conclusions.

Â Remember, there are some key differences in terms of what

Â value-based methods learn and what policy-based methods learn.

Â And I want you to guess what are the possible conclusions of this difference.

Â Yes. Actions overall.

Â Yes, there is definitely more than the one thing in which we differ.

Â And the most kind of,

Â the most important advantage of

Â the policy-based methods is that they kind of learn the simple problem.

Â We'll see how this difference in approaches gives you

Â better average rewards later on when we

Â cover particular implementation of policy-based algorithms.

Â Now, another huge point here is that,

Â the value-based and policy-based algorithms,

Â they have different kind of ideal how they explore.

Â Value-based methods, If you remember,

Â you have to specify the explicit kind of exploration strategy,

Â with epsilon-greedy strategy or Boltzmann softmax strategy.

Â Basically, you have your Q-values and you determine the probabilities of

Â actions given those Q-values and any other parameter you want.

Â Now, in policy-based methods, you don't have this thing.

Â Instead, you have to sample from your policy.

Â And this is both a boon and a quirk basically.

Â So, what you have is, in policy-based methods,

Â you cannot explicitly tell the algorithm that should over or under explore.

Â But instead, you have this kind of,

Â you have algorithm decide for itself,

Â whether it wants to explore more at this stage because it's kind of not sure what to do,

Â or it wants to take the opposite direction because it's obviously straight from offset.

Â So, you can of course,

Â affect how the policy-based algorithms explore.

Â Work out this just in a few slides.

Â Now, finally you can point out some of the areas where

Â the current scientific progress has better

Â developed for the value-based methods or the policy-based ones.

Â For value-based methods, their main strength is that, instead of value-based,

Â they give you this free estimate of how good this particular state is.

Â You can use it for some kind of seeing the charts and

Â for other algorithms that rely on this value-based approach.

Â Finally, value-based methods, have this, well,

Â more kind of more mechanisms designed to train off-policy.

Â For example, both Q-learning and expected value SARSA simple algorithms,

Â may be trained on session sampled from

Â experience cheaply just as well as their own sessions.

Â The main advantage here is that,

Â since you can train off-policy,

Â you increase this property of simple efficiency.

Â The idea is that, your algorithm requires less training data,

Â less actual playing to converge to the optimal strategy.

Â Now, of course there are similar ways you

Â can take class actions for policy-based methods.

Â But, they are slightly harder to grasp and even harder to implement.

Â Speaking of the advantages of policy-based methods,

Â first you have this innate ability to work with any kind of probability distribution.

Â For example, if you have actions that are not discrete but continuous,

Â you can specify a multi-dimensional normal distribution

Â or Laplacian distribution or anything you want for your particular task.

Â And, you can just plug it to

Â the policy algorithm or formula and it will work like a blaze.

Â Now, basically this allows you to train

Â not actually terrible on the continuous action basis and,

Â of course you can do better with special case algorithms

Â that we're going to cover in the reading section.

Â But however, considered, it is a strong argument towards using policy-based methods.

Â Finally, since policy-based methods learn the policy,

Â the probability of taking action in a state,

Â have one super neat idea.

Â They are, they train exactly the stuff

Â you need when you train supervised learning methods.

Â This basically means that, you can transfer between,

Â policy-based enforcement learning and supervised learning,

Â without changing anything in your model.

Â So, you have a neural network and you can train it as a classifier and convert

Â it as a policy of an agent trained by reinforcer, actor-critic.

Â You might have to train another head for actor-critic,

Â but this is not as hard as retraining the whole set of Q-values.

Â