0:02

Welcome back to the course. So by this time,

Â you've probably done some great things in the value-based method.

Â You've studied how to apply Q-learning,

Â maybe value iteration before that,

Â and an arm checker has some bonus algorithms as usual.

Â Now, the idea behind value-based methods that is easy to explain intuitively,

Â is that to find an optimal action,

Â you want your algorithm to learn how much reward,

Â how much as a discounted do you think you will get.

Â If it starts from this state,

Â maybe takes this action,

Â and then follows this policy or maybe optimal policy for Q-learning.

Â Now, this approach is very useful,

Â provided that you know this function.

Â But today, we're going to study another family of methods that try to

Â explicitly avoid learning the Q or V or any other value-related functions.

Â To understand why, let's just use this simple example.

Â Now, on the next slide will contain a simple question.

Â It doesn't have any math to it,

Â it's a question that a toddler can answer,

Â so I want you to answer as quick as possible.

Â Now, you have this breakout game here,

Â and the question is, "What action do you take? Left or right?"

Â Quick. Now, obviously, right in this case because it comes intuitively.

Â But now, let's answer another question.

Â There will be another question on the next slide,

Â and this one will be slightly more mathematical,

Â but it's what your reinforcement learning algorithm has to learn.

Â Quickly, answer now, what's the Q function of

Â this state and going to right? You have 10 seconds.

Â 1:38

It's a bit harder than the previous one, isn't it?

Â So the idea here is that,

Â it's really easy for you to understand what to do,

Â but the Q-learning doesn't explicitly learn what to do.

Â It instead tries to learn what kind of a value,

Â what Q function you'll get if you do this,

Â and it's kind of hard, especially if you consider this applied to everyday problems.

Â Let's say that you have a very simple problem of,

Â whether or not do you go for coffee,

Â so you can make yourself a coffee in another room.

Â You can either go there and drink a coffee and then proceed,

Â or you can stay here and avoid drinking coffee.

Â Now, what you actually do is,

Â you either feel like you want to do this,

Â or you feel like you don't, and this is very simple.

Â What Q-learning has to do,

Â is it tries to learn the value of your entire life from this moment to the end,

Â and it tries to add up all the rewards you're going to get in this day, next day,

Â the day after that with some Gamma-related coefficients,

Â and this is kind of impractical,

Â especially because it takes to predict what's going to happen in the future.

Â So when I say difficult,

Â I actually mean that it's not only difficult for you to add up rewards with Gammas,

Â it's also difficult for a new level to approximate.

Â Or for that matter, any other algorithm.

Â You have your DQN or something related to DQN,

Â trying to learn the game of breakouts,

Â or whether it wants to drink a cup of coffee.

Â You actually have a squared error minimization problem under the hood.

Â So what it tries to do, it tries to minimize the squared error between

Â the predicted Q function and the temporal difference kind of corrected Q Function,

Â which is on the right here.

Â So basically, it's reward plus Gamma times whatever.

Â And if you remember the last week,

Â we actually considered this last term,

Â Reward plus Gamma whatever to be constant,

Â not dependent on the parameters of a Q function.

Â When it comes to real world applications,

Â your neuro network size is usually insufficient

Â because otherwise, it will train for ages.

Â What it actually means that, your neural network would never be able to

Â approximate old Q values in a way that it has no error,

Â so it would have some approximation error there,

Â and what it actually tries to do,

Â it tries to make an approximation which

Â minimizes the lost function or a mean squared error in this case.

Â So now, imagine that you have two possible Q functions.

Â You have two possible outcomes of your learning,

Â and you are considering them on two different states,

Â the S zero and the S one.

Â Now in those two states,

Â you have two actions, A zero and A one.

Â Let's imagine that you only have two actions,

Â and that on all other states,

Â your neuro network's identical.

Â This is just for simplification.

Â The first clone here,

Â is the kind of true Q values,

Â the Q values that your Q and A neural network would actually going to

Â get if it does this particular action,

Â and then follows its policy or the optimal one.

Â Now, you have the first two rows corresponding to S zero.

Â In this case, the first action brings you the Q function,

Â the true Q function of one.

Â The second one brings you two,

Â and you have the second state, the S1,

Â and in this case, the first action brings you three,

Â and the second one brings you 100.

Â This is not very practical,

Â but it kind of serves the point of explaining the stuff,

Â so you have two networks, two possible approximations.

Â The first approximation is exact about the first state,

Â is the A:, so it captures one and two Q values exactly.

Â But on the second state,

Â it captures first action right,

Â but it fails to grasp the actual Q value of the second action.

Â Then, you have the second possible option of what you can learn.

Â In this case, the second state is approximately ideally, but the first state,

Â the S zero has its Q values freed,

Â so it has an error of plus one and minus one there.

Â The question is, which of those two Q functions would you prefer?

Â Which of them would get a better average of work per session,

Â or which of them will take the optimal action more frequently?

Â 5:50

So, you've probably noticed that the option A,

Â the detrian under the letter A,

Â it has this innate property,

Â that it doesn't give you some error.

Â But if you take the optimal action, the arg maximum,

Â the arg maximum is seen for this Q network and for the optimal Q function.

Â Despite being slightly off in s1a1,

Â it will in fact find the optimal policy here,

Â unless of course, there are some other states that we have not considered.

Â The second network, the network B, doesn't have this property,

Â although its error is very small,

Â and I think A it will take the sub-optimal action.

Â So, it's ever sure to be slightly worse.

Â Well, to conclude the network A, should be better.

Â Let's pose another question,

Â you have this square root error minimization problem.

Â Which of those two options,

Â A or B, will your network prefer when minimizing the square root error.

Â What if the square root error is smaller?

Â 6:49

Yes, right. It's kind of the other way around.

Â Basically, option A has the square root error of 50 squared,

Â which is somewhere around too much,

Â and option B only has a square root error of 2.

Â So, it's like plus one squared and minus one squared added up. Now, this is the problem.

Â The problem is, in that your deep Q in,

Â while trying to minimize square root error,

Â will avoid optimal policy and try to converge to something different,

Â which is although more accurate than sense of a slow function,

Â it's not what you actually want,

Â you want to play optimally.

Â And doing so, approximating Q function is really hard because again,

Â you remember this Korff example,

Â you'll have to predict what happens next.

Â What if you could avoid learning this Q function or V function or ignore it?

Â If you could try to approximate something else.

Â In this case, why don't we just learn the policy here?

Â We want the optimal policy.

Â We want to take optimal actions,

Â so why not just directly learn the optimal policy distribution?

Â During earlier weeks of our course,

Â we did have one method or a few methods if you listen to the honor check,

Â we had the method that fits the definition of not learning Q or V function.

Â Instead, it explicitly tries the probability of taking action in a state,

Â either by the table or some approximation.

Â Now, what kind of algorithm works this way?

Â Yes, this is a cross-entropy method,

Â and what it does,

Â it simply plays some games with its current policy, then it means,

Â minus the cross-entropy between the policy and the games that are more like you kind of,

Â that reward better than average or better than some percentile.

Â We'll cover this in more detail in just a few slides.

Â Now, this algorithm had a lot of drawbacks.

Â It required a lot of samples,

Â it discarded a lot of samples,

Â and it had things that you would

Â solve with temporal difference rewards and with Q-learning,

Â but it also had this neat efficiency in it.

Â It learned kind of complicated environments really simple,

Â in case of course, you can sample a lot of sessions in there.

Â The reason behind cross-entropy method being so efficient, provided of course,

Â it has a lot of samples, is because,

Â it solves a simpler problem,

Â it doesn't try to approximate Q function.

Â It simply tries to learn what to do.

Â So imagine, you're in front of a tiger and you have to decide,

Â whether or not you want to pat the tiger,

Â run from the tiger, provoke the tiger, or ignore the tiger.

Â What Q-learning does, is it tries to estimate the quality of your life

Â based on each of those choices and distinct some calculations.

Â Now then, it simply picks an action which maximizes the expected quality of life.

Â But in reality, before you do that, the tiger is going to eat you.

Â What cross-entropy method did is,

Â it simply tried to learn the probability,

Â and this is kind of the thing you should do if you don't want to get eaten.

Â Hopefully, you humans don't ever need Q function every time you want to make a decision.

Â