0:02

Now, one popular solution to this problem of optimism is the so-called double Q-learning.

Â What it says is basically that,

Â if you cannot trust one Q function,

Â Double Q-learning could make you learn two of them,

Â to train one another.

Â You have this Q1 and Q2.

Â Those are two independent estimates of the actual value function.

Â So in Double Q-learning,

Â those are just two independent tables,

Â and in deep Q network case,

Â those are two neural networks that have several sets of weights.

Â What happens is, you update them one after the other,

Â and you use one Q function to train the other one, and vice versa.

Â Let's take a closer look to that bit rule.

Â We have the Q1,

Â we just update it by following the rule.

Â You take the rewards to Gamma times maximum of Q of the next state,

Â but this time you do this maximization in a very cunning way.

Â You take an action which Q1 deems optimal,

Â and you take this action from Q2,

Â or you can interchange those two terms.

Â You can take the action which is deemed optimal by Q2

Â and take the action value of this action from Q1.

Â Now, this defeats this over optimism because here is what happens,

Â you have your Q functions that may be over optimistic,

Â or pessimistic, or whatever,

Â just because of the noise and how they are trained.

Â Maybe one of the Q functions is likely to

Â have several updates where it's getting good next states and therefore,

Â it's too optimistic due to the moving errors that they form.

Â Now, the idea here is that,

Â if you take an action which is optimal by that being one of this first Q function,

Â where say action one is optimal,

Â because it's just Q of that weight for randomness,

Â then there is no connection between this overoptimism for action one and Q1,

Â and the same overoptimism in Q2.

Â In fact, the same action in Q2,

Â is more or less independent.

Â It can be overoptimistic or overpessimistic and it can be exactly the true value.

Â The idea here is that, the noise in Q2 is independent of the noise in Q1.

Â And if you update them that way,

Â then you take the maximization which will take account of the sampling error.

Â If all Q functions are equal for example,

Â than the maximum of say

Â Q2 is going to be just basically a random action because of how noise works.

Â If you take the expectation of basically Q value of random action from Q1,

Â you'll get exactly the maximum of expectations in the limit of course,

Â if you take all those samples.

Â You do the same thing with Q2.

Â Basically, you take the Q2 and you use Q1 to help it to update itself.

Â So you maximize by one Q network

Â and take the action value for this maximal action from the other one.

Â And here's how it works.

Â Basically you're trying two networks,

Â and since they are more or less decorrelated,

Â they have different kinds of noises, not talking,

Â one noise but different realizations of this noise,

Â then the overoptimism disappears.

Â Now lets see how we can apply this to a DQN more efficiently.

Â Just as a reminder, DQN is again just an

Â unilateral convolutional one with experience replay,

Â and target networks to stabilize training.

Â Now by default, you could of course train two Q networks.

Â The Q1 and Q2 but this will effectively double the convergence time.

Â So if it is previously converged one week on GPU,

Â then now it is going to converge over two weeks and

Â that's kind of unacceptable in the scope of our course.

Â Instead I want you to think up of some smarter way.

Â You could use the current state of DQN to get the same effect.

Â In fact what we need is,

Â we need some way to maximize over one network and take the value over the other network,

Â and this basically requires that you have two networks that are kind of decorrelated.

Â They are not statistically speaking absolutely independent,

Â but they are expected to have different kinds of local noise here.

Â Now can we find some pair of networks to do this trick in

Â DQN without us having to retrain another network from scratch. How do we do that?

Â Well, yes.

Â One way you could try to solve this problem and the way the actual article,

Â introduced this method suggested,

Â is that you use the target network,

Â the older snapshot of your network as

Â the source of independent randomness as the other Q network.

Â So you only train your Q1,

Â but instead of using the other Q network to get this smart,

Â maximizing and taking the action value,

Â you just take the action value of your old Q network,

Â the target network that corresponds to an action optimal on the reoccurring Q network.

Â Let's walk through this step by step.

Â In your usual DQN,

Â you have this update rules,

Â the first rule here which just takes the reward plus Gamma

Â times the maximum over the target networks action values.

Â You can rewrite this mathematically by simply replacing

Â maximization over action values by taking the actual value of the maximal action.

Â So basically, substituting mass with auto mass here.

Â What we're going to do next,

Â is we're going to assign

Â those two Q functions on the right hand side to different networks.

Â So we have first Q function which we use to take the action value,

Â we use the target Q network for this one because once stability is here and so on.

Â The other Q network, which is used to maximize

Â our action is our own trainable Q network. The main one.

Â Therefore, we take old Q networks Q values,

Â corresponding to actions that are optimal under our current Q network,

Â which are going to be more or less independent,

Â provided that we update our target network too rarely.

Â And of course in our usual DQN,

Â the updates happen every say 100,000 iterations

Â so the dependencies there are more or less negligible.

Â Sincerely speaking it's more or less a humanistic which doesn't

Â guarantee anything but it's very unlikely that it fails.

Â So it's a practical algorithm which uses some duct tape and some black magic to

Â get efficient results without training another set of parameters from scratch.

Â