0:00

In this video we're going to look at the momentum method for improving the learning

Â speed when doing grading descent into neural network.

Â The momentum method can be applied to full batch learning, but it also works for mini

Â batch learning. It's very widely used.

Â And probably the commonest recipe for learning big neural nets is to use

Â stochastic grade and descent with mini batches combined with momentum.

Â I'm going to start with the intuition behind the momentum method.

Â 0:42

The ball starts off stationary and so initially it will follow the direction of

Â steepest descent. It will follow the gradient.

Â But as soon as it's got some velocity it'll no longer go in the same direction

Â as the gradient. Its momentum will make it keep going in

Â the previous direction. Obviously we wanted eventually to get to a

Â low point on the surface, so we wanted to lose energy.

Â So we need to introduce a bit of viscosity.

Â That is, we make its velocity die off gently on each update.

Â 1:15

What the momentum method does, is it damps oscillations in directions of high

Â curvature. So if you look at the red starting point,

Â and then look at the green point we get to after two steps, they have gradients that

Â are pretty much equal and opposite. As a result, the gradient across the

Â ravine has cancelled out. But the gradient along the ravine has not

Â cancelled out. Along the ravine, we're going to keep

Â building up speed, and so, after the momentum method has settled down, it'll

Â tend to go along the bottom of the ravine, accumulating velocity as it goes, and if

Â you're lucky, that'll make you go a whole lot faster, than if you just judge

Â steepest descent. The equations of the momentum method are

Â fairly simple. We say that the velocity vector at time t,

Â is just the velocity vector at time t minus one, time here is the updates of the

Â weights. So it's the velocity vector that we got

Â after mini batch t minus one, attenuated a bit.

Â So we multiply by some number like point.9.

Â Which is really viscosity, or it's related to viscosity.

Â But unfortunately, I called it momentum. So we now call alpha momentum.

Â And then we add in the effect of the current gradient,

Â Which is to make us go downhill by some learning rate times the gradient that we

Â have at time t And that'll be our new velocity at time t We then make our weight

Â change at time t equal to velocity. That velocity can actually be expressed in

Â terms of previous weight changes as it's shown on the slide share.

Â Then I will leave it to you to follow the math.

Â 3:07

The behavior of the momentum method is very intuitive.

Â On an air surface that's just a plane, the ball will reach some terminal velocity of

Â which the gaining velocity that comes from the gradient is balanced by the

Â multiplicative attenuation of velocity due to the momentum term,

Â Which is really viscosity. If that momentum term is close to one,

Â then it'll be going down much faster than a simple gradient descent method would.

Â 3:40

So the terminal velocity, the velocity you get at time infinity is the gradient times

Â the learning weight, multiplied by this factor of one over one minus alpha.

Â So if alpha is 0.99, you'll go 100 times as fast as you would with the learning

Â rate alone. You have to be careful in setting

Â momentum. At the very beginning of learning, if you

Â make the initial random weights quite big, there may be very large gradients.

Â You have a bunch of weights that's completely no good for the task you're

Â doing. And it may be very obvious how to change

Â these weights to make things a lot better. You don't want a big momentum.

Â Because you're going to quickly change them to make things better.

Â And then you're going to start on the hard problem of finding out how to get just the

Â right relative values of different weights.

Â So you have sensible feature detectors. So it pays at the beginning of learning to

Â have a small momentum. It is probably better to have 0.5 than

Â zero, because 0.5 will average out some sloshes and obvious ravines.

Â Once the large gradients have disappeared, and you've reached the sort of normal

Â phase of learning, where you're stuck in a ravine.

Â And you need to go along the bottom of this ravine without sloshing to and fro

Â sideways. You can smoothly raise the momentum to its

Â final value. Or you could raise it in one step, but

Â that might start an oscillation. You might think that, why didn't we just

Â use a bigger learning rate. But what you'll discover is that, using a

Â small learning rate and a big momentum allows you to get away with an overall

Â learning rate that's much bigger than you could have had if you used learning rate

Â alone with no momentum. If you use a big learning rate by itself,

Â you'll get big divergent oscillations] across the ravine.

Â 5:34

Very recently Ilya Sutskever has discovered that there's a better type of

Â momentum. The standard momentum method works by

Â first computing the gradient at the current location.

Â It combines that with its stored memory of previous gradients, which is in the

Â velocity of the ball. And then it takes a big jump in the

Â direction of the current gradient combined with previous gradients.

Â So that's its accumulated gradient direction.

Â 6:06

Ilya Sutskever has found that it works better in many cases to use a form of

Â momentum suggested by Nesterov who was trying to optimize convex functions, where

Â we first make a big jump in the direction of the previous accumulating gradient, and

Â then we measure the gradient where we end up and make a correction.

Â It's very, very similar, and you need a picture to really understand the

Â difference. One way of thinking about what's going on

Â is in the standard momentum method, you add in the current gradient and then you

Â gamble on this big jump. In the Nesterov method, you use your

Â previously accumulated gradient, you make the big jump and then you correct yourself

Â at the place you've got to. So here's the picture, when we first make

Â the jump and then make a correction. Here is a stamp in the direction of the

Â accumulated gradient. So this depends on the gradient that we've

Â accumulated on, in our previous iteration. We take that step.

Â We then make it the gradient, and go downhill in the direction of the gradient.

Â Like that. We then combine that little correction

Â stat with the big jump we made to get our new accumulated gradient.

Â 7:33

We then take that accumulated gradient, we attenuate it by some number, like nine.

Â Or 99. multiply it by that number, and we now take our next big jump in the

Â direction of that accumulated gradient, like that.

Â Then again, at the place where we end up, we measure the gradient and we go

Â downhill. That correct any errors you made, and we

Â our new accumulated gradient. Now if you compare that with the standard

Â momentum method, the standard momentum method starts with a accumulating

Â gradient, like that initial brand vector, but then it measures the gradient where it

Â is, so it measures the gradient at its current location, and it adds that to the

Â brown vector, so that it makes a jump like this big blue vector.

Â That is just the brown vector plus the current gradient.

Â It turns out, if you're going to gamble, it's much better to gamble and then make a

Â correction, than to make a correction and then gamble.

Â