0:00

In this video, I'm first going to introduce a method called rprop, that is

Â used for full batch learning. It's like Robbie Jacobs method, but not

Â quite the same. I'm then going to show how to extend RPROP

Â so that it works for mini-batches. This gives you the advantages of rprop and it

Â also gives you the advantage of mini-batch learning, which is essential for large,

Â redundant data sets. The method that we end up with called RMS

Â Pro is currently my favorite method as a sort of basic method for learning the

Â weights in a large neural network with a large redundant data set.

Â I'm now going to describe rprop which is an interesting way of trying to deal with

Â the fact that gradients vary widely in their magnitudes.

Â 1:13

For issues like escaping from plateaus with very small gradients this is a great

Â technique cause even with tiny gradients we'll take quite big steps.

Â We couldn't achieve that by just turning up the learning rate because then the

Â steps we took for weights that had big gradients would be much to big.

Â Rprop combines the idea of just using the sign of the gradient with the idea of

Â making the step size. Depend on which weight it is.

Â So to decide how much to change your weight, you don't look at the magnitude of

Â the gradient, you just look at the sign of the gradient.

Â But, you do look at the step size you decided around for that weight.

Â And, that step size adopts over time, again without looking at the magnitude of

Â the gradient. So we increase the step size for a weight

Â multiplicatively. For example by factor 1.2.

Â If the signs of the last two gradients agree.

Â This is like in Robbie Jacobs' adapted weights methods except that we did, gonna

Â do a multiplicative increase here. If the signs of the last two gradients

Â disagree, we decrease the step size multiplicatively, and in this case, we'll

Â make that more powerful than the increase, so that we can die down faster than we

Â grow. We need to limit the step sizes.

Â Mike Shuster's advice was to limit them between 50 and a millionth.

Â I think it depends a lot on what problem you're dealing with.

Â If for example you have a problem with some tiny inputs, you might need very big

Â weights on those inputs for them to have an effect.

Â I suspect that if you're not dealing with that kind of problem, having an upper

Â limit on the weight changes that's much less than 50 would be a good idea.

Â So one question is, why doesn't rprop work with mini-batches.

Â People have tried it, and find it hard to get it to work.

Â You can get it to work with very big mini-batches, where you use much more

Â conservative changes to the step sizes. But it's difficult.

Â So the reason it doesn't work is it violates the central idea behind

Â stochastic gradient descent, Which is, that when we have a small

Â loaning rate, the gradient gets effectively average over successive mini

Â batches. So consider a weight that gets a gradient

Â of +.01 on nine mini batches, and then a gradient of -.09 on the tenth mini batch.

Â What we'd like is those gradients will roughly average out so the weight will

Â stay where it is. Rprop won't give us that.

Â Rprop would increment the weight nine times by whatever its current step size

Â is, and decrement it only once. And that would make the weight get much

Â bigger. We're assuming here that the step sizes

Â adapt much slower than the time scale of these mini batches.

Â So the question is, can we combine the robustness that you get from rprop by just

Â using the sign of the gradient. The efficiency that you get from many

Â batches. And this averaging of gradients over

Â mini-batches is what allows mini-batches to combine gradients in the right way.

Â That leads to a method which I'm calling Rmsprop.

Â And you can consider to be a mini-batch version of rprop. rprop is equivalent to

Â using the gradient, But also dividing by the magnitude of the

Â gradient. And the reason it has problems with

Â mini-batches is that we divide the gradient by a different magnitude for each

Â mini batch. So the idea is that we're going to force

Â the number we divide by to be pretty much the same for nearby mini-batches. We do

Â that by keeping a moving average of the squared gradient for each weight.

Â So mean square WT means this moving average for weight W at time T,

Â Where time is an indicator of weight updates.

Â Time increments by one each time we update the weights The numbers I put in of 0.9

Â and 0.1 for computing moving average are just examples, but their reasonably

Â sensible examples. So the mean square is the previous mean

Â square times 0.9, Plus the value of the squared gradient for

Â that weight at time t, Times 0.1.

Â We then take that mean square. We take its square root,

Â Which is why it has the name RMS. And then we divide the gradient by that

Â RMS, and make an update proportional to that.

Â 5:57

That makes the learning work much better. Notice that we're not adapting the

Â learning rate separately for each connection here.

Â This is a simpler method where we simply, for each connection, keep a running

Â average of the route mean square gradient and divide by that.

Â There's many further developments one could make for rmsprop. You could combine

Â the standard moment. My experiment so far suggests that doesn't

Â help as much as momentum normally does, And that needs more investigation.

Â You could combine our rmsprop with Nesterov momentum where you first make the

Â jump and then make a correction. And Ilya Sutskever has tried that recently

Â and got good results. He's discovered that it works best if the

Â rms of the recent gradients is used to divide the correction term we make rather

Â than the large jump you make in the direction of the accumulated corrections.

Â Obviously you could combine rmsprop with adaptive learning rates on each connection

Â which would make it much more like rprop. That just needs a lot more investigation.

Â I just don't know at present how helpful that will be.

Â And then there is a bunch of other methods related to rmsprop that have a lot in

Â common with it. Yann LeCun's group has an interesting

Â paper called No More Pesky Learning Rates that came out this year.

Â And some of the terms in that looked like rmsprop, but it has many other terms.

Â I suspect, at present, that most of the advantage that comes from this complicated

Â method recommended by Yann LeCun's group comes from the fact that it's similar to

Â rmsprop. But I don't really know that.

Â So, a summary of the learning methods for neural networks, goes like this.

Â If you've got a small data set, say 10,000 cases or less,

Â Or a big data set without much redundancy, you should consider using a full batch

Â method. This full batch methods adapted from the

Â optimization literature like non-linear conjugate gradient or lbfgs, or

Â LevenbergMarkhart,Marquardt. And one advantage of using those methods

Â is they typically come with a package. And when you report the results in your

Â paper you just have to say, I used this package and here's what it did.

Â You don't have to justify all sorts of little decisions.

Â Alternatively you could use the adaptive learning rates I described in another

Â video or rprop, which are both essentially full batch methods but they are methods

Â that were developed for neural networks. If you have a big redundant data set it's

Â essential to use mini batches. It's a huge waste not to do that.

Â The first thing to try is just standard gradient descent with momentum.

Â You're going to have to choose a global learning rate, and you might want to write

Â a little loop to adapt that global learning rate based on whether the

Â gradient has changed side. But to begin with, don't go for anything

Â as fancy as adapting individual learning rates for individual weights.

Â The next thing to try is RMS prop. That's very simple to implement if you do

Â it without momentum, and in my experiment so far, that seems to work as well as

Â gradient descent with momentum, would be better.

Â 9:11

You can also consider all sorts of ways of improving rmsprop by adding momentum or

Â adaptive step sizes for each weight, but that's still basically uncharted

Â territory. Finally, you could find out whatever Yann

Â Lecun's latest receipt is and try that. He's probably the person who's tried the

Â most different ways of getting stochastic gradient descent to work well, and so it's

Â worth keeping up with whatever he's doing. One question you might ask is why is there

Â no simple recipe. We have been messing around with neural

Â nets, including deep neural nets, for more than 25 years now, and you would think

Â that we would come up with an agreed way of doing the learning.

Â There's really two reasons I think why there isn't a simple recipe.

Â 9:58

First, neural nets differ a lot. Very deep networks, especially ones that

Â have narrow bottlenecks in them, which I'll come to in later lectures, are very

Â hard things to optimize and they need methods that can be very sensitive to very

Â small gradients. Recurring nets are another special case,

Â they're typically very hard to optimize, if you want them to notice things that

Â happened a long time in the past and change the weights based on these things

Â that happened a long time ago. Then there's wide shallow networks, which

Â are quite different in flavor and are used a lot in practice.

Â They often can be optimized with methods that are not very accurate.

Â Because we stop the optimization early before it starts overfitting.

Â So for these different kinds of networks, there's very different methods that are

Â probably appropriate. The other consideration is that tasks

Â differ a lot. Some tasks require very accurate weights.

Â Some tasks don't require weights to be very accurate at all.

Â