0:00

One of the problems of training neural network,

Â especially very deep neural networks,

Â is data vanishing and exploding gradients.

Â What that means is that when you're training

Â a very deep network your derivatives or your slopes can sometimes get either very,

Â very big or very, very small,

Â maybe even exponentially small,

Â and this makes training difficult.

Â In this video you see what this problem of

Â exploding and vanishing gradients really means,

Â as well as how you can use careful choices of

Â the random weight initialization to significantly reduce this problem.

Â Unless you're training a very deep neural network like this,

Â to save space on the slide,

Â I've drawn it as if you have only two hidden units per layer,

Â but it could be more as well.

Â But this neural network will have parameters W1,

Â W2, W3 and so on up to WL.

Â For the sake of simplicity,

Â let's say we're using an activation function G of Z equals Z,

Â so linear activation function.

Â And let's ignore B, let's say B of L equals zero.

Â So in that case you can show that the output Y will be

Â WL times WL minus one times WL minus two,

Â dot, dot, dot down to the W3,

Â W2, W1 times X.

Â But if you want to just check my math,

Â W1 times X is going to be Z1,

Â because B is equal to zero.

Â So Z1 is equal to, I guess,

Â W1 times X and then plus B which is zero.

Â But then A1 is equal to G of Z1.

Â But because we use linear activation function,

Â this is just equal to Z1.

Â So this first term W1X is equal to A1.

Â And then by the reasoning you can figure out that W2 times W1 times X is equal to A2,

Â because that's going to be G of Z2,

Â is going to be G of

Â W2 times A1 which you can plug that in here.

Â So this thing is going to be equal to A2,

Â and then this thing is going to be

Â A3 and so on until the protocol of all these matrices gives you Y-hat, not Y.

Â Now, let's say that each of you weight matrices

Â WL is just a little bit larger than one times the identity.

Â So it's 1.5_1.5_0_0.

Â Technically, the last one has different dimensions so

Â maybe this is just the rest of these weight matrices.

Â Then Y-hat will be,

Â ignoring this last one with different dimension,

Â this 1.5_0_0_1.5 matrix to the power of L minus 1 times X,

Â because we assume that each one of these matrices is equal to this thing.

Â It's really 1.5 times the identity matrix, then you end up with this calculation.

Â And so Y-hat will be essentially 1.5 to the power of L,

Â to the power of L minus 1 times X,

Â and if L was large for very deep neural network,

Â Y-hat will be very large.

Â In fact, it just grows exponentially,

Â it grows like 1.5 to the number of layers.

Â And so if you have a very deep neural network,

Â the value of Y will explode.

Â Now, conversely, if we replace this with 0.5,

Â so something less than 1,

Â then this becomes 0.5 to the power of L.

Â This matrix becomes 0.5 to the L minus one times X, again ignoring WL.

Â And so each of your matrices are less than 1,

Â then let's say X1, X2 were one one,

Â then the activations will be one half,

Â one half, one fourth,

Â one fourth, one eighth, one eighth,

Â and so on until this becomes one over two to the L. So

Â the activation values will decrease exponentially as a function of the def,

Â as a function of the number of layers L of the network.

Â So in the very deep network, the activations end up decreasing exponentially.

Â So the intuition I hope you can take away from this is that at the weights W,

Â if they're all just a little bit bigger than one

Â or just a little bit bigger than the identity matrix,

Â then with a very deep network the activations can explode.

Â And if W is just a little bit less than identity.

Â So this maybe here's 0.9, 0.9,

Â then you have a very deep network,

Â the activations will decrease exponentially.

Â And even though I went through this argument in terms of

Â activations increasing or decreasing exponentially as a function of L,

Â a similar argument can be used to show that

Â the derivatives or the gradients the computer is going to send

Â will also increase exponentially

Â or decrease exponentially as a function of the number of layers.

Â With some of the modern neural networks, L equals 150.

Â Microsoft recently got great results with 152 layer neural network.

Â But with such a deep neural network,

Â if your activations or gradients increase or decrease exponentially as a function of L,

Â then these values could get really big or really small.

Â And this makes training difficult,

Â especially if your gradients are exponentially smaller than L,

Â then gradient descent will take tiny little steps.

Â It will take a long time for gradient descent to learn anything.

Â To summarize, you've seen how deep networks suffer from

Â the problems of vanishing or exploding gradients.

Â In fact, for a long time this problem was

Â a huge barrier to training deep neural networks.

Â It turns out there's a partial solution that doesn't completely solve

Â this problem but it helps a lot which is

Â careful choice of how you initialize the weights.

Â To see that, let's go to the next video.

Â