0:00

When training a neural network, one of the techniques that will speed up your

Â training is if you normalize your inputs.

Â Let's see what that means.

Â Let's see if a training sets with two input features.

Â So the input features x are two dimensional, and

Â here's a scatter plot of your training set.

Â Normalizing your inputs corresponds to two steps.

Â The first is to subtract out or to zero out the mean.

Â So you set mu = 1 over M sum over I of Xi.

Â So this is a vector, and then X gets set as X- mu for every training example,

Â so this means you just move the training set until it has 0 mean.

Â And then the second step is to normalize the variances.

Â So notice here that the feature X1 has a much larger variance than the feature

Â X2 here.

Â So what we do is set sigma = 1 over m

Â sum of Xi**2.

Â I guess this is a element y squaring.

Â And so now sigma squared is a vector with the variances of each of the features,

Â and notice we've already subtracted out the mean, so

Â Xi squared, element y squared is just the variances.

Â And you take each example and divide it by this vector sigma squared.

Â And so in pictures, you end up with this.

Â Where now the variance of X1 and X2 are both equal to one.

Â 1:35

And one tip, if you use this to scale your training data, then use the same mu and

Â sigma squared to normalize your test set, right?

Â In particular, you don't want to normalize the training set and

Â the test set differently.

Â Whatever this value is and whatever this value is, use them in these two

Â formulas so that you scale your test set in exactly the same way, rather than

Â estimating mu and sigma squared separately on your training set and test set.

Â Because you want your data, both training and test examples,

Â to go through the same transformation defined by the same mu and

Â sigma squared calculated on your training data.

Â So, why do we do this?

Â Why do we want to normalize the input features?

Â Recall that a cost function is defined as written on the top right.

Â It turns out that if you use unnormalized input features, it's more likely

Â that your cost function will look like this, it's a very squished out bowl, very

Â elongated cost function, where the minimum you're trying to find is maybe over there.

Â But if your features are on very different scales, say the feature X1 ranges

Â from 1 to 1,000, and the feature X2 ranges from 0 to 1,

Â then it turns out that the ratio or the range of values for

Â the parameters w1 and w2 will end up taking on very different values.

Â And so maybe these axes should be w1 and w2, but I'll plot w and b,

Â then your cost function can be a very elongated bowl like that.

Â So if you part the contours of this function,

Â you can have a very elongated function like that.

Â Whereas if you normalize the features,

Â then your cost function will on average look more symmetric.

Â And if you're running gradient descent on the cost function like the one on

Â the left, then you might have to use a very small learning rate because if you're

Â here that gradient descent might need a lot of steps to oscillate back and

Â forth before it finally finds its way to the minimum.

Â Whereas if you have a more spherical contours, then wherever you start

Â gradient descent can pretty much go straight to the minimum.

Â You can take much larger steps with gradient descent rather than needing to

Â oscillate around like like the picture on the left.

Â Of course in practice w is a high-dimensional vector, and so

Â trying to plot this in 2D doesn't convey all the intuitions correctly.

Â But the rough intuition that your cost function will be more round and

Â easier to optimize when your features are all on similar scales.

Â Not from one to 1000, zero to one, but

Â mostly from minus one to one or of about similar variances of each other.

Â That just makes your cost function J easier and faster to optimize.

Â In practice if one feature, say X1, ranges from zero to one, and

Â X2 ranges from minus one to one, and X3 ranges from one to two,

Â these are fairly similar ranges, so this will work just fine.

Â It's when they're on dramatically different ranges like

Â ones from 1 to a 1000, and

Â the another from 0 to 1, that that really hurts your authorization algorithm.

Â But by just setting all of them to a 0 mean and say, variance 1, like we did in

Â the last slide, that just guarantees that all your features on a similar scale and

Â will usually help your learning algorithm run faster.

Â So, if your input features came from very different scales,

Â maybe some features are from 0 to 1,

Â some from 1 to 1,000, then it's important to normalize your features.

Â If your features came in on similar scales, then this step is less important.

Â Although performing this type of normalization pretty much never does any

Â harm, so I'll often do it anyway if I'm not sure whether or

Â not it will help with speeding up training for your algebra.

Â