0:00

If you suspect your neural network is over fitting your data.

Â That is you have a high variance problem,

Â one of the first things you should try per probably regularization.

Â The other way to address high variance,

Â is to get more training data that's also quite reliable.

Â But you can't always get more training data, or

Â it could be expensive to get more data.

Â But adding regularization will often help to prevent overfitting, or

Â to reduce the errors in your network.

Â So let's see how regularization works.

Â Let's develop these ideas using logistic regression.

Â Recall that for logistic regression, you try to minimize the cost function J,

Â which is defined as this cost function.

Â Some of your training examples of the losses of the individual predictions in

Â the different examples, where you recall that w and

Â b in the logistic regression, are the parameters.

Â So w is an x-dimensional parameter vector, and b is a real number.

Â And so to add regularization to the logistic regression, what you do is add to

Â it this thing, lambda, which is called the regularization parameter.

Â I'll say more about that in a second.

Â But lambda/2m times the norm of w squared.

Â So here, the norm of w squared is just equal to

Â sum from j equals 1 to nx of wj squared, or this can also be written w

Â transpose w, it's just a square Euclidean norm of the prime to vector w.

Â And this is called L2 regularization.

Â 1:33

Because here, you're using the Euclidean normals, or

Â else the L2 norm with the prime to vector w.

Â Now, why do you regularize just the parameter w?

Â Why don't we add something here about b as well?

Â In practice, you could do this, but I usually just omit this.

Â Because if you look at your parameters, w is usually a pretty high dimensional

Â parameter vector, especially with a high variance problem.

Â Maybe w just has a lot of parameters, so

Â you aren't fitting all the parameters well, whereas b is just a single number.

Â So almost all the parameters are in w rather b.

Â And if you add this last term, in practice,

Â it won't make much of a difference,

Â because b is just one parameter over a very large number of parameters.

Â In practice, I usually just don't bother to include it.

Â But you can if you want.

Â So L2 regularization is the most common type of regularization.

Â You might have also heard of some people talk about L1 regularization.

Â And that's when you add, instead of this L2 norm,

Â you instead add a term that is lambda/m of sum over of this.

Â And this is also called the L1 norm of the parameter vector w,

Â so the little subscript 1 down there, right?

Â And I guess whether you put m or 2m in the denominator, is just a scaling constant.

Â If you use L1 regularization, then w will end up being sparse.

Â And what that means is that the w vector will have a lot of zeros in it.

Â And some people say that this can help with compressing the model, because

Â the set of parameters are zero, and you need less memory to store the model.

Â Although, I find that, in practice, L1 regularization to make your model sparse,

Â helps only a little bit.

Â So I don't think it's used that much, at least not for

Â the purpose of compressing your model.

Â And when people train your networks,

Â L2 regularization is just used much much more often.

Â Sorry, just fixing up some of the notation here.

Â So one last detail.

Â Lambda here is called the regularization, Parameter.

Â 3:45

And usually, you set this using your development set, or

Â using [INAUDIBLE] cross validation.

Â When you a variety of values and see what does the best,

Â in terms of trading off between doing well in your training set versus also

Â setting that two normal of your parameters to be small.

Â Which helps prevent over fitting.

Â So lambda is another hyper parameter that you might have to tune.

Â And by the way, for the programming exercises,

Â lambda is a reserved keyword in the Python programming language.

Â So in the programming exercise, we'll have lambd,

Â 4:29

So this is how you implement L2 regularization for logistic regression.

Â How about a neural network?

Â In a neural network, you have a cost function that's a function of

Â all of your parameters, w[1], b[1] through w[L], b[L],

Â where capital L is the number of layers in your neural network.

Â And so the cost function is this, sum of the losses,

Â summed over your m training examples.

Â And says at regularization, you add lambda over

Â 2m of sum over all of your parameters W, your parameter matrix is w,

Â of their, that's called the squared norm.

Â Where this norm of a matrix, meaning the squared

Â norm is defined as the sum of the i sum of j,

Â of each of the elements of that matrix, squared.

Â And if you want the indices of this summation.

Â This is sum from i=1 through n[l-1].

Â Sum from j=1 through n[l],

Â because w is an n[l-1] by n[l] dimensional matrix,

Â where these are the number of units in layers [l-1] in layer l.

Â So this matrix norm, it turns out is called the Frobenius

Â norm of the matrix, denoted with a F in the subscript.

Â So for arcane linear algebra technical reasons,

Â this is not called the l2 normal of a matrix.

Â Instead, it's called the Frobenius norm of a matrix.

Â I know it sounds like it would be more natural to just call the l2 norm of

Â the matrix, but for really arcane reasons that you don't need to know,

Â by convention, this is called the Frobenius norm.

Â It just means the sum of square of elements of a matrix.

Â So how do you implement gradient descent with this?

Â Previously, we would complete dw using backprop,

Â where backprop would give us the partial derivative

Â of J with respect to w, or really w for any given [l].

Â And then you update w[l], as w[l]- the learning rate times d.

Â So this is before we added this extra regularization term to the objective.

Â Now that we've added this regularization term to the objective,

Â what you do is you take dw and you add to it, lambda/m times w.

Â And then you just compute this update, same as before.

Â And it turns out that with this new definition of dw[l],

Â this new dw[l] is still a correct definition of the derivative

Â of your cost function, with respect to your parameters,

Â now that you've added the extra regularization term at the end.

Â 7:54

+lambda of m times w[l].

Â Throw the minus sign there.

Â And so this is equal to w[l]- alpha

Â lambda / m times w[l]- alpha times

Â the thing you got from backpop.

Â And so this term shows that whatever the matrix w[l] is,

Â you're going to make it a little bit smaller, right?

Â This is actually as if you're taking the matrix w and

Â you're multiplying it by 1-alpha lambda/m.

Â You're really taking the matrix w and subtracting alpha lambda/m times this.

Â Like you're multiplying matrix w by this number,

Â which is going to be a little bit less than 1.

Â So this is why L2 norm regularization is also called weight decay.

Â Because it's just like the ordinally gradient descent, where you update

Â w by subtracting alpha times the original gradient you got from backprop.

Â But now you're also multiplying w by this thing,

Â which is a little bit less than 1.

Â So the alternative name for L2 regularization is weight decay.

Â I'm not really going to use that name, but the intuition for

Â it's called weight decay is that this first term here, is equal to this.

Â So you're just multiplying the weight metrics by a number slightly less than 1.

Â So that's how you implement L2 regularization in neural network.

Â