So this is how you implement L2 regularization for logistic regression.

How about a neural network?

In a neural network, you have a cost function that's a function of

all of your parameters, w[1], b[1] through w[L], b[L],

where capital L is the number of layers in your neural network.

And so the cost function is this, sum of the losses,

summed over your m training examples.

And says at regularization, you add lambda over

2m of sum over all of your parameters W, your parameter matrix is w,

of their, that's called the squared norm.

Where this norm of a matrix, meaning the squared

norm is defined as the sum of the i sum of j,

of each of the elements of that matrix, squared.

And if you want the indices of this summation.

This is sum from i=1 through n[l-1].

Sum from j=1 through n[l],

because w is an n[l-1] by n[l] dimensional matrix,

where these are the number of units in layers [l-1] in layer l.

So this matrix norm, it turns out is called the Frobenius

norm of the matrix, denoted with a F in the subscript.

So for arcane linear algebra technical reasons,

this is not called the l2 normal of a matrix.

Instead, it's called the Frobenius norm of a matrix.

I know it sounds like it would be more natural to just call the l2 norm of

the matrix, but for really arcane reasons that you don't need to know,

by convention, this is called the Frobenius norm.

It just means the sum of square of elements of a matrix.

So how do you implement gradient descent with this?

Previously, we would complete dw using backprop,

where backprop would give us the partial derivative

of J with respect to w, or really w for any given [l].

And then you update w[l], as w[l]- the learning rate times d.

So this is before we added this extra regularization term to the objective.

Now that we've added this regularization term to the objective,

what you do is you take dw and you add to it, lambda/m times w.

And then you just compute this update, same as before.

And it turns out that with this new definition of dw[l],

this new dw[l] is still a correct definition of the derivative

of your cost function, with respect to your parameters,

now that you've added the extra regularization term at the end.