0:00

In this video, I'd like to

Â convey to you, the main intuitions

Â behind how regularization works.

Â And, we'll also write down

Â the cost function that we'll use, when we were using regularization.

Â With the hand drawn examples that

Â we have on these slides, I

Â think I'll be able to convey part of the intuition.

Â But, an even better

Â way to see for yourself, how

Â regularization works, is if

Â you implement it, and, see it work for yourself.

Â And, if you do the

Â appropriate exercises after this,

Â you get the chance

Â to self see regularization in action for yourself.

Â So, here is the intuition.

Â In the previous video, we saw

Â that, if we were to fit

Â a quadratic function to this

Â data, it gives us a pretty good fit to the data.

Â Whereas, if we were to

Â fit an overly high order

Â degree polynomial, we end

Â up with a curve that may fit

Â the training set very well, but,

Â really not be a,

Â but overfit the data

Â poorly, and, not generalize well.

Â 0:57

Consider the following, suppose we

Â were to penalize, and, make

Â the parameters theta 3 and theta 4 really small.

Â Here's what I

Â mean, here is our optimization

Â objective, or here is our

Â optimization problem, where we minimize

Â our usual squared error cause function.

Â Let's say I take this objective

Â and modify it and add

Â to it, plus 1000 theta

Â 3 squared, plus 1000 theta 4 squared.

Â 1000 I am just writing down as some huge number.

Â Now, if we were to

Â minimize this function, the

Â only way to make this

Â new cost function small is

Â if theta 3 and data

Â 4 are small, right?

Â Because otherwise, if you have

Â a thousand times theta 3, this

Â new cost functions gonna be big.

Â So when we minimize this

Â new function we are going

Â to end up with theta 3

Â close to 0 and theta

Â 4 close to 0, and as

Â if we're getting rid

Â of these two terms over there.

Â 2:03

And if we do that, well then,

Â if theta 3 and theta 4

Â close to 0 then we are

Â being left with a quadratic function,

Â and, so, we end up with

Â a fit to the data, that's, you know, quadratic

Â function plus maybe, tiny

Â contributions from small terms,

Â theta 3, theta 4, that they may be very close to 0.

Â And, so, we end up with

Â essentially, a quadratic function, which is good.

Â Because this is a

Â much better hypothesis.

Â In this particular example, we looked at the effect

Â of penalizing two of

Â the parameter values being large.

Â More generally, here is the idea behind regularization.

Â The idea is that, if we

Â have small values for the

Â parameters, then, having

Â small values for the parameters,

Â will somehow, will usually correspond

Â to having a simpler hypothesis.

Â So, for our last example, we

Â penalize just theta 3 and

Â theta 4 and when both

Â of these were close to zero,

Â we wound up with a much simpler

Â hypothesis that was essentially a quadratic function.

Â But more broadly, if we penalize all

Â the parameters usually that, we

Â can think of that, as trying

Â to give us a simpler hypothesis

Â as well because when, you

Â know, these parameters are

Â as close as you in this

Â example, that gave us a quadratic function.

Â But more generally, it is

Â possible to show that having

Â smaller values of the parameters

Â corresponds to usually smoother

Â functions as well for the simpler.

Â And which are therefore, also, less prone to overfitting.

Â I realize that the reasoning for

Â why having all the parameters be small.

Â Why that corresponds to a simpler

Â hypothesis; I realize that

Â reasoning may not be entirely clear to you right now.

Â And it is kind of hard

Â to explain unless you implement

Â yourself and see it for yourself.

Â But I hope that the example of

Â having theta 3 and theta

Â 4 be small and how

Â that gave us a simpler

Â hypothesis, I hope that

Â helps explain why, at least give

Â some intuition as to why this might be true.

Â Lets look at the specific example.

Â 4:12

For housing price prediction we

Â may have our hundred features

Â that we talked about where may

Â be x1 is the size, x2

Â is the number of bedrooms, x3

Â is the number of floors and so on.

Â And we may we may have a hundred features.

Â And unlike the polynomial

Â example, we don't know, right,

Â we don't know that theta 3,

Â theta 4, are the high order polynomial terms.

Â So, if we have just a

Â bag, if we have just a

Â set of a hundred features, it's hard

Â to pick in advance which are

Â the ones that are less likely to be relevant.

Â So we have a hundred or a hundred one parameters.

Â And we don't know which

Â ones to pick, we

Â don't know which

Â parameters to try to pick, to try to shrink.

Â So, in regularization, what we're

Â going to do, is take our

Â cost function, here's my cost function for linear regression.

Â And what I'm going to do

Â is, modify this cost

Â function to shrink all

Â of my parameters, because, you know,

Â I don't know which

Â one or two to try to shrink.

Â So I am going to modify my

Â cost function to add a term at the end.

Â 5:36

By the way, by convention the summation

Â here starts from one so I

Â am not actually going penalize theta

Â zero being large.

Â That sort of the convention that,

Â the sum I equals one through

Â N, rather than I equals zero

Â through N. But in practice,

Â it makes very little difference, and,

Â whether you include, you know,

Â theta zero or not, in

Â practice, make very little difference to the results.

Â But by convention, usually, we regularize

Â only theta through theta

Â 100. Writing down

Â our regularized optimization objective,

Â our regularized cost function again.

Â Here it is. Here's J of

Â theta where, this term

Â on the right is a regularization

Â term and lambda

Â here is called the regularization parameter and

Â what lambda does, is it

Â controls a trade off

Â between two different goals.

Â The first goal, capture it

Â by the first goal objective, is

Â that we would like to train,

Â is that we would like to fit the training data well.

Â We would like to fit the training set well.

Â And the second goal is,

Â we want to keep the parameters

Â small, and that's captured by

Â the second term, by the regularization objective. And by the regularization term.

Â And what lambda, the regularization

Â parameter does is the controls the trade of

Â between these two

Â goals, between the goal of fitting the training set well

Â and the

Â goal of keeping the parameter plan

Â small and therefore keeping the hypothesis relatively

Â simple to avoid overfitting.

Â For our housing price prediction

Â example, whereas, previously, if

Â we had fit a very high

Â order polynomial, we may

Â have wound up with a very,

Â sort of wiggly or curvy function like

Â this. If you still fit a high order polynomial

Â with all the polynomial

Â features in there, but instead,

Â you just make sure, to use

Â this sole of regularized objective, then what

Â you can get out is in

Â fact a curve that isn't

Â quite a quadratic function, but is

Â much smoother and much simpler

Â and maybe a curve like the magenta

Â line that, you know, gives a

Â much better hypothesis for this data.

Â Once again, I realize

Â it can be a bit difficult to see why strengthening the

Â parameters can have

Â this effect, but if you

Â implement yourselves with regularization

Â you will be able to see

Â this effect firsthand.

Â 8:00

In regularized linear regression, if

Â the regularization parameter monitor

Â is set to be very large,

Â then what will happen is

Â we will end up penalizing the

Â parameters theta 1, theta

Â 2, theta 3, theta

Â 4 very highly.

Â That is, if our hypothesis is this is one down at the bottom.

Â And if we end up penalizing

Â theta 1, theta 2, theta

Â 3, theta 4 very heavily, then we

Â end up with all of these parameters close to zero, right?

Â Theta 1 will be close to zero; theta 2 will be close to zero.

Â Theta three and theta four

Â will end up being close to zero.

Â And if we do that, it's as

Â if we're getting rid of these

Â terms in the hypothesis so that

Â we're just left with a hypothesis

Â that will say that.

Â It says that, well, housing

Â prices are equal to theta zero,

Â and that is akin to fitting

Â a flat horizontal straight line to the data.

Â And this is an

Â example of underfitting, and

Â in particular this hypothesis, this

Â straight line it just fails

Â to fit the training set

Â well. It's just a fat straight

Â line, it doesn't go, you know, go near.

Â It doesn't go anywhere near most of the training examples.

Â And another way of saying this

Â is that this hypothesis has

Â too strong a preconception or

Â too high bias that housing

Â prices are just equal

Â to theta zero, and despite

Â the clear data to the contrary,

Â you know chooses to fit a sort

Â of, flat line, just a

Â flat horizontal line. I didn't draw that very well.

Â This just a horizontal flat line

Â to the data. So for

Â regularization to work well, some

Â care should be taken,

Â to choose a good choice for

Â the regularization parameter lambda as well.

Â And when we talk about multi-selection

Â later in this course, we'll talk

Â about a way, a variety

Â of ways for automatically choosing

Â the regularization parameter lambda as well. So, that's

Â the idea of the high regularization

Â and the cost function reviews in

Â order to use regularization In the

Â next two videos, lets take

Â these ideas and apply them

Â to linear regression and to

Â logistic regression, so that

Â we can then get them to

Â avoid overfitting.

Â