0:31

fits. So the data which are fit like so, and there's a cost function, and

Â that was our optimization objective. [sound] So this video, in order to better

Â visualize the cost function J, I'm going to work with a simplified hypothesis

Â function, like that shown on the right. So I'm gonna use my simplified hypothesis,

Â which is just theta one times X. We can, if you want, think of this as setting the

Â parameter theta zero equal to 0. So I have only one parameter theta one and

Â my cost function is similar to before except that now H of X that is now equal

Â to just theta one times X. And I have only one parameter theta one and so my

Â optimization objective is to minimize j of theta one. In pictures what this means is

Â that if theta zero equals zero that corresponds to choosing only hypothesis

Â functions that pass through the origin, that pass through the point (0, 0). Using

Â this simplified definition of a hypothesizing cost function let's try to understand the cost

Â function concept better. It turns out that two key functions we want to understand.

Â The first is the hypothesis function, and the second is a cost function. So, notice

Â that the hypothesis, right, H of X. For a face value of theta one, this is a

Â function of X. So the hypothesis is a function of, what is the size of the house

Â X. In contrast, the cost function, J, that's a function of the parameter, theta

Â one, which controls the slope of the straight line. Let's plot these functions

Â and try to understand them both better. Let's start with the hypothesis. On the left,

Â let's say here's my training set with three points at (1, 1), (2, 2), and (3, 3). Let's

Â pick a value theta one, so when theta one equals one, and if that's my choice for

Â theta one, then my hypothesis is going to look like this straight line over here.

Â And I'm gonna point out, when I'm plotting my hypothesis function. X-axis, my

Â horizontal axis is labeled X, is labeled you know, size of the house over here.

Â Now, of temporary, set theta one equals one, what I want to do is

Â figure out what is j of theta one, when theta one equals one. So let's go ahead

Â and compute what the cost function has for. You'll devalue one. Well, as usual,

Â my cost function is defined as follows, right? Some from, some of 'em are training

Â sets of this usual squared error term. And, this is therefore equal to. And this.

Â 3:14

Of theta one x I minus y I and if you simplify this turns out to be. That. Zero

Â Squared to zero squared to zero squared which is of course, just equal to zero. Now,

Â inside the cost function. It turns out each of these terms here is equal to zero. Because

Â for the specific training set I have or my 3 training examples are (1, 1), (2, 2), (3,3). If theta

Â one is equal to one. Then h of x. H of x i. Is equal to y I exactly, let me write

Â this better. Right? And so, h of x minus y, each of these terms is equal to zero,

Â which is why I find that j of one is equal to zero. So, we now know that j of one Is

Â equal to zero. Let's plot that. What I'm gonna do on the right is plot my cost

Â function j. And notice, because my cost function is a function of my parameter

Â theta one, when I plot my cost function, the horizontal axis is now labeled with

Â theta one. So I have j of one zero zero so let's go ahead and plot that. End

Â up with. An X over there. Now lets look at some other examples. Theta-1 can take on a

Â range of different values. Right? So theta-1 can take on the negative values,

Â zero, positive values. So what if theta-1 is equal to 0.5. What happens then? Let's

Â go ahead and plot that. I'm now going to set theta-1 equals 0.5, and in that case my

Â hypothesis now looks like this. As a line with slope equals to 0.5, and, lets

Â compute J, of 0.5. So that is going to be one over 2M of, my usual cost function.

Â It turns out that the cost function is going to be the sum of square values of

Â the height of this line. Plus the sum of square of the height of that line, plus

Â the sum of square of the height of that line, right? ?Cause just this vertical

Â distance, that's the difference between, you know, Y. I. and the predicted value, H

Â of XI, right? So the first example is going to be 0.5 minus one squared.

Â Because my hypothesis predicted 0.5. Whereas, the actual value was one. For my

Â second example, I get, one minus two squared, because my hypothesis predicted

Â one, but the actual housing price was two. And then finally, plus. 1.5 minus three

Â squared. And so that's equal to one over two times three. Because, M when trading

Â set size, right, have three training examples. In that, that's times

Â simplifying for the parentheses it's 3.5. So that's 3.5 over six which is about

Â 0.68. So now we know that j of 0.5 is about 0.68.[Should be 0.58] Lets go and plot that. Oh

Â excuse me, math error, it's actually 0.58. So we plot that which is maybe about over

Â there. Okay? Now, let's do one more. How about if theta one is equal to zero, what

Â is J of zero equal to? It turns out that if theta one is equal to zero, then H of X

Â is just equal to, you know, this flat line, right, that just goes horizontally

Â like this. And so, measuring the errors. We have that J of zero is equal to one

Â over two M, times one squared plus two squared plus three squared, which is, One

Â six times fourteen which is about 2.3. So let's go ahead and plot as well. So it

Â ends up with a value around 2.3 and of course we can keep on doing this

Â for other values of theta one. It turns out that you can have you know negative

Â values of theta one as well so if theta one is negative then h of x would be equal

Â to say minus 0.5 times x then theta one is minus 0.5 and so that corresponds

Â to a hypothesis with a slope of negative 0.5. And you can

Â actually keep on computing these errors. This turns out to be, you know, for 0.5,

Â it turns out to have really high error. It works out to be something, like, 5.25. And

Â so on, and the different values of theta one, you can compute these things, right?

Â And it turns out that you, your computed range of values, you get something like

Â that. And by computing the range of values, you can actually slowly create

Â out. What does function J of Theta say and that's what J of Theta is. To recap, for

Â each value of theta one, right? Each value of theta one corresponds to a different

Â hypothesis, or to a different straight line fit on the left. And for each value

Â of theta one, we could then derive a different value of j of theta one. And for

Â example, you know, theta one=1, corresponded to this straight line

Â straight through the data. Whereas theta one=0.5. And this point shown in magenta

Â corresponded to maybe that line, and theta one=zero which is shown in blue that corresponds to

Â this horizontal line. Right, so for each value of theta one we wound up with a

Â different value of J of theta one and we could then use this to trace out this plot

Â on the right. Now you remember, the optimization objective for our learning

Â algorithm is we want to choose the value of theta one. That minimizes J of theta one.

Â Right? This was our objective function for the linear regression. Well, looking at

Â this curve, the value that minimizes j of theta one is, you know, theta one equals

Â to one. And low and behold, that is indeed the best possible straight line fit

Â through our data, by setting theta one equals one. And just, for this particular

Â training set, we actually end up fitting it perfectly. And that's why minimizing j

Â of theta one corresponds to finding a straight line that fits the data well. So,

Â to wrap up. In this video, we looked up some plots. To understand the cost

Â function. To do so, we simplify the algorithm. So that it only had one

Â parameter theta one. And we set the parameter theta zero to be only zero. In the next video.

Â We'll go back to the original problem formulation and look at some

Â visualizations involving both theta zero and theta one. That is without setting theta

Â zero to zero. And hopefully that will give you, an even better sense of what the cost

Â function j is doing in the original linear regression formulation.

Â