0:05

The first thing in this kind of data we would want to do is create a scatter plot

Â of the child's heights by the parent's height.

Â Here, I use a ggplot.

Â There is numerous failings in this plot.

Â One of the primary failings is that these points are over-plotted.

Â There's lots of different paired, paired parent child x,

Â y values at each one of these specific points.

Â Here I give a better plot, where the size of the point represents the number

Â of parent child combinations at that particular x, y location.

Â 0:44

Here the color, a very light color, represents the frequency near 40 and

Â a very bluish or darker relatively speaking,

Â bluish color represents a frequency of number toward that of, of,

Â toward ten or down to one at specific locations.

Â So the size of the point and the color of the point are representative of

Â the frequency of parent-child combinations that that specific x,

Â y player pair and this is a much better plot, because it, it gives you,

Â it doesn't lose that information.

Â 1:18

I would say, there is another failing of this plot.

Â For example, I don't put the units inches on the either of the two axes and

Â I think that's good practice to have the units on the axes, so

Â that's a little bit of a failing for this plot.

Â If you want to see the code, it's in the r mark down file.

Â 1:38

Now let's suppose we want to dis, explain the children's heights

Â using the parent's heights and let's assume we wanted to do it with a line.

Â Well, in order to make things easy for right now,

Â let's force the line to go through the origin.

Â 2:14

In order to find the best line, all we have to find is the slope.

Â Well, here's how we could potentially do that.

Â We would want to find the slope beta that minimizes

Â the sum of the squared distances between the observed data points the Yi and

Â the fitted data points on the line, Xi beta.

Â We'll square that distance and add them up and this is directly analogous to

Â finding the least squares mean, that we did just a couple of slides ago.

Â So this is exactly sort of using the origin as a pivot point and

Â picking the line that minimizes the sum of the squared vertical

Â distances between the points and the line.

Â So we're going to use our studio's function manipulate to

Â experiment with this and see if we can find that line.

Â Now there, there is a point in regression to the origin is useful for

Â explaining things, because we only have one parameter, the slope and

Â we don't have two parameters, the slope and the intercept.

Â But it's generally bad practice to force regression lines through the point

Â zero, zero.

Â So, an easy way around this is to subtract the mean from the parent's heights and

Â the mean from the child's heights, so that the zero, zero point is right in

Â the middle of the data and that will make this solution a little bit more palatable.

Â And we'll discuss it later on how this relates to real regression,

Â where you fit both the slope and an intercept.

Â Let me just show a picture to illustrate some of these concepts.

Â So here, I have a scatter plot where I have some data on the y-axis and

Â some data on the x-axis and I want to use my x variable to predict my y variables.

Â So this is my x-axis and this is my y-axis.

Â 4:08

My red crosshairs are my data points.

Â Progression through the origin, the way that we're doing it,

Â takes the point zero, zero and treats it as a pivot point.

Â It then tries to find the best line going through

Â the points, where the best line is as follows.

Â I take this line here,

Â it's going to take each observed point y,

Â it's going to calculate the vertical distance,

Â so this is that's Yi, that height is Yi.

Â Okay?

Â This length Xi and

Â that point is Xi beta on the line.

Â So this distance right here is Yi minus Xi beta and

Â then if we square it, we'll get the squared distance.

Â So what regression to the origin is trying to find is,

Â it's trying to take all these vertical distances like

Â this between the fitted line and the observed heights.

Â Square all those distances, add them all up.

Â so that each one contributes to the error rate that it's calculating and

Â tries to find the best slope.

Â Because remember, this line is defined by a simple equation y equal to x beta,

Â when you have a line going through the origin you only need one parameter,

Â that's the slope.

Â Now.

Â 5:48

That line's not going to be very good.

Â If you look, the line really should hit somewhere along here.

Â So regression to the origin kind of doesn't make sense.

Â What we're doing by centering the data is we're setting the origin to be right in

Â the middle of the data, so that point now is zero, zero by subtracting off the mean.

Â So, it's basically reorienting the axis, so that the zero,

Â zero point lies right in the middle of the data.

Â Now, it seems a little more reasonable to find the regression

Â line that goes through the data where we just consider a slope.

Â So regression the origin seems to make a little bit more

Â sense if you subtract the means and we'll find out later that this yields

Â an equivalent to the solution if we were both fit the intercept and the slope.

Â 6:47

So let's try and do this with R studio's manipulate function and I think,

Â because this is one of the central themes of the next several lectures,

Â we're going to go over these points over and over again.

Â They're very fundamental.

Â 7:01

So, I won't show running the code, because you can grab it from the R markdown file.

Â And I'm hoping at this point in the specialization you'll be familiar with

Â running R code.

Â But here's my plot and over it,

Â I have a specific value of the regression line.

Â Notice that my child's heights if you,

Â if the indexes get down sampled a little in the, in the video compression, but

Â what, what you can see the center of the child's height here is at zero,

Â because I have subtracted the mean, the center of the parent's height is at zero.

Â So, I have a pivot point right in the middle at zero, zero.

Â I also want to point out that up here I give the slope data equals point six.

Â And the mean squared error, where the mean squared

Â error is again the sum of the y data points subtracting off the x data point,

Â the parent's height multiplied times this factor, 0.6.

Â So, it's taking each one of these points, let's take this one or

Â let's take one that's more centered.

Â This one right here, it's calculating this distance.

Â The vertical distance between the child's heights and the parent's heights multiply

Â it times the slope, calculating that, squaring it and adding them up.

Â But here again, because there's multiple points at any, at any x, y combination.

Â You can think of maybe multiplying this distance by the square distance by

Â the appropriate number of points with that specific combination,

Â because of the overplot.

Â 8:41

At the point zero, zero, our line is going to pivot around that point zero, zero,

Â because we're only fitting lines that have to go through the origin,

Â because we're only considering the slope.

Â Okay.

Â So let's do it and

Â see how our mean squared R changes as a function of our slope.

Â 0.6 doesn't look so bad, let's move it to one that's not so good.

Â Okay. The mean squared error has gotten a lot

Â worse, right?

Â So, it went from 5.004, say at .68.

Â Now let's move it all the way up now to 5.8 and

Â you can see the mean squared error getting lower.

Â 9:20

Right?

Â As you get down to a slope.

Â Looks, looks like 0.6 something is, is doing pretty well.

Â 0.64 has 5 and then 0.62 has 5.0,

Â 5.002, so it's gone up.

Â So, it looks like about 0.64.

Â What's interesting is the slope of one is not good.

Â There's the slope of one.

Â That's saying, you want to predict your child's height,

Â just try the parent's height.

Â Apparently, we have to multiply by a factor to get the child's, to get a better

Â prediction of the child's height than just the parent's heights by itself.

Â We have to multiply it by a factor of 0.6 in this case,

Â that's what the slope is doing for us.

Â 10:09

Here, I give the manipulate code.

Â And again, all of this code is in the R markdown file.

Â So don't retype it in from the slides, actually grab it,

Â grab the actual text from the R markdown file.

Â 10:26

Now that we've used manipulate to find the slope,

Â I'm going to show how you can do this very quickly in R.

Â The function lm fits the linear model and

Â if you, here I have the code where I.

Â 10:44

Here I have the code, where I use lm.

Â My outcome, my y value is the centered child's heights.

Â So, I've subtracted off the mean.

Â And here is the centered parent's heights, where I subtracted off the means for

Â the parents.

Â This minus 1 says, get rid of the intercept,

Â because we're talking about regression through the origin.

Â 11:06

Where we forget about a y intercept and then telling them that the data

Â set I want to fit is galton and it gives me the slope, 0.646.

Â So we're going to talk about this a lot.

Â We'll go through this in great detail throughout the rest of the class.

Â 11:24

Here I, in the final slide of this lecture, I just give the fitted line.

Â Now because of how we've forced the model, it has to go through the point zero,

Â zero for the center child's heights and the center parent's height.

Â So, it has to go through the mean of the child's height to the mean

Â of the parent's height.

Â So, I've reshifted the plot so that, that zero, zero is,

Â is back at the original, at the, at the original point.

Â 11:53

And so here is our line that has slope of 0.646.

Â And now what we're going to do in subsequent lectures is talk about

Â how we get these?

Â How the motivation behind it and all the things we can do with this fitted line,

Â we're going to spend maybe the next several lectures talking about this.

Â