0:24

Remember if we include an intercept, the residuals have to sum to zero,

Â which means their mean is zero.

Â So if we want to take the variance of the residuals,

Â it's just the average of the squares.

Â So the sum of the squared residuals,

Â times one over n, is an estimate of sigma squared.

Â The variation around the regression line.

Â The true population variation around the regression line.

Â 0:48

Now most people use n minus two, instead of n.

Â So it's not the average squared residual,

Â it's kind of like the average squared residual.

Â And, and for large n, the difference between one over n minus two, and

Â one over n is irrelevant.

Â But for small n, it can make a difference.

Â The way to think about that is, remember,

Â if we include the intercept the residuals have to sum to zero.

Â So, that puts a constraint.

Â If you know n minus one of them, then, you know the nth.

Â Well, if you have a line term in there, if you have a co-variant in there, then,

Â that puts a second constrain on the residuals.

Â So, you lose two degrees of freedom.

Â If you put another regression variable in there, you have another constraint,

Â you lose three degrees of freedom.

Â So in that sense it's sort of like saying you really don't have n residuals,

Â you have n minus two of them,

Â because if you knew n minus two of them you could figure out the last two.

Â And that's why it's one over n minus 2.

Â So let me show you how you can grab the residual variation

Â out of your l m fit and assign it to a variable.

Â This way if you needed, if you need to work with it in an R program you can

Â actually grab the number, not just see it on the printout.

Â So here I've defined my y and my x, and I've defined my

Â fit as the regression model with y as the outcome and x as the predictor.

Â Well if you just do summary of fit and you don't do anything else,

Â you just hit return, it'll print out the summary of the regression model.

Â Intercepts, slopes, estimated values, and so on, and you'll see the residual

Â standard deviation estimate among the elements in the printout.

Â However, if you want to grab it as an object that you can assign to something,

Â just put dollar sign sigma.

Â Then you can assign sigma to any other variable.

Â So if you're using it in a program in some other way.

Â This works out in this particular example to be 31.84 dollars.

Â 2:34

So here, let's just confirm that I'm not lying to you and that the formula works.

Â So if I do resid fit, that grabs the residuals.

Â If I square it, it squares them.

Â If I sum it, it adds up the squared values.

Â If I divide by n minus two, it takes the average of the unique residuals.

Â And then if I square root it, we get 31.84 so I wasn't lying.

Â 2:58

Now let's go back to this plot where we look at the total variability in diamond

Â prices.

Â And then compare what happens to the variability when we explain

Â some of that variability with a regression line.

Â So the total variability is just the deviations of my data.

Â The average squared deviation of my data around its mean.

Â Around the center.

Â And just to make things easy, let's forget about the denominator and

Â just talk about the sum of the squared deviations.

Â 3:40

And that's the variability in the response that's simply explained by the regression,

Â by the regression line.

Â Let's take how that, how much that deviates around the average.

Â So that's the regression variability.

Â Then everything that's left over is that variation around the regression line, and

Â that's the residual variability.

Â And the interesting identity, and

Â it kind of makes sense that this would be the case, is that this total variability.

Â The variability in diamond prices disregarding everything except for

Â where they're centered at.

Â Is equal to the regression variability,

Â that is the variability explained by the model.

Â 4:23

Because the residual variation and

Â the regression model variation add up to the total variation.

Â We can define a quantity that represents the percentage of the total variation

Â that's represented by the model.

Â Simply take the regression variation and divide it by the total variation.

Â That quantity is called R squared.

Â So R squared for our diamond example, is the percentage of the variation

Â in diamond price, that is explained by the regression relationship with mass.

Â 4:57

Just remind you, R squared is the percentage of the variation in

Â the response explained by the linear relationship with the predictor.

Â R squared has to be between zero and

Â one because the regression variability and the error variability and the sums

Â of the squares add up to the total sums of squares, and they're all positive.

Â So that forces our square to be between zero and one.

Â If we define R as the sample correlation between the predictor and the outcome,

Â then R squared is literally that sample correlation R, squared.

Â 5:31

R squared can be a misleading summary of model fit.

Â For example, if you have somewhat noisy data and delete all the,

Â a lot of the points in the middle, you can get a much higher R squared.

Â Or if you just add arbitrary regression variables into a linear model fit,

Â you increase R squared and

Â you de, decrease mean squared error where the average squared residual variation.

Â So these things have to be taken into a mind,

Â into mind if you're using them to assess model fit.

Â 5:58

Anscombe created a particularly stark example of a bunch of data sets with

Â an equivalent R squared, equivalent mean, and variances in the x's and the y's.

Â And identical regression relationships, but when you look at the scatter clouds,

Â you can see that the fit has very different meanings in each of the cases.

Â So, let's look at the outcome from that example Anscombe, and see what it shows.

Â And here it is, the four data sets.

Â The first is a nice regression line, exactly sort of along the lines

Â of what we think of, when we think of just a slightly noisy x y relationship.

Â The second one, clearly there's a missing term.

Â In order to address some of this curvature in the data.

Â