0:01

Let's go through calculating residuals and we're going to use the diamond dataset.

Â Let's see. So, our data is the diamond dataset.

Â So let's do that.

Â I'm going to redefine price as y, x as carat and n as, as the length of the,

Â the number of pairs, just so I don't have to type so much.

Â Now, I'm going to assign to a variable named fit.

Â I'm going to assign my linear regression object that gets created from lm.

Â So let me do that.

Â Now the easiest way to get the residuals is just do

Â resid of fit, so I'm going to define those as e.

Â Let me show you another way to get the residuals that of course has to do

Â the same thing.

Â If I were to get my predicted fitted values and remember, if I don't give

Â the predict function new data, if I just give it the output from lm, the,

Â the assignment from ln, then it will just predict at the observed x values.

Â So yhat now is a vector of predictions at the observed carat values.

Â Now, I just want to show you that my residual's calculated via the red,

Â resid functions are the same as the residuals that I'm calculating manually,

Â which is just subtracting y and yhat.

Â So the way to do that is just take the difference, the absolute differences and

Â find the largest one.

Â And I see that the largest one is on the scale of 10 to the minus 13th.

Â So, up to numerical precision, it's the same thing.

Â Then lastly, I just want to show that if I manually even calculate

Â the fitted values, coef fit 1 and then coef fit 2 times x,

Â that I of course will get very, I get exactly the same numbers.

Â So, up to numeric precision, exactly the same.

Â So the way you want to do it to get the residuals is resid, but

Â hopefully showing you this other code will illustrate what's going on

Â in the background with what res, the actual calculation that resid is doing.

Â Finally, let me show you that the sum of my residuals is zero.

Â Well, it's 10 to the minus 14th, which is close enough to zero for me.

Â And then also the sum of my residuals times the price variable x,

Â that also has to be zero.

Â Well, 10 to the minus 15th.

Â So, up to numerical position is zero in both cases.

Â So the residuals are the sign lengths of the red line that undershown

Â the following plot.

Â And I'm going to do this using base R graphics, just so

Â I mix a little base R with some ggplot graphics.

Â So, I'm going to create my plot here, there's my plot.

Â I'm going to add the fitted line and in base R, if you want to add the fitted line

Â and you fit a regression line, you can just do abline and put the object that you

Â assign to the lm fit just as an argument and it will add the regression line.

Â Here I want the line width to be two, so it shows up a little bit better.

Â And then I'm just going to for loop over,

Â over the data values to add in the red lines.

Â Let me zoom in and show you that plot.

Â There's my plot.

Â Now my residuals are these red lines.

Â These distances, where if the point is above the line,

Â the residual will be positive.

Â And if the point is below the line, the residual will be negative.

Â This scatter plot isn't particularly useful for assessing residual variation.

Â Notice all of the blank space in this part and

Â this part of the graph, making the plot kind of useless for that purpose.

Â So, instead, why don't we plot the residuals

Â on the vertical axis versus mass on the horizontal axis?

Â Let's go ahead and run the code and here's the plot.

Â Now we can see the residual variation much more clearly.

Â When you look at a residual plot, you're looking at,

Â you're looking for any form of pattern.

Â The residuals should be mostly patternless.

Â Also, remember that if we've included an intercept, residuals have to sum to zero.

Â So they have to lie above and below this horizontal line at zero,

Â and you'd like them to be sort of nicely in a random looking

Â fashion distributed both above and below zero.

Â We can see some interesting patterns by honing in on the residual plot here.

Â For example, we can see that there were lots of diamonds of

Â exactly the same mass measured in the,

Â in this sort of gets lost in the scatter plot by zooming in this way,

Â we, we notice that particular feature.

Â Next, what we're going to go through some pathological residual plots,

Â just to highlight what residual plots can do for you.

Â So, I've concocted some examples that will

Â help us to understand how residuals can highlight in on model fit.

Â So let's look in the R mark down file and I'm going to show it again at the console

Â rather than going through the slides, so you can actually watch me doing it.

Â X here is just going to be uniform from minus 3 to plus 3.

Â So, I've created a x variable that's just a kind of a random

Â smattering of points between the values 3 and minus 3.

Â My y is equal to x, so it's an identity line, but

Â then I'm going to add another term that's sin x.

Â So, it should look like an identity line, but kind of oscillating around it a little

Â bit and then I'm adding some normal noise on top of it.

Â So let me add my y and I'm going to switch back to ggplot, because I like it better.

Â So then base graphics now.

Â So you, I've created my gg plot.

Â I'm going to, I'm going to go ahead and add the smooth first,

Â because I want it as the bottom most layer and then I'm going to add my two sets

Â of points and there's my, there's my scatter plot.

Â And so let me zoom in and it's a little difficult to see the non-linearity,

Â that sin x term is very, it's a little apparent, but it's,

Â it's kind of very hard to see.

Â I think if I was looking at this,

Â I would immediately notice something pattern asking the for fit here.

Â But nonetheless, it's maybe a little bit hard to see.

Â Before I move on to the residual plot, let me make a comment.

Â This model is actually the not, is, is, is actually not the correct model for

Â this data and this might happen in practice.

Â This doesn't mean that this model is unimportant, right?

Â There is a linear trend and the model is accounting for it,

Â it's just not accounting for the secondary variation in the sign term.

Â So, I just want to emphasize this in regression modelling is just because you

Â aren't fitting the actually correct model,

Â that doesn't mean the model is itself useless.

Â You have about, you know, an average identity line here that represents

Â the relationship between y and x and it explains a lot of the variation.

Â So, I just want to remind you that in regression,

Â having the exact right model is not always the primary goal.

Â You can get meaningful information about trends from incorrect models.

Â So, I just want to get that statement out of the way, but

Â now let's hone in on the residuals.

Â So plot the residuals versus x and

Â see if it makes this component of the fifth sin x term more apparent.

Â Okay.

Â So let me plot the residual.

Â The residual's versus the x variable.

Â So just to describe what I have, I'm going to assign g as my ggplot.

Â My x is in this case, x.

Â I have defined the x variable as the variable named x.

Â But now, my y is not the y variable, but

Â it's going to be the residual from the linear model fit.

Â In here, I just grab it in that R command right there, then my aesthetic for

Â my ggplot, just has x and y as the names of variables for the horizontal and

Â vertical axis variables.

Â So let me run that command and

Â then I want to put a horizontal line reference line at zero and

Â then I want to add my points and set the axes how I'd like and

Â then let's see the plot and there's the plot.

Â Let me zoom in.

Â And here is the plot and

Â I think what you can see is that this sign term is now extremely apparent.

Â That what the residual plot has done is, it's zoomed in,

Â on this part of the model inadequacy and really highlighted it and

Â that's one thing that residual plots are especially good at.

Â I'm going to show you another one where by all appearances,

Â the plot falls perfectly on a line.

Â But when you highlight in, at the residuals, it looks quite different.

Â So let me run the commands and then I'll show you.

Â So there's the plot and then now, so look at that and

Â it seems like the points fall exactly on an identity line.

Â Now let me highlight in on the residuals and you see this

Â trend toward greater variability as you head along the x variable.

Â That property, where the variability increases with the x

Â variables called heteroscedasticity.

Â Heteroscedasticity is one of those things that residual plots are quite good at

Â diagnosing and you couldn't see it.

Â If I go back to this earlier plot, you can't see it at all here.

Â Zoom in on the residuals and there you see it there.

Â Let me just zoom back here to how I generated the data,

Â just to illustrate it for you.

Â My x variable is a bunch of uniform random variables.

Â My y variable is my x variable, so an identity line.

Â But then when I added the errors, the standard deviation of the errors,

Â look right here has the x term involved in it and

Â that's how I generated data with heteroscedasticity.

Â So let's run the residual plot for the diamond data.

Â So here, I'm just going to add a column for the diamond data that is the residuals

Â from regression model fit where price is the outcome and carat is the predictor.

Â So, I run that and now my data frame.

Â My data frame has carat price and now the residuals.

Â So, I'm going to create my ggplot.

Â My x label is going to be Mass in carats.

Â My y label is going to be Residual price and

Â I just want to emphasize that the residuals have the same units as the ys.

Â So the residual price is in Singapore dollars.

Â I'm going to add my horizontal line.

Â I'm going to add my points and then there's the plot.

Â So there doesn't appear to be a lot of pattern in the plot, so

Â this, this is good it seems like it's a pretty good fit.

Â Let me illustrate something about variability in a diamond

Â dataset that will help us set the stage for

Â defining some new properties about our regression model fit.

Â So I'm going to create now two residual vectors.

Â The first residual vector is the one where I just fit an intercept.

Â So the residuals are just the deviations around the average price,

Â then I'm going to find the residuals where I add in.

Â Our variation around the regression line.

Â So the first one is variation around the average price.

Â The second is the variation around the regression line with carats

Â as the explanatory variable and price as the outcome.

Â So, I'm going to run that and then I want to create

Â a factor variable that labels the set of residuals.

Â The first one is just going to be labeled as a bunch of intercept only model

Â residuals and the second set are going to be labeled as a bunch of intercept and

Â slope residuals, then I want to create a ggplot.

Â My, with this data frame, my y variable is going to be the residuals.

Â My x variable is going to be the fit, which is the two fits it is.

Â Linear model or linear model with just an intercept or

Â the linear model with carat, the mass as the predictor.

Â And I want to fill in my points based on the color, based on the which fit it was.

Â So, I'm going to do that and then the kind of plot I want is a dot plot and

Â then I want to put my axes labels the way I'd like.

Â Now let's see the plot.

Â There it is.

Â So what we see on this left-hand plot with just the intercept

Â is the variation in diamond prices around the average diamond price.

Â So just the, basically the variation in diamond prices from the sample.

Â What we're seeing in the rightmost plot,

Â this is displaying the variation around the regression line Mind.

Â So what we've done is we've explained a lot of this

Â variation with the relationship with mass.

Â And what we're going to talk about next is r squared, which basically says, we can

Â decompose this total variation, this variation in the y's by themselves leaving

Â you with just the mean into the variation explained by the regression model and

Â the variation that's left over after accounting for the regression model.

Â So this is the variation that's left over after counting for the regression, but

Â also this sort of subtraction of these two would be the variation that was explained

Â by the regression model.

Â But there's a formula for that and we're going to dive into that next.

Â