Now, the residuals are useful as well,

because they allow me to assess the quality of fit of the regression model.

Ideally, all our residuals will be zero,

that would mean that the line went through all the points.

In practice, that is simply not going to happen but

we will often examine the residuals from a regression.

Because by examining the residuals,

we can potentially gain insight into that regression.

And typically, when I run regressions one of the very first

thing I'm going to do is take all the residuals out of the regression.

I'm going to sort that list of the residuals and

I'm going to look at the most extreme residuals.

The points with the biggest residuals are by definition those points

that are not well fit by the current regression.

If I'm able to look at those points and explain why they're not well fit.

Then, I have typically learned something that I can incorporate in a subsequent

iteration of the regression model.

Now that all sounded a little bit abstract,

I've got an example to show you right now.

So here's another data set that lends itself to a regression analysis.

And in this data set, I've got two variables.

The outcome variable, or the Y variable, is the fuel economy of a car.

And to be more precise,

it's the fuel economy as measured by gallons per 1,000 miles in the city.

So if you're going to take let's say, you live in the city and

you only drive in the city.

How many gallons are you going to have to put in the tank to be able to drive

your car a thousand miles over some course of time?

That's the outcome variable.

Clearly, the more gallons you have to put in the tank,

the less fuel efficient the vehicle is, that's the idea.

Now we might want to create a predictive model for

fuel economy as a function of the weight of the car.

And so here, I've got an X variable as weight and I'm going to look for

the relationship between the weight of a car and its fuel economy.

We collect a set of data, that's what you can see in this scatter plot.

The bottom left hand graph on this slide and each point is car, and

for each car we've found it's weight, we've found it's fuel economy,

we've plotted the variables against one another.

And we have run a regression through those points, through the method of each graph.

And that regression gives us a way of predicting the fuel economy of the vehicle

of any given weight.

Now why might you want to do that?

Well, one of the things that many vehicle manufacturers are thinking about these

days is creating more fuel efficient vehicles.

And one approach to doing that is actually to change

the materials that vehicles are manufactured from.

So for example, they might be moving from steel to aluminum.

Well, that will reduce the weight of the vehicle.

Well, if the vehicle's weight is reduced,

I wonder how it will impact the fuel economy?

And so, that's the sort of question that will be able to start a dressing through

such a model.

So that's the set up for this problem but I want to show you why looking

at the residual chores can be such a useful thing.

So when I look at the residuals from this particular regression, I know

one of the residuals actually I found the biggest residual in the whole data set.

And that's the point that I have identified in red in the scatter plot and

it is the biggest residual.

It's a big positive residual which means that

the reality is that this particular vehicle needs a lot more gas

going in the tank than the regression model model would predict.

The regression model would predict the value on the line.

The red data point is the actual observe value, it's above the line, so

it's less fuel efficient than the model predicts.

It needs more gas to go in the tank than the model predicts so

is there anything special about that vehicle?

Well, at that point, I go back to the underlying data set and I drill down, so

when I see bigger residuals, I'm going to drill down on those residuals.

And drilling down on these residuals,actually identifies

the vehicle.and the vehicle turn's up to be something called a Mazda RX-7.

And these particular vehicles somewhat unusual,because it had,

what's term to rotary engine?

Which is a different sort of engine than every other single

vehicle in this data set.

Every other vehicle had a standard engine but the Mazda RX-7 had a rotary engine and

that actually explains why its fuel economy is bad in the city.

And so, by drilling down on the point, by looking at the residuals,

I've identified a feature that I hadn't originally incorporated into the model.

And that would be the type of engine.

And so, the residual and the exploration of the residual has

generated a new question for me that I didn't have prior to the analysis.

And that question is I wonder how the type of engine

impacts the fuel economy as well?

So that's one of the outcomes of regression that can be very, very useful.

It's not the regression model directly talking to you.

It's the deviations from the underlying model that can sometimes be the most

Insight for part of them model itself or the modeling process.

I remember in one of the other modules I talked about,

what are the benefits of modeling?

And one of them is serendipitous outcomes, things that you've find,

that you hadn't expected to at the beginning.

And I would put this out there as an example of that by

exploring the residuals carefully.

I've learned something new,

something that I hadn't anticipated and I might be able to subsequently improve my

model by incorporating this idea of type of engine into the model itself.

So the residuals are an important part of a regression modem.