0:01

The concept of an outlier should not be foreign to you at this point.

Â We've talked about outliers numerous times throughout the course.

Â However, in this video, we're going to focus

Â on outliers within the context of linear regression.

Â And we're going to talk about how to identify various types

Â of outliers, as well as touch on how to handle them.

Â In this plot, we can see a cloud of points that are clustered together,

Â as well as one single point that is far away from the rest of them.

Â The question is how does this outlier influence the least squares line?

Â To answer this question, we want to think about where

Â the line would go if this particular outlier was not there.

Â And in that case, there would be

Â absolutely no relationship between the two variables,

Â because the pointer completely randomly scattered, so

Â the line would look like a horizontal line.

Â 0:53

Therefore without the outlier, there is no relationship between x and

Â y, and this one single outlier makes it appear as though there is.

Â There are various types of outliers, and depending on

Â the type is how we decide how to handle them.

Â In general, outliers are points that fall away from the cloud of points.

Â 1:16

Outliers that fall horizontally away from the center of the cloud but

Â don't influence the slope of the

Â regression line are called leverage points.

Â And outliers that actually influence the slope

Â of the regression line are called influential points.

Â 1:34

Usually, these points are high leverage points.

Â And to determine if a point is influential, we want to

Â visualize the regression line with and without the point and ask...

Â Does this slope of the line change considerably?

Â So what type of an outlier is this?

Â To answer this question, we want to first ask, does this point

Â fall away from the rest of the data in the horizontal direction?

Â And the answer is yes, it does.

Â This makes it a potential leverage point.

Â But, another question we want to ask is, is it also influential?

Â Let's try to think about where the line would

Â go whether the point was there or not there.

Â It appears that the line would stay in exactly the same place.

Â So, the outlier point is actually on the trajectory of the regression line.

Â Therefore it does not influence it.

Â This makes this point a leverage point.

Â 2:31

And what about this one?

Â Just like with the previous point, this outlying point also falls

Â away from the rest of the data in the horizontal direction.

Â So it could simply be a leverage point.

Â However, it also appears to be influencing the slope of the line.

Â If we were to remove this point, the line would look considerably different.

Â In fact, it would look horizontal since otherwise,

Â there's absolutely no relationship between x and y.

Â And therefore, we would identify this as an influential point.

Â When we are trying to decide whether to

Â leave this data point in the analysis or take

Â it out, if it's an influential point, we

Â want to be very careful about leaving it in there.

Â Because it's definitely going to affect our estimates and all of the decisions

Â that we're going to be making based on the results of the analysis.

Â Here's another example of influential points.

Â Here we have light intensity and surface temperature, both of

Â which are log of 47 stars in a star cluster.

Â We can see that there are two different types of stars, ones

Â that have a lower temperature and ones that have a higher temperature.

Â The solid blue line shows us how the regression

Â model would look if we were to ignore the outliers.

Â And the red dash line tells us how the regression

Â model would look if we were to include the outliers.

Â Those are the four stars with the lower temperatures.

Â Obviously, the red-dashed line is not a good fit for these data.

Â So in this case, what we might want to do is actually split our data into two, those

Â stars that have lower temperature and those stars that

Â have higher temperature, and model the two groups separately.

Â 4:20

Remember, we don't want to just blindly get rid of outlying

Â points, because those actually might be the most interesting cases.

Â Perhaps these stars that are much colder than the

Â other ones are indeed more interesting to look at.

Â But what we want to do is we don't want to lump them along with the

Â stars that have a higher temperature and try to model all of them together.

Â [BLANK_AUDIO]

Â One last remark on influential points.

Â Let's take a look at this statement and evaluate whether it's true or false.

Â Influential points always reduce R squared.

Â It is true that influential points tend to make life more difficult.

Â But is it true that they always reduce R squared?

Â Let's take a look at these two graphs, one where

Â which we have an influential point and one where we don't.

Â The first plot does not have an influential point.

Â And we can see that the regression line looks fairly horizontal,

Â indicating that there's little to no relationship between x and y.

Â In the second plot, we have an influential point that is far away from the trajectory

Â of the original regression line, and hence pulls the regression line to itself.

Â In the first plot, the correlation coefficient is very low, just

Â 0.08, and hence R squared is pretty low as well, at 0.0064.

Â In the second plot, however, all of a sudden, we're seeing an increase in

Â our correlation coefficient as well as an

Â increase associated with that in our R squared.

Â So, even though we would never want to fit a linear model in the second plot,

Â we are actually seeing a much higher correlation and a much higher R squared.

Â This is a good lesson for always viewing a scatter plot before fitting a model.

Â If we were simply deciding on whether or not the model

Â is a good fit by looking at the correlation coefficient and R

Â squared, we would never catch the anomaly in the data, and

Â that there is only one influential point that's driving the entire relationship.

Â