0:32

Now R squared is the number that measures the proportion of

Â variability in Y explained by the regression model.

Â It turns out simply to be the square of the correlation between Y and X, but

Â it has a nicer interpretation, just a straight forward correlation.

Â It is interpreted as the proportion of variability and Y explained by X.

Â And so all other things been equal to 1, typically prefers a higher

Â R squared over a lower 1, because you're interpreting more variability.

Â RMSE is a different one number summary from a regression and

Â what RMSE is doing for you, it's measuring the standard deviation of the residuals.

Â The residuals remember are the vertical distance from the point to the lease

Â squares or the fitted line and

Â the standard deviation is the measure of spread.

Â So, it's telling you how much spread there is about the line in the vertical

Â direction and I would often informally call that the noise in the system and so

Â RMSE is the measure of the noise in the In the system.

Â What I've shown you in the table at the bottom of this slide are the calculations

Â of R squared and RMSE for the three datasets that we've had a look at.

Â There was the diamonds dataset, the fuel economy and the production time dataset.

Â And if you look at R squared, it's frequently reported on a percentage basis.

Â We've got a 98% R squared for the diamonds dataset,

Â that's because there was a very strong linear association going on there.

Â In the fuel economy dataset, it was 77%.

Â And in the production time dataset, it was only sitting there at 26%.

Â Now, one of the things you have to be careful about with R squared is that there

Â is no magic value for R squared.

Â It's not as even R squared has to be above a certain number for

Â the regression model to be useful.

Â You want to think much more about R squared as a comparative bench

Â mark as opposed to an absolute one.

Â So if I'm comparing two regression models for the same data,

Â all other things being equal.

Â I'm going to typically prefer the one with the higher R squared, but

Â all because I've got a model with an R squared of say, 5% or 10% doesn't

Â necessarily mean that that model isn't going to be useful in practice, but

Â it is a useful comparison metric.

Â Now the other number, Root Mean Squared Error, I've calculated it for

Â the three examples here.

Â And it's 32, 4 and 32, somewhat coincidentally for

Â the production time dataset.

Â Now, one key difference between R squared and RMSE are the units of measurement.

Â So R squared, because it's a proportion,

Â actually has no units associated with it at all.

Â So it's easier to compare R squared in that sense where as RMSE certainly,

Â because it's the standard deviation of the residuals and

Â the residuals are distance from point to line in vertical direction.

Â Vertical direction is the Y variable direction.

Â So RMSE has the units of Y associated with it.

Â So for the diamonds dataset, that RMSE of roughly 32, that's 32.

Â You can say, $32.

Â And for the fuel economy, RMSE is 4.23.

Â It's 4.23 gallons per thousand miles in the city to be formal about it.

Â And the 32 for the production time, that's 32, an RSME of 32 minutes.

Â So these two one number summaries are frequently reported with a regression.

Â Most software will calculate them automatically as soon as you run your

Â regression model, for example, within a spreadsheet environment.

Â And all other things being equal, we like higher values of R squared.

Â We're explaining more variability and

Â we like lower values of Root Mean Squared Error.

Â If there's a low standard deviation of the residuals around the regression line,

Â that's tantamount to saying that there are residuals are low, they're small and

Â the points are therefore, close to the regression line, which is what we like.

Â So those are the two one number summaries that accompany most regression models.

Â Now perhaps, the most useful thing you can do with a Root Mean Squared Error

Â is to use it as an input into what we call a prediction interval.

Â So remember that when you have uncertainty in a process, you don't just want to

Â give a forecast, you want to give some range of uncertainty about that forecast.

Â That's just so much more useful than practice.

Â And with suitable assumptions, we can tie in Root Mean Squared Error

Â to come up with a prediction interval for a new observation.

Â So here's our assumption, we're going to assume that at a fixed value of X.

Â The distribution of points about the true regression line follows a normal

Â distribution.

Â So, another module has discussed the normal distribution.

Â This is one of the places where normality assumptions are very common in

Â a regression context and what we're assuming is that the distribution of

Â the points about the true regression line has a normal distribution.

Â We'll talk about checking that in just a minute, but

Â let's work with that as an assumption.

Â Furthermore, that normal distribution is centered on the regression line.

Â So you can see in the graphic on the page, the assumption being shown to you.

Â So we believe there's no data here, because we're positing a true model, so

Â to speak.

Â So there's a true regression lying there at any particular value.

Â Let's take the left hand normal distribution.

Â Let's say, we took lots and lots of diamonds that weighed 0.15 of a carat.

Â What do we expect their distribution to look like around the regression line?

Â We expect the distribution of the prices to be normally distributed with the center

Â of the normal distribution sitting on top of the regression line and

Â we believe that's true for any value of X.

Â That's why one of the standard assumptions of a regression wall involves.

Â Furthermore, we're going to assume that all of these normal distributions

Â around the true line have the same standard deviation.

Â That's often termed the constant Variance assumption and with that assumption,

Â we can estimate that common variance,

Â the spread of the points about the line in the vertical direction with RMSE.

Â So RMSE will be our estimate of the noise in the system and

Â with this assumption of normality,

Â it's estimating the standard deviation associate with the normal distribution

Â that captures the spread of the points around the true regression line.

Â So on this slide, I've introduced an important assumption behind regression.

Â That of the normality of the points about the regression line.

Â 7:11

Now, we know about root mean squared error as an estimate of

Â the points about the regression line.

Â And furthermore,

Â we believe at least we assume that that spread is normally distributed.

Â What can we do with that information?

Â Well, here is what we do with it.

Â We can put that information together to come up with what we determine

Â approximate 95% prediction interval for a new observation.

Â So I'm going to present to you a rule of thumb that comes out of a regression, but

Â you've gotta be careful with this rule of thumb.

Â You can only use it within the range of the data.

Â So if you are extrapolating forecasts outside of the range of the data,

Â don't use this rule of thumb.

Â But at least within the range of the data, it's extremely useful.

Â And with the Normality assumption and overlaying the Empirical Rule,

Â which was discussed in a separate module.

Â Then within the range of the data,

Â an approximate 95% prediction interval for a new observation.

Â So the idea is that somebody comes to me with a new diamond.

Â A diamond that wasn't used in the calculation of the regression line,

Â they got a new diamond, they give it to me.

Â They say, it weights 0.250 carat.

Â What do you think it's going to go for?

Â What do you think the price is going to be?

Â I could use the prediction interval to give a range of feasible values.

Â The 95% prediction of all is forecast, which means go up to the regression line

Â and read off the value and then plus or minus twice the Root Mean Squared Error.

Â And that plus or

Â minus twice the Root Mean Squared area is coming straight at of the Empirical Rule,

Â the 2 is coming because we want a 95% prediction interval and the RMSE is

Â our estimate of the standard deviation of the underlying normal distribution.

Â So this interval really captures one of the key goals of a regression,

Â which is to provide uncertainty with our forecast.

Â Not just a forecast, but uncertainty range associated with that forecast.

Â So with the normality assumption and Root Mean Squared Error,

Â you're in the position at least within the range of the data to get a sense of

Â the precision of forecast coming out of a model.

Â So let's have a look at that idea for the diamonds data set.

Â For the diamonds dataset, the RMSE was equal to 32 and with the Normality

Â assumption that says, at least within the range of the collective data four

Â diamonds that are similar to the set that were used in the regression analysis.

Â The width that with an approximate 95% prediction interval for

Â a new observation is plus or minus twice the root means squared area.

Â 2 times 32 is 64, so this model is able to price diamonds

Â using a 95% prediction interval to within about or minus $64.

Â That's the calculation that is done at the bottom of the slide and

Â working it out, if a diamond weighs 0.25 of a carat and

Â I put 0.25 of a carat into the regression equation.

Â That's the -260 + 3721 x 0.25.

Â That's my forecast or prediction and then I do plus or minus twice the root mean

Â square error, which here is plus or minus 64 and I get a range of feasible values.

Â Somewhere between $606 and 734.

Â And I say that really captures the essence of what these probabilistic models

Â are able to do for you that you couldn't do with a deterministic model,

Â a range of uncertainty.

Â So there's the 95% prediction interval.

Â So we've now seen a 95% prediction interval.

Â Remember that it relied on a normality assumption for the noise in the system,

Â for the spread of the points about the regression line.

Â We're assuming that was normally distributed.

Â Now when you make assumptions, part of the modeling process should be to think

Â carefully about that those assumptions and make a call on whether or

Â not they seem reasonable.

Â So always check your assumptions, if you can.

Â Now one way that I could check this Normality assumption is to

Â take the residuals from the regression.

Â