Now R squared is the number that measures the proportion of

variability in Y explained by the regression model.

It turns out simply to be the square of the correlation between Y and X, but

it has a nicer interpretation, just a straight forward correlation.

It is interpreted as the proportion of variability and Y explained by X.

And so all other things been equal to 1, typically prefers a higher

R squared over a lower 1, because you're interpreting more variability.

RMSE is a different one number summary from a regression and

what RMSE is doing for you, it's measuring the standard deviation of the residuals.

The residuals remember are the vertical distance from the point to the lease

squares or the fitted line and

the standard deviation is the measure of spread.

So, it's telling you how much spread there is about the line in the vertical

direction and I would often informally call that the noise in the system and so

RMSE is the measure of the noise in the In the system.

What I've shown you in the table at the bottom of this slide are the calculations

of R squared and RMSE for the three datasets that we've had a look at.

There was the diamonds dataset, the fuel economy and the production time dataset.

And if you look at R squared, it's frequently reported on a percentage basis.

We've got a 98% R squared for the diamonds dataset,

that's because there was a very strong linear association going on there.

In the fuel economy dataset, it was 77%.

And in the production time dataset, it was only sitting there at 26%.

Now, one of the things you have to be careful about with R squared is that there

is no magic value for R squared.

It's not as even R squared has to be above a certain number for

the regression model to be useful.

You want to think much more about R squared as a comparative bench

mark as opposed to an absolute one.

So if I'm comparing two regression models for the same data,

all other things being equal.

I'm going to typically prefer the one with the higher R squared, but

all because I've got a model with an R squared of say, 5% or 10% doesn't

necessarily mean that that model isn't going to be useful in practice, but

it is a useful comparison metric.

Now the other number, Root Mean Squared Error, I've calculated it for

the three examples here.

And it's 32, 4 and 32, somewhat coincidentally for

the production time dataset.

Now, one key difference between R squared and RMSE are the units of measurement.

So R squared, because it's a proportion,

actually has no units associated with it at all.

So it's easier to compare R squared in that sense where as RMSE certainly,

because it's the standard deviation of the residuals and

the residuals are distance from point to line in vertical direction.

Vertical direction is the Y variable direction.

So RMSE has the units of Y associated with it.

So for the diamonds dataset, that RMSE of roughly 32, that's 32.

You can say, $32.

And for the fuel economy, RMSE is 4.23.

It's 4.23 gallons per thousand miles in the city to be formal about it.

And the 32 for the production time, that's 32, an RSME of 32 minutes.

So these two one number summaries are frequently reported with a regression.

Most software will calculate them automatically as soon as you run your

regression model, for example, within a spreadsheet environment.

And all other things being equal, we like higher values of R squared.

We're explaining more variability and

we like lower values of Root Mean Squared Error.

If there's a low standard deviation of the residuals around the regression line,

that's tantamount to saying that there are residuals are low, they're small and

the points are therefore, close to the regression line, which is what we like.

So those are the two one number summaries that accompany most regression models.

Now perhaps, the most useful thing you can do with a Root Mean Squared Error

is to use it as an input into what we call a prediction interval.

So remember that when you have uncertainty in a process, you don't just want to

give a forecast, you want to give some range of uncertainty about that forecast.

That's just so much more useful than practice.

And with suitable assumptions, we can tie in Root Mean Squared Error

to come up with a prediction interval for a new observation.

So here's our assumption, we're going to assume that at a fixed value of X.

The distribution of points about the true regression line follows a normal

distribution.

So, another module has discussed the normal distribution.

This is one of the places where normality assumptions are very common in

a regression context and what we're assuming is that the distribution of

the points about the true regression line has a normal distribution.

We'll talk about checking that in just a minute, but

let's work with that as an assumption.

Furthermore, that normal distribution is centered on the regression line.

So you can see in the graphic on the page, the assumption being shown to you.

So we believe there's no data here, because we're positing a true model, so

to speak.

So there's a true regression lying there at any particular value.

Let's take the left hand normal distribution.

Let's say, we took lots and lots of diamonds that weighed 0.15 of a carat.

What do we expect their distribution to look like around the regression line?

We expect the distribution of the prices to be normally distributed with the center

of the normal distribution sitting on top of the regression line and

we believe that's true for any value of X.

That's why one of the standard assumptions of a regression wall involves.

Furthermore, we're going to assume that all of these normal distributions

around the true line have the same standard deviation.

That's often termed the constant Variance assumption and with that assumption,

we can estimate that common variance,

the spread of the points about the line in the vertical direction with RMSE.

So RMSE will be our estimate of the noise in the system and

with this assumption of normality,

it's estimating the standard deviation associate with the normal distribution

that captures the spread of the points around the true regression line.

So on this slide, I've introduced an important assumption behind regression.

That of the normality of the points about the regression line.