0:01

In addition to modeling and prediction we can

Â also use linear regression models to do inference.

Â In this video we're going to talk about hypothesis testing for

Â the significance of a predictor and

Â confidence interval for the slope estimate.

Â We're also going to talk a little more

Â about conditions for regression with respect to what additional

Â conditions may need to be satisfied if we want to

Â be able to do inference based on these data.

Â 0:50

Later on in history this study actually got a lot of criticism saying that the

Â data may have been either non-random or non-representative or entirely falsified.

Â But, regardless for this example we're going to be working with the

Â original data set from the paper that was published back in 1966.

Â In the scatter plot we can see the

Â relationship between the foster twin's IQ and the biological

Â twin's IQ, and we can see that as one goes up the other one goes up as well.

Â We have a, a positive and relatively

Â strong relationship with a correlation coefficient of 0.882.

Â 1:31

The results of this study can be summarized

Â using a regression output that looks something like this.

Â So we have the estimate for the intercept as well as the slope here.

Â We can, based on this, write our linear model as the predicted IQ score of the

Â fostered twin is 9.2076 plus 0.9014 times the biological twin's IQ.

Â The 9.2 value is the intercept, and the 0.9 value is the slope here.

Â 2:21

Within the framework of inference for regression we're going

Â to be doing a hypothesis test on the slope.

Â The overall question we want to answer is, is

Â the explanatory variable a significant predictor of the response variable?

Â The null hypothesis as usual says there's nothing going on.

Â In other words, the explanatory variable is not

Â a significant predictor of the response variable, i.e.,

Â there's no relationship, or slope of the relationship is 0.

Â The alternative hypothesis says that there is something going on, that

Â the explanatory variable is a significant predictor of the response variable.

Â In other words, there is a relationship between these two

Â variables and the slope of the relationship is different than zero.

Â So the notion, our hypothesis says that beta one is equal to 0, remember

Â beta one was the population parameter for the slope, and

Â the alternative hypothesis says that beta one is not equal to 0.

Â And how do we go about actually going through this test?

Â In linear regression we always use a t-statistic for

Â inference, and remember that a t-statistic looks like this.

Â It's a point estimate minus a null value, divided by a standard error.

Â In this case, our point estimate is simply our slope estimate, b1.

Â And our standard error is the standard error of this estimate.

Â So the t-statistic for the slope can be summarized to

Â b1 minus 0, remember in the null hypothesis, we had set

Â the beta one equal to 0, which means no relationship

Â or a horizontal line, divided by the standard error of b1.

Â And whenever we have a t-score, we also

Â have a degrees of freedom associated with it.

Â And in this case the degrees of freedom is n minus 2.

Â Let's

Â 4:17

pause for a moment and think about why is the degrees of freedom n minus 2.

Â We haven't really seen that before.

Â In the past we've seen for a t-statistic the degrees of freedom equaling n minus 1.

Â Remember, that with the degrees of freedom, we always lose

Â a degree of freedom for each parameter that we estimate.

Â And when we fit a linear regression, even if you're only interested

Â in the slope, you always also end up estimating an intercept as well.

Â And since we're estimating both an intercept and

Â a slope, we're losing two degrees of freedom, and

Â that's why in linear regression, the degrees of

Â freedom associated with a t-score is n minus 2.

Â For calculating the test statistic, we are actually

Â going to make use of the regression output and

Â then kind of show you guys that we didn't have to do any hand calculations at all.

Â So the t-statistic we said is our point

Â estimate, so that is 0.9014 for the point estimate

Â for the slope, minus 0, the null value,

Â divided by the standard error of the point estimate.

Â And we can simply grab that from the regression output as well.

Â We're not going to be asking you guys to be calculating any of this by hand.

Â You should know how the regression output works, and

Â that's why we're going through the calculation of the t-score.

Â But you're not going to asked ever to

Â calculate the standard error of the slope by hand.

Â 5:42

It is simply a tedious task that can be

Â error prone and we usually use computation for it.

Â But it is important to understand what that standard error

Â means and how the mechanics of the regression output work.

Â If we do the math here, we're actually going to get a 9.36 for

Â our t-score, and that's simply the value

Â that's already given on the regression output anyway.

Â The degrees of freedom is 27 twins minus 2 is 25, and the p-value is going to be the

Â area under the t curve, that's greater than 9.36 or less than negative 9.36.

Â Remember we had a two sided alternative hypothesis.

Â This comes out to be a pretty low value as you can imagine, 9.36 standard errors

Â from the null value is a really unusual

Â outcome, and therefore the p-value is approximately 0.

Â We can see that the p-value is given

Â as exactly 0 on the regression output, but note

Â that that's simply rounded saying that when rounded

Â to four digits, we still have very little probability.

Â The p-value is probably never exactly equal to

Â 0, but it's a very, very small number.

Â 6:58

Just like we can do hypothesis tests for

Â the slope, we can also do a confidence interval.

Â Remember, the confidence interval is always of the form

Â point estimate plus or minus a margin of error.

Â In this case, our point estimate is b1, and our margin of error

Â can be calculated as usual, as a critical value times a standard error.

Â We said that in linear regression, we always use

Â a t-score, so we're going to use a t-star

Â for our critical value, and the standard error of

Â the slope, we said, comes from the regression output.

Â 7:39

The degrees of freedom we had said was 25, and what we want to do first is

Â to find the critical t-score associated with this

Â degrees of freedom, and the given confidence level.

Â To find the critical t-score, let's draw our curve and mark the middle

Â 95% and note that each tail is now left with 2.5%, or 0.025.

Â So, the cutoff value, or the critical t-score, can be calculated using R

Â and the qt function, as qt of 0.025 with degrees of freedom of 25.

Â This is going to yield a negative value, negative roughly 2.06.

Â But note that for confidence intervals the critical

Â value that we use always needs to be positive.

Â So the t-star is going to be simply 2.06.

Â We know our slope estimate, 0.9014 plus or minus 2.06 is the

Â critical value times the standard error that also comes from the regression

Â output, gives us 0.7 to 1.1 as our confidence interval.

Â And what do these numbers mean?

Â How do we interpret this confidence interval?

Â Basically what this means is that we are 95% confident that for each additional

Â point on the biological twins' IQs, the foster twins' IQs are expected on average

Â to be higher by 0.7 to 1.1 points.

Â So, to recap, we said that we could do a hypothesis test for

Â the slope, doing a t-statistic, where our point estimate is b1, our null, and

Â then we subtract from that a null value and divide by the standard error,

Â and the degrees of freedom associated with this test statistic is n minus 2.

Â To construct a confidence interval for the slope,

Â we simply take our slope estimate b1, and

Â add and subtract the margin of error, that's

Â composed of a critical t-score and a standard error.

Â 9:49

And also note that the regression output, gives us

Â b1, the estimate for the slope, the standard error for

Â that estimate, and the two tailed p-value for the

Â t-test for the slope, where the null value is 0.

Â So if this is the standard test that you are

Â trying to do, you shouldn't have to do any hand calculations

Â and should simply be able to make your decision on

Â the p-value that is given to you on the regression output.

Â 10:16

We didn't really talk about inference for the intercept here.

Â We've been focusing on the slope because

Â inference on the intercept is rear, rarely done.

Â Earlier we said that in some cases,

Â the intercept is actually not very informative.

Â And usually when we fit a model, we want to

Â evaluate the relationship between the variables involved in the model.

Â And the parameter that tells us about the relationship

Â between those variables is the slope, not the intercept.

Â So we're going to focus our inference for regression

Â on the slope and not really worry about the intercept.

Â 10:50

Before we wrap up, a few points of caution.

Â Always be aware of the type of data you're working with.

Â Is it a random sample, a non-random sample, or a population data?

Â Statistical inference and the resulting p-value are

Â completely meaningless if you already have population data.

Â So, we usually use statistical inference when we have a

Â sample, and we want to say something about the unknown population.

Â 11:16

If you have a sample that is non-random, so it's biased in some way, note

Â that the results that arise from that sample are going to be unreliable as well.

Â And lastly, remember that the ultimate goal is to

Â have independent observations to be able to do statistical inference.

Â And by now in the course, you should

Â know how to check for the independent observations.

Â Remember, we like random samples, we do like large

Â samples, but we don't want them to be too large.

Â And we have that 10% rule that we check if we're sampling without replacement.

Â