0:48

And then what I'm going to do is simulate data,

Â where the outcome y only depends on x1.

Â An error, and then doesn't depend on the other two variables.

Â And then I'm going to look at the standard error for the x1 variable for

Â the model that just includes x1, that's this one.

Â The model that includes x1 and

Â x2, that's this one, and the model that includes x1 and x3.

Â And I'm going to do that over and over again.

Â And I'm going to look at the standard deviation of the simulated estimated

Â coefficients.

Â Now the reason I'm doing this by simulation, is because the variance

Â inflation occurs on the actual standard errors, not the estimated standard errors.

Â It's actually a little complicated, in that if I put another variable into

Â a regression model, it doesn't necessarily inflate the observed standard error.

Â And the reason is because the variance is estimated rather than the actual variance.

Â And that has a kind of conflicting property.

Â So you don't necessarily see the variance

Â go up when you include a regressor in an regression,

Â an unnecessary regressor into a regression model.

Â You have to look at it through simulation.

Â Let's go ahead and do this.

Â 2:13

This is sort of the ideal setting,

Â where the three regressors don't have anything to do with one another.

Â Now what you see is in fact.

Â So the first model this is the standard error, the standard deviation

Â of the beta coefficients I got crossed across all simulations.

Â Here's the second one.

Â It actually went down a little bit with the second one.

Â Now we know theoretically that can't happen.

Â But there is some Monte Carlo error when I do my simulation.

Â So let;'s just say, these two are about the same.

Â And, the third one is also about the same, 0.032 as opposed to 0.031.

Â So nothing that bad.

Â The variance inflation by including the extra variables did almost,

Â was negligible.

Â And the reason that is the case, is because these other to variables, x2 and

Â x3, have nothing to do x1.

Â I simulated them independently.

Â Let's look at a case now,

Â where I've simulated them in such a way that they depend heavily on x1.

Â So I'm going to generate my x1 as I did the last time,

Â just as a random normal vector.

Â But my x2 depends on x1, but also some noise.

Â 3:20

And my x3 is heavily dependent on x1, very dependent on x1, okay?

Â So now, let's repeat the simulation.

Â And see what happens to the variability and the estimated coefficients.

Â Okay, now we see huge amounts of variance inflation, especially for

Â this third model where we include x2 and x3, right.

Â We see the standard air has gone up by a factor of five, okay.

Â And that goes to the general rule, If the variable that you

Â include is highly correlated to the thing that you're interested in, right?

Â You are going to inflate the variance more.

Â And so, you need to concern yourself with putting

Â highly correlated variables into your regression model, unnecessarily, okay?

Â So if I have diastolic blood pressure in my model, and

Â I put systolic blood pressure on my model, which is basically the same,

Â relatively the same thing, they're very correlated.

Â If I did that unnecessarily, that's going to make my diastolic blood pressure,

Â standard air, it's going to increase it.

Â 7:26

This looks at the case that you're in, versus the ideal case

Â where what you're interested in is uncorrelated with all the other variables.

Â That's the so called variance inflation factor.

Â And here I give a little simulation.

Â I'm not going to show it.

Â Where I just show you that the ratios can get estimated, and

Â I show you a little simulation.

Â That they can get estimated pretty well so, let's actually look

Â with this Swiss data, let's look at the variance inflation you,

Â so what I'm going to do is fit the model What happens if there's some

Â residual effect of the first aspirin when you give the second one?

Â I'm going to fit the model with all the variables included, so

Â when you do a ~., it includes all the variables.

Â And then I'm going to do vif(fit), that gives me my variance inflation factors.

Â Oh, but, you have to remember to load this library, c, a, r., okay?

Â Let's load that, and then now let's try it again.

Â I would rather than the variance inflation factors, I would look at

Â the inflation factors for the standard deviations, which is the square root.

Â So let's look down here.

Â So on the first line I have the variance inflation factor by itself and

Â then in the line below it, I have the square root of it which I tend to prefer.

Â So let's look at the first line just because it's the more normal thing,

Â the variance inflation, so for the agriculture 1 is 2.

Â What does that mean?

Â It means the standard error for the agriculture effect is double

Â of what it would be if it were orthogonal to all the other regressors, okay?

Â And we see that some of them are quite high, examination is quite high,

Â 3.67, so almost 4 times

Â than the instance when if it were orthogonal to all the other regressors.

Â And we know that education, I'm sorry, examination and

Â education are highly related, so that's why these two have very high

Â 11:53

So let's talk about model selection in general, automated model selection.

Â I'm going to fit the model with all the variables included, so

Â when you do a ~., it includes all the variables.

Â It really, I think, in one point was a statistical topic but

Â it's really moved in to the realm of machine learning for the most part.

Â And I would say, though, even for relatively simple

Â linear regression models, the space of model terms that you have to search among

Â explodes really quickly when you start including interactions and

Â polynomial terms like the square of a regressor and so on.

Â If you have a lot of regressors and you're interested in how do I reduce this space,

Â then there's a lot of factor analytic and things like principal components,

Â those kind of techniques that are available to you,

Â to reduce your co-variant space down to size.

Â Now however, those come with consequences.

Â Your principal components or factors that you obtain might be less interpretable

Â than the original data that you're interested in again.

Â Again, this is probably better served in a multivariable class,

Â a multivariate class, or a class on machine learning.

Â But for us we're going to mostly consider the case where we have a relatively

Â small number of regressors, and we're going to pick through them with a highly

Â interactive process between the analysts, the data, and the scientific context.

Â 13:23

Another thing I would mention is that good design can often eliminate the need for

Â a lot of this model discussion.

Â We've talked a lot about how randomization can really prevent a lot of

Â the problems that we're talking about with making our

Â variable of interest unrelated to nuisance variables that we're not interested in, or

Â in nuisance variables that we don't even know about; however,

Â there's other aspects of design that can serve the same purpose.

Â For example, if we stratify and randomize within strata.

Â The classic example of this, when this was developed, was R.A.

Â Fisher was working in field crop experiments and they needed,

Â let's say you were trying a different kind of seed,

Â you might block on different areas of the field that you were going to plant in and

Â randomize the different seeds to those areas.

Â So you might have two different kinds of seeds, but they will have been distributed

Â in a systematic way that is fair across the field, but

Â also then, within that design, there will also be some randomization.

Â This topic of experimental design is a pretty broad topic.

Â Another great example is in biostatistics, the field that I work in the most,

Â a very common kind of design is a cross-over design, and

Â in that case, you try to use every subject as their own control.

Â So let's say, for example,

Â you're interested at looking at two different kinds of aspirin, and

Â you might give the aspirin to one group of people, and then the aspirin to

Â another group of people, and the other aspirin to a different group of people.

Â Let's say they have different gels or whatever that determine how much it,

Â how it gets absorbed in your stomach.

Â So if those two groups aren't the same, either the randomization wasn't

Â very good and there was some certain imbalance that you just got unlucky about,

Â or if the study was just observational, then the comparison of those two groups

Â might be biased by whatever differentiates the groups rather than group one receiving

Â one kind of aspirin and group two receiving a different kind of aspirin.

Â 15:34

On the other hand, if you can give a person one kind of aspirin and

Â then later on give them a different kind of aspirin when they have another

Â headache, that would compare each person to themselves, right?

Â Control block on the person so to speak.

Â So that's a design strategy.

Â Now there's some nuance with this design strategy as well.

Â What happens if there's some residual effect of the first aspirin when you

Â give the second one?

Â So maybe you could handle that with some sort of wash out period,

Â long wash out period or something like that.

Â But anyway, the point of that design is to make it, so that you're comparing people

Â with themselves to control and everything that's intrinsic to the person.

Â These across time periods control for that by giving both aspects to each person.

Â Maybe you would randomize the order in which they received them that's

Â called a crossword design.

Â At any rate, the broader point that I'm trying to make is, it's often the case

Â that good thoughtful experimental design can really eliminate the need for

Â some of the main considerations that you would have to go through in model

Â building if you were to just collect data in an observational fashion.

Â [SOUND] The last thing I would say is there's one automated search model

Â technique that I like quite a bit and I find it very useful and

Â it's the idea of looking at nested models.

Â So, I'm often interested in a particular variable and I'm very

Â interested in how the other variables that I've collected will impact it.

Â So, I'm interested in a treatment or something like that.

Â Some important variable, but I'm worried that my treatment groups and

Â imbalanced with respect to potentially some of these other variables.

Â So what I'd like to look at is the model that just includes the treatment by itself

Â and the model that includes the treatment and let's say, age.

Â If the ages weren't really balanced between the two treatment groups and

Â then one that looks at age and gender, if maybe the genders between the two

Â groups weren't really balanced and then so on.

Â And this idea of creating models that are nested,

Â every successive model contains all the terms of the previous model

Â leads to a very easy way of testing each successive model.

Â And these nested model examples are very easy to do, so I'm just

Â going to show you some code right here on how you do nest and model testing in R.

Â So I fit three linear models to our SWF dataset,

Â the first one just includes agriculture.

Â Let's pretend that, that's the variable that we're interested in and

Â then the next one includes agriculture and examination in education.

Â I put both of those in,

Â because I'm thinking they're kind of measuring the same thing.

Â But now after this lecture, I'm concerned over the possibility that they're too

Â much of measuring the same thing, but let's put that aside for this time being.

Â And then the third model includes Examination + Education + Catholic +

Â Infant.Mortality.

Â So, all the terms.

Â So now, I have three nested models and I'm interested in seeing what happens to

Â my effect as I go through those three models.

Â The point being, in this case, you can test whether or not the inclusion of

Â the additional set of extra terms is necessary with the ANOVA function.

Â So I do anova fit1, fit1 and fit5.

Â That's what I named them, one, three, five.

Â And then you see down here, what you get is a listing of the models.

Â Model 1, model 2, model 3 and then it gives you the degrees of freedom.

Â That's the number of data points minus the number of parameters that it had to fit.

Â The residual sums of squares and

Â the excess degrees of freedom of going Df is the excess degrees of freedom of

Â going from model 1 to model 2 and then model 2 to model 3.

Â So we added two parameters going from model 1 to model 2, that's why that Df is

Â 2 and then we added two additional parameters going from model 2 to model 3.

Â So the two parameters we added from going from model 1 to model 2 is we added

Â examination and education, they're two regression coefficients.

Â Going from model 2 to model 3, we added Catholic and

Â Infant.Mortality there to regression coefficients.

Â With this residual sum to squares and the degrees of freedom,

Â you can calculate so-called S statistic.

Â And thus, get a P value.

Â This gives you the S statistic and the P value associated with each of them,

Â then here it shows that yes, the inclusion of education examination Information

Â appears to be necessary when we're just looking at agriculture by itself.

Â Then I look at the next one it say, yes, the inclusion of Catholic and

Â Infant.Mortality appears to be necessary beyond just

Â including examination, education and agriculture.

Â So if the way in which you're interested in looking at your data naturally

Â falls into a nest model search as it often does,

Â I think when you're interested in one variable in specific.

Â As in this case,

Â I think this would be a pretty natural way of thinking about the series of analyses,

Â then some kind of nested model searches a reasonable thing to do.

Â It doesn't work if the models that you're looking at aren't nested.

Â For example, if I had the first model or model 2 had an examination, but

Â not education and the third model had education, but not examination.

Â This wouldn't apply, you'd have to do something else.

Â And there, I think you get into the harder world of automated model selection with

Â things like information criteria.

Â So I would put all that stuff off to our prediction class and

Â just leave you this one technique that's useful in the one specific instance

Â where you've decided to kind of look along a series of models,

Â each get increasingly more uncomplicated, but including the previous one.

Â So, I hope in this lecture that you've gotten a couple of model selection

Â techniques that you can use.

Â I hope you've also learned that there are some basic consequences that occur, if you

Â include variables that you shouldn't have or exclude variables that you should have.

Â These has consequences to your coefficients that you're interested in,

Â they have consequences to your residual variance estimate.

Â We didn't even touch on some other aspects of [INAUDIBLE] model that could occur,

Â such as absence of linearity and other things like that, non-normality and so on.

Â So again, it's generally necessary to take your model that's the a grain of salt,

Â because more than likely one aspect of your model is wrong.

Â And I'll leave you then with this famous quote by George Box who vary

Â famously said, all models are wrong, some models are useful.

Â And I think that's a very credo to go along with that yes, for

Â sure your model is wrong, but it might be useful in the sense of being

Â a lens to teach you something useful and true about your data set.

Â