0:00

This lecture's about two things.

Â First it's about predicting with regression and using

Â multiple covariates, but more importantly it's about exploring a

Â data set and trying to identify which predictors are

Â the most important to include in our prediction model.

Â 0:13

So we're again going to be using the wages data to try to

Â predict the wages of a group of men that come from the mid Atlantic.

Â This data set is available at the ISLR package which you can find here.

Â And it comes from the book Introduction to statistical learning.

Â 0:28

So the first thing that we do is that we load

Â the data set in, and so here is the ISLR data set.

Â We're also loading the ggplot2 package for some

Â exploratory analysis, and we're loading the caret package.

Â We're doing prediction, and we're going to

Â subset the data set for exploration purposes to

Â just the part of the data set that isn't the variable we're trying to predict.

Â So we select out the log wage variable, which is

Â the variable that we're going to be predicting in this analysis.

Â Then we can look at a summary of the data set and

Â so we can already see some features, for example, we saw before,

Â that this data set has only males included in the data set

Â and that for example the region is entirely peopled from the mid Atlantic.

Â 1:09

So to do a little bit more exploration, we first

Â need to train subset things into the training and test sets.

Â So we use that with the Create Data Partition function.

Â And we subset into a training and test set.

Â And so we're going to do all of our exploration again on the training set.

Â Because when we're building models, we're not going to use any of the test set.

Â We're only going to apply it once.

Â 1:30

So the first thing that you can do is a feature plot.

Â So this feature plot says, shows a little bit

Â about how the variables are related to each other.

Â Sometimes this plot is useful and sometimes it isn't.

Â In this particular plot it's a little bit hard

Â to see because everything is so squished together but if

Â you make it on your own, by using this function

Â you'll see that it's a little bit easier to see.

Â So, for example, you can see that there appears

Â for the job class there appears to be this group,

Â appears to be a little bit higher than that

Â group for the job class in terms of the outcome.

Â And you can see that there is, at least

Â for this age variable the relationship to the outcome.

Â Again there seems to be some outlier group, two separate groups here that gives

Â us some indication that we might be able to use that variable to predict.

Â So the first thing that we can do is plot the variables versus weight age.

Â So this is plotting the wage variable versus the age

Â variable and you can see that there appears to be

Â some kind of trend which we saw in the previous

Â lectures, but there also is this set of points that's

Â outlined up here, and so the idea is that that

Â set of points that's outlined up there might be, something

Â that we can predict, but we would have to figure

Â out what variable it is that, is representing that chunk.

Â 2:46

So, one way that you can do that is, you

Â plot one variable, one predictor, age versus the outcome, wage.

Â And then you color the points by another variable, in this case job class.

Â And you can see for example that it appears that most of these points up

Â here are blue instead of pink and so

Â that means they come from the information group.

Â So, this gives us some indication that the

Â information variable might be able to predict at least

Â some fraction of the variability that's in that top class up at the top of the plot.

Â 3:15

You can also color it by education.

Â So, here I'm just doing a q plot again.

Â So I'm plotting age versus wage.

Â And I'm telling it to color the plot by education.

Â And, again, I'm only using the training set because we're trying

Â to, only look at training set for our development of our model.

Â And you can see that the advance degree, also explains,

Â a lot of the variation up here in the top group.

Â And so, some combination of, degree, and,

Â 3:56

So the next thing that we can do is fit a linear model with multiple variables in it

Â and so the idea here is, again, we're just

Â fitting lines, but now we're fitting more than one line.

Â And so the idea is, we have some intercept terms, so that's

Â just the baseline level of, wage that we might have, and then

Â we might have a, a relationship with the age of the person,

Â and then we might have relationship with what job class you're in.

Â So one way that we typically do that is, by fitting an indicator variable.

Â So an indicator variable is a variable

Â that's denoted like this in mathematical notation.

Â It just says, if the job class for the ith

Â person is equal to information, this variable's equal to one.

Â If the job class for the ith person is not equal to information, then

Â this information is equal to zero, and so this represents the difference in the

Â wages between the people with job class

Â equal to information versus job class equal

Â to not information, when you, fix all the other variables in the regression model.

Â You can also do this for education it's a little

Â bit more complicated road because there are multiple education levels.

Â So we create an indicator variable for, each of the different education

Â levels and so, here this is the sum of four indicator variables.

Â And so the, the variable's equal to one, if the education

Â for person I is equal to level K, that variables equal to one and zero otherwise.

Â 5:24

So then we can fit the model just like we did before

Â so we have, we're using the train function in the caret package.

Â And we again use wage as the outcome and then tilde represents the formula

Â on the right, it's going to be used to predict the variable on the left.

Â So we use age, job class, and education.

Â So job class and education are both factor variables in r and so by

Â default it does, you know, it creates

Â these indicator variables like I've shown here.

Â So, that when it fits the model it takes that into account by automatically.

Â Again we're fitting the model on the training set

Â and we can look at the final model and

Â you can see now that the final model has

Â ten predictors even though we only put three variable

Â names into the formula and the reason is because

Â this, variable right here, got, actually received more than

Â one, predictor in the data set because of the

Â way that we had to create these indicator functions.

Â 6:19

So then we can look at some diagnostic plots.

Â So this is very typical for when you're building these regression models.

Â The idea here is you can plot the fitted

Â values, so this is the predictions from our model on

Â the training set versus the residuals, that's the amount

Â of variation that's left over after you fit your model.

Â And so what you'd like to see is that.

Â This line would be centered at zero on this, axis because the residuals is the,

Â difference between our model prediction and our,

Â actual real values that we're trying to predict.

Â And here, you can see there's still a couple of outliers

Â up here that have been labeled for you in this plot.

Â 6:55

And so, those variables might be variables that we

Â want to try to explore a little bit further and

Â see if we can identify any other predictors in

Â our data set that might be able to explain them.

Â 7:07

The other thing that we can do is color by variables not just used in the model.

Â So for example here I'm plotting again the

Â fitted values from our model versus the residuals.

Â And so again, we like to see this laying on the zero

Â line because it's the difference between our fitted values and our real values.

Â And so, what we can do is plot on this plot, we can again plot

Â it by different variables, in this case, we can plot it by for example, race.

Â And so you can see that it seems like some of these

Â outliers up here may be explained by the race variable in the

Â data set and so these another exploratory technique plotting the fitted model

Â versus the residuals then coloring it

Â by different variables to identify potential trends.

Â 7:47

Another thing that can be really useful

Â is plotting the fitted residuals versus the index.

Â And what do I mean by the index?

Â So the data set comes in a set of rows that you got in a particular order.

Â And so the index is just which row of the data set you're looking at.

Â 8:04

And the y axis here is the residuals.

Â And so you can see that all the residuals seems to be happening, high residuals seem

Â to be happening down here at the right end of the high, the highest row numbers.

Â And you can also see a trend with respect to row numbers and so

Â whenever you can see a trend or a outlier like that with respect to the

Â row numbers, it suggests that there's a

Â variable missing from your model because you're

Â plotting the residuals here, so that's the

Â difference between the true values and the fitted.

Â And that shouldn't have any relationship to the order

Â in which the variables appear in your data set.

Â Unless, and this is what's typically you discover when you see

Â a trend like this, or outliers like this at one end of

Â this plot, that there's a relationship with respect to time, or

Â age, or some other continuous variable that the rows are ordered by.

Â 8:56

So the other thing that you can do is plot the wage variable.

Â So this is the wage variable in the test

Â set versus the predicted values in the test set.

Â So ideally these two things would be very close to each other.

Â Ideally you'd have essentially a straight line on the 45

Â degree line where wage was exactly equal to our predictions.

Â Of course that isn't how it always works out.

Â And then in the test set, you can explore

Â and try to identify trends that you might have missed.

Â So, for example, here we're looking at the year

Â that the data was collected in the test set.

Â As a way of exploring how our model might have broken down.

Â Now something to keep in mind is that if you do

Â this sort of exploration in the test set, you can't then

Â go back and re-update your model in the training set because

Â that would be using the test set to rebuild your predictors.

Â This is more like a post-mortem on your analysis or a

Â way to try to determine whether your analysis worked or not.

Â 9:53

If you want all of the covariants in your model building, one

Â thing that you can do is use this again in the training function.

Â You can pass it an outcome and then tilde and then if you put

Â a dot here instead of putting a set of variables separated by plus signs.

Â It says predict with all of the variables in the data set.

Â So, this is model fit with all of the variables and

Â so this is the wage variable and the predictions here and

Â so you can actually see that it does a little bit

Â better when you include all of the variables in the data set.

Â This is By default if you don't want to

Â try to do some sort of model selection in advance.

Â 10:30

So linear regression is often useful in combination with other models.

Â It's a quite a simple model in the

Â sense that it always fits lines through the data.

Â And so it can be, capture a lot of variability

Â if the relationship between the predictors and the outcome is linear.

Â If it's not linear, then you can often miss

Â things, and it can be better blended with other models.

Â We'll talk about model blending later in the class.

Â Exploratory data analysis can be very useful with regression

Â models because, like, the plots we made with residuals

Â and so forth colored by different features, you can

Â try to identify the patterns in the data set.

Â