0:00

Recognizing whether you are asking an inferential question or

Â a prediction question is really important.

Â Because the type of question that you are asking can greatly influence the modeling

Â strategy that you pursue in any data analysis.

Â 0:16

So just to recap, with inferential questions the goal is to estimate

Â the association between an outcome and a key predictor.

Â While potentially trying to adjust for all these confounding factors.

Â There is usually a very small number, or even a single

Â key predictor that we're interested in and it's relationship to the outcome.

Â And then there may be all these other variables.

Â And the key goals for the modeling is to estimate this association and

Â to make sure that you appropriately adjust for any other kinda factors.

Â You often do sensitivity analyses to

Â see how the association might change in the presence of other factors.

Â With the prediction question, the goal is to develop a model that

Â best predicts the outcome using whatever information you have available to you.

Â So typically you don't put any [FOREIGN] weight on one predictor over another.

Â And so there's no notion of let's say a key predictor and confounder.

Â All the predictors might be equally important before you look at the data.

Â And any good prediction algorithm will tell you which predictors or

Â which variables are more or less important for predicting the outcome.

Â But beforehand,

Â we don't necessarily divide them into different groups based on importance.

Â Finally, we usually don't care about the mechanism or the specific

Â details of the relationships between the various predictors or variables.

Â We just wanna find a kind of a model form

Â that predicts the outcome with a high accuracy and low error.

Â 1:51

I'll start first with an inferential question.

Â Suppose I want to know how is air pollution in New York City related to

Â mortality in New York City.

Â Okay?

Â So the data I'm gonna use comes from this national morbidity and

Â mortality in air pollution study, and

Â here's a picture of daily mortality in New York from 2001 to 2005.

Â And you can see that it has highly seasonal components,

Â the mortality tends to be higher in the winter and lower in the summer, and

Â it's very specific to kind of to pattern across every year.

Â 2:22

Here is what particulate matter data looks like.

Â And so for

Â the same time period you can see this has also a seasonal structure to it.

Â In particular, particulate matter is high in the summer.

Â It tends to be low in the winter in New York City.

Â That's a very, that pattern is kind of repeated across the years.

Â 2:42

So the first step I'm gonna do is just to try the easy solution.

Â The easiest solution is just a scatter plot of mortality and

Â PM10 on the x-axis, and here you can see what the relationship looks like.

Â There's quite a bit of noise because we wouldn't necessarily expect

Â particulate matter to explain all the vulnerability in mortality.

Â And so, we may need to resort to some modeling,

Â some formal modeling to see if there's any sort of association here.

Â 3:10

So the first thing we're going to do is to take that scatter plot and

Â just fit a simple linear aggression to the data in that scatter plot.

Â This is basically regressing mortality as the outcome and PM10 as our key predictor,

Â without any other factors, just as kind of a baseline model.

Â 3:36

And so basically is close to zero.

Â You can tell by the size of the standard error,

Â which is much bigger than the estimate, that there's a lot of

Â variability around this estimate and so it's effectively zero.

Â The association between the two variables is zero.

Â Okay. So

Â that's kind of what our basic first cut analysis tells us.

Â Now one thing we know about pollution and mortality,

Â just from the pictures that we just saw of the data is that they're highly seasonal.

Â Season plays a big role in explaining variability in both mortality and

Â in air pollution okay?

Â Remember mortality was high in the winter and low in the summer, and

Â pollution was high in the summer and low in the winter.

Â So it seems like season is clearly related to both mortality and pollution.

Â So it might be a reasonable thing to include as a potential confounding factor.

Â So we can fit a secondary model to the data,

Â which includes pm10 as our key predictor, and

Â then maybe we'll include the season of the year as a potential confounding factor,

Â so the season will just be, you know, there'll be four seasons, and

Â we'll have a categorical value with a category for each season.

Â 4:42

So, here is the results of fitting that model to the data.

Â And you can see that I highlighted the coefficient for

Â pm10 here is actually quite a bit larger now, is 0.00149, and you

Â can see that the standard error is quite a bit smaller relative to the estimate.

Â So, this suggests that the coefficient for

Â pm10 is quite a bit bigger than zero, maybe statistically significant.

Â So the addition of this confounding factor

Â dramatically changed the association we estimate between pm10 and mortality.

Â And so, part of this is because season is very strongly related to both,

Â but it's kind of positive, it's correlated in one way with mortality,

Â but it's correlated in a different way with pollution.

Â That's part of why we saw it when we didn't include season we saw no

Â relationship.

Â But when we do include it, we see this kinda quite a bit stronger relationship.

Â 5:34

Now there are other potential factors that we might wanna consider

Â in terms of things that might both be related to mortality and to air pollution.

Â So another one is the weather.

Â Right? So weather is associated with mortality

Â and it's also highly associated with various air pollutants and so

Â we can characterize the weather with something like temperature or dew point

Â temperature to think to just capture a piece of kind of what weather is and so

Â we can include that into our model.

Â So here's the results of including temperature which is the tmpd variable and

Â dew-point temperature which is the dptp variable.

Â And you can see now.

Â And actually I've highlighted the coefficient for pm10,

Â it's even bigger than it was before and the standard error is similar, so it's

Â actually more statistically significant in some sense than in the previous models.

Â Again, temperature and

Â dew point are strong factors that are related to both mortality and pollution.

Â Finally, another type of factor that may be related to both mortality and PM,

Â particulate matter, is other pollutants, right?

Â So there are other pollutants that may affect health, and they may be correlated

Â with particulate matter because they may share common sources.

Â So, one common source in a city like New York is gonna be traffic.

Â Traffic can produce particles, it can also produce other pollutants.

Â So one of the pollutants that we'll look at is nitrogen dioxide.

Â And so nitrogen dioxide tends to be correlated with particles.

Â So if you are interested in association between particles and

Â mortality, you might wonder well is what we're

Â seeing actually the association between nitrogen dioxide and mortality?

Â And pm10 is kind of getting mixed up in the two.

Â And so we can include nitrogen dioxide in our model as a potential

Â confounding factor, and see how the estimate for

Â particulate matter changes when we do that.

Â 7:20

So here are the results from that model, and

Â you can see that compared to the previous model the coefficient for

Â pm10 drops a little bit, it goes down a little bit, so

Â the effect weakens a little bit when we include nitrogen dioxide in the model.

Â Now it's still quite a strong effect,

Â relative to the standard error that we estimate but it's not estimate,

Â that is not as strong as it was before we entered, we included no2 in the model.

Â And so, no2 and pm10 might be sharing a lot of the same effect on mortality and

Â it may be difficult to completely disentangle them.

Â However, even with no2 in the model we still see a reasonably strong

Â association between pm10 and mortality in New York City.

Â 8:00

So when we put all these results together from the primary model which

Â just had pm10 in mortality, and then the various secondary models we've shown,

Â you can see that the primary model had a zero association effectively, and

Â then the other three models had a relatively strong

Â positive association with the outcome.

Â And here I've plotted the 95% confidence intervals for each of these associations.

Â So this is the kind of analysis that we're interested in.

Â We're looking at pm10 as our key predictor.

Â And mortality is our outcome.

Â And we want to see how that association changes under different scenarios.

Â Under different sets of models, including different sets of confounding factors.

Â Now what we do with this information will depend on what the goal of the analysis

Â is, who the stakeholders are, and what we might do with this information afterwards.

Â But we won't talk about that now.

Â The point I want to make is the kind of the analysis that you do for

Â an associational type of analysis is very much along these lines.

Â 8:55

So another question that we could ask, is what best

Â predicts mortality in New York City, using the data that we have available, okay?

Â So now I've changed the question to a prediction type of question, and

Â we want to know what predicts the outcome best.

Â 9:08

So one of the things that we could do is fit a complex prediction

Â algorithm, all right?

Â We don't need to know, we don't need to worry about estimating associations, or

Â adjusting for confounding factors.

Â We're just gonna put all the data that we have available to us and

Â see how well that predicts the outcome mortality.

Â So here I'm gonna use a random force algorithm to make

Â predictions of mortality in New York City.

Â And one of the aspects of random forest algorithm is that it allows for

Â a summary statistic called variable importance.

Â And this gives you a sense of how important a variable is in increasing

Â the skill of the algorithm in predicting the outcome.

Â So, I've plotted here,

Â the variable importance plot that rank orders all the variables in terms of how

Â important they are to improving the prediction skill of the algorithm.

Â And you can see at the very top here is temperature,

Â which is kind of ranked as the most important variable.

Â Followed by dew-point temperature, followed by the date, and

Â then no2, ozone which is o3.

Â Season.

Â And then finally, pm10 and then dow, which is the day of the week.

Â So you can see that from this prediction model,

Â particularly matter actually is second from the bottom

Â in terms of improving the prediction skill of the algorithm.

Â So if you were to look at this analysis and

Â then ask the original question, and say, how is pm10 related to mortality?

Â You might think well, it's not related to mortality.

Â It's not important for predicting.

Â But that's true, but it doesn't necessarily mean that pm10 is not

Â 10:43

associated with mortality, from an associational standpoint.

Â It may have an important association with mortality.

Â But that the association is inherently weak.

Â And so it's not necessarily going to be good for be predicting the outcome, but

Â it may, nevertheless, have an important association with the outcome.

Â And so, separating these kinds of questions out, and the goals of these

Â questions is very important because if you were to ask an inferential question, but

Â then do an analysis that's really tuned for prediction you might be lead to

Â come to the wrong conclusion that, oh, pm10 is not important.

Â And it doesn't have an important association with the outcome.

Â And particularly, if you think about something like pollution, it may have

Â an inherently very small association with the outcome, but because everyone in

Â New York City ostensibly breathes, and is exposed to this polluted air.

Â The effect, the ultimate effect, of such an association,

Â could be quite large given the large population of a city like New York.

Â So it's important to not kind of conflate the magnitude of the association

Â with the ultimate effect on the population that you're interested in.

Â 11:47

Using prediction algorithms for prediction questions, and

Â associational analyses for associational questions, is really important so

Â that you can draw the right conclusions from your data, and

Â not mistake the results of one question for another.

Â So framing the question correctly is really important for developing

Â an important modeling strategy, and for kinda drawing the right conclusions.

Â