A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 2: Regression Methods

75 ratings

A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

From the lesson

Module 3A: Multiple Regression Methods

This module extends linear and logistic methods to allow for the inclusion of multiple predictors in a single regression model.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

Greetings and welcome to Section B.

Â In this section we're going to start with multiple linear regression now, and

Â look at some examples, to make specific comparisons to what we talked about in

Â the general context in the previous section.

Â So, hopefully by the end of this section you'll be able to

Â interpret the intercept and slope estimates from multiple linear

Â regression models in a substantive context.

Â And then compare the results from simple or

Â unadjusted associations via simple linear regression.

Â And then multiple linear regression models to assess confounding.

Â So let's first talk about predictors of arm circumference.

Â We'll start with height and we'll go back to what we looked at in lecture 1.

Â The simple linear regression relating arm circumference to height.

Â We're using a random sample of 150 Nepalese children less than a year old.

Â In the unadjusted association we estimated with the simple regression, and it's

Â unadjusted because we're only considering height as the one predictor for

Â arm circumference was such that we'd estimate the mean arm circumference.

Â For a group of children of a given height by taking the intercept 2.7 and

Â adding 0.16 times the height value in centimeters for the group.

Â So, we saw a visual evidence of an association seemed well described by

Â a line, but we noted that the R square was quote, unquote only 46%.

Â And again, it's hard to put a value judgment on the size of R squar square.

Â But for a physical phenomena, like growth in children, we'd expect there to

Â be relatively strong correlations between the anthropometric measures.

Â Now 46% is reasonably strong.

Â But, because of the way growth works, this implies that other predictors,

Â such as weight or sex, may be able to predict some of the variability in arm

Â circumference that wasn't explained by height.

Â We also showed in lecture 4, evidence that this crude association was

Â confounded by weight differences across the different height values.

Â So let's tie that altogether now.

Â So first, let's just do a simple regression to estimate the unadjusted

Â association between arm circumference and height.

Â In lecture 4, we saw a scatter plot that showed that this was reasonably well

Â described by a line.

Â The unadjusted association via this regression is such that our slope for

Â height is 0.8 and the intercept which describes or estimates the mean arm

Â circumference for children that weigh zero kilograms which is not a group sample.

Â But this is a necessary place holder at 7.8.

Â The 95% confidence interval for

Â that slope of 0.8 is 0.72 to 0.89 so it does not include zero.

Â And in fact the P value, protesting the null, no association between arm

Â circumference and height at the population level was predictably rather small.

Â The R square here is notably high, 70%, so 70% of the variability in s, we estimate

Â that 70% of the variability in this arm circumference values and the sample

Â at least was explained by variability in weight between the children.

Â So now let's put the two together though.

Â Let's see if we can do a both better job of predicting mean arm circumference and

Â also whether the results are relationships between arm circumference in

Â both height and weight change when we consider the other in this same model.

Â So what we get here now is if I use the computer to estimate this,

Â I'd actually get a multiple regression model.

Â Where I have one predictor of height and one of weight, such that

Â the estimated average arm circumference is equal to some intercept 14 plus

Â negative 0.16 times the height of the group of children we're looking at.

Â And the number may sound familiar from lecture 4.

Â Plus 1.40 times weight for the group of children we're looking at.

Â So we're going to parse this, closely in a second and compare the unadjusted, but

Â let me just note that both the adjusted associations between

Â height and weight in this model were statistically significant.

Â So both contributed, from a statistical perspective,

Â information to the outcome of arm circumference above and beyond the other.

Â And the R squared here is 0.77 which is higher than either R squared we saw for

Â height or weight alone although not much higher than what we saw from weight alone.

Â So let's just, before we talk about potential confounding and

Â assessing that by comparing the unadjusted and adjusted and

Â talk about why this R squared wasn't much larger than what we saw with weight alone.

Â Let's, let's try and interpret these slopes of height and

Â weight in terms of the comparisons that's being made by each.

Â So the estimated slope for height is beta 1 hat equals negative 0.16.

Â So this is still the estimated mean difference in arm circumference between

Â two groups of children who differ by one centimeter in height but

Â it's more specific because it's comparing groups who differ by one centimeter in

Â height but are of the same weight.

Â This is a different interpretation and when we have a simple linear

Â regression with height because that didn't consider weight at all.

Â Here we've adjusted this comparison for weight such that we're

Â comparing groups who are comparable in terms of weight but differ by height.

Â And we saw in lecture 4 that we got a similar result or we,

Â we presented the same result but we didn't talk about where it came from per say.

Â But this is the weight adjusted association between arm

Â circumference and height.

Â Now think about this, does it make sense that the relationship between average arm

Â circumference and height is negative when we're comparing groups of the same weight?

Â Well, think about it.

Â If they're the same weight, sort of the same mass in terms of heft,

Â increasing in height is associated with a more lanky, if you will,

Â or thinner body type which would translate into narrower arm circumference.

Â So, it does make some sense that after we adjust for

Â weight and are comparing groups are comparable in terms of weight,

Â height is negatively associated with arm circumference.

Â So, this result is an adjusted mean difference.

Â Another way to say this is that this result,

Â this negative 0.16 estimates the groups of children who differ by one centimeter in

Â height but are the same weight will differ in arm circumference on

Â average by negative 0.16 centimeters, taller to shorter.

Â So the taller group will have lesser average arm circumference.

Â And that 95% confidence interval suggests that this difference is real on average,

Â and could be anywhere from, anywhere from negative 0.21 to negative 0.11

Â centimeter difference in arm circumference per additional centimeter in height.

Â The slope estimate for weight is 1.40.

Â This is still an estimated mean difference in arm circumference between two

Â groups who differ by one kilogram in weight, but here the comparison is more

Â specific than the simple regression because this is adjusted for height.

Â So, this compares children who differ by one unit in weight, but

Â are of the same height.

Â This is called the height adjusted association between arm circumference and

Â weight and this result another way to state is that it estimates that groups of

Â children who differ by one kilogram in weight, but are the same height will

Â differ in arm circumference on average by 1.4 centimeters, heavier to lighter.

Â So increased weight is associated with an increased arm

Â circumference among children of the same size.

Â And think about this, this also makes some sense biologically.

Â And then the 95% confidence interval for

Â the true height adjusted arm circumference and weight association is 1.2

Â centimeters per kilogram of weight, up to 1.6 centimeters per kilogram weight.

Â That estimates a range for the size of the association at the population level,

Â accounting for sampling variability.

Â So how could we present these findings?

Â Well, in research articles, and

Â we'll look at some results from research articles in similar tables in section D.

Â Frequently a single table of unadjusted and

Â adjusted associations will be presented, especially for non-randomized studies

Â where adjustment is necessary to at least assess whether there's any confounding.

Â So, something like this might in, in,

Â in our, if we were to present this in an article, it might look like this.

Â We have a table.

Â It's called, it says linear regression results for

Â predictors of arm circumference and at this point we've only considered two.

Â And we have one column that says Unadjusted and

Â this implies that the results being estimated here are slopes because we

Â have linear regression from simple regression models.

Â So this is from the simple regression model of

Â arm circumference on height alone.

Â And this is from the simple regression model of arm circumference on weight.

Â And, then the adjusted estimates, and

Â the implication is that each has been adjusted for the other.

Â Or all other things in the table, which there's only one.

Â If we're looking at height, the only other thing is weight.

Â If we're looking at weight, the only other things is height.

Â These are from the multiple regression model that included height and

Â weight as predictor.

Â So we can immediately see that the association between height changed not

Â only magnitude but direction.

Â After we adjusted for weight, so we get some very

Â clear numerical demonstration of the confounding in these data.

Â We also see that the association between weight and arm circumference,

Â while it's positive and statistically significant when it's unadjusted.

Â It's larger in magnitude, still positive.

Â And, again, statistically significant when adjusted.

Â But if you look at the confidence intervals for the unadjusted and

Â adjusted, they don't overlap.

Â So, this implies that there was some confounding here as well.

Â We were underestimating the association when

Â we didn't adjust for height differences between the weight groups, and

Â that estimate is larger both in terms of the estimate, and

Â in terms of the uncertainty interval after we adjust for height.

Â Sometimes, the results from several models will be presented, so for

Â example if we also wanted to consider sex as a predictor,

Â we might do something like this.

Â We have the unadjusted associations here, this is the again, the two unadjusted

Â associations between arm circumference and height and arm circumference and weight.

Â Here is the unadjusted association with sex and

Â one way to present this to indicate that this number compares females to males is

Â to list female and males and for it says rails, put ref for reference.

Â For, indicate that this is the group, reference group for the sex comparison.

Â And then by female we put negative 0.13 which indicates this is the unadjusted

Â difference in arm circumference, in mean arm circumference for

Â females compared to the reference of males.

Â And this is not statistically significant.

Â Here in Model 2, what's filled out in the table are the height and

Â weight, information but no sex.

Â So this implies that the results from this model include height and

Â weight as predictors in a multiple model.

Â It also gives the intercept so that somebody could read the whole equation and

Â use it for

Â prediction purposes which we'll talk about in the next section if they wanted to.

Â If we only gave the slopes, they could make comparisons in terms of

Â mean differences but they couldn't estimate the mean arm circumference for

Â any specific single group, given the weight and height info.

Â And then in Model 3, the implication here is it has all three, predictors.

Â So this is from a multiple regression model of height, weight, and

Â sex all taken together.

Â And so we'll notice that the associations between height and

Â weight don't change much when they're additionally adjusted for sex above and

Â beyond being adjusted for each other.

Â The estimated slopes are of similar magnitude and

Â the confidence intervals overlap.

Â Interestingly enough though the association for

Â sex, the difference between females and males goes from negative and

Â non-statistically significant when it was unadjusted accrued to positive and

Â statistically significant when it's adjusted for height and weight.

Â So this 0.30 here estimates the mean difference in

Â arm circumference between females and males of the same weight and height.

Â And once we control for those and dimensions of mass, we

Â see that comparable females have, in terms of weight and height, have greater average

Â arm circumference than comparable males, and it is statistically significant.

Â Let's look at another example.

Â These were data from the National, how, Hospital Ambulatory Medicare,

Â Medical Care Survey, NHAMCS.

Â So the potential predictors they include, there's more than these.

Â But they include things like sex,

Â the race of the person which was actually the sample was people who,

Â a random sample of people who visited emergency departments in 2010.

Â In a random sample of US hospitals, and we have the sex of the,

Â individuals, the race categorizes black, white or other.

Â The insurance payer type, whether it's public, private or

Â other, and age, which is categorized in the data in to four quartiles.

Â So I'll just show you table, jump to the table that you might see in

Â a article looking at predictors or emergency department waiting times.

Â And so what this shows is one set of unadjusted estimates and

Â one set of adjusted estimates from a multiple linear regression.

Â So let's just consider the unadjusted estimates for a minute.

Â So this sometimes when we have binary predictors,

Â they won't specify which is the reference group with it's own row, but

Â they'll just name the predictor for the group that is not the reference group.

Â So male here implies that this is a sex comparison, and we're comparing males.

Â And the hidden group, the one that's not shown, females, is the reference.

Â So this negative 2.5 said that males, on average,

Â had waiting times of 2.5 minutes less than females.

Â Accounting for sampling variability, this reduction average waiting time

Â could be anywhere from 4.4 to 0.7 minutes but it is statistically significant,

Â indicating that at the population level males had

Â shorter waiting times on average because that interval does not include zero.

Â But this does not take into account any other characteristics of the patients.

Â Here's the racial comparisons not considering any other factors.

Â So White here,

Â we actually put White on the table to indicate it's the reference group.

Â Black, this co, this slope here of 19.3 compares the average difference in

Â waiting times for Black patients to White patients.

Â No other factors considered and it was 19.3 minutes on average and

Â statistically significantly so.

Â And this 2.6 compares the mean difference in waiting times for

Â those who identify as Other, not Black or White compared to White.

Â And this difference is not statistically significant.

Â See this p value up here?

Â This is what I was alluding to in the first, section.

Â The null here is that the slopes, the White is the reference group so

Â the slopes, the mean differences between Black and White and Other and

Â White, are equal to each other and are equal to 0.

Â Meaning, if you play this out, not only the differences between each of

Â these groups and White, 0 but the differences between Other and

Â Black, which would be the difference of these two things, is also 0.

Â But null is that there's difference in waiting times on

Â average between the three racial groups.

Â And the alternative is that at least one group has a statistically significant

Â different waiting time.

Â And so we can see this null, this p value is less than 0.001.

Â This is an anova, analysis of variance is what this p value is testing,

Â whether there's any mean differences here.

Â And the answer is, this results are statistically significant, so

Â at least one group was statistically significantly different than the other and

Â we don't have a confidence interval for the black to other comparison but

Â we can certainly see that the difference between black and

Â white is statistically significant.

Â Age, we can see that older age is associated with

Â longer waiting times relative to be less than 20.

Â Until you get into the 55 and older group.

Â And the average is slightly low under the reference but

Â not statistically significantly so.

Â But overall there is an association between age and

Â waiting times as that p value for testing for any differences is less than 0.001.

Â And similarly Public, those on public insurance, have

Â longer waiting times by about 3.5 minutes on average compared to those on Private.

Â Statistically those are another which include those with self pay,

Â who are self paid who have no insurance,

Â have average waiting times of 11 minutes longer than those on Public.

Â And that is also statistically significant and of course what we can already tell

Â seeing some of these differences but on the whole, there is an association.

Â So let's look at the adjusted estimates.

Â So this, for males now compares the difference in males to females,

Â the difference in average waiting time of same race,

Â age, category at least, and payer type.

Â And this is very similar.

Â It's still negative,

Â males have lesser time than comparable females by 2.1 minutes.

Â And it's statistically significant.

Â So the comparison here is the average difference in waiting times for

Â males to females of the same race, age, and

Â payer is such that males have waiting times 2.1 minutes lower on average.

Â So from a convalley perspective that doesn't at least in terms of race, age and

Â payer, these don't seem to confound much the overall unadjusted association, it's

Â still negative after adjustment similar in value and statistically significant.

Â Without race we might ask whether some of the original difference in race had to do

Â with sex, age or payer type differences between the racial groups.

Â Now, let's compare black to white after adjustment for

Â these other things, so this 18 minute difference here is the average

Â difference between blacks to whites after adjusting for, or otherwise,

Â another way to say it is around the same age, sex and payer type.

Â And this difference is still, remains, slightly smaller, but

Â substantly equivalent more than 15 minutes, more than a quarter of an hour.

Â And it's statistically significant and the differences between others and

Â why it stays almost exactly the same so it doesn't appear that those original

Â differences because of race are explained by differences in sex,

Â age or payer types between these racial categories as they effect waiting time.

Â And this p value here, is testing the null.

Â That there's no association between waiting times and race.

Â After accounting for sex, age and payer type.

Â If we go down to age I'll let you look at the details but there doesn't appear to be

Â much confounding, the associations are similar in terms of their magnitude and

Â significance after adjustment.

Â This is the p value from the,

Â null testing at the slopes for the three age categories.

Â Are all equal to each other and 0.

Â From the model that's already accounted for male.

Â For sex, race, and payer type.

Â So after adjusting for sex, race, and payer type the null here is that there

Â is no association between waiting times and age and that, is rejected.

Â So this says that.

Â Age tells us, gives us information about waiting times above and

Â beyond those other three predictors.

Â And the story with payer type remains very similar as well.

Â So if we were to write this model out just for, just to have some fun,

Â to show what this model looks like, here's the intercept.

Â This model, this adjusted model looks like this.

Â We have our average waiting time, y hat is equal to the intercept, 46.5.

Â I'm going to write it down just next to this plus negative 2.1.

Â And I'll say this is x1, where x1 is a 1 for males and a 0 for females.

Â Plus, now I'll do the race part.

Â Plus 18.0 times x2 plus 2.6 times x3 so

Â x2 is a 1 if subject identifies as black, 0 if not.

Â X3 is a 1 if they identify as others, 0 if not.

Â So this is our race component.

Â And we have the age component.

Â Into four categories.

Â You have, 5.2 times, I'll call it x4 plus 4.9

Â times x5 plus negative 0.1 times x6.

Â So, x4 is a 1 if the person is 20-34 years and a 0 if not.

Â These represent the other two categories.

Â And we'd have payer type.

Â So plus 3.3 times x7

Â plus 10.0 times x8.

Â So there's eight x's in this model but there's only four predictors, sex,

Â race, age and payer type but some of these predictors require multiple x's so

Â that's just to give you a sense, you know, these models can be big,

Â they can be bigger than this.

Â But nicely in the paper, or in a paper context,

Â we see a table like this instead of the entire equation written out.

Â And this is a more concise presentation because we get the estimate and

Â the confidence interval.

Â Where these confidence intervals come from well like I say it was business as usual,

Â just to give you an example though, here are the slopes we presented for

Â the age categories and their estimated standard error.

Â So the confidence interval given in the previous table for

Â the mean difference adjusted for sex, race and

Â payer type between those that are 20-34 years old in the youngest

Â group would be the estimate of 5.2 plus or minus two standard errors.

Â And that gave us that confidence interval from 2.2 to 8.2.

Â And this looks slightly different than what I

Â showed on the previous slide because I rounded the numbers where as I

Â got the previous ones form the computer but this is 4.9 plus or minus 2(1.4).

Â Which gives us the confidence interval of 2.1 to 7.7.

Â Very similar to what was on the previous slide.

Â And this is negative 0.1 plus or minus 2(1.5).

Â Which gives a confidence interval we showed before of negative 2.9 to 2.8, so

Â it's the same old business as usual for confidence intervals.

Â Okay, so in summary, we've given a couple examples here of multiple regression.

Â We've showed how to interpret the resulting slopes.

Â What would the estimated intercept describe in these models,

Â that have multiple predictors?

Â Well, for example, in the model with height, and weight, and sex.

Â On arm circumference, the intercept estimates the average arm

Â circumference for, male children, because that was the reference for sex.

Â Of the same, male children with zero height and zero weight.

Â So it's a completely fictional group that doesn't describe anybody in our sample.

Â But we've shown how to act, so the intercepts frequently in multiple

Â regression don't have any relevance to the population from which our sample is drawn,

Â but they're necessary to specify the entire equation.

Â But we've seen in this section how to interpret the slopes from multiple

Â regression models in terms of the comparisons being made, and the mean

Â differences, and how they're specifically adjusted for other things in the model.

Â And we've looked at how to assess confounding through two examples, and

Â we'll look at some more in Section D, by comparing the results from

Â simple regressions to the adjusted results from multiple linear regressions.

Â In the next section, Section C,

Â we'll show how to use these models to predict outcome values for

Â different predictor values, and how to compare groups and get mean differences

Â between groups who differ by more than one predictor in a multiple regression model.

Â Coursera provides universal access to the worldâ€™s best education,
partnering with top universities and organizations to offer courses online.