A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 2: Regression Methods

76 ratings

A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

From the lesson

Module 3A: Multiple Regression Methods

This module extends linear and logistic methods to allow for the inclusion of multiple predictors in a single regression model.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

All right, welcome back. In this section we're going to look at some examples,

Â three examples of linear regression used in research articles and we're going to show how

Â the authors describe their approach to fitting

Â regression models and then look at the results in tabular format.

Â So hopefully this will give you an opportunity to interpret the results from

Â simple and multiple linear regression models presented in

Â several published journal articles.

Â So the first one is one we're already familiar with.

Â We looked at it several times in this course.

Â It's the Academic Physician's Salary published in

Â the Journal of the American Medical Association in 2012.

Â And what they ostensibly did was took a survey of

Â academic physicians and then wanted to compare the average salary,

Â yearly salary between male and female researchers but because there could be

Â other differences between the male and female researchers

Â that may also be related to salary,

Â it was important to adjust for those differences

Â to properly estimate the salary differential between men and women.

Â So we've looked at this initial results several times.

Â They say the mean salary within our cohort was $167,669 annually,

Â U.S. dollars annually for women and $200,433 annually for men.

Â And this difference was on the order of $33,000 per year.

Â But they went on to adjust this for multiple things including specialty,

Â academic rank, leadership positions,

Â publications, and research time and they talked about doing this in

Â the final model and we'll see in a minute what they're

Â referring to is a multiple linear regression model.

Â They found that after adjustment there was still

Â a sizable and statistically significant difference

Â in average salary between males and females.

Â However, it was lower in magnitude than originally estimated when

Â ignoring these other factors that were ultimately adjusted for.

Â So let's look at the method section to get a sense of how

Â they use linear regression to do their analysis.

Â So I'll just read it to you even though you're perfectly

Â capable of reading it but above just highlight certain things here.

Â They just start by saying they limit

Â the analytic sample to individuals with MD degrees who are still affiliated

Â with U.S. academic institutions and reported

Â their salary after comparing those who reported salaries and those who have not.

Â They initially looked to see if there were any biases in the data on other factors using

Â only those who reported salaries since extensively the study was

Â about salary they couldn't use the information on people who didn't report.

Â They go on to say, we described characteristics of

Â this sample by gender and we've seen some examples of that in

Â previous lectures where they look at for example

Â the distribution of region of the country by male and female where

Â the distribution of NIH funding tiers

Â by males and females with regards to the institution.

Â And then they go on to say, and then constructed

Â multiple variable linear regression models for

Â salary with the following respondent characteristics and they go on to tick off

Â a huge long list of things that they included

Â as other predictors above and beyond salary extensively to adjust for these.

Â They go on to say,

Â most characteristics were categorical and

Â modeled as indicators with a reference category.

Â So we've talked about doing that.

Â It's just nice to see people referencing something we've learned about in the text.

Â They go on to say,

Â we constructed both a full model using all these Covariates,

Â that's another name for predictors.

Â They ostensibly used the Oubre regression model that included

Â salary and all the things listed in that previous slide.

Â Then they also took a smaller,

Â more parsimonious model where they say,

Â we iteratively deleted variables from the bottle based on improvement in

Â the akaike information criterion

Â using both forward stepwise and backlights elimination approaches.

Â We have not talked about this akaike information in Criterion sometimes called the

Â AIC but it's a very similar process to what I was

Â referencing in the previous section where the researchers would go and pare things

Â down iteratively by taking things out that were not statistically significant.

Â One of the time refilling the model then removing the next with some kind of ordering

Â the next variable is not statistically significant until they

Â got down to a subset of predictors that were all statistically significant.

Â The forward approach would start with salary and then add in

Â subsequently more variables see if they were statistically significant if they weren't,

Â throw them out of a pool of

Â potential covariates- of potential predictors and try another one.

Â So they goes to a pair the model down to

Â only those predictors that are adding information statistically.

Â So they actually show the results of both these models.

Â I can only show you a piece of it because it spans two pages in the article.

Â But what they show here let's focus our eyes on

Â the prize in terms of the main question of interest, the gender comparison.

Â So they give the intercept from both models and then the initial model this is

Â the one that includes every possible confounder as a predictor.

Â So they have race here and they give

Â the adjusted differences in salary between

Â the race groups adjusting for all the other things in

Â the model but then they give an overall P-value testing whether

Â there are any differences in average salaries at the population level after adjustment.

Â And it's not statistically significant.

Â So ultimately they remove this from

Â their final model where they only included the statistically significant items.

Â They looked at age.

Â These are just some of the things they looked at whether or not

Â the researcher had children after adjusting for all these other things.

Â And while those who didn't have children had

Â lower average salaries in the sample after adjusting for

Â the other things this result was not statistically significant.

Â So in this model the one where they adjusted for everything whether it was

Â statistically significant or not the average differences salary between males and

Â females was about $12,000 per year in their final model.

Â So this is the model where they only

Â included the other predictors that were statistically significant.

Â So the reason there is no entries for all of these things here is because none of

Â these were statistically significant when they were put in that overall model.

Â Like I say this I'm not going to show you

Â a whole table but there's a whole another page of results and for some of

Â these other predictors they remained in the model and

Â their adjusted estimates are shown in this column as well.

Â So the average difference for salaries between

Â males and females upon adjustment only for the things that

Â remained in the final model was a little bit larger but it was still statistically

Â significant and it was on the order of $13,400 per year.

Â Here's another example of a study that used

Â linear regression techniques for sleep in BMI subjects with insomnia.

Â So ultimately here's the abstract in extensively the study

Â started by looking at polysomnographically determined sleep

Â monitored overnight especially the amount of slow-wave sleep

Â (SWS) and body mass index in patients with insomnia.

Â Now initially they recruited patients with insomnia and

Â people without insomnia for their study and they measured

Â things like height and body weight at the time of the study and they found

Â no significant correlations were found between total sleep time and BMI among insomniacs.

Â However compared with normal volunteers those without insomnia,

Â insomnia patients exhibited longer sleep late latency and shorter total sleep duration.

Â While the two groups had no significant differences in BMI,

Â insomniacs presented with more N1 but

Â less time spend in slow wave sleep and rapid eye movement sleep.

Â And based on their slow wave sleep time

Â they divided now they're going to focus in on the insomniacs

Â into three groups and they found differences

Â in the average BMI between several of these groups.

Â And then they go on this is really what they present in the paper so it's

Â interesting they hold up to the in the

Â abstract but this is what we're going to focus on.They say,

Â "Further analyses with multiple linear regression showed

Â a significant negative correlation between the amount of

Â slow wave sleep and BMI scores in insomniacs,

Â whereas no such correlation was found in

Â healthy volunteers after controlling for potential confounds like age,

Â sex, and other indices of sleep."

Â They actually looked at the regression separately for

Â insomniacs and healthy volunteers

Â and they found correlations among the insomniacs with BMI.

Â And so they go on to say, "Our study suggest that lower amounts of

Â slow wave sleep may be associated with higher BMI in patients with insomnia.

Â So then that let's just take a look at their data analysis section just to

Â show the presence of a lot of things

Â we've covered up till now in this two quarter sequence.

Â So they go on to say in the Data Analysis section,

Â Descriptive statistics examined were means and standard deviations.

Â Comparisons of the two groups we use

Â performed using students t-test for continuous variables.

Â Comparisons across trisection of time spent in

Â slow wave sleep that was the three groups they characterized by the

Â Tukey's were carried out using ANOVA for continuous variables with

Â normal distributions variables that do not distribute

Â normally were log transformed for statistical analyses.

Â That's not necessary but some researchers clinging to this idea

Â that data needs to be relatively normally distributed to do mean comparisons.

Â And then additional chi square analyses were used for

Â categorical data and they go on to talk about some other adjustments they made.

Â But then here's where we get to the part about regression.

Â Unadjusted linear regressions were used to assess the relationship

Â between time spent in slow wave sleep and REM sleep,

Â total sleep time and BMI.

Â The model was then adjusted for multiple potential confounding variables.

Â These potential confounding variables include age,

Â sex, education level, duration of illness, etc.

Â So I just want to point out,

Â so here's the first table they showed where they did

Â descriptive statistics and compared the means or percentages depending

Â on whether they were continuous or

Â categorical between the healthy controls and the insomnia patients.

Â And they report p-values ostensibly using the two sample T-test for

Â means and if they have proportions which they

Â don't in this table they would use the chi squared test.

Â But here's what I want to show you just to get a sense of

Â how much in one night of sleep slow waves we can expect.

Â In healthy control, the average was

Â 83 minutes but there was a fair amount of person to person variability.

Â In insomnia patients, the average was on the order of slightly more than an hour,

Â 62.8 minutes with a fair amount of patient to patient variability.

Â So just want to get a baseline sense of how

Â much on average we would expect in insomnia patient.

Â So then they talk about this specific linear aggression that was

Â sort of the crux of one of their findings in

Â the article and will show the results next to say.

Â Because there were significant inverse correlation

Â between time spent in slow wave sleep in BMI,

Â we further estimate that the unadjusted association between

Â time spent and slow wave sleep and BMI using linear regression.

Â We then added multiple potential confounding variables.

Â In the second multivariate model,

Â the confounders or covariates

Â the additional predictors that were potential founders were age,

Â sex, education level and duration of illness.

Â In the third model they added

Â more potential confounders and then

Â the final model included more on top of the third model.

Â And they actually will see now show that they presented the results from each of

Â the models focusing on the relationship between BMI and slow wave sleep time.

Â And they say here the significance levels was established at P less than 0.05.

Â So here they're saying their alpha level for their hypothesis tests is five percent.

Â So here they show the results of several linear regression models.

Â They don't show all estimated slopes.

Â They focus on the slope of

Â the slow wave sleep time as it predicts mean BMI from each of these models.

Â The first model is the Unadjusted Association.

Â They just simply regressed BMI on slow wave sleep time.

Â And we can see that there's a negative correlation every increased minute

Â in a slow wave sleep time is associated with

Â an average difference in BMI of 0.02 units lower.

Â So think about this so it would take you know if we were looking at

Â groups of patients who differ by half an hour in

Â their average nightly slow wave sleep that would estimate in

Â a lower BMI average for the group they got

Â more slow waves sleep on the order of 0.6 units.

Â So they give some sense of how much the magnitude of this association is.

Â They give a confidence interval and a P-value and it's statistically significant.

Â And then they go on to show that with these subsequent models they

Â start adjusting for things and then more things and then more things.

Â They show that while the estimate attenuates

Â a bit it gets a little bit smaller and absolute magnitude.

Â It's still negative and still statistically significant.

Â And one thing to note here I'm not sure why a B I would have asked this if I were

Â viewing the paper but the confidence interval they report for

Â the last adjusted association includes zero.

Â But the P-value is less than 0.05 and without knowing further what they did I

Â can't comment but my first reaction was there be some kind of reporting mistake here.

Â But on the whole what they're demonstrating is that there is in general to start and then

Â adjusted statistically significant negative

Â association between average BMI and slow wave sleep.

Â And that persists even after adjustment for other factors.

Â Layered other factors they start with this set.

Â Then they add these additional indices and then they finally add

Â REM sleep time and more indices

Â and throughout those layers of

Â adjustment once they start adjusting for the first layer age,

Â sex, and education level there's

Â little attenuation from the unadjusted the absolute value gets

Â more and then it stays consistently around the same order magnitude with more adjustment.

Â So they're demonstrating that the initial so easily found

Â wasn't completely explained by the other factors they adjusted for.

Â One last one we'll look at and then another sleep based study.

Â But it has to do with this is an earlier study from The Lancet in

Â 1994 with blood pressure and snoring.

Â And so the summary in the abstract they say.

Â "The association between snoring and blood pressure is still a matter for a debate."

Â And they go on but they say partly because

Â the uncertainty about the definition of snoring and partly

Â because confounding factors may affect systemic blood pressure such as obesity,

Â sleep apnea, and nocturnal hypoxaemia.

Â So what they did is they took a large sample relatively large sample over 1,400

Â patients -- the majority male who were referred to a sleep disorder Study Center.

Â And they got a full history on their health with

Â particular attention to CBD and medications.

Â And they go on to summarize that the patients had nocturnal

Â polysomnography's or a sleep study including

Â objective measurement of snoring and blood pressure was measured in the morning.

Â 18 percent of the non-snorers had hypertension as did

Â 20 percent of the heavy snores that that proportion was not so significantly different.

Â However, they go on to say

Â multilinear regression analysis showed that

Â snoring was not a significant determinant of blood pressure.

Â Only age, male sex,

Â apnea or hypo apnoea index,

Â and body mass index contributed significantly the variability in blood pressure.

Â So just I want to hone in on something that I found interesting here.

Â There is a summary statistics they give for

Â the sample and let's look at this snoring index.

Â And the average is

Â 323 for all subjects in the sample a fair amount of variability 321 standard deviation.

Â But this ranges from zero to 1,846.

Â And the authors define the snoring index is the number of snores.

Â I'm taking this verbatim from the text,

Â in an hour of sleep.

Â It was measured as part of the sleep study,

Â so I am very curious and there's nothing in the article that shows this but I'm very

Â curious about what the distribution of

Â these measures looks like across the people in the sample.

Â I would love to have seen a histogram of that.

Â So then they go on to report the regression results when they actually put there.

Â They call this univariate,

Â which means that these are the unadjusted estimates for example this is the regression

Â of systolic blood pressure on sex and only sex.

Â And here by putting male,

Â the implication is that this number compels males to

Â the female Reference Group of age and years BMI in kg/m2.

Â And then here down the table they have the snoring index.

Â So let's just look at for example sex.

Â Even though snoring index is the main question of interest.

Â So they show that males have on

Â average systolic blood pressures of

Â 7.2 millimeters of mercury higher larger than females.

Â And the result is statistically significant.

Â And they give a standard error at 1.052.

Â So we could create a confidence interval if we wanted.

Â And this is the slope from

Â a simple regression where X is one for males and zero for females.

Â Then they show the adjusted association,

Â adjusted for the other factors in the model.

Â This is the slope of an X for sex in a model that includes age, BMI, AHI etc.

Â shows that after adjustment for these other factors males still

Â have a higher systolic blood pressure than comparable females.

Â But it's on the order of 5.4 and it is statistically significant.

Â So slightly smaller than the difference

Â when not adjusted for anything but still statistically significantly larger.

Â Let's go to snoring index now.

Â The slope- this compares.

Â So this is from this unadjusted slope is from a simple model that compares

Â the average systolic blood pressure to a slow time snoring index.

Â But snoring index is measured on a continuum.

Â So this compares the average difference in systolic blood pressures between

Â two groups who differ by one unit on the snoring index.

Â This assumes that the relationship between systolic blood pressure and snoring index is

Â well described by line across that entire range in the sample from zero to over 1,800.

Â And I would love to see a scatterplot of that relationship.

Â But in any case under that assumption they estimated that the per unit difference,

Â the average differences systolic blood pressure per

Â one unit difference of snoring index is on the order of 0.007

Â millimeters of mercury not a large amount but remember we're

Â comparing people in the sample whose values range from zero to over 1,800.

Â So if we actually compared two groups who's snoring

Â index is differed by 100 then we would expect the group with

Â more snoring a 100 units more to have a blood pressure on the order

Â 0.7 Milligrams of mercury higher on average than those with the lower score.

Â This result is statistically significant in the non-adjusted comparison.

Â But notice when they go to adjust the estimate becomes

Â negative and is no longer statistically significant.

Â So it appears at first that there was a statistically significant association between

Â systolic blood pressure and snoring index.

Â But after adjusting for other factors that are potentially related to growth

Â the association from a statistical perspective disappeared.

Â So this is how they describe it.

Â They say results of the regression analysis are shown in Table 3.

Â Although univariate analysis shows

Â statistically significant contributions for all variables when considered individually.

Â In other words all of the unadjusted associations

Â were significant full model multi-variate analysis showed that only male sex,

Â age, BMI and AHI which is an apnea index contribute significantly to a final model.

Â These variables in fact were the only ones selected by a stepwise multiple rush analysis.

Â In other words they let the computer

Â choose the results that were statistically significant

Â to form their ultimate model which they actually don't show.

Â They only show the results of a model that included all of

Â these predictors in which the r2 =

Â 0.18 for diastolic blood pressure and

Â 0.21 for systolic blood pressure. That's what we were looking at.

Â So they're substantially saying they estimate based on

Â their data that together; sex, age, BMI,

Â and this apnea index explain an estimated 21 percent of

Â the variation in the systolic blood pressure measurements in their sample.

Â So hopefully this has been informative to show you at least some examples of the use of

Â linear regression in research and the reporting of both

Â how the researchers conceptualized modeling their results,

Â how they chose their final if you will multiple regression model and

Â then the presentation of the results and how they interpret them substantively.

Â Coursera provides universal access to the worldâ€™s best education,
partnering with top universities and organizations to offer courses online.