A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 2: Regression Methods

81 ratings

A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

From the lesson

Module 3B: More Multiple Regression Methods

This set of lectures extends the techniques debuted in lecture set 3 to allow for multiple predictors of a time-to-event outcome using a single, multivariable regression model.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

Greetings and welcome to lecture set nine.

In this lecture set, we'll give a brief overview to handling effect

modification in a multiple regression context, and

also look at another approach above and beyond categorizing continuous

predictors in a regression model to handle non-linear associations.

So in this lecture set, an overview, we will look at testing for effect

modification and estimating different outcome/predictor associations for

different levels of a potential effect modifier via the use of

something called interaction terms in regression.

And we'll also look at conceptualizing non-linearity as a type of

effect modification and showing another way to model it in a regression context,

which will be very similar to the concept of interaction terms, and

we can do this without categorizing the continuous predictor.

So this is just another approach that's sometimes used in the literature.

But let's first get started by first looking at the idea of regression with

interaction terms.

So hopefully at the end of this lecture you will describe the interaction term

approach and appreciate that it exists, this approach for

estimating separate outcome/predictor associations for

different levels of effect modifier and for testing for effect modification.

I'm going to give some details in here, and

what I really want you to get is the big picture.

But some of you will be interested in using these techniques in

your own analyses and may go on to do your own analyses or

take further courses in regression analyses so I want to show you at

least some of the basic mechanics for those of you who are interested.

So one way to handle effect modification in the regression context is to

present stratified results.

So here is an example of a study where the effect modification by sex of

the person was of interest and the results were presented for multiple, simple and

multiple logistic regression models.

Separately estimated only for data from males and data for females.

This was an article that was looking for suicide outcomes.

Both the idea of having suicidal thoughts and attempting suicide as a function of

a person's self sexual identify, whether they identify as homosexual or not.

And the authors look both at the unadjusted associations between these

outcomes and homosexuality and the same association adjusted for

other factors like ethnicity, alcohol abuse, et cetera, but

they did so separately for males and females.

So, a closeup of the first the top of this table shows these

associations estimated for boys only.

And at the bottom of the table the same analysis is presented for females.

One way to summarize this concisely, with regards to the outcome of suicide attempts

would be that show the unadjusted and adjusted odds ratios comparing the odds

of having attempted suicide for those who identify as homosexual to not.

Doing so separately for males and females, and

that's what the author's ultimately were talking about.

And they wanted to investigate whether the relationship between suicidal outcomes and

homosexuality was different for males and females.

So they completely stratified their data by sex and ran these separate analyses.

Another example is an article that came out in the New England Journal of

Medicine looking at coffee drinking and mortality.

And the authors, what they did was they looked at time to event data,

time to mortality data in a follow-up period as a function of how much coffee,

on average, people reported drinking at the start of the follow-up period.

And so what they did separately for males and

females to see if there were any differences in the associations,

either directionally or in terms of the magnitude is they

estimated both the essentially unadjusted association, only adjusted for age.

The association between mortality and coffee consumption.

And the estimated hazard ratios of mortality for

the different levels of coffee consumption relative to those who didn't drink coffee.

And then they reestimated these adjusted for a, a multitude of other factors and

here's a list of those factors here.

Body mass index, race or ethnic group, et cetera.

But they did these analyses both the unadjusted or only age adjusted and then

multiple adjustments totally separately for the data on men and again on women.

So they never combine the data between the sex groups,

they did the analyses completely separately.

Sometimes, however, the researcher may want to estimate separate associations for

one predictor only.

Like sex.

And then use the information across both sexes, or

across all groups that they want to present separate associations for

one predictor, to estimate the association with other predictors.

So for example, we may want to estimate the sex specific association between

wages and years of education after adjusting for

other factors, but we want to use the data from both males and

females to estimate the association with these adjustment factors.

So, they don't want to do the analysis completely separately for males only and

females only.

Similarly we might want to estimate age specific associations between

mortality and race in dialysis patients, after adjusting for other factors.

But we want to, want to use the data for all age groups combined in order to

estimate the adjusted associations in one model with these other factors.

Well this can be done by concluding what's called an interaction term in

a multiple regression model.

So let's look at an example here, this is the data set 534.

US workers in 1985 and what I want to look at,

is the association between hourly wages and years of education.

What I am presenting in this table is the unadjusted and adjusted linear

regression slopes for years of education only in models that were adjusted for

various multiple characteristics.

So, what we have here.

Here is this slope of years of education from a simple linear regression model of

wages on years of education.

So the unadjusted association suggests that two groups who differ by one year

in years of education will have hourly wages that differ by 75 cents on average,

and the higher salary is associated with more years of education.

This result doesn't change at all after adjusting for

sex differences in the years of education groups nor essentially

stays the same after additionally adjusting for for union membership.

Then estimate attenuates a bit, but

is still statistically significant when we start adjusting for job type.

But in all these analyses, these comparisons, all these adjusted analyses,

this compares, the slope for years of education compares the difference

in hourly wages on average for two groups who differ by one year of education.

But are the same in all other characteristics.

It doesn't matter what those other characteristic values are as long as

the two groups being compared in terms of

the years of education difference are the same on these.

So for example, let's look at the results for model c.

This was the one that estimated the association between hourly wages,

years of education, sex, and union membership.

And in the table I only showed you the results, resulting slope for

years of education, but

there were also resulting slope estimates for sex and union membership.

So what I'm showing here is a graphic that shows what this model's estimating with

regards to the union membership adjusted relationship between hourly wages and

years of education and I'm applying this separately by sex groups.

So you'll notice that these lines are parallel.

The slope of each of these lines is 0.76,

that adjusted slope we reported, adjusted for sex and years in education.

And it, additionally if you look here if we fix years of education of any value and

look at the vertical distance between these two lines.

What that's estimating is the difference in average salary between males and

females of the same years of education and union membership status.

And we don't have any visual component to membership status, but

I am telling you that this association is adjusted.

So the slope of each sex-specific regression line for

years of education is the same.

This is the slope of years of education from that multiple linear regression model

with years of education and sex as predictors as well as union membership.

Similarly, the difference between the estimated hourly wages for

males and females is the same at each value of years of education.

This is the difference in hourly wages between males and

females adjusting for years of education and also union membership.

And this difference is a $1.89.

Similar graphics could be shown for the other models that adjust for

things beyond union membership.

So in these models where we adjust for sex and other things once

sex has been adjusted for, for example the wages/years of education.

Relationship is the same in each level of sex, for both males and females.

Once years education has been adjusted for the relationship between wages and sex is

the same for groups with the same years of education, and other adjustment variables.

So suppose however, we are interested in investigating whether the sex,

relationship between wages and

years of education is modified by sex, after we've adjusted for other things.

Well, one thing we could do is what I

was talking about at the beginning of the lecture.

We could stratify the sample by sex and look at the results of a regression

analysis on wages years and education using males only, the data for

males, and do the same thing using the data for females only but

never combine the males and females data for an overall analysis.

If we did this we'd get a slope estimate for years of education for

males of 0.71 and a larger estimate for females of 0.84.

But let's talk about another approach,

that will allow us to use the information for males and females together.

So approach number two is to add what's called interaction term,

between years of education and

sex to the model, which includes usually multiple other adjustment variables.

Here's how it works.

We actually if we read the computer,

the computer I have to actually create it myself.

It doesn't come with the data set and there is no automatic command on

the computer to say create an interaction term.

But it's actually elegant in its simplicity once we get through

the mechanics here are, hopefully you'll appreciate that this is kind of neat.

So the interaction term can be created.

It sounds kind of strange at first but we'll see why it works.

Can be created by taking the product of the two variables we want to interact.

In other words, if we want to estimate whether the relationship between years of

education is modified by sex, we're seeing whether

there's an interaction between years of education and sex.

And so we would create this interaction term.

I'll generically call it x3.

By taking the product of years of education and sex, x1 times x2.

And so let's look at what the new model we're going to estimate is.

It's going to include years of education and sex, just like

the model that adjusted for each other, plus the other adjustment variables.

But we're actually going to also add in this interaction term, x3.

And then any other x's we want to include to either better predict wages or

adjust the relationship between wages and years of education above and beyond sex.

So let's see why this works.

So in this coding schema I've called x1 years of education and

x2 is sex which takes on a value of 1 for

females and zero for males and then x3 is the interaction term.

So what is the value of this interaction term for males?

Well for males this equals the years of education measure

times the sex value for males, x2 is equal to 0 for males.

So this interaction term is nothing, it disappears.

It's equal to 0 for males.

What about for females?

Well for females, x3 our interaction term, is equal to years of education for

x1 times x2, the sex indicator which is a 1 for females.

So for females, x3 is ultimately equal to another copy of years of education.

So, let's see how this all plays out.

So, what I did here was estimated this regression that includes years of

education and sex, and other adjustment variables.

In this case, just union membership, but

I just want to make this more gen, general in its conceptualization.

And I also included the interaction term.

And here are the resulting estimates I got from the computer.

The slope of years of education is 0.7, the slope of sex is negative 3.69,

and the slope of the interaction term is 0.14.

So, let's see how this plays out.

Well lets look at what, first what this model estimates, the relationship between

wages, years of education, and everything else to be for males, males only.

Males are kind of easy given their coding because their value of x2 is equal to

0 and hence their value of the interaction term is equal to 0.

So when we write out, if we were only looking at males, when we write out what

this estimates for males, we get the intercept of 0.4 plus the slope for

years of education times years of education.

And then both sex and the interaction term disappear because they're both equal to 0.

Plus whatever else we have in this model.

In this case its just union membership but

I'm not showing that slope because I want to focus on this years of education piece.

So in males, in males the slope of years of education here.

The piece that describes the relationship between hourly wages and

years of education is equal to that 0.7.

So for males,

hourly wages increase by 70 cents on average per additional year of education.

For females we're going to have to do a little more accounting to get this story,

but what we're basically going to see is by generating this

interaction we get to put in another copy of years of education, and

when we combine the two parts that we'll get for years of education we get

a different slope estimate of years of education for the females.

So let's do this out.

So for females we get this, we get the intercept that the males got.

We get everything that the males got.

And then we get the slope of sex times 1, so plus -3.69.

Then we get this interaction term times the years of education variable times 1,

like we said before.

And then plus there's the piece about union membership, but

I'm just leaving that generic.

So if we do a little accounting here we can bring the negative 3.69 over here.

Sorry about this extra negative sign, but, and

then if we order the 0.7 x1 and the 0.14 x1 together.

And then we do a little factoring we see the combine, these two combine to give,

if you will, the slope of years of education among females.

So among females, the average increase in hourly wages per year increase in

years in education is that increase for

males of 0.7 plus this additional piece, plus another 14 cents.

So in total the slope or

association between hourly wages and years of education for females is 0.84.

And this piece, this piece for the interaction term

quantifies the difference in the relationship between hourly wages and

years of education for females compared to males.

So here's what this would look like if we plotted it.

Similar to the plot we did before, but now you'll notice that these lines,

these sex specific associations have different slopes.

The slope for males is 0.7.

And the slope for females is larger, 0.84.

So you can see that these lines are starting to

converge with increased years of education.

The other side of this story, and that we could go back and

rewrite the model to estimate this piece as well, but

just conceptually, if we are estimating interaction between these two variables.

Not only are we estimating differing relationships between hourly wages and

years of education by sex, but we're estimating different associations between

hourly wages and sex depending on years of education.

So if we look at two groups who have ten years of education,

males compared to females, this vertical distance here is the average difference

in salaries between males and females with ten years of education.

If we did the same thing for those with 15 years of education,

this vertical distance is the average difference between males and females.

So you can see that the average difference between males and

females also depends or changes depending on years of education.

So, now that we've done this we have to

remember that everything we've done is just an estimate.

So this 0.14, that inter-estimates the difference in the slopes

of years of education for females compared to males is just an estimate.

If we want to test formally whether there's evidence of

an effect modification based on these data, whether there is

a statistically significant difference in the relationship between hourly wages and

years of education between males and females.

We would test that this slope is equal to 0 because think about it.

This slope estimate of years of education for males was equal to

just that original piece for years of education.

The slope for females was, we start with males, and add.

So suppose there were no difference in the relationship between

education and hourly wages between males and females.

Then this piece that quantifies the difference would be 0 because,

because there'd be no difference in the association.

So testing whether or not the,

at the population level the coefficient of this interaction term is equal to 0 is

akin to asking is there evidence of a difference in this hourly wages, years of

education relationship between the sexes after counting for sampling variability.

So this is sometimes called a formal test of interaction.

In this example the p-value for

testing this null of, coefficient of 0 for the interaction term, is 0.38.

So there's not a statistically significant interaction between years of education and

sex after adjusting for union membership status.

And the purposes of our investigation are to either confirm or

rule out effect modification after adjustment for union membership status we

would say we've ruled it out power considerations notwithstanding.

And we would probably go back and report that common adjusted association between

years of education and hourly wages adjusted for sex and union membership.

Let's look at another example though, and

again I'm just trying to give you the basic idea here.

Being able to handle the mechanics of this are not essential for this course, but for

some of you this may be interesting and you may want to apply this.

At some point in your data analysis projects, or

you may go on to do further courses in statistics and this will

give you at least a starting point for the mechanics of interaction and regression.

So let's look at an example of mortality in

patients with primary biliary cirrhosis.

This Mayo Clinic data we've looked at so often before.

And this is a randomized trial for patients randomized to

receive the drug DPCA or Placebo, and the outcome of interest is death.

And the results the unadjusted hazard ratio mortality for

patients receiving DPCA Placebo, was 1.06

a slight increase in the mortality in the sample for those who receive the drug.

But this result was not statistically significant.

And as we've seen if we adjusted for something like age the unadjusted and

adjusted hazard ratios,

DCPA to placebo, are very similar because this was a randomized trial.

However, we still may have a question about age, in knowing if

one could found the overall relationship between DPCA and mortality.

But we might want to ask maybe,

maybe the drug, the affected drug was modified by the age of the patient.

Maybe it doesn't work.

Or is even harmful for some age groups but works well for others.

Well, at this level of analysis,

all we have is one overall association between DPCA and mortality.

That would use to describe the association for all ages.

So if we want to investigate whether there is effect modification by age,

we have to go a little further.

So, I'm going to look at age categorized into quartiles so

that we can do this interaction approach, and there's four quartiles.

The first quartile is persons four, less than 42 years.

Second quartile is those 42 to 49.9 years.

Just less than 50, etc.

You can see what's going on here.

So to investigate whether age modifies the effect of

the drug we will need to fit a Cox model.

That includes drug as a predictor but also the age quartile indicators.

We've got four groups here so we'll need three binary indicators.

And then interaction terms between the drug variable and

each of the age quartile indicators.

So this actually looks a little daunting when it all comes down.

And again, for those of you that are not interested in the mechanics, I just want

you to get the basic idea of what the interaction term or terms allow us to do.

But for those of you interest, I'll detail this a little bit.

So this x1 here is an indicator of DPCA or placebo.

[NOISE] And then these indicators here are for

the second through fourth age quartiles.

The reference group is the first age quartiles.

And then what I have here are interaction terms between the drug indicator and

each of the indicators for the second through fourth age quartiles.

So literally we just multiplied those things through.

So let me show you just how this shakes down.

If we're looking at age quartile one.

Well, all the indicators are 0 because age quartile 1 is a reference group, and

all the interaction terms are 0 because they're a product

of each of these indicators.

And so the relationship we get on the log hazard scale,

between mortality and treatment is this slope of negative 0.07.

So this is our log hazard ratio for the relationship between

mortality and DPCA compared to placebo amongst persons in age quartile 1.

If we look at age quartile 2 we pick up the same piece that we have for

age quartile 1.

We also picked up another piece of information because they're in

age quartile 2, and we pick up

the interaction term between the indicator being in age quartile 2 and

the drug variable and the piece for that is the coefficient of 0.28.

Similarly if we look at age quartile 3 we pick up this first piece, negative 0.07

plus a piece that has to do with the age differential and then a piece that has

to do with the interaction between the drug and the age quartile 3 indicator.

Notice what we're getting with each of these interactions is just another

copy of x1, the dug indicator.

And you can see the same sort of thing applies for age quartile 4.

So let's do a little reorganization here to make this a little more cogent.

In age quartile 1 the only number that has to do

with our drug indicator is that initial slope of negative 0.07.

So this is the log hazard ratio mortality for

those in the DPCA group to the placebo amongst the lowest age quartile.

If we wanted to get the log hazard ratio comparing patients on the drug to placebo

for age quartile 2, we would take that initial slope for the first quartile and

then add the coefficient for the interaction term, that 0.28.

So the log hazard ratio here is, when all the dust settles, 0.21.

And this 0.28 estimates the difference in the association between mortality and

treatment for those in age quartile 2 compared to age quartile 1.

Similarly for age quartile 3 we start with the estimated association of log hazard

ratio for those in age quartile 1 and then add the coefficient for the interaction

term between the drug indicator and the indicator for age quartile 3.

And the log hazard ratio for this group would be the sum of those two things,

0.03, and this 0.01 is the estimated difference in

the association between mortality on the log scale, and treatment for

those in the third quartile to the first.

And you could do something similar for the fourth quartile.

If I were presenting these results to somebody else who wasn't as

knowledgeable as we are about regression models I would

use the computer to estimate the hazard ratios.

Exponentiate those log hazard ratios and

then also with the computer I can the confidence intervals.

So this just shows me that there's a slight benefit for the drug for

these in age quartile 1 but it's not statistically significant.

And age quartile 2 and 3 the drug is positively associated with mortality

in this study, but again it's not statistically significant for either.

And then it looks like the results are promising for the oldest group.

In older persons there's an estimated reduction in

mortality that's notable on the order of 27%.

But again, unfortunately this result is not statistically significant in either.

But what this interaction term has allowed us to do is estimate separate,

ultimately separate hazard ratios for between mortality and

treatment for these four age quartiles, and

then with the computer's help put confidence intervals on these.

We can also with the aid of a computer, we couldn't do it based on what I've given

you here, we could test whether any of these interaction terms.

We can test when any of them were statistically significant.

The idea being at least one of these interaction terms is different than 0 or

any combination of them, then the relationship between

mortality and treatment is different for at least two of the age groups.

And the resulting p-value on that is 0.74.

So unequivocally from both qualitatively and

looking at the confidence intervals, and

from a formal test, we conclude there's no interaction between age and treatment.

And on the whole this drug didn't work at, in this population of patients.

So in conclusion here, the effect of the drug is not modified by age.

Although the results looked promising for the oldest age quartile,

this was not significant after accounting for sampling variability in the data.

So, hopefully this is a, a basic introduction to

the idea of assessing effect modification of, with an interaction term.

I want you to get the basic idea.

I'm not going to hold you responsible for parcee models with interaction terms.

That requires a little more practice than what we can devote to in this course, but

it's really just involved accounting skills and

keeping track of what's turned on when etc.

And then combining terms where appropriate.

So, if the mechanics were a little daunting, don't worry about it.

But I do want you to appreciate that the inclusion of interaction terms allows us,

within the context of one model, to estimate separate outcome predictor

associations for the level of of potential effect modifier, for

different levels of a potential effect modifier in a single regression model.

For those of you who will go on to take further courses in statistics and/or

are interested in applying this in your own research, then this gives you

a primer on how to handle interaction terms in the regression modeling process.

In the next section we'll look at the use of interaction terms in,

in some of the published literature and how the authors report the results and

discuss their approach.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.