A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 2: Regression Methods

76 ratings

A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

From the lesson

Introduction and Module 1A: Simple Regression Methods

In this module, a unified structure for simple regression models will be presented, followed by detailed treatises and examples of both simple linear and logistic models.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

Welcome to Lecture 2.

Â In this set of lectures we'll take on the analysis method

Â called Simple Logistic Regression.

Â And hopefully, you'll see it, the results may look different on paper than what

Â we got with Simple Linear Regression, because our outcome type is different, but

Â the spirit is exactly the same.

Â So in this set of lectures we will develop a framework for

Â simple logistic regression, which is a method for relating a binary outcome

Â to a single predictor, and the predictor can be binary as well, categorical, or

Â continuous, and the way in which we'll relate these two things,

Â the outcome to the predictor, is again via linear equation.

Â So in this first section we're going to take on simple logistic regression when

Â our predictor of interest is either binary, two categories or

Â more than two categories.

Â So hopefully at the conclusion of this lecture section, you'll understand how

Â logistic regression relates a function of the probability or proportion of

Â individuals with a binary outcome, to a predictor filed in your equation.

Â And you'll be able to interpret the result, the intercept, and slope or

Â slopes, from a simple logistic regression model, when the models either

Â include a binary or categorical predictor, and in the next section,

Â we'll take on the situation where the predictor is continuous.

Â The first thing we have to establish formally is what

Â the left hand side of the linear equation for a logistic regression looks like,

Â and this is a bit more convoluted than it was with linear regression.

Â With linear regression, the left hand side was simply the estimated mean

Â of the continuous outcome and

Â we were predicting that by a linear function of our x1.

Â For a linear, a logistic regression, we have to work off a function of

Â the probability or proportion of having the binary outcome.

Â If the binary outcome is a y coded as 1 if the outcome occurs, and zero if not,

Â then we're looking at the proportion of probability that y equals 1, and

Â the function we're going to use we're going to start by looking at the odds of

Â the outcome occurring, and recall odds is just

Â the probability of it occurring divided by the probability of it not occurring.

Â But we're going to look at this on the log scale.

Â So our equation is going to relate the log odds of a binary outcome to a single

Â predictor x by this linear equation, and as noted previously, for

Â any type of simple regression, x can be binary,

Â nominal categorical, or not categorical, or continuous.

Â As with everything else we have done this far, we're only going to be able to

Â estimate the equestion, equation from a sample of data.

Â So to indicate that estimates that the slope and intercept we get are estimates,

Â again just like we did before, we'll put hats on top of them.

Â Technically speaking we're only going to be able to estimate the relationship

Â between the sample based log odds, and the predictor via this equation.

Â So we should put hats on these ps here to indicate that

Â we're estimating the proportions that in turn are used to estimate the odds and

Â the log odds based only on our sample.

Â But that is not consistently done in the literature,

Â the hats are left off of the ps, so I will from here on in not Include them.

Â You might say, well, walk odds, that doesn't sound like a very natural way of

Â thinking about things, and for the moment, just take it on faith, if you will.

Â But, if you trust me, but in the next section,

Â when we delve in the situation where x is a continuous predictor.

Â We'll explore the reason for this choice of scaling, and

Â see why it's necessary and appropriate.

Â So what we've got here ultimately well is an equation,

Â where if you give me the value of x1 for a group of persons or

Â subjects, I can plug that into my equation and estimate their log odds.

Â And this slope of x, remember when we started talking about linear equations we

Â focused on the comparison being made by the slopes.

Â We said generically a slope compares two groups, estimate of the left hand side,

Â for two groups whose x value differs by one.

Â So what the slope does here is it compares the log odds of the binary outcome for

Â two groups who differ by one unit of x1,

Â and hence this slope estimate is interpretable as an estimated difference

Â in the log odds of the outcome between two groups who differ by one x value.

Â Let's see, well, that's not very helpful, how, what is the difference in log odds?

Â How can I make sense of that?

Â Well, you may recall those wonderful properties of logarithms and

Â one of them includes the fact that the difference in

Â logs can be re-expressed mathematically as a log of a ratio.

Â So generically speaking I'm going to write out what the slope estimates here,

Â it estimates the difference in the log odds of the outcome.

Â For a group whose x1 value is, I'm just going to call it a plus 1, there.

Â Value of x1 is a plus 1 minus the log odds for a group whose x1 value is 1 unit less.

Â I'm just trying to be generic here and not put in specific numbers.

Â For the group whose x1 value is a, a could be any number.

Â The point is that the two groups being compared here in

Â this difference differ by one unit in the predictor.

Â So you may recall that the difference in

Â the log of one thing minus the log of another can be

Â re-expressed as the ratio of the first thing that we're taking the log of.

Â The odds that the group whose x1 value is a plus 1,

Â divided by the odds, or the second thing we're taking the log of,

Â the odds of the group whose x value is a, and the log of

Â the difference can be written expressed as the log of the ratio of those two things.

Â So now we're starting to see something that looks a little fam, familiar.

Â Odds divided by odds gives nothing more than an odds ratio.

Â The odds ratio comparing the odds of the outcome for two groups who differ by 1

Â unit in x1, and so what the slope is, is the log of an odds ratio,

Â and we'll see, with the concrete example and numbers in a minute.

Â If we have the log of the odds ratio, we can exponentiate or

Â anti-log that to get an odds ratio estimate.

Â So let's start with an example.

Â Let's go back to our data on anthropometric measures and

Â other measures from a random sample of Nepalese children, but

Â I'm going to expand the age range here to be between 0 in three years.

Â So we can explore the relationship between breastfeeding and

Â characteristics of children up to three years old.

Â So the first thing we're going to look is we're going to see,

Â if there's any relationship between breastfeeding and sex of the child and

Â we're going to quantify that.

Â And so in our data set the proportion of children in

Â this age group who are breast fed is three quarters or

Â 75%, and the sex distribution slightly favors females in the sample.

Â 52% of the sample is female the remaining 48% is male.

Â Now, we've seen several times now if we want to handle a binary predictor such

Â as sex,.

Â As an x, we can make one of the groups coded to 1, and make the other

Â group coded to 0, the reference group, and so just to mix things up, I'm going to

Â make males the 1s, and females the 0, or reference group for this analysis.

Â [SOUND] So what we're going to end up doing is estimating.

Â The logistic regression that looks like this,

Â the log odds of being breast fed, and I'm just going to shorten that to log odds,

Â equals an intercept plus a slope times our value of sex, x1.

Â And despite the fact that this is an equation, we're really only estimating two

Â outcomes here, the log odds of being breast fed amongst the males, and

Â the log odds of being breast fed amongst the female.

Â So let's just, just to get a little more practice and

Â understand where the difference in log odds,

Â let's write it out the estimated log odds of B breast fed for both these sex groups.

Â So if we do males, this is the log odds when x1 equals 1 we put our intercept,

Â beta nought hat plus our slope times 1, and so the log odds estimate for

Â male children is the sum of the intercept, and that slope, beta 1 hat.

Â For female children, their value of sex is 0.

Â So when we plug that in, this part disappears, it's just the intercept.

Â So again, beta 1 here is the difference in the log odds

Â being breast fed for males minus the log odds for females.

Â Another way to think about this is, is what we said before.

Â It's the difference in the log odds being breast fed for males.

Â Minus the log odds of being breast fed for females,

Â and we've seen before, we just saw that this could

Â be re-expressed as the log of the ratio of the odds.

Â Being breast fed for

Â male children, relative to the odds of being breast fed for female.

Â So this slope has a log odds ratio interpretation.

Â What about the intercept?

Â Well, we solve it, when we looked at the log odds for

Â the reference group, the log odds.

Â For the females, who's value of x1 is equal to zero, we got the intercept.

Â So this intercept estimates the log odds of breast feeding for one group,

Â the estimated log odds of breast feeding

Â for females, and we could certainly translate that into an odds.

Â And we'll show in a subsequent section how to backtrack or back compute this into

Â an estimated probability of being breast fed for females, for example.

Â So here's the result and equation that we get if we use a computer.

Â We get the estimated of log odds of being breast fed is

Â equal to an intercept of 1.12 plus a slope of .002 times sex.

Â So, this slope of .002 again estimates the log odds ratio of being breast fed for

Â males compared to the reference group of females, and the intercept is

Â equal to the log odds of being breastfed for female children in the sample.

Â So, these things are not so informative or helpful on the log scale so

Â let's, let's take things and antilogged them to get some clarity.

Â So this beta 1 estimate, is .002,

Â that's the estimated log odds ratio of breast feeding for

Â males to females, so if we exponentiate this, this will give us the estimated

Â odds ratio e to the .002 which is about equal to 1.002.

Â So what does this suggest?

Â The odds ratio of being breast fed for male children to female children is 1.002,

Â essentially there's no difference in the odds.

Â There's a slight .2% higher odds of, that males are breast

Â fed relative to females, but this ratio is very close to one.

Â What is the intercept estimate?

Â Well, the intercept here is equal to 1.12, this is the log odds,

Â this does not compare to groups, it's the log odds of being breast fed for

Â one group, the reference group, the group with x1 equals 0 females.

Â If we exponentiate this, take e to the 1.12, get 3.06.

Â This is the estimated odds that females are breast fed.

Â So they're roughly three to one that females in this sample are breast fed,

Â and very shortly, you may even be able to do it now, if you think about it.

Â But we'll show how to convert an odd testament for any

Â one group to a probability of proportion that, that group has the outcome.

Â So we'll be able to translate that into the estimate

Â of the probability of females in the sample being breast fed.

Â Yeah, the coding choice again for this binary predictor is completely arbitrary.

Â For this breast feeding and sex analysis.

Â Breast feeding, sex analysis.

Â I wanted you to think about what the values of the intercept and

Â slope would be, if sex was coded as a 1 for females and a 0 for males, and

Â then what would the subsequent odds ratio comparing males and females and

Â the odds for females look like under this scenario, and

Â I'll leave this to the review exercises, so you can start thinking about now.

Â But, actually go through it if you choose to do when you do the review exercises.

Â Let's look at another example.

Â This was a study published in 2010 in the Journal of

Â the American Medical Association looking at the risk of respiratory failure at,

Â after birth and gestational age.

Â So what they did was they looked at their context here was they

Â were considering late preterm births.

Â They account for an increasing proportion of prematory, prematurity-associated

Â short-term morbidities, partisu, particularly respiratory,

Â that requires specialized care and prolonged neonatal hospital stays.

Â And they went to assess short-term respiratory mini,

Â morbidity in late preterm births compared with term,

Â term births, in a contemporary cohort of deliveries in the US.

Â And so they retrospective electric data records from 19 hospitals across the US,

Â and they gathered an impressive amount of data on over 230,000

Â deliveries in these 19 hospitals between 2002 and 2008.

Â And what they used ultimately was a multiple, which we'll get to shortly,

Â logistic regression analysis comparing the risk of

Â respiratory failure by gestational age.

Â So we're going to look at an unadjusted analysis to get this started.

Â So here's the from their data set, collected from US records of the majority,

Â or 90% of the sample, had full term gestational ages, 37 to 40 weeks.

Â Another 5% came in at a late preterm at 36 weeks.

Â Another 3% at 35 weeks and the remaining 2% had a 34 week gestational age.

Â So even though the gestational age categories are ordinal, the authors did

Â not want to assume, and we'll talk more about this in the sec, next

Â section with equivalent of treat, this is a single measure and use the ordinality.

Â We would assume that the log odds linearly increases or

Â decreases the log onto respiratory failure with increasing gestational age,

Â and maybe the jump isn't consistent for each additional week of gestational age.

Â So what we're going to do just to explore this is treat this as categorical to

Â start, and we'll talk a little bit about the ramifications of that.

Â But there, I'm going to make four categories, okay?

Â And what we're going to do is, is make one the reference, and

Â we're going to use the four categories I laid out before.

Â 34, 35, and 36 weeks and then 37 to 40 weeks, so we've got

Â four categories and so what we're going to do is like we've done with categorical

Â variables in linear regression we're going to make one group the reference, and

Â then create binary indicators separately for each of the three other groups.

Â So, I'm going to be consistent with how the authors did it.

Â We're going to use full term births, 37 to 40 weeks as the reference, and

Â then we're going to create individual indicators for

Â births that were at 34 weeks, 35 weeks, and 36 weeks, and we're

Â going to estimate a logistic regression equation looks like this.

Â We're going to read the log odds of respiratory failure to these indicators

Â via this equation.

Â So what do we get?

Â For example, for in the, so let's think about this suppose we're looking at,

Â suppose we're looking at children who were born at 34 weeks.

Â So x1 equals 1, x2 equals 0,

Â takes 3, equals 0.

Â Well, this equation going to estimate the log odds of respiratory failure for

Â this group to be the intercept plus beta one hat times one

Â plus the other slopes times 0, so the estimated log odds for

Â the group born at 34 weeks is beta nought plus beta one-half.

Â For the reference group of 37 to 40 weeks all x's are 0's,

Â and the log odds for this group, at 34 we,

Â 37 to 40 weeks, is simply the intercept.

Â So this slope for the indicator of 34 weeks,

Â is going to estimate the difference in the log odds of respiratory failure, for

Â those born at 34 weeks, compared to the reference of 37 to 40 weeks.

Â So you can go through and show, we've shown via that setup that

Â the intercept is interpretable as the log odds of respiratory failure, for

Â the reference group.

Â We showed that the slope beta 1 is equal to the difference in log odds of

Â respiratory failure for the gestational age group of 34 compared to the reference,

Â and a difference in log odds,

Â as we've shown, is interpretable as a log odds ratio.

Â You can go ahead and prove to yourself that beta 2 hat estimates the log of

Â the odds ratio of respiratory failure for the group x2 equals 1, which is the group

Â with 35 weeks gestational age compared to the same reference that we had before.

Â And that beta 3 had as the log odds for the group with 36 weeks of

Â gestational age compared to the reference of full term births.

Â So here's what the results look like.

Â The estimated intercept is negative 5.5.

Â The slope for x1, the indicator of gestational age, 34 weeks, is 3.4.

Â The slope of x2, the indicator of

Â gestational age being 35 weeks is 2.8 and the slope of x3

Â Is equal to 2.0.

Â So, let's just think about this for a minute.

Â So this is, 2.0 is the difference in the log scale between the group

Â with 36 weeks of gestational age compared to the reference of 37 to 40.

Â So the difference for that one unit difference, if you will,

Â I've lumped 37 to 40 together, but you can think of

Â them as qualitatively one unit higher than 36 weeks is 2.0 on this scale.

Â The difference between the group at 35 weeks in that same reference is 2.8.

Â So this difference doesn't double when we go up by one unit in gestational age,

Â it compounds by another additional .8 but

Â it doesn't double, and then when we go the 34 weeks compared to the reference we get

Â another .6 because the difference between this group and the reference is 3.4.

Â So it doesn't appear that the association on the difference between

Â these resulting lower gestational ages,

Â and the reference is strictly linear that this count/gs difference does

Â not compound constantly for each one unit increase in gestational age.

Â And as such as it's a good idea then that

Â we made these categorical as opposed to treating it as continuous.

Â because incrementally the additional increase in odds,

Â is not the same for one unit increase in gestational age across these three levels,

Â these four levels.

Â So let's try and make sense of this.

Â So let's, let's start with some odds ratios.

Â So, we said beta 1 had equals 3.4,

Â so e the beta one hat equals the odds ratio of respiratory failure for

Â kids born at 34 weeks compared to the reference of full term

Â that's equal to essentially 30, that's pretty shocking.

Â This suggests that the relative odds of respiratory failure

Â are 30 times that of the reference group for the group that was at 34 weeks.

Â So just a huge increased risk because the odds is increased by 30 times here.

Â We do the comparison for the 35 week

Â gestational age group to the same reference, that's 2.8,

Â if we exponentiate that, that's beta 2 hat.

Â The age of the 2.8 is 16.4, so not as high

Â an increase as with the 34 week olds at gestational age of the 34 week group.

Â But certainly a substantial increase in the odds of respiratory failure, and

Â then if we do the last group And exponentiate that, that 2.0 at 7.4.

Â So I think, you know, certainly getting closer to term is better.

Â But being preterm, regardless of whether it's 34, 35 or 36 weeks is associated

Â with the large increase, and the relative odds of being, having respiratory failure.

Â What is that intercept interpretal as what's the log odds

Â of respiratory failure for the reference group, the full term group, and

Â that's equal to negative 5.5.

Â We exponentiate this to get the odds for

Â this group, this is at least some good news here.

Â The odds is .004 which is relatively low, and we'll, we'll see translate

Â into a low probability, so luckily the probability of respiratory failure and

Â I haven't shown explicitly is low in the group with the best outcomes,

Â and it's certainly worse for the other, lower gestational ages, but hopefully and

Â we'll talk about this in a subsequent section.

Â Hopefully, this doesn't result in a large, large probability respiratory failure,

Â even though the odds are increased by a sizeable amount.

Â So, in summary, logistic regression, again, is a method for

Â relating the binary outcome to a predictor x via a linear equation,

Â and the predictor can be binary, categorical, or continuous, but

Â we've only considered the first two situations, thus far.

Â What we get is a linear equation that relates the log odds of the binary outcome

Â to that predictor of interest, and we've shown that the slopes from the logistic

Â regression, for our x or x's, have log odds ratio interpretation, and

Â can be exponentiated to estimate odds ratios.

Â And the intercept for these situations, estimates the log odds of

Â the binary outcome for the groups, whose x or x values are 0.

Â So, in the next section we'll continue working on these ideas, but

Â we'll talk about the situation where we treat x as a continuous measure.

Â Coursera provides universal access to the worldâ€™s best education,
partnering with top universities and organizations to offer courses online.