A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 2: Regression Methods

43 ratings

Johns Hopkins University

43 ratings

A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

From the lesson

Module 1B: More Simple Regression Methods

In this model, more detail is given regarding Cox regression, and it's similarities and differences from the other two regression models from module 1A. The basic structure of the model is detailed, as well as its assumptions, and multiple examples are presented.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

And hopefully, by the end of this section, you'll be able to again interpret the slopes from simple Cox regression as log hazard ratios. But interpret the proper comparison being made when our predictor is continuous, and similarly, interpret the exponentiated slopes as hazard ratios. And we'll also talk about a process by which one can empirically assess whether the relationship between the log hazard of the outcome and a continuous predictor is linear, which is an assumption of the model. We don't have similar visual assessments, like we did with linear and logistic regression, to do this. But there is a process by which we could compare the results from different regression models, dealing with our continuous predictor as categorical versus continuous, to look at this. And the reason I showed this to you is because you'll sometimes see reference to this type of process in the method sections of papers.

So let's look at our study from the 312 patients with the primary biliary cirrhosis studied at the Mayo clinic. And this study was a randomized clinical trial to look at the association between treatment and mortality. And patients were followed from their time of enrollment until death or censoring, and the follow-up period was up to 12 years. But at the time of enrollment, some other measurements were taken on the patients as well. And we could use these to look at their associations with mortality. For example, the patient's bilirubin level was measured at enrollment in milligrams per deciliter. And we can look at the association between this as a marker for progression of disease and mortality.

So here is the distribution of serum bilirubin levels in mg/dL in the sample, here's a boxplot presentation. We can see from this picture that this is right skewed data. And it's further evidenced by the fact that the mean, 3.3 mg/dL, is compared to the median of 1.4. 25th percentile is 0.8 mg/dL, the 75th percentile is 3.5 mg/dL. And the range for individual values in this data set runs from 0.3 mg/dL to 28. So for the moment, let's assume the relationship between the log hazard and bilirubin levels is linear. And we're going to fit a Cox regression that looks like this. We're going to estimate the log hazard of mortality given time t as a function of some baseline log risk. And this slope times the group, we're looking at bilirubin levels in mg/dL, so x1 is bilirubin in mg/dL. And here's the result we get when we run this in the computer. We get a positive association estimate, the slope is positive. So increased log hazard of mortality with increased bilirubin, which translates into increased mortality with increased bilirubin. But how do we interpret this slope of 0.15? Well, again, this is a slope that compares two groups who differ by one unit in the predictor, compares them in terms of what they have on the left-hand side. So beta 1 hat here, this 0.15, equals the difference in the log

for two groups whose bilirubin differs by one unit, or 1 mg/dL. So say log hazard of mortality at time t for a group with bilirubin of b + 1, that's being generic, minus the log hazard, Of mortality.

At the same time, a follow up period for a group whose bilirubin is just b. So b + 1 and b differ by one unit.

There's just a generic representation, it could be 2 mg/dL versus 1, 4 versus 3, 3.5 versus 2.5, etc. So when all the dust settles, we know we can rewrite this as the log, I won't write it out fully, but of the ratio of this hazard for the group with the higher bilirubin to the lower where the difference is one unit. So this is the log of a hazard ratio. So if we wanted to get this to make sense to us, give a number we can report in a journal article, we could exponentiate this estimate to get the estimated hazard ratio. And it would equal about 1.16. So each 1 mg/dL increase in bilirubin is associated with a 16% increase in mortality. Or if we can compare the hazard of mortality at any point in the follow-up period for two groups who differ by 1 mg/dL in their baseline bilirubin levels, the group with the higher value would have 16% greater hazard of mortality.

What is this intercept piece? Well, again, this estimates the log hazard of mortality for a baseline, a reference group. But in this situation where bilirubin is measured on a continuous scale, this is technically the log hazard at a given time when bilirubin is equal to 0. So this is a sort of a fictional referent group, because we don't have any subjects in our sample with 0 bilirubin, and that's biologically not possible. So this function over time doesn't track the risk for any group in the population from which our sample is taken from. But it provides a starting point at any given time for getting the log hazards for other groups, given their bilirubin levels. We start with this and add the appropriate multiple, 0.15, depending on the group bilirubin level we're looking at to get the log hazard for that specific group at that time. So this is again like a placeholder that varies as a function of time.

So we can do other comparisons besides that given by the slope. We could estimate, for example, the hazard ratio of mortality for persons with bilirubin levels at the 75th percentile, 3.5 mg/dL, versus those at the 25th percentile, 0.8 mg/dL. So on our slope scale, we know the slope estimates the difference in log hazard per 1 mg/dL.

This difference we're looking at is 3.8- 0.8, which is 2.7 mg/dL. So we could take that difference per 1 unit and multiply it by the total difference here of 2.7 on the slope scale. And this gives us a difference in the log hazards of 0.405. So if we exponentiate 0.405, we get our estimated hazard ratio of mortality at any point in the follow-up period for these two groups, and it's about 1.5. So this 16% per mg/dL compounds across this 2.7 mg/dL difference to give us an estimated 50% increase in the risk of mortality at any given time in the follow-up period for the group with starting bilirubin levels of 3.5 compared to the group with starting bilirubin levels of 0.8 mg/dL. So how can we assess? We've fit that model. We've stuck x1 in. Bilirubin in forming a linear relationship between a log hazard and this continuous predictor. But how can we assess whether linearity assumption is reasonable? Well, unlike with linear or logistic regression, there's no easy visual tool like a scatterplot or lowess graph to help with assessment. One thing we could do if we were analyzing these data though, is use what I call and empirical approach. We could take this continuous predictor, categorize it into several groups, and see if the difference in the log hazard between consecutive ordinal groups is similar.

So let me give you an example of what I mean. Let's suppose I took bilirubin, which we initially measured on a continuous scale, and I categorized it into four quartiles. So, those between the minimum and 25th percentile, bilirubin got into the first quartile.

Those between the 25th percentile and the median were put in the second quartile, etc. And then what I do is fit the model with the log hazard of mortality at any given time. And here, because it's been categorized, the reference group, the starting log hazard is for those in quartile 1. That's in the reference group, then this is an indicator here of being in quartile 2 versus not.

Etc., well, let's look at this. So, if I look at the increase in the log hazard going from at any given point time from the first quartile to the second, this jump is 1.3. The difference in the log hazard between the second quartile compared to that reference group is 1.3. If I go from the reference group to the third quartile, this jump is 2.2 which differs from the jump for the second quartile by 0.9. So the increase in log hazard going from quartile 2 to quartile 3 is 0.9. And then the increase going from quartile 3 to quartile 4, it goes from 2.2 to 3.3 is 1.1. So what I see here, these are estimates and there's some sampling variability and etc. But I see, and this is my opinion, a relatively similar increase with ordinal increase in the quartiles. That the log hazard of mortality increases by relatively similar amounts with each increase in bilirubin quartile. So that to me reinforces the assumption that the association is relatively linear. That the increases with similar increases in the magnitude percentile wise, we get similar increases in the log hazard on the order of one or something, on the order of about one. So, this reinforces to me that the linearity assumption may be reasonable and there's some subjectivity in this. But this would cause me to feel comfortable about the model we fit previously where we had bilirubin as a continuous predictor. And I would choose that model over this one because I only had to estimate one measure of association, the single slope, as opposed to the three separate quartile 2, quartile 3, and quartile 4 slopes. So I could use all the data to estimate one slope as opposed to three and get a more precise estimate. Another nicely however of categorizing bilirubin into quartiles or some of other arbitrary categorizations is at least for tissue purposes. I can't display relationship in Kaplan-Meier curves when bilirubin is continuous. So if I group it into four quartiles, I can at least show the time to mortality curves for the four different groups. And you can see pretty clearly that the lowest quartile bilirubin has the highest survival curve and in order, followed by the second quartile,

The third quartile and the fourth. So this, at least, establishes that ordinality that would also be appropriate and necessary for the linearity assumption. Let's look at another example, infant mortality and gestational age. Here's the gestational age distribution for our sample of 10,295 Nepali newborns who were followed for up to six months after birth to see about mortality. So, this is a much more symmetric distribution in our bilirubin measurements in the previous example. The mean gestational age is 38 weeks, the median is also 38, and the range in sample goes from 28 weeks to 46. So let's just for the moment assume that the relationship between the log hazard of mortality at any given time and gestational age is linear. And we're going to fit a model that looks like this, the log hazard of mortality at any given time t is equal to some starting log hazard plus a slope times the gestational age for the group we're looking at. If we do this, we actually end up with the slope estimate of -0.13. So what is this compared? Well this is, I'm going to not write it out fully for this example. But this is a log of the hazard ratio of mortality at any point in the follow up period for two groups who differ by one week in gestational age. And the group with the longer gestational age has lower log hazard of mortality. So, if we exponentiate this to get our estimated hazard ratio.

It's equal to about 0.88. So this suggests that each additional week of gestational age is associated with a 12% decrease in mortality in the six-month follow up period.

What is the relative hazard of mortality, for example, for babies born at 40 weeks compared to babies born at 34 weeks? Well based on this linearity assumption with this estimated slope, to get this difference in a log hazard scale, we take the difference in x which would be 6. (40- 34) times our slope of -0.13 to give a difference in log hazards or log hazard ratio of -0.78. If we exponentiate this, we get the hazard ratio, it's 0.46. So this difference, 40 week gestational age to 34 weeks is associated with a 54% reduction in mortality in the six months following birth.

So how can we assess whether your linearity assumption is reasonable? Well again, there's no easy visual tool to look at, so we'll try first this empirical approach.

So what I did was categorize gestational age into five categories. I did pre-term, less than 36 weeks, and I'm going to use that as the reference to compare the other four groups too. And then I made categories 36 to 38 weeks, 38 to 39 weeks, 39 to 41 weeks, and then 41 plus weeks. Since we had five categories, I made four indicator variables And this just shows the respective values for each of these categories. The preterm group, the reference group, takes on a value of 0 for all four indicators. x1 is an indicator of being in the 36 to 38 week group. So x1 is a 1 for that group and the other xs are 0s. And similarly, the rest of the table plays out for showing that each of these is the indicator for yet another one of those four non pre-term groups. So, we do this, and we actually look at the association, estimate this with the computer, we get that the ln(Hazard of mortality at time t) is equal to starting point. And this, because we've categorized, is going to be the log hazard for the reference group, the less than 36 week gestational age group, at any given time. This is going to estimate the difference between those who are 36 to 38 weeks and that reference of 36 weeks. This is going to be the 38 to 39 week group, the 39 to 41 group, and the 41 plus weeks group. So, what do you see here? Well, as we go from pre-term to full term at 36 weeks, the 36 to 38 week group has a lower log hazard by about 0.88 relative to the reference of less than 36 weeks. But then if we look at the subsequent groups, they have,

A greater reduction, but not by much. And across these three groups, the reduction relative to that reference group of 36 weeks is similar. So I get the sense here that maybe the relationship isn't linear. It maybe is more of a pretty big drop-off in the risk for log risks, going from pre-term to full term.

So what I am seeing with these numbers is, if we're looking at less than 36 weeks versus these other categories. And this is the log hazard scale here,

what we see is just this drops off really quickly and then sort of stabilizes. So before, we were estimating sort of a linear function to this. We were interpolating across that entire shape. And we'd missed the sort of big jump initially going from pre-term to full term. And so in order to capture sort of the richness in what we're seeing here, I would suggest going with a model like this, as opposed to the previous model. Because it's going to underestimate the decrease in mortality early in the gestational age scale and overestimate it later. So I think that this categorization schema shows us really that there's a big jump going from pre-term to full term and then slightly greater reduction going up to 38 weeks, relative to 36 to 38. And it's not strictly linear. If we look at the Kappa Meyer curves for these five gestational age categories, we see pretty clear evidence of this. This is it on the full scale on the y axis, let me blow this up, so that the y axis goes from 90% to 100. We can see clearly that the group here, the less than 36 week group has much greater risk and mortality over the follow up period visually. This tracks the proportion that were still alive compared to these other groups. And among these other groups, there is a little bit of a distinction between the 36 to 38 week groups and the other three. But in general, I think the big picture in this data is that being full term is very beneficial in terms of reducing mortality. And so the relationship isn't decreasing across the entire gestational age spectrum. But there's this big jump from being pre-term to full term.

And so it's not strictly linear as a function of gestational age. I mean, we may prefer to present it with that alternative representation, treating gestational age as categorical. So in general, slopes from simple Cox regressions with continuous predictor, just like the other Cox regressions we looked at with binary and categorical predictors, have a log hazard ratio interpretation. And the slopes could be exponentiated to estimate hazard ratios. The assumption of linearity between the log hazard and the binary outcome and a continuous predictor x1 can be assessed empirically. By comparing the results from Cox regressions with x1 modelled as continuous versus where the continuous values are categorized into several ordinal groups. And I want you to be aware of that. We aren't doing the actual analyses ourselves in this class, but you'll see reference to this type of thinking in the method sections of papers, where Cox regression is used occasionally.

Coursera provides universal access to the world’s best education, partnering with top universities and organizations to offer courses online.