A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 2: Regression Methods

43 ratings

Johns Hopkins University

43 ratings

A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

From the lesson

Module 3B: More Multiple Regression Methods

This set of lectures extends the techniques debuted in lecture set 3 to allow for multiple predictors of a time-to-event outcome using a single, multivariable regression model.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

Greetings, and welcome to Lecture Set 8, Section B. In this section we'll give a brief treatise of the basics of model selection. And show how the results from multiple Cox regression, can be presented in terms of estimated survival curves or outcomes from the regression model.

So hopefully by the end of this lecture set, you'll appreciate the linearity assumption as it applies to multiple Cox regression. Explain different strategies for picking the final multiple Cox regression model among candidate models. How a researcher might do this, and this process is exactly the same as it was for linear and logistic regression, but it's worth reiterating. Use the results of multiple Cox regression models to compare groups who differ by more than one predictor. And appreciate that the results from multiple Cox regression can be used to estimate group specific survival curves, where the groups are defined by multiple predictor values. So, let's briefly talk about the estimation process for Cox regression, what the computer is doing. The algorithm to estimate the equation of the multiple Cox regression is called partial maximum likelihood estimation. Same process used with simple Cox regression. And what this does, is this uses, this estimates the slopes for our predictors. These are the values that make the observed data most likely among all choices for the slopes. And this is a complex numerical algorithm that has to take some starting guesses for the slopes. And iterate until it finds the choice that maximizes the likelihood of the observed data. But before it does this, this was the, the, what was added by Cox and what makes this method so unique. Is there's a separate behind the scenes algorithm that first estimates the shape or the function over time of the baseline hazard. The one that's going to be used as a starting point at each point in time to get to the hazard for our other groups defined by our X's. And this is actually pretty neat because it doesn't make, force us to make any assumptions about what the relationship between hazard and time looks like. We don't get the opportunity to put in time as a continuous predictor, time as a categorical, etc.

But this, the algorithm figures out the general shape, and it isn't restricted to such constraints such that it has to be linear, or it can only change at certain points in time. It can estimate a very dynamic function.

But we have to leave that to the computer to do. And then after it does that, it does the maximum likelihood estimation process for the slopes. But this, of course, all of this has to be done by a computer.

So what is the linearity assumption in multiple Cox regression? Well, the linearity assumption is similar to what we saw in the simple Cox with the additional part about adjustment. It, it assumes that the adjusted relationship being estimated between the log hazard of the binary event y and whether it's an event or sensory. And each x is linear in nature and this is an issue for continuance predictors, but not for binary or multi-categorical predictors. And as with simple Cox regression, there is no graphical way to assess this. But when fitting models, researchers can compare the results of treating a predictor as continuous. Versus putting it in as categorical to see if there's evidence in the categorical formulation of a consistent same directional change in the log hazard for increasing ordinal categories. And if that's the case, they may opt to treat it as a continuous predictor and estimate one overall association that exploits that relationship. Otherwise, they may present it as categorical.

So how do researchers, when they're, when they've got data and they have a bunch of potential predictors, how do they choose a final model? Well, model building, as we've said before, and selection is a combination of science, statistics, and the research goals.

So if the goal is to maximize the precision of the adjusted estimates. The strategy, as it was with linear and logistic, would keep, keep only those predictors that are statistically significant in the final model. Do not contribute uncertainty to the model by estimating things that don't need to be there, which then steal from the precision of the things that do actually correlate with the hazard.

If the goal is to present results comparable to results of similar analyses presented by other researchers, on similar or different populations. Then in the process of writing this up, researchers would at least want to present one model that includes the same predictor set. As the other research does, even if some of the findings are not statistically significant in this particular data set. Then they can make comparable comparisons in terms of the factors that were adjusted for, et cetera. With the results from the other researchers' findings.

If the goal is to show what happens to the magnitude of associations with different levels of adjustment, then a researcher could present the results from several models that include different subsets or combinations of adjustment variables. And if the goal is prediction, well again, this is slightly more complicated story, and we will only be able to discuss it briefly, but we'll touch on it towards the end of the course.

So let's look at the idea of prediction with regression results. With Cox regression results. And this is different than with linear and logistic, because we can't actually do this by hand, because we are not presented with any output with what the value of the intercept is. The log hazard at base line, as a function of time. It changes with time, so there's not one value, an intercept, that describes the starting point for all comparisons. So this has to be done by a computer, but let's look at the results presenting some of the results from the predictors of mortality in primary Bilirubin cirrhosis patients. And I'm going to use the results from this model, the one that includes predictors, includes treatment, age, Bilirubin, and sex. To show, based, and I did this with the computer, to show the estimated survival over the followup period for different groups depending on certain characteristics.

So this can be done, but it's computationally involved because what the computer has to do. What has to be done is at each point in time, the log hazard for a particular group has to be computed based on the intercept or starting hazard at that point in time. This has to done across the entire time period.

And then each of those time specific log hazards have to be converted into cumulative survival estimates, but this can be done by a computer. So for example, I'm not going to present all possible groupings here. But this nice, nicely I think shows that I, the estimated survival curves compliment the results from the Cox regression and turned some of the hazard ratios, put them in the context of what that means in terms of the cumulative probabilities in the follow up period. So what I have here, these two curves down here show the survival trajectories for males.

And males with Bilirubin of two milligrams per deciliter, so we can see that there's a certainly higher Bilirubin was associated with.

Increased hazard, which results in reduced survival. So, like this lower curve is a function of that difference in Bilirubin levels. The two curves up here, are the same comparisons. Subjects in the DPCA group who are female with Bilirubins of one and two milligrams per deciliter, respectively. And what you can see here, pretty clearly, is the same sort of differences in the estimated survivals over time as a function of Bilirubin. But what you also get from this picture is how dramatic the difference is between males and females if you compare these two sets of curves. And so I like this because it actually puts a face, if you will, on what those hazard ratios mean in terms of the decrease in cumulative survival over the follow up period. And it also gives a sense of the magnitude of the difference in terms of predicted survivals, between groups who differ by Bilirubin and groups who differ by sex. It's certainly not an exhaustive presentation. But, it helps to contextualize the results from that Cox regression.

We could write out the adjusted model on the regression scale, and write it out in terms of the generic intercept as a function of time. And then the slopes for each of our predictors, I got these from the computer, but you could get these by taking the re, respective logs of the hazard ratios presented in the previous table from the second multiple regression models. So this an indicator here, so one if they're in the drug group, a zero if they're in placebo.

Here are our indicators of the three age categories. Remember the reference is the first quartile. X2 is an indicator that the person's in the second quartile. X3 is an indicator that the person's in the third quartile. And X4 is an indicator that they're in the highest or fourth age quartile. This is Bilirubin entered continuously in milligrams per deciliter.

And then here is that slope, that negative slope for sex, where it's a 1 for females and a 0 for males. So that's a difference in the log hazard scale which translates into a hazard ratio. On the order of 0.6, and we can see from the previous picture what that means in terms of the difference in survival by otherwise comparable men and women. So suppose that we wanted to use these results to estimate the hazard ratio of mortality for 60 year-old males on DPCA with Bilirubin levels at the start of one milligram per deciliter? And compare them to 40 year-old females on the placebo arm, with Bilirubin at the start of equal to 0.5 milligrams per deciliter. Well if we wanted to do this on the regression scale, we could simply write out the estimated log hazard of mortality at a given point in time, for each of these groups by plugging in their X values. So this was what it looks like. The log hazard, at any point in time, is a function of the starting hazard on the log scale, whatever it is at that point in time, plus the slope for DPCA. Plus the slope for being in the fourth age quartile, because these, the fourth quartile is greater than or equal to 57 years and these males are 60 years-old, so they're in their 4th quartile. Plus the slope for Bilirubin times their starting level, which is one gram per deciliter and then because they're male.

Their x value for sex is 0, so they don't pick up anything for being male. And when you write this out in terms of the baseline log hazard at any given time and then the cumulative impact of these other things. We get that the sum is equal to whatever the baseline hazard is on the log scale plus 1.12 if you add up these three numbers. If we do the same thing for females who are 40 years-old and in the placebo group with Bilirubin levels of 0.5. Then the log hazard is equal to the same starting log hazard at the comparable time that we're making a comparison, which could be anytime in the followup period. They're in the, de- they're in the placebo group, so their value for pe-, indicator of treatment group is zero, so they don't get anything for that. They're in the lowest age quartile, so they don't pick up anything for age because they're the reference there. The Bilirubin level's 0.5, so we take the slope for Bilirubin, 0.15 times 0.5, and because they're female. Their x1 for sex is a 1 and the slope for that is -.51. So when we combine the slope values into a sum, we get that the estimate at any given time, is found by taking the log of the baseline hazard at that time plus -.435. This is the cumulative impact of having the Bilirubin level of 0.5 and being female. So if we actually took the differences in these estimates, for males 60 years old on DPCA with a Bilirubin of 1. That's this part here, and we subtracted what we get for females. If you do this, the difference in the estimated log hazards at any given time is 1.555. And if we exponentiate that, we get a hazard ratio of 4.74. So male, 60 years-old in the drug group, with starting Bilirubins of 1 milligram per deciliter. Have 4.74 times the risk of mortality at any given time in the followup period when compared to females, 40 years-old in the placebo group with a Bilirubin level of 0.5. And so this difference is the, it's the culmination of the increased risk for being male. The increased risk for being older. The slightly increased risk for being a DCPA group, and being increased risk for having higher Bilirubin, compounds into a hazard ratio 4.74. Just want to show you something, certainly a lot of times in papers they will not give you the results on the log scale. And you could certainly take the logs with respect to hazard ratios and write out the equation. But I wanted to show you this. If we, instead of actually mashing these together into one sum, I keep the component separately in this comparison. The difference between these two groups because of the difference in treatment groups is 0.1.

That's because the males were in the treatment group and the females were not. The difference between these two groups, because the age difference is 0.87. The males can add that additional 0.87 to their hazard, because they were in highest age quartile, compared to females who had no additional above and beyond the baseline because they were in the reference category, lowest quartile. The difference in Bilirubin levels is 1 for the first group, minus 0.5 for the second group. And so the Bilirubin contribution to this sum has to be the difference between the two groups, is the slope of bilirubin times that difference. And then the first group is male, so the get a value of zero for sex, but the second group is female, so they get a value of negative 0.51. Because their sex value is 1. And so, these are the components that if we wrote this out, equals that 1.555 we saw before. But if you exponentiate this in its component parts, and do a little mathematics you can see that it's the product of E to the first thing, the slope for being DCPA. Times E, the slope for being in the older, oldest age quartile compared to the youngest. Times the slope for being, for Bilirubin exponentiated raised to the 0.5 power, because that's the difference in the Bilirubin levels for these groups. Times E to the 0.51, which is the opposite of the slope for being female. And what this turns out to be on the product scale is. The adjusted hazard ratio for being in the DPCA group, compared to the placebo group, times the adjusted hazard ratio from that table before. Being in the oldest age quartile compared to the youngest. Times the adjusted hazard ratio for a 0.5 difference.

And Bilirubin times 1 over the adjusted hazard ratio for being female, because we're comparing in the opposite direction that that hazard ratio compares, taken at face value. In this example we're comparing males to females. So you can actually get this kind of comparison based on the adjusted hazard ratios just by multiplication. And for things that are continuous, the hazard ratios that represent a per unit change in the continuous variable like Bilirubin. Taking that hazard ratio in the multiplication and raising it to the difference, in that continuous value between the groups being compared. So you don't necessarily have to, if you're reading a paper and are interested in doing such a thing you can do it directly from the results via this type of multiplication. Its just an FYI. Just shows how the math works.

So let's look at one more example of prediction with Cox regression. We looked at several models, unadjusted associations, and then two different adjusted models, looking at predictors of infant mortality. And we found in the first lecture that pretty much the only two things among our candidate predictors that were associated with infant mortality, were gestational age and maternal parity.

And they each contribute independent information. Neither of their adjusted associations was different than their unadjusted. But let's just look at what the results would look like if presenting different estimated survival curves for these inference based on gestational age and two different parity categories, to look at the impact of parity above and beyond gestational age.

So I put these curves next to each other, and they just give some sense of what's going. What I have here, in this presentation, are the five estimated survival curves for the gestational age groups. Amongst first born children, amongst the group of children whose mothers had not had any previous children. And we see clearly that big jump here's the pre-term group, less than 36 weeks. And then here are the four other groups, and there's actually one is right on top of each, so it looks like there's three curves, but pretty much the story of gestational age shines through here. That pre-term is really a risk factor for mortality. And that reduction in hazard shown with full-term translates into an estimated difference in probabilities of survival on the order of ten or more percent in the follow up period. So it's pretty dramatic. If we look at the same presentation here, but instead of looking at first born children we look at second born children. You'll notice, if you compare that this is on the same scale, so if you compare these curves to the curves over here, they all shift up a bit. There's still a, quite a disadvantage in terms of increased risk and decreased survival. Of being pre-term in this curve again, is very distant from the other gestational age categories. But there is a benefit in terms of decreased risk and increased survival of being the second born child compared to the first. So you see these curves here on the same scale are shifted up, at least relative to their counterparts among the firstborn children. So, this kind of presentation, it's nice when authors do it. It leaves for some of the groups just to sort of ground our understanding of what the underlying cumulative survival looks like over the follow-up period. And that helps us take those hazard ratios and translate them into understanding about what it means in terms of cumulative differences of survival between groups across the follow up period. So in summary, multiple Cox regression results can used to estimate cumulative survival curves of time to event outcomes. For a given subset in a population, given the subset's predictor values. Obviously we can't do this by hand, and you wouldn't be expected to be able to recreate any of those given the equation for Cox regression, but I just want you to be aware that this can be done. So if you're working in a research group, and thinking how about, how to present, present your results in a paper or a report. You may advocate for publishing some estimated survival curves above and beyond presenting the results from a multiple Cox regression to contextualize what the impact of these predictors are, in terms of the cumulative probabilities of survival over the follow up period. And multiple Cox regression results can be used to estimate hazard ratios between groups who differ by more than one characteristic. And we looked at one example of that and it was a very similar approach to what we did with linear regression except that the results on the regression had to be exponentiated. And it's analogous to what we did with logistic regression, because we were dealing with log ratios on that scale as well in terms of the slopes.

Confidence intervals for these comparisons can be estimated. But they need to be done by the computer because the standard error for a hazard ratio and the law of hazard ratio that represents a comparison on multiple x's. Has to be estimated by the computer, and it can't easily be done by hand, but the idea is exactly the same. If you get a confidence, if somebody gave you the standard error, you could get a confidence interval estimate by taking the log as you add and subtract in two estimated standard errors and exponentiating the result. In the next section, we'll look at some examples of Cox regression used in published articles from the public health and medical domains.

Coursera provides universal access to the world’s best education, partnering with top universities and organizations to offer courses online.