A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 2: Regression Methods

68 ratings

A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

From the lesson

Module 1B: More Simple Regression Methods

In this model, more detail is given regarding Cox regression, and it's similarities and differences from the other two regression models from module 1A. The basic structure of the model is detailed, as well as its assumptions, and multiple examples are presented.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

In this section we're going to look at a simple regression method for

when we have time to aventate it with the, in the presence of sensory.

And this something in it's full name called Simple Cox Proportional Hazards

Regression, frequently abbreviated as Cox Regression.

And it's named after it's inventor Sir David Cox, an English statistician.

So in this set of lectures, we will develop a framework for

simple Cox proportional hazards regression.

Which is a method for relating a time-to-event outcome in the presence of

sensory to a single predictor that can be binary, categorical or continuous.

So at face value it's like the other two regressions we've looked at.

Where we're relating outcome or

function of an outcome to a single predictor by a linear equation.

So in this first section, we'll just set up the idea of Simple Cox Regression.

And something that comes out in its full name,

proportional hazards, we'll define that concept.

So after viewing this lecture section,

hopefully you will be able to interpret the slope of a predictor.

From a Cox regression model as a log hazard ratio.

And explain the concepts of proportional hazards and

what it means with respect to interpreting the slope.

And exponentiated slope from a Cox regression model.

So for Cox proportional hazards regression,

the equation is similar to logistic regression with an extra piece.

What this regression does is model the log risk called Hazard.

And we'll define Hazard more formally.

But log risk of a binary outcome y as a function of our predictor x1.

And also, the follow up time.

We'll just call that generically t.

Remember, what we're doing with timed events studies is following subjects from

time 0, when they start the study, over time.

To see if they have an event of interest or

if they make it to the end of the study without having the event.

Or if they drop out with having, without having the event.

So there's potential for some censoring.

What this model does is it estimates the log of what's called the hazard of

our binary outcome.

At any given time in the followup period time t generically is a function.

It looks like an intercept, but it depends on where we are in the time.

So you'll na, name this lambda hat of t,

evaluated at t plus a slope times our predictor.

So this slope times predictor looks like any other regression we've seen.

The only difference with Cox now is that our intercept piece depends on

where we are in the followup period.

And we'll delve into this more deeply in a few slides.

But, so as with linear and logistic regression our predictor x1

can be binary, nominal categorical, ordinal categorical, or continuous.

So let's just define this idea of hazard.

Hazard, in the English language, is synonymous to some degree with risk.

But, technically speaking, what hazard means in timed event studies.

And we were estimating a function of the hazard when we did the Kaplan-Meier curve,

but we didn't call it as such.

But what hazard is is the instant risk, instantaneous risk

of having the binary outcome at a given time in the follow-up period.

It's the risk of having the event among those who are at

risk of having the event at this time in the follow-up period.

So it's sort of at a given time,

it estimates among those who were still in the study.

Have not dropped out or had the event previously to that time.

It's the risk among those of having the event at that spetif, specific time.

So for interpretation purposes, we can think of the hazard as

the time-specific risk of having the event of interest.

So essentially, it's synonymous with risk.

So, let's look at what this equation gives us then.

It says the law of this time specific risk at

a specific time is equal to some intercepted depends on

the time plus the business as usual slope times of predictor.

So just like in any other we rush we looked at linear logistics the slope.

And we'll show it exclusively compares the difference in the law

of Hazard between two groups who differ by one unit in x1 at the same time point.

And this slope measures or quantifies this difference in

the law of Hazard between an, any groups who differ by one unit in x1.

Regardless of what time in the follow up period we're comparing their risk at.

Or their log risk.

So let's just give an example.

Suppose x1 is binary for sex.

We only have two groups who we'll be comparing in the follow up period and

let's just arbitrarily make x1 a one for females and a 0 for males.

So let's see what this model estimates for

both groups at a specific time in the follow-up period.

Let's say t equals one month.

So the log hazard of the binary outcome for females at

one month in the follow-up period is equal to this function, this intercept.

I'll call it lambda naught.

Its value when t is equal to 1 plus our slope times x1, which is equal to 1.

The same estimate, but for males, when x1 is equal to 0,

is just this intercept if you will evaluate it at time 1.

So this intercept function, if you will, estimates the log risk of

having the event, at that given time, for this reference group of males.

And so, if we take the difference of these estimates, this piece cancels, and

all we're left with is the slope for sex.

So this slope for sex, at least in this particular case, estimates

the difference in the log hazard of the outcome for females compared to males.

At one month of follow-up.

But we'll see,

regardless of the follow-up time, let's look at another example, 50 months.

We looked at 50 months in the follow-up period, assuming our study went on for

at least 50 months.

The estimated log hazard of the outcome for

females at 50 months is equal to this intercept function evaluated at 50 months.

Lambda naught evaluated at 50 months plus our slope times 1.

And for males, they are the reference group.

So essentially this intercept evaluated at 50 months estimates the log hazard

of the binary outcome when x equals 0 at that time.

So again, if we look at the difference in the estimated log hazards between the two

groups we take the top and subtract the bottom.

This piece cancels and again we're just left with this same slope.

So, notice we could go back and do this for other times in the follow up period.

But notice that regardless of what the log hazard is for

males, this intercept piece evaluated at a specific time.

The corresponding log hazard for

females at the same point in time, the same point in the follow up period.

Is this log hazard for males at the same time, plus this slope.

So in other words, the difference in the log hazards for females compared

to log hazards for males at, at the same time regardless of the time is this slope.

So what is this lambda hat of naught of t?

Well, we've alluded to this in my verbiage before but

think of this as a time specific intercept.

At any given time in the follow-up period, this is the starting log hazard that, for

the reference group, that we add the slope beta one hat to.

To get the time specific risk for,

in this case, females since we only had two groups.

Defined by In x equals 1 females or 0, the reference group, males.

So what this function does is estimates the log hazard for the reference group

over the follow-up period as a function of time when x1 is binary.

So what does this look like?

Well, if we were to think about on the log hazard scale.

Looking at the.

Function of time-specific risks over the followup period for

males, this doesn't have to be.

This can jump all over the place.

The risk on a lot of stuff can go up for

males, it can come down at any given point in time.

Go back up I'm just making this up.

I don't have any data to support this particular manifestation.

But what this tells us then if this is lambda naught t as a function t.

What this regression is telling us is to get the same function estimate over time

for females, at any given point in time for males, add beta 1 here.

I'm assuming beta 1 is positive.

It doesn't have to be.

And so what these functions will look like, and

I'm going to try and draw this to scale, is they will be

parallel functions on the log hazard scale.

In other words, the difference between the log hazard for

males, given by this red curve, and females, given by this blue curve.

At any point in the follow-up period is that slope beta 1 hat.

So think about it, beta 1 hat estimates the difference in

log hazards between males and females at any given point in time.

So difference in log hazards, log hazard.

I'm just abbreviating this for females minus the log hazard for males.

At any given point in the follow-up period can be reexpressed

as the log of the hazard or risk.

For females, compared to the hazard or risk for males.

So this is the log of risk or hazard ratio.

If we exponentiate this.

We get what's called a hazard ratio.

Which is essentially a risk ratio of the outcome from females compared to

males at any given time in the follow up period.

So what this means or not it's going to be difficult to draw.

But on the hazard scale.

So if we exponentiated those curves that I looked at before on the hazard scale.

Let's suppose this if for males.

Now this is not drawn to what I drew before, but this is males.If I

were to actually draw the spore responding hazard curve, for females.

And I'm not going to draw this to proportion, I apologize.

But, it would be such that these are no longer

necessarily in a constant difference, but the ratio.

And I did not draw this well, but

the ratio of the Hazards between these two groups at any two points.

Between these two groups at any point of time is equal to e of the beta one-half.

So, on the exponentiated scale things that are, have a constant difference

on the log scale are in a constant ratio on the exponentiated scale.

This is the idea of proportional hazards, that regardless of what the risk is for

males at any given time in the follow-up period.

The risk for females is always the same multiple the risk for

males, e to the beta 1 times that risk for males.

Regardless of where we are in the follow up.

So in other words, the relative risk for

females to males at any given point in the follow up period is e to the beta one hat,

regardless of what the starting risk for males is.

So this constant difference in the log hazards between females and

males, regardless of the comparison time.

And hence, the constant ratio of the hazard for

females to males regardless of the comparison time.

Is an example of the idea of proportional hazards.

The baseline hazard.

And hence the log of hazard that we're estimating on the regression scale here is

for males.

And this can vary over time, as can the hazard and hence log hazard for females.

But the ratio of the hazards between females and

males is constant regardless of time.

And this translates into a constant difference, or slope.

On the log Hazard scale.

Let's look at another example.

Let's make our predictor now continuous.

So slopes again are interpretable as the difference in the log hazard per unit

difference in x1 at a given time in the follow up period.

But if x is continuous.

Still that same interpretation it's just we have more potential groups we can

compare for values of x.

So suppose x1 is age in years at the start of the study.

So let's just interpret our slope in this situation.

So let me postulate here in a specific time in the follow up here.

Let's say one year follow up.

Let's compare two groups,

the log hazard of two groups who differ by one year and age.

So I'm going to generically make the ages a and

a plus 1 to indicate the one year difference.

So the log hazard of our outcome being one, for

the older group at one year, is equal to this.

Intercept function at one year, plus our slope times their age.

For the group that's one year younger, it's the intercept part evaluated at

the same point in the follow-up period, plus the slope times their age.

If we take this difference, this intercept piece at time one year will cancel.

And this part here is equivalent to beta one hat times a, plus an extra beta

one hat, so if we take that difference, all we're left with again is the slope.

So this slope estimates the difference in the log hazard of our outcome for

two groups who differ by one year in age.

At the first year of follow-up.

Suppose we go and look at them, again, and we compare them.

Remember year of age does not change over time because it's

their age in years that the study starts.

So if we look at 4.5 follow-up here at a time.

Well, the only thing that would change in this comparison is the intercept piece.

Which would now be a function evaluated at 4.5 years, and

that were pertained to both groups.

So if we again looked at the difference in the log hazard between those two groups,

they differ by one year of age.

It would be Beta One hat again.

So notice that regardless of what the log hazard is for

x1 equals 0 as it potentially varies as a function of age.

The log hazard of two groups who differ by one year in age compared at

any point in time in the follow up period, the difference is beta 1.

We take the log hazard of the younger group.

At the start of the study, and add beta 1 to it to get the log hazard of

the older group by one year at that same time in the follow up period.

So what is this lambda hat of naught evaluated a given time t?

Well think of this again as a time specific intercept, but here when x is

continuous, this at any given time, this is sort of the starting log hazard.

That we add multiples of the slope to to get an age specific log hazard at time t.

And again age is measured at baseline.

So technically, this is the log hazard as a function of time t for

persons who are 0 years old at the start of the study.

Depending on the population we're looking at, and hence the sample we have,

this may or may not describe a specific age group in our sample.

So this may just be a place holder.

Like an intercept for, for a logistic or linear progression when x was continuous,

but this intercept or this place holder will vary as a function of time.

So if we were to look at a picture,

we might have something like this on the log scale.

If we, this is our, our function or log hazard.

For people with x 1 equals 0.

If we were to plot for various other ages like 30 years old.

The shape would be the same.

And the difference between any two.

Curves over time for two groups who differ

by one year in age would be beta 1 hat.

And for multiples of age, so 32 this reference group of times here,

this is not drawn to scale, would be 30 beta one hat.

But that difference would be constant, on the log risk scale,

at any point in time on the follow-up period.

So as such, the slope estimates the difference in the log hazard,

for any two groups who differ by one unit and x1.

At a specific time the difference in law hazard is equivalent to

the law of hazard ratio.

So when x1 is continuous to exponentiated the slope either the beta 1 hat

gives the estimated hazard ratio of the outcome for two groups.

Who differ by one year in age at any given time in the study follow up period.

So this constant difference in the log hazards between groups who differ and

one or more years of age.

Our x1 variables and hence the constant ratio of the hazard for two such

groups regardless of the comparison time is an example of proportional hazards.

The baseline has for in this example were x1 continuous.

And hence the log hazard which we're estimating on

the regression scale is a potentially theoretical group.

Those who are 0 years old at the start of the study, but nevertheless the ratio of

the hazards between different age groups is constant regardless of time.

So in summary, Cox Proportional Hazards regression allows for

the estimation of a log hazard ratio.

And hence a hazard ratio comparing the hazard or

risk of the outcome over the follow up period.

Between two groups who differ by one unit in the predictor x one.

This model assumes and hence estimates a result.

Under the assumption that the relative hazard to hazard ratio of the outcome for

any two groups being compared is constant across the entire follow up period.

And hence on the log scale, the difference in

the log hazards is constant across the entire follow up period.

In the next several sections, we'll look at some examples with real outcomes and

real data.

And estimate hazard ratios and

put confidence limits on them so we can put this in a substantive context.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.