A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 2: Regression Methods

81 ratings

A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

From the lesson

Introduction and Module 1A: Simple Regression Methods

In this module, a unified structure for simple regression models will be presented, followed by detailed treatises and examples of both simple linear and logistic models.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

So in this section we'll deal the accounting the uncertainty in

the estimates we get, the slope and intercept for

logistic regression model when we estimate from a sample from some larger population.

So, I don't think there will be an surprises in this lecture, but

what were going to show how to do is create 95% confident intervals mainly from

slopes from logistic regression.

Because we can convert those to slopes.

95% confidence intervals for the corresponding odds ratios, but

we can also do it for intercepts.

And convert these to 95% confidence intervals for

the odds of having the binary outcome in those whose x1 value is 0.

We're also showing how to estimate p-values for testing the null of no

association between the risk of the binary outcome and at the predictor x1.

In terms of the slope that would be the null that the slope is 0.

This corresponds is equivalent to the null that the exponentiated slope is 1.

Or in other words, the odds ratio of the binary outcome for

two groups who differ by 1 unit next 1 is 1 indicating no association.

So the last two sections we show the results from

several simple logistic regression models.

For example, the relationship between breastfeeding and child sex estimated

from a random sample of 236 Nepalese children between 0 and 36 months old

was given by the following equation, x1 is for males and 0 for females.

And we saw a very slight increase in the log odds and

hence log odds ratio of being breast fed for males relative to females.

But how is this estimated?

Well, it's no surprise that I used a computer to do this.

But what is the algorithm the computer uses to estimate this equation?

There must be some algorithm that will always yield the same results for

the same sample of data.

So for logistic regression this approach is called maximum likelihood.

Actually, the approach we use for linear regression of these squares is

a version of maximum likelihood applied to linear regression estimation.

But what maximum likelihood means is that the resulting estimates we get for

both the intercept and the slope are the values that make our

observed sample data most likely among all choices for the intercept and slope.

So the computer iterates choices for beta not hat and

beta 1 evaluates how likely our sample results are given those choices and

iterates until it can't make the likelihood any larger.

And it finds the values that maximize it and

those are the presented values for beta naught and beta one-hats.

This has to be done by the computer and

can be computationally intense under circum, certain circumstances.

So the values chosen for beta naught-hat and beta one-hat are just

estimates based on the single sample from a larger population.

Suppose by chance, we had gotten a different random sample of

236 children from the same population of children 0 to 36 month olds.

Well, the resulting estimates we would get from the second sample would likely be

different just by chance than the estimates for beta naught-hat and

beta one-hat when we just looked at in our original sample.

So is such, there is potential for

readability in our estimates from sample to sample and all regression coefficients

have an estimated standard error that estimates that potential variability.

And this can be used coupled with their estimates to make statements about

the true relationship for example between the log odds of our outcome and x1.

For example, the true underlying population slope and

we can do this based on the results of a single sample.

So let's go and look at the estimated regression equation.

This is the estimated regression equation for

the relationship between breastfeeding and sex of a child estimated from

a random sample of 236 Nepali children less than 36 months old.

And these are the numbers we just looked at.

But the computer not only gave me the slope and

intercept estimates, it gives an associated standard error for each one.

And it turns out just in, this is probably no surprised,

because let's think of the slope for a moment that estimates a log odds ratio.

We already discussed back in statistical reasoning one how the distribution of

ratios from sample to sample is not necessarily normal, but

their log values are normally distributed.

So you may recall we had to do our confidence interval estimation for

log, for odds ratios and log scale.

And since our slope is a log odds ratio we can do

the imprints right on the slope scale.

So what we're going to do is use the same old logic.

If we looked at the distribution of the slopes the log odds ratios from

multiple random samples of the same size and

did a histogram, they'd be roughly normally distributed.

There'd be some variability, but

on average they'd equal the true underlying population slope.

So, let's look at the results for relating arm circumference to sex.

So the estimated slope was 0.002 with the standard error estimate the slope from

a computer is 0.03.

We want to create a 95% confidence interval for the slope.

We take our estimate plus or minus 2 standard errors.

And it gives us a 95% confidence interval from negative 0.598 to positive 0.602.

So this interval for the slope includes the null value for

the log odds ratio of 0.

If you wanted to get the corresponding confidence interval for

the odds ratio associated with breastfeeding for males to females,

we could take the endpoints of the slope and exponentiate them.

And if we do so we take e to the negative

0.598 and e to the 0.602.

We get an interval approximately equal to 0.55 up to 1.83.

So our interval for the log odds ratio included the null value in

that scale of 0 and the interval for the odds ratio includes 1.

So this says after accounting for the uncertain year estimates,

the true difference in odds being breast fed between males and

females in this population could be that hutched males have anywhere from

0.55 times the odds of females to 1.83 times the odds.

Anywhere from 45% reduced odds up to 83% greater odds than females.

So there's no clear consensus here as to, you know, what the nature of

the association is and we cannot rule out the null of no association.

So these results indicate that we did not find a statistically significant

association between breastfeeding and sex in this population.

When we want to get a p-value, the p-value for testing the null that

the slope is 0 versus the null alternative, but it's not 0.

Or in other words, that the odds ratio is 1 versus not 1.

Well, this is business as usual.

We do it on the slope scale.

We assume the null is true that the true, that our data comes from

a population where the true slope or log odds ratio is 0.

We measure how far our result is in standard errors.

We get a result that's 0.01 standard errors above what we'd expect,

very close to it.

We already knew the p-value for this test would be

greater than 0.05 given our previous intervals, but we actually looked this up.

P-value is very large at 0.997.

So again, we fail to find a statistically significant association after

accounting for the uncertainty in our data.

So we wanted to summarize this findings in a paragraph.

As if we were running a result section for an abstract, for example.

Might say, something like logistic regression was used to

estimate the relationship between breastfeeding and sex of the child using

data from a random sample of 236 Nepalese children 0 to 36 months old.

The results show no substantive, now this is my interpretation,

because the estimated odds ratio in the sample was essentially 1,

or statistically significant association breastfeeding status and sex.

And so I report the odds ratio, the estimate we got from the sample.

Rounding to two decimals of 1.00 with a confidence interval that we

just computed 0.55 to 1.83.

So there's the long honest ratio and

the computations are necessary to do the interval only ratio scale.

But we would ultimately present things on the ratio scale.

How about the risk of obesity in HDL cholesterol levels that we

estimated from Anne Haynes data?

We called the result look like this.

The estimated intercept was negative 0.005,

negative 0.05, estimated slope was negative 0.033.

The computer will give me estimated standard errors for

each of these quantities, so

we could go ahead and create a confidence interval for this slope, for example.

So if we did that, we take our estimate negative 0.033.

Add and subtract two estimated standard errors of 0.003.

And if you do this now,

we get an estimate of negative

0.039, negative 0.027.

Now we can parlay this into an estimate for the odds ratio confidence interval,

and just to remind you the estimated odds ratio is actually equal to 0.967.

I rounded it in the previous set of lectures to 0.97,

because this confidence interval we'll see is very tight from the observation scale.

I'm going to represent it here with, to three decimal places.

So if we actually get the 95% confidence interval for

the population level odds ratio of obesity per 1 unit change in

HDL cholesterol levels, we'd exponentiate these endpoints up here.

If you have to take e with a negative 0.039, you get 0.9 roughly 0.96.

And if you take e to the negative 0.027, you get roughly 0.97.

This is between those two,

but not perfectly symmetric because we're on the ratio scale.

So this is a very type confident sort of rule indicating that after accounting for

the uncertainty in our sample based estimates.

The estimated association between obesity and HDL cholesterol level in

the population is on the order of anywhere from a 3% to 4% decrease in

the odds of obesity per unit, milligram per deciliter increase in HDL.

Suppose we wanted to do this for

a comparison of groups who differed by more than 1 unit of HDL cholesterol level.

Well, we could do this in several different ways.

The perhaps the easiest way to do it would be do it on the slope scale.

We know that a difference now in 20 units of x1 results in a,

because 100 minus 80 is 20, results in a 20 unit difference in the slope.

We've seen this many times by now.

And so the estimated difference in the log odds of obesity for these two groups

is 20 times the slope, which estimates the log odds ratio for a 1 unit difference.

If you do this, it's equal to 20 times

negative 0.033 or negative 0.66.

We exponentiate that our estimated odds ratio.

If we exponentiate, then it's 0.52.

So this, this decrease compounds relatively quickly.

And we'd say, those with 100 milligrams per deciliter have nearly half the odds of

obesity compared to those with 80 milligrams per deciliter.

But this is just a slope based on our sample.

So to get confidence interval for a multiple of the intercept all we have

to do is take the intercept end points and multiply them by the confidence interval,

multiply them by this difference in the two groups we're comparing.

So we take 20 times the lower end point we just computed for

the confidence interval for the slope.

And we take 20 times the upper interval we just computed.

And I'll let you verify this.

But if you then exponentiate, exponentiate to get the confidence

interval on a ratio scale it becomes 0.46 to 0.58.

So after accounting for the uncertainty in our estimate for

the odds ratio of this comparison, we see that this 20 unit difference in

HDL cholesterol levels is associated with a decrease of anywhere from.

42%.

To 54% at the population level.

We got one more example.

Respiratory failure and gestational age.

Remember, this is where we had an ordinal but categorical variable.

Our reference group, the one whose x1 values were equal to 0,

were children who were 37 plus weeks in gestational age.

And x1 for example is an indicator of being 34 weeks.

So let's just take that on for a minute.

Let's first, we haven't done this yet, but

we can easily do it estimate the confidence interval for the intercept.

This estimate's, in this case,

is it's a real useful quantity applicable to our dataset.

It's the log odds of respiratory failure for children who are full term or

37 plus weeks, because that's our reference.

All x-values are 0 for that group.

If we do that and standard error is equal to 0.039.

So if we take negative 5.5 plus or

minus 2 times 0.039 we get

an interval that goes from negative 5.58.

Roughly speaking, after rounding to negative 5.42.

Certainly, this does not include our null value of 0 on this scale.

If we exponentiate these end points, we get an interval from 0.0038 to 0.0044.

This is the confidence interval, not for a ratio because there's no comparison here.

This estimates the confidence interval for the odds.

The odds in this group.

The estimated odds I should have stated e to the negative 5.5 is 0.004.

So we estimate very low odds and that would translate into very low probability,

which we'll see in the next section a respiratory failure for

full term for children.

And after we account for the uncertainty, we see that it's relatively

tight on the odds scale and all the possibilities show a low odds.

if we wanted to get an odds ratio for the comparison of relative odds for

respiratory failure for those most premature in our data set that was

with the gestational age of 34 weeks compared to this reference group.

Well, we have the estimated log odds ratio of 3.4.

We add and subtract 2 estimated standard errors.

If we do the math, we get a confidence interval here on the log odds ratio or

slope scale, 3.27 to 3.53.

Now, the estimated odds ratio we got was approximately 30.

If we actually exponentiate the endpoints to get a 95% confidence interval,

the odds ratio of the endpoints of our confidence interval for

the slope goes from 26.3 to 34.1.

So it's pretty convincing even after accounting for

the uncertainty in our data that being premature at 34 weeks is highly

associated with respiratory failure as compared with children who are full term.

It's a strong risk factor.

How would we get the p-value for the slope if we wanted to test?

We already know that the confidence interval for the odds ratio of

the slope did not include 1, was way off from 1 as a matter of fact.

But if we wanted to get a p-value, to get the exact p-value we,

instead of just comparing it to 0.05, we need to do this computation.

We measure how far our result was from the log odds ratio.

We'd expect it to null 0 in terms of standard errors.

If you do this, we get a test statistic.

We get a distance measure of 51.2.

Our result is more than 51 standard errors above what we'd expect.

If there were no association between low gestational age and

increased or decreased respiratory failure risk compared to full-term infants.

So I don't have to tell you that this p-value is very, very small.

So in summary, it, it's business as usual.

In 95% confidence intervals for the slopes and

intercepts can be found by generically taking our beta,

whatever it is naught or 1, and adding or subtracting 2 standard errors.

These results can be exponentiated to get confidence

intervals on the odds ratio and odds scales.

And if we want to test whether the odds ratio we're estimating with

the slope exponentiated is different than one.

It's business as usual, as well.

We measure how far our log odds ratio estimate is from 0 in terms of

standard errors and get a p-value based on that.

In the next section, we'll look at how to translate the results from logistic

regression into risks and probabilities under allowable study designs.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.