A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 2: Regression Methods

45 ratings

Johns Hopkins University

45 ratings

A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

From the lesson

Introduction and Module 1A: Simple Regression Methods

In this module, a unified structure for simple regression models will be presented, followed by detailed treatises and examples of both simple linear and logistic models.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

So in this section we'll deal the accounting the uncertainty in the estimates we get, the slope and intercept for logistic regression model when we estimate from a sample from some larger population. So, I don't think there will be an surprises in this lecture, but what were going to show how to do is create 95% confident intervals mainly from slopes from logistic regression. Because we can convert those to slopes. 95% confidence intervals for the corresponding odds ratios, but we can also do it for intercepts. And convert these to 95% confidence intervals for the odds of having the binary outcome in those whose x1 value is 0. We're also showing how to estimate p-values for testing the null of no association between the risk of the binary outcome and at the predictor x1. In terms of the slope that would be the null that the slope is 0.

This corresponds is equivalent to the null that the exponentiated slope is 1. Or in other words, the odds ratio of the binary outcome for two groups who differ by 1 unit next 1 is 1 indicating no association. So the last two sections we show the results from several simple logistic regression models. For example, the relationship between breastfeeding and child sex estimated from a random sample of 236 Nepalese children between 0 and 36 months old was given by the following equation, x1 is for males and 0 for females. And we saw a very slight increase in the log odds and hence log odds ratio of being breast fed for males relative to females. But how is this estimated? Well, it's no surprise that I used a computer to do this. But what is the algorithm the computer uses to estimate this equation?

Actually, the approach we use for linear regression of these squares is a version of maximum likelihood applied to linear regression estimation. But what maximum likelihood means is that the resulting estimates we get for both the intercept and the slope are the values that make our observed sample data most likely among all choices for the intercept and slope. So the computer iterates choices for beta not hat and beta 1 evaluates how likely our sample results are given those choices and iterates until it can't make the likelihood any larger. And it finds the values that maximize it and those are the presented values for beta naught and beta one-hats. This has to be done by the computer and can be computationally intense under circum, certain circumstances. So the values chosen for beta naught-hat and beta one-hat are just estimates based on the single sample from a larger population. Suppose by chance, we had gotten a different random sample of 236 children from the same population of children 0 to 36 month olds. Well, the resulting estimates we would get from the second sample would likely be different just by chance than the estimates for beta naught-hat and beta one-hat when we just looked at in our original sample. So is such, there is potential for readability in our estimates from sample to sample and all regression coefficients have an estimated standard error that estimates that potential variability. And this can be used coupled with their estimates to make statements about the true relationship for example between the log odds of our outcome and x1. For example, the true underlying population slope and we can do this based on the results of a single sample.

So let's go and look at the estimated regression equation. This is the estimated regression equation for the relationship between breastfeeding and sex of a child estimated from a random sample of 236 Nepali children less than 36 months old. And these are the numbers we just looked at. But the computer not only gave me the slope and intercept estimates, it gives an associated standard error for each one. And it turns out just in, this is probably no surprised, because let's think of the slope for a moment that estimates a log odds ratio. We already discussed back in statistical reasoning one how the distribution of ratios from sample to sample is not necessarily normal, but their log values are normally distributed. So you may recall we had to do our confidence interval estimation for log, for odds ratios and log scale. And since our slope is a log odds ratio we can do the imprints right on the slope scale.

So what we're going to do is use the same old logic. If we looked at the distribution of the slopes the log odds ratios from multiple random samples of the same size and did a histogram, they'd be roughly normally distributed. There'd be some variability, but on average they'd equal the true underlying population slope.

So, let's look at the results for relating arm circumference to sex. So the estimated slope was 0.002 with the standard error estimate the slope from a computer is 0.03. We want to create a 95% confidence interval for the slope. We take our estimate plus or minus 2 standard errors.

And it gives us a 95% confidence interval from negative 0.598 to positive 0.602. So this interval for the slope includes the null value for the log odds ratio of 0.

If you wanted to get the corresponding confidence interval for the odds ratio associated with breastfeeding for males to females, we could take the endpoints of the slope and exponentiate them. And if we do so we take e to the negative 0.598 and e to the 0.602. We get an interval approximately equal to 0.55 up to 1.83. So our interval for the log odds ratio included the null value in that scale of 0 and the interval for the odds ratio includes 1. So this says after accounting for the uncertain year estimates, the true difference in odds being breast fed between males and females in this population could be that hutched males have anywhere from 0.55 times the odds of females to 1.83 times the odds. Anywhere from 45% reduced odds up to 83% greater odds than females.

So there's no clear consensus here as to, you know, what the nature of the association is and we cannot rule out the null of no association. So these results indicate that we did not find a statistically significant association between breastfeeding and sex in this population.

When we want to get a p-value, the p-value for testing the null that the slope is 0 versus the null alternative, but it's not 0. Or in other words, that the odds ratio is 1 versus not 1. Well, this is business as usual. We do it on the slope scale. We assume the null is true that the true, that our data comes from a population where the true slope or log odds ratio is 0. We measure how far our result is in standard errors. We get a result that's 0.01 standard errors above what we'd expect, very close to it. We already knew the p-value for this test would be greater than 0.05 given our previous intervals, but we actually looked this up. P-value is very large at 0.997. So again, we fail to find a statistically significant association after accounting for the uncertainty in our data.

So we wanted to summarize this findings in a paragraph. As if we were running a result section for an abstract, for example. Might say, something like logistic regression was used to estimate the relationship between breastfeeding and sex of the child using data from a random sample of 236 Nepalese children 0 to 36 months old. The results show no substantive, now this is my interpretation, because the estimated odds ratio in the sample was essentially 1, or statistically significant association breastfeeding status and sex. And so I report the odds ratio, the estimate we got from the sample. Rounding to two decimals of 1.00 with a confidence interval that we just computed 0.55 to 1.83. So there's the long honest ratio and the computations are necessary to do the interval only ratio scale. But we would ultimately present things on the ratio scale. How about the risk of obesity in HDL cholesterol levels that we estimated from Anne Haynes data? We called the result look like this. The estimated intercept was negative 0.005, negative 0.05, estimated slope was negative 0.033. The computer will give me estimated standard errors for each of these quantities, so we could go ahead and create a confidence interval for this slope, for example. So if we did that, we take our estimate negative 0.033. Add and subtract two estimated standard errors of 0.003. And if you do this now, we get an estimate of negative 0.039, negative 0.027. Now we can parlay this into an estimate for the odds ratio confidence interval, and just to remind you the estimated odds ratio is actually equal to 0.967. I rounded it in the previous set of lectures to 0.97, because this confidence interval we'll see is very tight from the observation scale. I'm going to represent it here with, to three decimal places. So if we actually get the 95% confidence interval for the population level odds ratio of obesity per 1 unit change in HDL cholesterol levels, we'd exponentiate these endpoints up here. If you have to take e with a negative 0.039, you get 0.9 roughly 0.96. And if you take e to the negative 0.027, you get roughly 0.97. This is between those two, but not perfectly symmetric because we're on the ratio scale. So this is a very type confident sort of rule indicating that after accounting for the uncertainty in our sample based estimates. The estimated association between obesity and HDL cholesterol level in the population is on the order of anywhere from a 3% to 4% decrease in the odds of obesity per unit, milligram per deciliter increase in HDL. Suppose we wanted to do this for a comparison of groups who differed by more than 1 unit of HDL cholesterol level. Well, we could do this in several different ways. The perhaps the easiest way to do it would be do it on the slope scale. We know that a difference now in 20 units of x1 results in a, because 100 minus 80 is 20, results in a 20 unit difference in the slope. We've seen this many times by now. And so the estimated difference in the log odds of obesity for these two groups is 20 times the slope, which estimates the log odds ratio for a 1 unit difference. If you do this, it's equal to 20 times negative 0.033 or negative 0.66. We exponentiate that our estimated odds ratio.

If we exponentiate, then it's 0.52. So this, this decrease compounds relatively quickly. And we'd say, those with 100 milligrams per deciliter have nearly half the odds of obesity compared to those with 80 milligrams per deciliter. But this is just a slope based on our sample. So to get confidence interval for a multiple of the intercept all we have to do is take the intercept end points and multiply them by the confidence interval, multiply them by this difference in the two groups we're comparing. So we take 20 times the lower end point we just computed for the confidence interval for the slope. And we take 20 times the upper interval we just computed. And I'll let you verify this. But if you then exponentiate, exponentiate to get the confidence interval on a ratio scale it becomes 0.46 to 0.58. So after accounting for the uncertainty in our estimate for the odds ratio of this comparison, we see that this 20 unit difference in HDL cholesterol levels is associated with a decrease of anywhere from.

To 54% at the population level. We got one more example. Respiratory failure and gestational age. Remember, this is where we had an ordinal but categorical variable. Our reference group, the one whose x1 values were equal to 0, were children who were 37 plus weeks in gestational age. And x1 for example is an indicator of being 34 weeks. So let's just take that on for a minute. Let's first, we haven't done this yet, but we can easily do it estimate the confidence interval for the intercept. This estimate's, in this case, is it's a real useful quantity applicable to our dataset. It's the log odds of respiratory failure for children who are full term or 37 plus weeks, because that's our reference. All x-values are 0 for that group. If we do that and standard error is equal to 0.039. So if we take negative 5.5 plus or minus 2 times 0.039 we get an interval that goes from negative 5.58. Roughly speaking, after rounding to negative 5.42. Certainly, this does not include our null value of 0 on this scale. If we exponentiate these end points, we get an interval from 0.0038 to 0.0044. This is the confidence interval, not for a ratio because there's no comparison here. This estimates the confidence interval for the odds. The odds in this group. The estimated odds I should have stated e to the negative 5.5 is 0.004. So we estimate very low odds and that would translate into very low probability, which we'll see in the next section a respiratory failure for full term for children. And after we account for the uncertainty, we see that it's relatively tight on the odds scale and all the possibilities show a low odds. if we wanted to get an odds ratio for the comparison of relative odds for respiratory failure for those most premature in our data set that was with the gestational age of 34 weeks compared to this reference group. Well, we have the estimated log odds ratio of 3.4. We add and subtract 2 estimated standard errors. If we do the math, we get a confidence interval here on the log odds ratio or slope scale, 3.27 to 3.53. Now, the estimated odds ratio we got was approximately 30. If we actually exponentiate the endpoints to get a 95% confidence interval, the odds ratio of the endpoints of our confidence interval for the slope goes from 26.3 to 34.1. So it's pretty convincing even after accounting for the uncertainty in our data that being premature at 34 weeks is highly associated with respiratory failure as compared with children who are full term. It's a strong risk factor. How would we get the p-value for the slope if we wanted to test? We already know that the confidence interval for the odds ratio of the slope did not include 1, was way off from 1 as a matter of fact. But if we wanted to get a p-value, to get the exact p-value we, instead of just comparing it to 0.05, we need to do this computation. We measure how far our result was from the log odds ratio. We'd expect it to null 0 in terms of standard errors. If you do this, we get a test statistic. We get a distance measure of 51.2. Our result is more than 51 standard errors above what we'd expect. If there were no association between low gestational age and increased or decreased respiratory failure risk compared to full-term infants. So I don't have to tell you that this p-value is very, very small. So in summary, it, it's business as usual. In 95% confidence intervals for the slopes and intercepts can be found by generically taking our beta, whatever it is naught or 1, and adding or subtracting 2 standard errors.

These results can be exponentiated to get confidence intervals on the odds ratio and odds scales. And if we want to test whether the odds ratio we're estimating with the slope exponentiated is different than one. It's business as usual, as well. We measure how far our log odds ratio estimate is from 0 in terms of standard errors and get a p-value based on that. In the next section, we'll look at how to translate the results from logistic regression into risks and probabilities under allowable study designs.

Coursera provides universal access to the world’s best education, partnering with top universities and organizations to offer courses online.