A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 2: Regression Methods

76 ratings

A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

From the lesson

Module 3A: Multiple Regression Methods

This module extends linear and logistic methods to allow for the inclusion of multiple predictors in a single regression model.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

Greetings and welcome to lecture 6 section C.

Â In this section we'll talk about multiple linear regression.

Â We'll comment briefly on how researchers make decisions about if

Â they're fitting regression models,

Â what models to choose to present in terms of summarizing their research results.

Â And we'll also talk briefly about using these models to estimate outcomes for

Â different groups.

Â In the population, based on the sample results.

Â So in this section we'll extend

Â the concept of lease squares to estimation of multiple linear regression models.

Â Understand what the linearity assumption is as it applies to multiple

Â linear regression.

Â Explain different strategies for

Â picking a final multiple linear regression model among candidates.

Â Or final models if you want to show the results of more than

Â one multiple regression model, using the same outcome.

Â And different combinations of predictors or potential predictors.

Â And then use the results of multiple linear regression model to

Â compare groups who differ with more than one predictor value and

Â estimate means for groups given their x values.

Â So let's first talk about least squares for multiple regression.

Â The algorithm to estimate the equation of the multiple linear regression line,

Â I've used the acronym MLR here just for

Â brevity is called the "least squares" estimation procedure.

Â And the idea is to find the line, and actually, and

Â we'll talk about this in more detail in a minute, when we have multiple predictors

Â more than one x, the object in space is multi-dimensional,

Â like a plane or something with more than three dimensions.

Â The idea is to find the object that gets closest to all of the points

Â in the sample, when they're plotted in the number of dimensions for

Â the outcome and predictors they have.

Â So how do we define closeness to multiple points?

Â Well we can take these things that exist in multiple dimensions and

Â then reduce it to a single estimate by looking at the predicted mean for

Â each observation, given its multiple x input value.

Â So in regression, closeness is defined as the cumulative squared distance between

Â each point's y-value and the corresponding value of y-hat.

Â The estimated mean for that point set of x values.

Â So in other words, the estimated mean for

Â all subjects with characteristics like the one we're comparing to the mean.

Â In other words, the squared distance between an observed y value and

Â the estimated mean y value for all points with the same values of x.

Â So the distance can be computed for each data point in the sample.

Â And then the algorithm that chooses the values of the intercept and

Â the slopes is called the least squares and

Â again it chooses the values that minimize the cumulative distances of each estim.

Â Each outcome value from it's estimated mean for all observations like it,

Â based on having the same x values.

Â So, the algorithmic approach is the same, the computer does it, but

Â with more than one predictor the linear regression model is no longer estimating

Â a line in two-dimensional space.

Â For example with two predictors, the shape being described by the regression

Â equation is a plane in 3 dimensional space.

Â And for more than two predictors when we get into 4, 5, 6,

Â 8,12 dimensional spaces mere mortals like ourselves cannot visualize

Â the resulting shape being estimated with one graphic.

Â So what does that mean then, about the linearity assumption,

Â when we had one outcome and

Â one predictor that was relatively straightforward to explain and assess?

Â Well, where the linearity assumption, and

Â the name linear regression comes in, is that the linearity assumption for

Â multiple regression makes the assumption the adjusted error estimate,

Â being estimated between y and each xi, if there's multiple x's.

Â In other words, the relationship between the mean of y and

Â each x, in a multiple model, is linear in nature.

Â This is an issue for continuous predictors, it's not for binary or

Â multi-categorical because essentially those always have a linear relationship

Â because we're just connecting two means when we have a binary predictor.

Â Or only in multiple categories we're connecting the mean for

Â reference group to the respective means for each of the other categories.

Â Again, 2 points to find each line.

Â So, there's not an issue of whether or not their comparison is linear because

Â it has to be, there's only two points being compared by each slope.

Â But, for continuous predictors, this is the issue we have to contend with.

Â So, this can be assessed by looking at what are called adjusted scatter

Â plots of each each y, xi relationship.

Â Adjusted for all the other variables used in the multiple regression model.

Â And we won't cover that or do that in this class, but

Â if you take a course in fitting models and doing data analysis, that will be covered.

Â But it's an extension of the idea of looking at a y/x scatter plot and

Â making decisions about whether linearity is appropriate or not.

Â So how do researchers go about choosing a final model?

Â In this class, we're more concerned with interpreting the results of what

Â others present in their research, but I just want to give some

Â thoughts on how researchers go ahead and choose what to present.

Â So when dealing with multiple regressions, and which models to present, for

Â a given outcome, whether to present one final model, or several models, and

Â how to choose those models.

Â Well model building and selection is a combination of science, statistics, and

Â the research goals which is related to the science.

Â So just some general thoughts here.

Â If the goal is to maximize the precision of our adjusted comparisons of each slope

Â in the regression model, then the idea might be to keep only those

Â predictors that are statistically significant in the final model.

Â And of course, if you have a main predictor of interest, a main exposure,

Â you would keep that in.

Â Regardless of statistical significance.

Â But the idea, here, is that if we're trying to estimate associations as

Â precisely as possible, then things that don't add information statistically

Â about the outcome, after we've accounted for other things, are just going to steal

Â precision from the things that are truly associated in the population.

Â Because we need to estimate more slopes with the same amount of data.

Â So by trimming, if you will,

Â the dead weight, the things that don't add information statistically.

Â We'll get more precise estimates of those things that do.

Â If the goal is to present research results comparable to results of similar analyses

Â presented by other researchers on similar or different populations.

Â Like, for example, if we wanted to look at the association

Â between the anthropometric measures in US children, and

Â we wanted to compare it to the findings of the researchers in Nepal and

Â they had presented a final analysis that included weight, height,

Â and sex as predictors then we would want to do the same even

Â if some of those were not significant in our analysis.

Â So in order to make comparisons between our finds adjusted for the same thing we

Â want to be sure to include the same predictors that the other researchers did,

Â regardless of the statistical significance in our results.

Â And we could comment about the difference in statistical significance that we found,

Â if in fact it were not, and tie that into the power, the study, etc.,

Â and make conclusions about potential differences in

Â growth patterns in infants in the US versus Nepal, etc.

Â If the goal is to show what happens to the magnitude of an association with different

Â levels of adjustment, so for example, if we're interested, for example,

Â in one relationship, the relationship between, and we'll look at an example in

Â the next section, BMI and the average amount of slow wave sleep somebody gets.

Â But that association is potentially confounded by other variables.

Â A nice way to show the sensitivity of that estimate is to start with the unadjusted

Â estimate then show the results of that estimate adjusted for

Â various combinations of potential confounders.

Â To show how either robust it is, regardless of what is used to adjust, or

Â how much things change with different levels of adjustment.

Â If we want to see how well this mean model that we use to estimate the mean as

Â a function of multi predictors, predicts from individuals from the same population

Â that were not used or studied in our data set, well that's a little more

Â complicated and we will discuss this briefly later in the course.

Â So let's talk about prediction.

Â You might call it prediction even though I just

Â said the models we're looking at are for estimation.

Â When I say prediction in this context we're going to look at estimating means

Â for different groups of a population based on the results from multiple

Â regression models.

Â So, recall the arm circumference results based on the sample

Â of 150 Nepalese children less than 12 months.

Â So, this is what model 3 looks like written out as an equation.

Â We estimate the mean.

Â This is the largest multiple regression model presented,

Â the one with the most predictors.

Â The mean arm circumference is estimated by taking 14.4 + -0.17

Â times the child's height + 1.46 times the group of children's weight,

Â plus the sex of the group of the children times 0.3.

Â So if we want to estimate the mean arm circumference for female children who

Â are 62 cm tall and 5.6 kilograms in weight, what would that look like?

Â Well in this case it would

Â look like 14.4 +

Â -.17(62) +

Â 1.46(5.6) + .3.

Â And if we do this it's equal to 12.336 centimeters.

Â I'm just going to round that to 12.34 centimeters.

Â So this result estimates that for this group of children,

Â this is the estimated mean arm circumference for this group.

Â Interesting enough, the overall mean arm circumference for everyone,

Â ignoring weight, sex and height, is 12.4, so for this one prediction,

Â that's actually relatively close to what we would've predicted for this group,

Â had we ignored weight, height, and sex.

Â But for other groups, the estimated mean arm circumference will be very different.

Â Given their weight, height, and sex and

Â this overall model had an r squared of 78%.

Â This one includes weight, height, and sex so the implication here is that

Â while there's still variation, individual estimates around their height, weight, and

Â sex predicted arm circumference means the individual values will generally get a lot

Â closer to their height, weight, and sex specific means than they would if they

Â used the same mean for everyone that ignored their height, weight,

Â and sex, if we used that overall sample of 150 arm circumferences for

Â everyone instead of using the regression specific mean.

Â And I wanted you to estimate the mean difference.

Â We already did this for the female children 62 cm tall, 5.6 kg in weight.

Â We want to compare them to male children 58 cm tall and

Â 4.5 kg, and I'm just going to rewrite this out for to make a point.

Â So the estimate we got for

Â the females, if we wrote it out,

Â was 14.4 + -.17(62)

Â +1.46 (5.6) +.3,

Â that was our 12.34.

Â If we do the same thing for males.

Â The male group who's 58 cm tall and 4.5 kg.

Â When we do this.

Â We get 14.4 + -.17(58) +

Â 1.46(4.5), plus 0 for

Â sex because they're male, so their reference group.

Â If we do this out it turns out to be 11.11 centimeters,

Â so this group mean, this is lower.

Â And if we take the difference in these two means, it turns out to be 1.23 cm.

Â And we actually don't have the tools at hand to create a confidence interval for

Â that, but if we were using a computer, it could do a confidence interval for

Â this estimated mean difference and for these estimated group means as well.

Â I want you to notice something though.

Â If we do this piece-wise.

Â If we were to actually look at the parts that cancel out and the parts that differ.

Â If we were to subtract in pieces the intercept cancels.

Â When we take this difference here, the difference because of the 4 kilogram

Â difference, the 4 centimeter difference in height.

Â We're left with the slope for height times the difference in height, or

Â the slope times point, of -.17 times 4.

Â We're left with the difference, the slope of weight times the difference in weight.

Â 5.6 minus 4.5 and we're left with a .3.

Â So if you actually look at this piece-wise and add the difference that's due to

Â the weight differences, the difference that's due to the height differences,

Â and the difference that's due to sex,

Â this actually turns out to be equal to 1.23 as well.

Â So, just showing that we could break this down into the components and

Â the slopes for each of the factors that differ between the two groups.

Â So we've shown that example.

Â Now let's look at the regression for emergency department waiting times, and

Â we're going to focus on the adjusted model, and

Â we're going to estimate means for different groups,

Â knowing that there are real and statistically significant differences

Â in the means across different groups in this population based on their race,

Â sex, age, and payer types, but there's still a lot of individual variability

Â in waiting times around the means for each of those groups in the population.

Â So this model, the results we get do not predict well for any one

Â individuals waiting time at all but our mean estimates will be slightly better

Â than if we used the overall waiting time, in the sample, for each of these groups.

Â So for this, to start,

Â I'd like you to estimate the mean waiting time for black males,

Â 35 years old, with private insurance using the multiple linear regression results.

Â And actually, to make the approach easier for both of us,

Â I'm going to put the results here and tally what's going on inside.

Â So, I'm actually going to write the estimation in a vertical way to

Â add up the respective slopes of interest to the intercept.

Â So, we would start with the intercept which is 46.5.

Â And then, because this group was on private insurance,

Â the when we would estimate we would add nothing.

Â That was the reference group for payer type.

Â Because this group of black males is 35 years old,

Â they would be in the age category of 35 to 54 years, so

Â we would add 4.9 minutes to the average.

Â Because they identify as black we would 18 minutes

Â to this computation and because they are male we would

Â add -2.1 minutes and so if you do this addition,

Â 46.5 plus 0 plus 4.9 plus 18.0 plus -2.1,

Â the average waiting time for this group is 68.3 minutes.

Â And then I wanted you to estimate the mean difference in waiting times for

Â the black males 35 years old on private insurance, the one we just did.

Â I want you to compare that mean estimate of 68.3 minutes

Â to White females 30 years old with public insurance.

Â Now I'm just going to line these up again.

Â And I'm going to just rewrite very quickly the Black males who

Â are 35 years old on private insurance, the components.

Â So we have the intercept of 46.5, 0 for being on private insurance, 4.9 for

Â being 35 to 54 years old, 18 extra minutes for

Â being black on the average wait times and a reduction of 2.1 minutes for being male.

Â Now if we do the same for the females,

Â white females who are 30 years old on public insurance.

Â Let's look at how theirs plays out.

Â They still get the intercept 46.5.

Â They're on public insurance so that adds 3.3 minutes to their average.

Â They're 30 years old so they're in the 20 to 34 year old category.

Â 5.2 minutes the average.

Â Their white so their in the reference group so they don't get anything for race.

Â And they're female so they're in the reference group so

Â they don't get anything for sex in terms of adding to the model.

Â So the estimated weight time average for this group of black males is 68.3 minutes.

Â The estimated wait time average for

Â this group of white females is 55 minutes when you do out the math.

Â And so the difference in averages between these two groups is 13.3 minutes.

Â The first group, the black males, are at 35 years old.

Â With private insurance have an average wait of 13.3 minutes greater

Â than the 30 year old White females on public insurance.

Â And if you actually go across and look at where the differences are and

Â just add those up, the difference is -2.1 that the first group gets for being male.

Â The 18 for black.

Â The difference here is there's slightly shorter difference, for

Â being 35 to 40, 54 years old compared to 20 to 34 years old.

Â We take that difference and then if you take zero minus 3.3,

Â if you add up those things, you would also get 13.3.

Â So you could dissect this difference into its component parts

Â on the different predictors these groups compare.

Â So in summary multiple regression results can be used to estimate mean outcomes for

Â a given subset in a population given the predictor values and put into the model.

Â Multiple regression results can be used to estimate mean differences between

Â groups in the population who differ by more then one characteristic or predictor.

Â And then confidence intervals for the estimated means and

Â the estimated mean differences, can be estimated using a computer, and

Â their interpretation is the same, but it's not something where we have an easy way to

Â estimate the standard error of these estimates by hand.

Â In the next set we'll look at the results of regressions from research articles and

Â we'll talk about how the authors report on how they chose their final regression

Â models to present and what their subsequent interpretations were.

Â Coursera provides universal access to the worldâ€™s best education,
partnering with top universities and organizations to offer courses online.