The numerator was disproportionately male and I'm just going to put a bunch of M's and

several F's to represent that the smokers were majority male,

and in the denominator, in reverse,

that is not drawn,

so to speak, to proportion but just to illustrate the point.

So, the numerator was over-represented by males who were less likely to have the disease,

and the denominator was over represented by

females who were more likely to have the disease.

So when we compare smokers directly to non-smokers,

and don't take into account these differing sex distributions,

the risk in the numerator is pulled down because of

the higher presence of males who were less likely to have the disease.

So, what we're showing here in this example is something called Simpson's paradox.

So the nature of an association can change even reverse direction or

disappear when data from several groups are combined to form a single group.

So, we started with all males and females,

and we missed the relationship between disease and smoking.

Another way to say this is association between an exposure X,

like smoking and our situation,

and an outcome Y, like disease,

can be confounded by another lurking or

hidden variable Z or multiple variables Z1, Z2, etc.

So, a confounder Z or multiple confounders Z1 through Zp for example,

distort the true relation between X and Y.

This can happen if any of our confounders is related to both our exposure and outcome.

So just to recap,

our outcome of interest was disease.

We were assessing its relationship with smoking,

and we had this third variable sex,

which was related to both.

So, if we call this male sex,

males versus females, it was negatively associated with disease.

Males were less likely to have disease,

but males were more likely to smoke.

So, since sex was related to both of these it had the potential and ultimately

did distort our understanding of

the relationship where we combine males and females together.

So, in the next section, we'll talk about how to

actually come up with a single measure that removes

the distortion from that sex imbalance

in the distribution between smokers and non-smokers.

So again another example of confounding in action,

here's an observational study we looked at these data before to

estimate the association between arm circumference and height Nepali's children.

So 150 randomly selected children,

0-12 months old, they had their arm circumference weight and height measured.

Certainly, an observational study,

our exposure of interest is height,

it's not possible to randomize subjects to height groups,

and the data looks like this: There's the ranges of values for

arm circumference height and weight in these data.

So you'll recall from the simple linear regression section we had done this analysis

for arm circumference and height was a positive association,

here's our scatter plot of the arm circumference values versus

the height values in

the Nepali's children with the estimated regression line superimposed on the graph.

We know that there was relatively high correlation there,

and it was a positive association.

But as you might suspect, weight,

this third measure we could be thinking of is certainly

associated with both arm circumference and height.

So our outcome of arm circumference in our primary predictor fight.

So here's a scatter plot of the relationship between arm circumference and

weight with a regression line relating arm circumference to weight superimposed,

and the R-squared for that association is 0.7,

so it's a positive association.

The correlation is relatively higher,

and if we certainly look at height,

our predictor of interest versus weight as well,

but we see a strong positive association of R-squared to 0.85.

So weight is certainly related to arm circumference and height.

So, if we take into account weight,

we may get a different understanding of

the relationship between arm circumference and height.

So, it turns out here's a scatter plot of

arm circumference by height after adjusting for weight.

What we have on this graph, one way to think of it,

is this is among persons of the same weight,

this shows the relationship between arm circumference and height.

So, now there's no longer variability in these data in terms of height.

In terms of weight values,

we're considering only persons of the same weight.

So, think about that for a moment.

Now the relationship between arm circumference and height is negative in nature.

Does that makes some sense?

I want you to think about this.

Of a restricter assessment to persons of the same weight would it make sense

in those weight groupings to have

a negative association between arm circumference and height?

Just chew on that, and we'll come back and talk about that in the next section.

Something to consider for those of you who are lab-based scientists,

confounding is a big issue in laboratory studies and it's only become

recently talked about and accounted for in the past 10-15 years.

Something I might call batch effects in lab-based analyses.

So, lab-based results can be influenced by the technician,

the laboratory used, the time of day,

the temperature in the lab.

If the goal of the study is to ascertain differences in lab measurements between groups,

for example, between diseased and non-diseased groups,

and the group is associated with at least some of the above characteristics,

the technician, the laboratory, etc.,

then there can be the valley and the most egregious examples,

the most difficult to understand is something where a quantity or quantities were

measured on a group of patients with disease in Lab 1,

and a group of patients it was a case-control setup,

no disease in Lab 2 at a different time,

and differences were found in some gene expressions between diseased and non-diseased.

But it was impossible because of the setup to disentangle

whether the disease results

in a different gene expression or is associated with a different gene expression,

then those without disease or whether this was

an artifact because of using two different labs.

So, lab-based researchers to become more cognitive.

These ideas not only thinking that they need to be adjusted for whenever possible,

but also this gives rise to some changes in how they might do the study design,

where they might for example,

randomize samples from diseased and non-diseased to one of the two labs.

So, there was variability in

the outcome types in both the labs because of the randomization,

and we'll get into talking about the role randomization in reducing

or minimizing the potential for compounding,

and we'll talk about that in this situation as well.

So in summary, a non-randomized studies outcome

exposure relationships of interest may be confounded by other variables,

in such a situation the relationship between the outcome and exposure

differs after taking into account the confounder or confounders of note.

In order to confound an outcome exposure relationship,

a variable must be related to both the outcome and exposure.

So in the next section, we'll show how to

extend what we did here when breaking things out

into separate groups in estimating

overall relationship separately for males and females for example,

or for different weight groups.

We'll show how to summarize that in

a single number called the adjusted estimate of the association.