This actually happens in proteomics as well.

So this was, again, a proteomic signature that had been published

to identify ovarian cancer, and so this was a very exciting result.

But it turns out that that also had a batch effect, and so

there was a difference between the times that the samples were processed

that correlated with the biology of the samples being processed.

All of these cases represent examples of where there's confounding,

where one variable that you don't care about happens to be highly correlated with

both the outcome you care about and the genomic information.

In fact, it happens in almost every genomic technology.

This is a paper that's worth reading that talks about all the different ways that

this sort of batch effect, or these other confounders,

can be associated with gene expression or genetic information or

proteomic information, basically, any kind of high throughput genomic experiment.

So how do we deal with this problem?

Since it's such a big problem, there needs to be a way to deal with it.

At the experimental design stage, the way to deal with it is randomization.

So here I'm giving an example to show you what's going on.

So here, imagine that we have a confounding variable, and

the level of the confounding variable is given by this scale here.

So dark on this scale means the confounding variable is high, and

light on this scale means the confounding variable is low.

Now, suppose that we have a set of experimental units or

a set of samples, each dot represents a person.

So this person has a high value of the confounding variable, and

this person up at the top has low value of the confounding variable.

Now, suppose that we assign the treatment so that the first five samples go to

the people that have the low variables of the confounding variable, and

the next five of the different treatment goes to the people that have a high value

of the confounding variable.

In that case, there's a strong relationship between the confounding

variable and the treatment, and so you will get this problem where you

can't distinguish the association with any outcome between, whether that's

an effect of the treatment or whether it's an effect of the confounding variable.

An alternative is to randomize, so for every single unit, you flip a coin and

you decide whether it's going to get the green treatment or

whether it's going to get the red treatment.

Now that breaks the relationship with this confounding variable, but, moreover,

since it's random, since you're just flipping a coin,

it should break the relationship with any other confounding variable.

Now in any given sample, if you have a small sample size,

there might still be a relationship with some confounder that you didn't measure.

But as the sample size grows, if you continue to randomize,

there will be an independence between the confounding variable and

whatever treatment that you're trying to do.

So randomization is one way to deal with confounders.

Another way to deal with confounders is called blocking.

So this is a case where, suppose you know that there's a specific confounding

variable that you care about, and

I'm going to use an example here, courtesy of Karl Broman, to illustrate this idea.

So imagine that you're doing an experiment on mice and

you have 20 males and 20 females.

You're going to treat half of them and leave the other half untreated, and

you can only do this experiment on four individuals per day.

So there are different ways that you can assign groups to the mice,

and there are different ways that you could run those mice on different days.