An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

138 ratings

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 1

This course is structured to hit the key conceptual ideas of normalization, exploratory analysis, linear modeling, testing, and multiple testing that arise over and over in genomic studies.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

One of the most important issues in designing an experiment is confounding,

and one good way to take care of confounding is to use randomization.

This lecture is a little bit about those two concepts.

So, remember the central dogma of statistics, basically we want to know

something about a big global population, and we're going to use probability to

sample some smaller subset of the population over here.

Then we're going to calculate some statistics and use inference to then say

something about the population quantity that we care about.

When we do this, we're going to often measure a number of variables and

then there will be some that we won't measure.

Regardless of whether we measure it or not, if we're looking for

the association between two variables, say, for example, we're looking for

the association between gene expression and case control status,

there might be other variables that play a role in mediating that relationship.

So here I'm going to use a really simple example to illustrate this idea.

So here's a picture of me and my son.

My son has small shoes and he's not very literate yet.

I have big shoes and you'll have to take my word for it that I'm somewhat literate.

And so if you took a lot of data points like this,

you collected data on fathers and sons, and

you collected information on their shoe size and their literacy, you would see

that there is a strong correlation between shoe size and literacy.

So does that mean there's a really strong direct relationship

between shoe size and literacy?

Probably not.

The reason why is we've missed a variable here.

We didn't collect the age of the person.

So my son is very young and I'm middle-aged, and so

I've had more time to learn how to read than my son has.

And so actually there's an intermediate variable age

that's related to both shoe size and literacy.

This is called a confounder, a variable that relates both to

the variable of interest and the variable you're trying to correlate it with.

And so this is actually quite a major issue in genetics and genomics.

I was going to illustrate that with just one concrete example.

So here's a study that was looking for differential expression between different

ethnic groups, and here they found 78% of genes were differentially

expressed between European samples and Asian samples.

So you can see that because there were tons of small p-values.

If you don't know what p-values are and can't interpret that plot, don't worry

about it right now, you'll learn about that more as the class goes on, but

the key point is that there were 78% of genes differentially expressed.

So, it turns out, if you look at all the previous studies that have been done,

there weren't that many studies that showed that

there were these huge differences in genetic information between populations,

and so it was quite surprising.

Then if you went and looked at when the samples were processed

all of the European samples here labeled in blue called CEU for

the European population they came from, and the Asian samples labeled in the label

ASN for the Asian population the blue samples the European samples

all came from 2003 to 2005 where the Asian samples covered 2005 later

into 2006 So the samples are actually processed on different days.

Now you can do the analysis in a couple of different ways.

First, you can look at just the differential expression between

populations without taking into account any other variables, and

you get that 78% number we talked about a minute ago.

If you look at just differential expression between years, so

which genes have expression that's associated with the year that the sample

was processed, you get 96% of genes that are differentially expressed.

So that means that there's a stronger signal associated with the year

variable than there is with the population variable.

Once you adjust for year that the sample was processed,

0% of genes are estimated to be differentially expressed.

So this is actually a very common thing that happens across many

different studies.

So here's another example.

This was a study that found that there were genetic signatures that might be

associated with really long life in humans.

So basically they identified a small subset of genetic variants that appeared

to be associated with living to be over 100.

But it turns out that those samples had been collected from the young people on

one technology and the older people on another technology, and the big

differences between the two populations were likely do to technology, not biology.

This actually happens in proteomics as well.

So this was, again, a proteomic signature that had been published

to identify ovarian cancer, and so this was a very exciting result.

But it turns out that that also had a batch effect, and so

there was a difference between the times that the samples were processed

that correlated with the biology of the samples being processed.

All of these cases represent examples of where there's confounding,

where one variable that you don't care about happens to be highly correlated with

both the outcome you care about and the genomic information.

In fact, it happens in almost every genomic technology.

This is a paper that's worth reading that talks about all the different ways that

this sort of batch effect, or these other confounders,

can be associated with gene expression or genetic information or

proteomic information, basically, any kind of high throughput genomic experiment.

So how do we deal with this problem?

Since it's such a big problem, there needs to be a way to deal with it.

At the experimental design stage, the way to deal with it is randomization.

So here I'm giving an example to show you what's going on.

So here, imagine that we have a confounding variable, and

the level of the confounding variable is given by this scale here.

So dark on this scale means the confounding variable is high, and

light on this scale means the confounding variable is low.

Now, suppose that we have a set of experimental units or

a set of samples, each dot represents a person.

So this person has a high value of the confounding variable, and

this person up at the top has low value of the confounding variable.

Now, suppose that we assign the treatment so that the first five samples go to

the people that have the low variables of the confounding variable, and

the next five of the different treatment goes to the people that have a high value

of the confounding variable.

In that case, there's a strong relationship between the confounding

variable and the treatment, and so you will get this problem where you

can't distinguish the association with any outcome between, whether that's

an effect of the treatment or whether it's an effect of the confounding variable.

An alternative is to randomize, so for every single unit, you flip a coin and

you decide whether it's going to get the green treatment or

whether it's going to get the red treatment.

Now that breaks the relationship with this confounding variable, but, moreover,

since it's random, since you're just flipping a coin,

it should break the relationship with any other confounding variable.

Now in any given sample, if you have a small sample size,

there might still be a relationship with some confounder that you didn't measure.

But as the sample size grows, if you continue to randomize,

there will be an independence between the confounding variable and

whatever treatment that you're trying to do.

So randomization is one way to deal with confounders.

Another way to deal with confounders is called blocking.

So this is a case where, suppose you know that there's a specific confounding

variable that you care about, and

I'm going to use an example here, courtesy of Karl Broman, to illustrate this idea.

So imagine that you're doing an experiment on mice and

you have 20 males and 20 females.

You're going to treat half of them and leave the other half untreated, and

you can only do this experiment on four individuals per day.

So there are different ways that you can assign groups to the mice,

and there are different ways that you could run those mice on different days.

So here's one example of a study design that might not be so good.

So here you have two weeks, Week One and Week Two, and then you have four mice

per day, so that's the rows of this matrix here, and then you have a number of days.

So here you've run all of the control samples, and

you made all of the control samples to be females.

So in pink is the sex, female, and in blue the sex is male.

And so here you can see all the controls are female, and

they're all run in Week One.

And all the treated are males, and they're run in Week Two.

And so here again, it's going to be very difficult to distinguish

the control versus treated effect from, say, the female versus male effect, or

the Week One versus Week Two effect.

So an alternative study design is to block these so

that you get sort of an equal balance of all the different confounding

variables that you know about with the treatment variable.

So in this study design, you see that we run some males and

some females on Week One, and some males and some females on Week Two.

We run the treated and controls on both Week One and

Week Two, and we make sure that some of the males and some of the females,

in approximately equal proportions, are controls and treated.

So now all the variables are sort of balanced with respect to each other, and

so that means we will be able to estimate all of those different effects.

Since we some females who were treated in Week One, we can estimate

the level that's associated with each of the different variables by doing that.

So this is actually a very long topic,

there's a lot more that you can learn about blocking,

randomization, and study design, but those are all a couple of the key ideas.

I would say that there's a couple of other ideas that are worth

keeping in mind just at the high level in experimental design.

One is making sure that your experiment is balanced.

So, in other words it's a better idea to make sure that you have the equal

number of treated and controls than to have way more treated than controls.

It's good to have replicates, just like we talked about in a previous lecture.

And it's good to have controls, both negative and positive controls,

where you know what the answer should look like.

This isn't always possible in genomic experiments where you

don't necessarily know whether genes are going to be turned on or turned off, but

it can be done with certain control measurements that are taken,

that you know that what the answer is going to be.

So that's a little bit about confounding and randomization.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.