An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

92 ratings

Johns Hopkins University

92 ratings

Course 7 of 8 in the Specialization Genomic Data Science

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 1

This course is structured to hit the key conceptual ideas of normalization, exploratory analysis, linear modeling, testing, and multiple testing that arise over and over in genomic studies.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

One of the most important issues in designing an experiment is confounding, and one good way to take care of confounding is to use randomization. This lecture is a little bit about those two concepts. So, remember the central dogma of statistics, basically we want to know something about a big global population, and we're going to use probability to sample some smaller subset of the population over here. Then we're going to calculate some statistics and use inference to then say something about the population quantity that we care about. When we do this, we're going to often measure a number of variables and then there will be some that we won't measure. Regardless of whether we measure it or not, if we're looking for the association between two variables, say, for example, we're looking for the association between gene expression and case control status, there might be other variables that play a role in mediating that relationship.

So here I'm going to use a really simple example to illustrate this idea. So here's a picture of me and my son. My son has small shoes and he's not very literate yet. I have big shoes and you'll have to take my word for it that I'm somewhat literate. And so if you took a lot of data points like this, you collected data on fathers and sons, and you collected information on their shoe size and their literacy, you would see that there is a strong correlation between shoe size and literacy. So does that mean there's a really strong direct relationship between shoe size and literacy? Probably not. The reason why is we've missed a variable here. We didn't collect the age of the person. So my son is very young and I'm middle-aged, and so I've had more time to learn how to read than my son has. And so actually there's an intermediate variable age that's related to both shoe size and literacy.

This is called a confounder, a variable that relates both to the variable of interest and the variable you're trying to correlate it with. And so this is actually quite a major issue in genetics and genomics. I was going to illustrate that with just one concrete example. So here's a study that was looking for differential expression between different

ethnic groups, and here they found 78% of genes were differentially expressed between European samples and Asian samples. So you can see that because there were tons of small p-values. If you don't know what p-values are and can't interpret that plot, don't worry about it right now, you'll learn about that more as the class goes on, but the key point is that there were 78% of genes differentially expressed. So, it turns out, if you look at all the previous studies that have been done, there weren't that many studies that showed that there were these huge differences in genetic information between populations, and so it was quite surprising.

Then if you went and looked at when the samples were processed all of the European samples here labeled in blue called CEU for the European population they came from, and the Asian samples labeled in the label

ASN for the Asian population the blue samples the European samples all came from 2003 to 2005 where the Asian samples covered 2005 later into 2006 So the samples are actually processed on different days. Now you can do the analysis in a couple of different ways. First, you can look at just the differential expression between populations without taking into account any other variables, and you get that 78% number we talked about a minute ago.

If you look at just differential expression between years, so which genes have expression that's associated with the year that the sample was processed, you get 96% of genes that are differentially expressed. So that means that there's a stronger signal associated with the year variable than there is with the population variable.

Once you adjust for year that the sample was processed, 0% of genes are estimated to be differentially expressed. So this is actually a very common thing that happens across many different studies. So here's another example. This was a study that found that there were genetic signatures that might be associated with really long life in humans. So basically they identified a small subset of genetic variants that appeared to be associated with living to be over 100. But it turns out that those samples had been collected from the young people on one technology and the older people on another technology, and the big differences between the two populations were likely do to technology, not biology.

This actually happens in proteomics as well. So this was, again, a proteomic signature that had been published to identify ovarian cancer, and so this was a very exciting result. But it turns out that that also had a batch effect, and so there was a difference between the times that the samples were processed that correlated with the biology of the samples being processed. All of these cases represent examples of where there's confounding, where one variable that you don't care about happens to be highly correlated with both the outcome you care about and the genomic information. In fact, it happens in almost every genomic technology. This is a paper that's worth reading that talks about all the different ways that this sort of batch effect, or these other confounders, can be associated with gene expression or genetic information or proteomic information, basically, any kind of high throughput genomic experiment. So how do we deal with this problem? Since it's such a big problem, there needs to be a way to deal with it. At the experimental design stage, the way to deal with it is randomization. So here I'm giving an example to show you what's going on. So here, imagine that we have a confounding variable, and the level of the confounding variable is given by this scale here. So dark on this scale means the confounding variable is high, and light on this scale means the confounding variable is low. Now, suppose that we have a set of experimental units or a set of samples, each dot represents a person. So this person has a high value of the confounding variable, and this person up at the top has low value of the confounding variable. Now, suppose that we assign the treatment so that the first five samples go to the people that have the low variables of the confounding variable, and the next five of the different treatment goes to the people that have a high value of the confounding variable. In that case, there's a strong relationship between the confounding variable and the treatment, and so you will get this problem where you can't distinguish the association with any outcome between, whether that's an effect of the treatment or whether it's an effect of the confounding variable. An alternative is to randomize, so for every single unit, you flip a coin and you decide whether it's going to get the green treatment or whether it's going to get the red treatment. Now that breaks the relationship with this confounding variable, but, moreover, since it's random, since you're just flipping a coin, it should break the relationship with any other confounding variable. Now in any given sample, if you have a small sample size, there might still be a relationship with some confounder that you didn't measure. But as the sample size grows, if you continue to randomize, there will be an independence between the confounding variable and whatever treatment that you're trying to do. So randomization is one way to deal with confounders. Another way to deal with confounders is called blocking. So this is a case where, suppose you know that there's a specific confounding variable that you care about, and I'm going to use an example here, courtesy of Karl Broman, to illustrate this idea. So imagine that you're doing an experiment on mice and you have 20 males and 20 females. You're going to treat half of them and leave the other half untreated, and you can only do this experiment on four individuals per day. So there are different ways that you can assign groups to the mice, and there are different ways that you could run those mice on different days.

So here's one example of a study design that might not be so good. So here you have two weeks, Week One and Week Two, and then you have four mice per day, so that's the rows of this matrix here, and then you have a number of days. So here you've run all of the control samples, and you made all of the control samples to be females. So in pink is the sex, female, and in blue the sex is male. And so here you can see all the controls are female, and they're all run in Week One. And all the treated are males, and they're run in Week Two. And so here again, it's going to be very difficult to distinguish the control versus treated effect from, say, the female versus male effect, or the Week One versus Week Two effect. So an alternative study design is to block these so that you get sort of an equal balance of all the different confounding variables that you know about with the treatment variable. So in this study design, you see that we run some males and some females on Week One, and some males and some females on Week Two. We run the treated and controls on both Week One and Week Two, and we make sure that some of the males and some of the females, in approximately equal proportions, are controls and treated. So now all the variables are sort of balanced with respect to each other, and so that means we will be able to estimate all of those different effects. Since we some females who were treated in Week One, we can estimate the level that's associated with each of the different variables by doing that. So this is actually a very long topic, there's a lot more that you can learn about blocking, randomization, and study design, but those are all a couple of the key ideas. I would say that there's a couple of other ideas that are worth keeping in mind just at the high level in experimental design. One is making sure that your experiment is balanced. So, in other words it's a better idea to make sure that you have the equal number of treated and controls than to have way more treated than controls. It's good to have replicates, just like we talked about in a previous lecture. And it's good to have controls, both negative and positive controls, where you know what the answer should look like. This isn't always possible in genomic experiments where you don't necessarily know whether genes are going to be turned on or turned off, but it can be done with certain control measurements that are taken, that you know that what the answer is going to be. So that's a little bit about confounding and randomization.

Coursera provides universal access to the world’s best education, partnering with top universities and organizations to offer courses online.