An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

92 ratings

Johns Hopkins University

92 ratings

Course 7 of 8 in the Specialization Genomic Data Science

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 2

This week we will cover preprocessing, linear modeling, and batch effects.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

Maybe the biggest confounder in most studies is what's called batch effects.Â So I'm going to talk a little bit about batch effects and confounders, andÂ how do you adjust for them, and how do you deal with them.Â So what are the sources of batch effects?Â So batch effects is actually quite a broad term when statisticians talk about it.Â When it's used by biologists and genomic scientists, they often talk aboutÂ the batch that the samples were processed in, so the time in which a group ofÂ samples that went through together, or the slide that they went through together on.Â But it turns out that batch effects is a surrogate forÂ lots of different confounders.Â So there could be external factors like the environment that might affectÂ the level of genomic measurements, there could be genetic orÂ epigenetic factors that contribute to the overall expression of a gene,Â or the overall epigenetic profile.Â And then there could be technological factors.Â So for example, if you have a diligent scientist versus a crazy scientist workingÂ away and doing an experiment, you might get different results.Â So all of these things are confounders that you might need to adjust forÂ when doing your experiment.Â So here's a quick example of that.Â So I've taken three studies, you can find the results of these studies in theseÂ different papers that I've linked to here.Â And so I colored by different things.Â So, in this case, I've done a clustering, and I colored by the environment.Â And so you can see that the orange environmentÂ sort of clusters together into these two clusters.Â Here, I colored by processing year andÂ here you can see that the purple processing year clusters together.Â And then here, in this study,Â I've actually clustered the data just by a particular allele,Â and so here, you can see that this allele, the orange case, they cluster together.Â So these are expression measurements that cluster together by different variables.Â Any one of these could be a confounder or a batch effect that you have to adjust forÂ If it's not the variable of interest in your study.Â So this doesn't just affect the continuous measurements like gene expressionÂ measurements.Â This is actually data from the 1000 Genomes Project,Â so this is a particular genomic location between this base pair and this base pair.Â And so you can see the samples are ordered by date here.Â And this is after normalizing the samples to have the same read coverage forÂ all the different samples on average, so the global distributions are the same.Â You can see that there's still a set of samples here that seem to haveÂ a much higher level of coverage than the set of samplesÂ that were processed on a different date here for this region.Â And so this sort of batch effect appears in lots of differentÂ types of genomic data.Â

So when can you deal with batch effects andÂ when can you not deal with batch effects?Â Well if the same biological group is run on the same batch, in other words,Â if each biological group is run on its own batch then it's impossible to tellÂ the differences between group and biology, sorry the group biology andÂ the batch variable when you're doing your statistical analysis.Â Now if on the other hand you run replicates of the different groups onÂ the different batches, soÂ you get a sum from each different group on each different batch, then it's possibleÂ to distinguish the difference between the batch effects and the group effects.Â So the first thing to dealing with these batch effects is good study design andÂ you get blocking or randomization of samples.Â

So the next thing that people do is that they fit regression model toÂ model the effective batch.Â So this only works, again, if there's not intense correlation orÂ high correlation between the phenotype and the batch.Â So here again, we're fitting a regression model where Y is the outcome,Â the genomic measurement that we care about.Â P is the phenotype that we care about, andÂ B is the batch variable that we measured in this data set.Â So if P and B are the exact same thing, then these two variables are the same, andÂ this just merges into one term and you can't really adjust for it.Â But if P and B are, say, uncorrelated with each other, or orthogonal to each otherÂ because of good experimental design, this is straight ahead to estimate.Â So again, we're going to fit many of these regression models,Â so again we've stacked the data up along rows and then we have the samples andÂ columns and we want to relate that to some set of primary variables andÂ some set of adjustment variables, the batch variables.Â We're going to fit that regression model over and over again.Â So you can actually do this a little bit more cleverly using empiricalÂ Bayes method.Â So basically what this does is it shrinks down the estimatesÂ towards their common mean.Â And so this is actually a very highly cited paper with over 1000 citations forÂ adjusting batch effects.Â And so another way that you can do this, though,Â is if you basically don't know what the batch effects are.Â So this is really common in genomics experiments where batch effectsÂ could be due to a large number of things.Â You end up with a model that looks like this.Â You have the genes in the rows and the samples in the columns.Â And here you might have some primary variables that you care about.Â And then there's some random variation.Â Now some of that random variation is due to sampling and usual measurement error,Â but some of it might be due to batch effects or other things.Â So what you might want to do is decompose this into the randomÂ independent variation that you would expect when doing linear modeling.Â And some kind of dependent variation.Â Which you can further break down into an estimated batch variable.Â So the idea is basically can you use the data itselfÂ to estimate these terms over here and soÂ you could essentially estimate batch from the data itself and then adjust for it.Â And so there's an algorithm that's been developed forÂ doing this called Surrogate Variable Analysis.Â So imagine here you have some simulated data.Â And so, in the rows are your genomic measurements, andÂ in the columns are your samples.Â And suppose that there's a difference due to a primary variable that you care about.Â And so that's the difference between the first ten samples andÂ the last ten samples.Â But then suppose there's a batch variable that's introducing some differenceÂ in expression due to that batch So the first thing that you could do is you couldÂ come up with a true estimate of the batch variable.Â So you're going to do that by taking an original estimate of the batchÂ variable like this, which could be any estimate that you come up withÂ that is even remotely correlated with that batch variable.Â And then you want to estimate what this true batch variable is.Â So it goes high for two samples, low forÂ two samples, high for five samples, and so forth.Â So now we're going to look at this indicator.Â This is the indicator that each gene is not affected by the group variable youÂ care about but is affected by the batch variable.Â So one thing that you could do is you could use this estimate of batch inÂ a linear regression model to update these probability estimates.Â So it's described in this paper here, this algorithm.Â

So then you could weight the matrix and recalculate, say some decomposition,Â say the principal components or singular value decomposition like we have,Â and then we update the estimate of the batch variable.Â We can then use the estimate of the batch variable to update the probabilityÂ weights, and update the data by weighting by those weights again, andÂ then re-update the new batch variable.Â So now, once you've done this iterative algorithm,Â you sort of removed the genes that are mostly driven by the group variable,Â you're focusing on the genes driven by batch, the decomposition givesÂ an estimate for batch, which is pretty close to the real batch variable.Â You can then include this as if it was a measured batch variable inÂ your statistical analysis, to adjust for that variable and remove batch.Â So, this is what you would do if you don't have the batch variable measure.Â If you want an introduction to batch effects, this review paper is very good.Â This paper that I've mentioned about adjusting forÂ batch effects with empirical Bayes is also a very good introduction to adjusting forÂ batch effects for batch effects when they're known.Â And then if you want to learn a lot more about surrogate variable analysis,Â that last technique for estimating batch effects from the data itself,Â you can check out this paper below.Â

Coursera provides universal access to the worldâ€™s best education, partnering with top universities and organizations to offer courses online.