An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

94 ratings

Johns Hopkins University

94 ratings

Course 7 of 8 in the Specialization Genomic Data Science

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 4

In this week we will cover a lot of the general pipelines people use to analyze specific data types like RNA-seq, GWAS, ChIP-Seq, and DNA Methylation studies.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

As we've seen in many of the analyses we've talked about throughout this class,Â there are a large number of steps that are involved inÂ doing a statistical genomics project from pre-processing and normalization,Â to statistical modeling, to post hoc analyses of the results that you get.Â So I wanted to talk a little bit about Researcher degrees of freedom.Â This is an idea that was originally proposed in psychology, and there was thisÂ paper that said, basically, undisclosed flexibility in data collection andÂ analysis allows for presenting anything as statistically significant.Â And, so, what are they talking about here?Â They're talking about how there's a large number of steps in the sort ofÂ data analytic pipeline.Â They go from experimental design, all the way from the raw data toÂ the summary statistics, and then finally there's a p-value at the end.Â Now usually when people are talking about statistical significance,Â they talk about p-values or multiple testing corrected p-values, andÂ often a lot depends on that p-value beingÂ sort of small enough that a journal will publish the paper, or something like that.Â And so that dependence is going down a little bit over time,Â but originally there's been a lot of sort of focus on that.Â But there's been a lot of sort of steps underneath thatÂ process before you get to a p--value that could change what the p-value is.Â So, for example, if you throw out a particular outlier, or if you normalizeÂ the data a little bit differently, you might get different results.Â And so, there's lots of different ways you can analyze data.Â And the danger here is that, when they were talking about it in this paper,Â they were sort of talking about a nefarious case where youÂ keep doing everything you can until you get a p-value that's significant, butÂ you could imagine doing this just sort of by accident.Â You make a large number of choices when doing a genomic data analysis,Â and once you've made those choices, you get some result.Â And maybe you don't like that result so you redo the analysis.Â So one thing that you have to be very careful about when doingÂ genomic analysis is redoing the analysis too many times.Â It makes sense when there's new updated software or there's sort of new biologicalÂ or scientific knowledge that's been brought to bear to redo the analysis.Â But if you keep redoing it over and over again you sort of fall into this trip.Â And so, you can imagine how that would happen with different teams.Â So, this comes from sort of a recent analysis.Â This is an analysis in genomics, butÂ it kind of illustrates the point that 29 different research teams were asked toÂ see if referees were more likely to give red cards to dark-skinned players.Â And so each team analyzed the data a little bit differently.Â And here you can see the dots represent the different effect sizes that theyÂ estimated for the different studies, and so you can see that they're all different.Â And then the sort of confidence intervals,Â or the sort of confidence uncertainty intervals, forÂ each of these different estimates are also different from each other.Â And so, while they're comfortingly sort of similar forÂ many of the estimates here in the middle, you can get quite big variabilityÂ just by changing the way that you analyze the data.Â And so, you have to be careful to make sure that you don't do this over andÂ over and over again until you find just the one case where you get a largeÂ estimate of the effect, even if it's probablyÂ not necessarily due to anything other than the way that you analyze the data.Â And so, the difficult thing about thinking about that is ifÂ you do a different analysis, particularly if you adjust forÂ different covariants, you actually are answering different questions.Â So the a question is going to be conditional on what your sort of model is.Â Ans so if you have whole bunch of extra covariants in the model,Â then you're asking, is there a difference in gene expression after I account forÂ all of these other variables?Â That's a very different question than, is there just a gene expression differenceÂ overall, which might mean something totally different.Â And so you have to be a little bit careful about this idea researcher degrees ofÂ freedom as related to knowing what question it is that you're answering.Â And so this whole idea was sort of summarizing in this paper by Andrew GelmanÂ and Eric Loken when they talk about The garden of forking paths.Â What they mean by that is basically that you start off doing an analysis where youÂ just haven't seen the data, and maybe you have an analysis plan in mind.Â Then once you collect the data you realize,Â oh that there's a problem of a particular type.Â This happens all the time in genomic data.Â And then you start making decisions based on the data that you've observed, andÂ once you start doing that you start playing into this researcher degrees ofÂ freedom idea.Â You're basically changing the way that you're analyzing the data based onÂ the data, and you can end up with a little bit of trouble.Â So the key is to be thinking ahead right from the beginning, how am I going toÂ analyze these data, what decisions am I going to make before looking at the data,Â so that you're not sort of driven by those,Â and sort of end up chasing a false positive.Â So the key take home message here,Â have a specific hypothesis that you're looking for.Â So with genomic data there's this sort of tendency to just sort of do discovery forÂ the sake of doing discovery without a specific hypothesis.Â And that can often lead towards this sort of garden of forking paths orÂ these researchers degrees of freedom.Â Another thing that you can do is pre-specify your analysis plan,Â that even if it's just internally to you, say likeÂ this is the way we're going to analyze the data and we're going to stick to it.Â And then even if you end up adapting it later,Â it's good to just analyze the data once exactly how you planned on analyzing it,Â even if it has problems, just so you know what would have happened, andÂ see if there's big differences and why those differences might be.Â

Another thing that you can do if you have enough data,Â although it's often not the case in genomics, is use training and test sets,Â so the idea that you can split your data up into a first analysis data set andÂ then you can validate the results that you get in the remaining data.Â And then analyze your data once.Â So a very common temptation with genomics is to increasingly add complicated modelsÂ until you find more and more things, and that often leads to false positives.Â The other thing that you could do is if you're going to do any analyses,Â if you report all of those analyses, it will give people the opportunity to sortÂ of understand if maybe there's potential forÂ data dredging or researcher degrees of freedom in your analysis.Â So this is sort of a cautionary note that genomic data is complicated, and if youÂ add complicated analysis on top, you can often run into extra false positives.Â

Coursera provides universal access to the worldâ€™s best education, partnering with top universities and organizations to offer courses online.