An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

94 ratings

Johns Hopkins University

94 ratings

Course 7 of 8 in the Specialization Genomic Data Science

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 3

This week we will cover modeling non-continuous outcomes (like binary or count data), hypothesis testing, and multiple hypothesis testing.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

I'm going to pick up where I left off in the last calculations.Â So I've got these p values that I've calculatedÂ from doing my multiple hypothesis test.Â And then I want to correct them for multiple testing.Â So the first one that we might want to consider is the Bonferroni correction, soÂ that's controlling the familywise error rate,Â the probability of even one false positive.Â So one way that we can calculate these sort of things is toÂ use the p.adjust function in R.Â So I pass it the p values that I originally calculated.Â And then I tell it to use the the Bonferroni method.Â And what it's going to do is it's going to transform those p values, such that I canÂ now apply a threshold to the transformed p values, which look like this.Â So you can see that they're mostly 1 because they've been transformed, andÂ I can basically look at the quantiles of thatÂ distribution to be really clear about how large they are for the most part.Â Then if I want to do Bonferroni control, I can basically say I'm only going to callÂ things significant at a family-wise error rate of 5% means I need to look forÂ all Bonferroni p values less than 0.05.Â

In this case, there are none, so there are no statistically significant resultsÂ at a Bonferroni corrected level, okay?Â So then the other thing that I could do is I could do adjustment forÂ a false discovery rate.Â So again we're controlling a different error rate here, butÂ I can use the same function p.adjust, and I can pass it the f stats,Â object, and I can tell it method BH.Â That's Benjamini-Hochberg,Â which is one of the ways to control the false discovery rate.Â So if I look at the adjusted p values, again, it's set up so that if IÂ call everything less than with an adjusted p value less than 0.05 significant,Â it will control the false discovery rate at 5%, soÂ then I can look at the number of those that are less than 0.05.Â In this case nothing is significant with either case, butÂ that's how you check to see what's statistically significant.Â You could also do this quite easilyÂ with the q value package to adjust from multiple or with the limma package.Â So, if I want to do limma, I can just calculate the limma adjusted p values.Â I'm using the top table function, so I'm going to pass it the previousÂ calculated from the previous lecture, Ebay's limma model fit.Â And I'm going to tell it I want every single one of the adjusted p values out,Â and then if I look for the adjusted .P.Val argument, got to get the caps right,Â then you end up with adjusted p values for the limma model.Â

So there's two that are adjusted at a Bonferroni adjustment.Â Not Bonferroni, Benjamini-Hochberg adjusted, soÂ this p value's the adjusted from the Benjamini-Hochberg.Â You can also apply q value directly to control the false discovery rate.Â So here if I pass the limma p values to the q value function,Â

So here it's, at a p value threshold it finds 2, at a q value threshold ofÂ 0.05 it finds 2 as well, and so it basically tells you for differentÂ calculations and different threshold levels, how many does it find significant?Â

The other thing it can tell you,Â the nice thing about q value compared to some of the other approaches,Â is that it will tell you the estimated fraction of null hypotheses.Â In this case, it estimates that the prior probability, orÂ the prior number of fraction of null hypotheses, is one.Â So that means that there is very little differential expression signature there.Â You can also apply that to the edge object that we calculated earlier.Â And you similarly get q values for the edge object, forÂ the likelihood ratio test.Â So again here for the likelihood ratio test it's the unadjusted value.Â You can use the ODP statistic for edge and get slightly more power.Â But this is sort of the direct f statistic comparison for multiple testing.Â In any of these cases, you're basically either calculating either a q value orÂ an adjusted p value, comparing it to the same usual threshold, andÂ identifying how many things are statistically significant, andÂ that will control the error rate for you.Â

Coursera provides universal access to the worldâ€™s best education, partnering with top universities and organizations to offer courses online.