0:02

Previously we talked about statistical significance.

Â But in general, in genomic studies,

Â you're often consider, considering more than one data set at a time.

Â In other words, you might be a, analyzing the expression of every one of the genes

Â in your body, or you might be looking at hundreds of thousands of millions of

Â variants in the DNA, or many other multiple testing site, type scenarios.

Â So in these scenarios what you're doing is you're calculating a measure of

Â association between say some phenotype that you care about say cancer

Â versus control and every single data set that you collected.

Â Say, a data set for each possible gene.

Â So in this case what's happened is people are still applying

Â the hypothesis testing frame work.

Â They're using P-values and things like that.

Â But the issue is that, that framework wasn't built for doing many,

Â many hypothesis tests at once.

Â So if you remember when we talked about what a P-value was,

Â it's the probability of observing a statistic as, or more extreme,

Â than the one we, you calculated in an original sample.

Â And so what is it, one property of P-values that's very important, and

Â that we should pay attention to is that if there's nothing happening,

Â suppose that there's absolutely no difference between the two groups that

Â you're comparing, the P-values are uni, what's called uniformly distributed.

Â So this is a plot of some uniformly distributed data histogram.

Â On the x-axis you see the P-value, and

Â on the y-axis is the frequency of the number of P-value that fall to that bin.

Â And so, this is what the uniform distribution looks like.

Â And so, what a uniform distribution means is that 5% of the P-values will be

Â less than 0.05.

Â 20% of the P-value would be less than 0.02, and so forth.

Â In other words, when there is no signal, the P-value distribution is flat.

Â So what does that mean?

Â How does that sort of play a role in a multiple testing problem?

Â And so here's an example with a cartoon.

Â Imagine that you're trying to investigate whether jelly,

Â jelly beans are associated with acne.

Â So what you could do is, you could perform a study where you compare

Â people who eat a lot of jelly beans and

Â people who don't eat a lot of jelly beans and look to see if they have acne or not.

Â And so if you do that, you probably won't find anything.

Â And so, at the first test, people go ahead and collect the data on the whole sample,

Â they calculate the statistic, the P-value's greater than 0.05, they conclude

Â there's no statistically significant association between jelly beans and acne.

Â But in the, you might consider, oh, well it might be just a kind of jelly beans.

Â So you could go back and test brown jelly beans and yellow jelly beans and so

Â forth, and in each case, most of the time, the P-value would be greater than 0.05.

Â And so it would not be statistically significant, and you wouldn't report it.

Â But then, since P-values are uniformly distributed,

Â about one out of every 20 tests that you do,

Â even if there's absolutely no association between jelly beans and

Â acne, about one out of 20 will still show up with a P-value less than 0.05.

Â And so a danger is that you do these many, many, many tests and

Â then you find the one with P-value is less than 0.05 and you just report that one.

Â So here's an example where there's a news article saying that green jelly beans have

Â been linked to acne.

Â So that's again, whether it's either reporting this with a statistical

Â significance measure that was designed when performing one hypothesis test, but

Â in reality they per, performed many.

Â So how do we deal with this?

Â How do we adapt the hypothesis testing framework

Â to the situation where you're doing many hypotheses?

Â Tests.

Â So the way that we do that is with different error rates.

Â So the two most commonly error rates that you'll probably hear about when doing

Â a genomic data analysis are the family wise error rate and

Â the false discovery rate.

Â So the family wise error rate says that if we're going to do many,

Â many hypothesis tests we want to control,

Â control the probability that there will be even one false positive.

Â This is a very strict criteria.

Â If you find many things that are significant, and

Â a false family wise error rate that's very low,

Â you're saying that the probability of even one false positive is very small.

Â 3:40

Another very commonly used error measure is the false discovery rate.

Â This is the expected number of false positives divided by the number of

Â total discoveries.

Â So what does this do?

Â It sort of quantifies, among the things that you're calling statistically

Â significant, what fraction of them appear to be false positives?

Â And so the false discovery rate often is a little bit more liberal

Â than the family wise error rate.

Â You're not controlling the probability of even one false positive.

Â You're allowing for some false positives, to make more discoveries.

Â But it quantifies the error rate at which you're making those discoveries.

Â And so to interpret these error rates you have to

Â be very careful because they actually have different interpretations.

Â You do different things to data, but you also have to interpret them differently.

Â So just because you find more statistically significant results when you

Â use the false discovery rate than when you use family wise error rate,

Â it doesn't mean that magically, all of the sudden,

Â there were more results that were truly different.

Â It just means that there's a different interpretation

Â to the analysis that you do.

Â So i'm going to give you a very sim, simple example.

Â Suppose you're doing an analysis with 10,000 genes.

Â A gene expression, differential expression analysis.

Â And you discover that 550 of those genes are significant at the 0.05 level.

Â 5:19

Alternatively suppose that when we declare those 550 to be significant we were using

Â the false discovery rate.

Â In this case we're quantifying among the discoveries that we've made the rate of

Â errors that we would make then.

Â So about 5% times the 550 things we discovered equals about

Â 27.5 false positives.

Â So in this case, we discovered the same number of things, but

Â using a different error rate, it means that we control the error level much

Â lower than if we just calculated P-values less than 0.05.

Â Finally, suppose we use the family wise error rate.

Â In this case, if we had found 550 genes differentially expressed out of 10,000,

Â at a Family Wise Error Rate control of 0.05, that means

Â the probability of even one of those 550 being a false positive is less than 0.05.

Â So that means that almost all of them would probably be true positives.

Â So in this case, we've sort of illustrated the three types of

Â ways that you could sort of calculate statistical significance.

Â In each case it means something totally different

Â with statistical significance set.

Â When you use those words it means something totally different depending on

Â what error rate that youâ€™re controlling.

Â One last thing to consider when looking at multiple hypothesis tests is

Â the inevitable scenario.

Â So everybody who's done some real science has run into this scenario

Â where the P-value that they calculated is just greater than 0.05.

Â And the natural reaction is to be very sad and to think game over, oh,

Â I've got to try all over again because my P-value's greater than 0.05.

Â It's a really good idea not to do that.

Â First of all, it's important to report negative results even if you can't get

Â them into the best journals, to avoid what's called publication bias.

Â But more importantly, it's a careful, it's important to be careful to avoid hacking.

Â 6:54

So a very typical email a statistician might get after reporting a P-value

Â greater than 0.05 is this one that my friend Ingo got.

Â So it said, curse you, Ingo!

Â Yet another disappearing act!

Â Because the P-value is greater than 0.05 after doing some correction.

Â And so, the, while this is a joke and it was totally said in jest, in general,

Â there can be pressure to try to discover more things at as,

Â more statistically significant level.

Â It's very important to avoid that temptation,

Â because you'll run into something called P-value hacking.

Â So in general, statistics hacking means doing things to the data.

Â Or, changing the way that you do the calculations

Â in order to manufacture a statistically significant result.

Â Even when your original analysis didn't do it.

Â So this is an example of a paper where people took a data set,

Â a very simple, simulate, simulated data set.

Â And made very sensible transformations to that data set

Â with the statistical methods they used.

Â And turn to almost any result into a statistically significant result.

Â A way to avoid this, is to in advance of looking at the data,

Â specify a data analysis plan and stick to it.

Â