An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

144 ratings

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 3

This week we will cover modeling non-continuous outcomes (like binary or count data), hypothesis testing, and multiple hypothesis testing.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

The P-value is the most widely used statistic in the entire world

including for inference and for everything else.

Its so popular that if it was cited every time that it was used it would have at

least three million citations, making it the most highly cited paper ever created.

So the p-value is a very important statistic and since its such an important

statistic there are lots of people that hate the p-value because it's so popular.

And so part of the reason why people hate it,

is because people consistently miss interpret the p-value.

And so the p-value is defined as the probability of observing a statistic that

you've calculated.

That is extreme as you observed it, if the null hypotheses is true.

So a couple of the things that p-value is not and

that will make statisticians see red is if you say that the p-values

the probability that the null hypothesis is true it's not equal to that.

It's also not the probability that the alternative is true.

And in some sense it's not necessarily a measure of statistical evidence.

That's a philosophical term that people will worry about but in this case,

you need to interpret it very narrowly.

As the probability of observing a statistic as or more extreme than the one

you observed in the data if you would observe the null hypothesis to be true.

So here we're going to use that example again with the responders and

the not responders to illustrate what's going on.

So again, we have responders and not responders, now we're looking at say, for

gene one, a statistic that compares the responders to the not responders.

So we might calculate the T statistics to take the average expression level among

the responders, and subtract the average expression level among the non-responders.

And then standardize that by some measure of the variability, in this case,

the average variability in each of the two groups.

So in a previous lecture we learned that one way that you could try to

quantify a null hypothesis.

The null hypothesis that the distributions are exactly the same among

the responders and the non responders, is to permute the sample labels.

So when you permute the sample labels, you leave the relationship among the genes

unchanged, but you can look at the, you can break the relationship between each

gene and the responder non-responder label.

So if I recompute the statistic, after I do that, I get a distribution

under the permutations and then I have the original statistic that I calculated.

And so the p-value that I can calculate could be the number of

permutation statistics I observed to be larger than

the statistic that I originally calculated.

And I do that in absolute value since in general the null hypothesis is

that the value is equal to zero.

That there's no difference between the two groups.

But the alternative could be that it's either more or it's positive or

it's negative.

And so I have to look in both directions, whether it's positive or negative.

And so I just count up the number of statistics that are more extreme in

each direction, and I divide by the total number of permutations.

So I basically average the number of times I observed the statistic as or

more extreme under this null hypothesis as the statistic I originally calculated and

that gives me the p-value.

So this p-value is often used as a measure, but

in general it's basically used as a hypothesis testing tool to be able to say,

if that p-value is small, you're going to reject the null hypothesis.

Because the statistic is very extreme

compared to the distribution that you would have got under the the null.

So this is what p-value distributions look like for

genomic experiments that are done well.

So typically, you see a distribution like this where there's a spike near zero and

then there's a flat distribution as you move out here towards one.

So if you actually look at this and break it down into the different parts,

this part near zero, these p-values that are really small,

those are really the P-values that are coming from the alternative distribution.

Because remember,

the p-value is measuring the probability of observing a statistic more extreme

under the permutations than the statistic that you got when you observed it.

So if you observe a statistic that's very, very extreme, the number of null or

the number of permuted statistics that will be larger than that is very small,

and you'll get a small p-value.

So this is the sort of the p-values that you expect to be coming from

the cases that are not from the null distribution.

And then under the null, these are the p-values you get,

you get a flat distribution that goes out here to the right hand side.

So turns out that a particular property of the p-value is that it's

uniformly distributed, it's equally likely to be any value between zero and

one if the null hypothesis is true.

What does that mean in general?

It means that even if you get a small p-value,

it might be from the null distribution because there's an equal chance that

it'll be any value between zero and one if the null is true.

So this actually is a useful set of properties that can be used to

estimate things like the false discovery rate that we we'll talk about when

we talk about multiple testing.

But the basic idea is that this distribution is a mixture of two

distributions.

There's a mixture of the p-values that come from the null hypotheses, and

the p-values that come from the alternative hypotheses.

And the null hypothesis p-values are supposed to be uniformly distributed.

And the alternative ones should be pushed up towards zero.

They should be skewed away from one.

And so the p-values almost always go to zero with the sample size.

That's another common misinterpretation of the p-value.

Just because you got a really small p-value,

it doesn't mean that the difference is huge.

It could just be that your sample size is really large, and so

the variability is small.

Even if you have any difference at all, as the sample size gets big,

the p-value will get small.

The usual cut off that people use for calling p-values significant is 0.05.

This is if you're doing only a single hypothesis test, but

that number is basically just a made up number.

So it could be any other threshold could also be used.

I mean it's useful to have a standard, but don't treat this as sort of religious

truth that 0.05 is the right way to tell if your p-value significant.

And you should always report p-values in conjunctions with estimates and

variances on the scale that's scientifically meaningful.

P-values can be useful as a complement to that,

as a way to sort of quantify statistical significance, as long as you pay attention

to the properties of the p-values and interpret them correctly.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.