0:02

This lecture's about experimental design as well it's about sample size

Â and variability.

Â So if you remember from previous lecture,

Â the central dogma of statistics is that we have this big population.

Â And it's expensive to measure, you know,

Â whatever measurement that we want to take genomic or

Â otherwise on that whole population so we take a sample with probability.

Â Then on that sample we make our measurements and

Â use statistical inference to say something about the population.

Â So we talked a little bit about how that best guess that we get from our sample

Â isn't all that we get, we also get an estimate of variability.

Â So let's talk a little bit about variability and

Â what its relationship is to good experimental design.

Â So there's a sample size formula that you may have heard of that's,

Â if N is the number of measurements that you could take or

Â the number of people that you could sample.

Â If you're doing scientific research, you have to ask for grant money often and

Â so N ends up being the number of dollars that you have

Â divided by how much it cost to make a measurement.

Â And while this is one way to get at a sample size, it's maybe not the best way.

Â So the real idea behind sample size is basically to understand variability

Â 1:06

in the population.

Â And so, here's a really quick example of what I mean by that.

Â So here are two synthetic made up data sets.

Â So there's a data set for Y and there's a data set for X.

Â So the measurement values on the X axis and

Â on the Y axis there's the two data sets, YX.

Â And you can see, I have two lines here, the red line is the mean of the Y values

Â and the blue line is the mean of the X values.

Â And so, what you can see is that the means are different from each other but

Â there's also quite a bit of variability around those means.

Â Some measurements are lower and some measurements are higher and they overlap.

Â So the idea is, if the two means are different, how,

Â how confident can we be about that?

Â If we know what the variation is around the measurement that we've taken and

Â the mean that we have.

Â How confident we can be that these two means are different than each other?

Â So this goes through how many samples that you need to collect?

Â How much variability you need to observe to be able to say whether

Â the two things are different or not?

Â 1:57

So the way that people do this in advance in sort of experimental design is

Â with power.

Â So basically, the power is the probability that

Â if there's a real effect in the data set then you'll be able to detect it.

Â So, it depends on a few different things, it depends on the sample size,

Â it depends on how different the means are between the two groups,

Â like we saw the red and the blue lines.

Â And it depends how variable they are, so

Â we saw that there was variation around the means in both the X and the Y data sets.

Â So this is actually code from the R statistical programming language.

Â You don't have to worry about the code in this lecture but

Â you can just see that for example, if we want to do a t-test,

Â comparing the two groups which is a certain kind of statistical test.

Â The probability that we'll detect an effect of size 5,

Â that's what we have delta there with a variability of 10, the standard deviat,

Â standard deviation of 10 in each group and 10 samples is 18%.

Â So it's not very likely that even if there's an effect we'll detect it but

Â what you can do is you could also go back and make the calculations,

Â say, as is customary, we want 80% power.

Â In other words, we want an 80% chance of detecting an effect if it's really there.

Â So for a effect size of 5 and a standard deviation of 10,

Â you could see that we could calc back out, how many samples that we need to collect?

Â Here, in this case by doing the calculation,

Â we see we need 64 samples from each groups

Â in order to have an 80% chance of detecting its particular effects on us.

Â But similarly, you can do that calculation by saying, how many do you need to have

Â for one group if you're only going to be doing, or for each group,

Â if you're only going to be doing a test in one direction or the other?

Â So suppose, I know that the effect size will always be expression levels will be

Â higher in the cancer samples than the control samples.

Â Then it's possible to actually create, less, less samples and still

Â get the same power because you actually have a little bit more information.

Â Later classes and statistical classes will talk more about power and

Â how you calculate it.

Â But the basic idea is to keep in mind that you, the power is actually a curve.

Â It's never just one number even though you might hear 80% thrown around quite a bit

Â when talking about power, the idea is that there is a curve.

Â So when there's no, in this plot,

Â I'm showing on the X axis, all the different potential sizes of an effect.

Â So it could be 0, that's the center of the plot or it could be very high or

Â very low and then on the Y axis is power for different sample sizes.

Â Black lines correspond to sample sizes of 5, blue line corresponds to sample sizes

Â of 10 and red lines correspond to sample size of 20.

Â So as you can see that,

Â as you move out from the center of the plot, the power goes up.

Â So, the bigger the effect,

Â the easier it is to detect, also as the sample size go up, goes up,

Â you see from the black, to the blue, to the red curve, you get more power as well.

Â So as you vary these different parameters, you get different power and so

Â a power calculation is a hypothetical calculation based on what you

Â think the effect size might be and what sample size you can get.

Â And so, it's important to pay attention before performing a study

Â as to the power that you might have so you don't run the study.

Â And end up at the end of the day without any potential difference

Â even when there might have been one there.

Â 5:13

So, variability of a genomic measurement can be broken down into three types,

Â the phenotypic variability.

Â So, imagine you're doing a comparison between cancers and controls.

Â Then there's variability between the cancer patients and

Â the control patients about their genomic measurements.

Â So this is often the variability that we care about,

Â we want to detect differences between groups.

Â There's also measurement error,

Â all genomic technologies measure whether it's gene expression,

Â methylation, whether it's the alleles that we measure in a DNA study.

Â All of those are measured with error and so

Â we have to take into account how well does the machi,

Â machine actually measure the reads, how long we quantify the reads and so forth.

Â There's also a component of variation that often gets ignored or

Â missed which is natural biological variation.

Â 5:55

So for every kind of genomic measurement that we take,

Â there's natural variation between people.

Â So even if you have two people that are healthy, have the same phenotypes in every

Â possible way, they're the same sex, the same age, they eat in the same breakfast.

Â There is still going to be variation between people and that natural biological

Â variability has to be accounted for when performing statistical modeling as well.

Â An important consideration is that there's often a rush when there's new technologies

Â to sort of claim that this new technology is so

Â much better than the previous technology.

Â One way they do that is by saying that the variability is much lower

Â that may be true for the technical component or the measurement error

Â component of variability, but it doesn't eliminate biological variability.

Â So here I'm showing an example of that,

Â there are four plots in this picture that you're looking at.

Â The top two plots show data that was collected using next generation

Â sequencing.

Â The bottom two plots show data that was collecting with micro

Â razing with older technology.

Â Each dot corresponds to the same sample, so

Â it's the same samples in all four plots.

Â And so what you can see is for the gene on the left, you see that the pink gene,

Â you can see that there's lower variability across people.

Â So this is true, whether you measure it on the top with sequencing or

Â on the bottom with arrays.

Â Similarly, the gene on the right that, I've colored in blue here

Â is highly variable when measured with sequencing or when measured with arrays.

Â So what this suggests is that biological variation is a natural phenomenon

Â that always is a component of non modeling data in genomic and

Â it does not get eliminated by technology.

Â