0:00

This lecture is about the basics of Experimental design.

Â You'll see a lot more about this

Â topic in other courses like Inference and Prediction.

Â But I thought I'd just give you a little bit of the basics, so you can get started.

Â 0:11

So why should you care about experimental design?

Â I'm going to use an example to illustrate why you should.

Â This is an example that actually comes from my area of research.

Â This is an area where they have used ge-, genomic information, so information

Â about your genomes to predict what kind of chemotherapy you might respond to best.

Â This was an incredibly exciting result, because it would open

Â the door for all sorts of personalized medicine within cancer therapeutics.

Â 0:44

Even more unfortunately, the studied had already been on going

Â for a long time, and they'd actually started clinical trials.

Â But they were using those predictions that they'd developed in

Â the earlier study to decide what chemo-therapies people should get.

Â This was causing all sorts of problems for

Â people because they had performed a poor analysis.

Â So this ultimately led to a lawsuit from the people that were

Â enrolled in the clinical trial against the original investigators who developed the

Â predictive model, this illustrates how a very exciting result can lead you

Â astray if you are not very careful about experimental design and analysis.

Â 1:18

The first thing to be aware of when performing any sort of an

Â experimental design or data science project

Â is to care about the analysis plan.

Â This is actually a publishes paper where, believe it

Â or not, in the abstract of the published paper, it

Â says insert statistical method here because the person who was

Â writing the paper didn't know what the statistical method was.

Â It's critically important that you pay attention to

Â all aspects of the design and analysis of the

Â study so that everything from the data cleaning

Â to the data analysis to the reporting so that

Â you don't end up, first of all in

Â silly, embarrassing situations like this, but more importantly so

Â that you're aware of what are the key issues

Â in the study design that can trip you up.

Â 1:57

Regardless of what study you are doing, you need to plan

Â for data and code sharing, you all now have GitHub accounts

Â that you have developed either through this course or previously to

Â the course, so that is a good place to share your code.

Â If you want to share your data with very

Â small amounts, you can share it on the GitHub.

Â Larger amounts might go on a site like FigShare, where you can

Â share your scientific data, and also other kinds of data with other people.

Â But you might also need a larger data sharing plan, if your data

Â are much more unwieldy, or large and can't be shared on the websites.

Â If you don't have a data sharing plan, can I recommend mine?

Â The leek group has developed a guide to data

Â sharing that you can find at this website here, that

Â outlines all the steps that are involved in taking a

Â broad data set and making it available to other people.

Â So

Â 2:46

the first thing that you need to do when you're

Â actually performing an experiment is formulate your question in advance.

Â So the key component of data science, if we were going to define it

Â for the purposes of this course track,

Â is that it's actually a scientific discipline.

Â And a scientific discipline requires that you're answering a specific

Â question when you're using data, so here's one example of that.

Â This is actually a study that was performed by Barack

Â Obama when he was running for re-election for United States Presidency.

Â 3:39

They wouldn't want to show the website to every single

Â person, because it would be expensive to run that experiment.

Â So the population of people is all possible donors in the United States.

Â All possible people that might go to Barack Obama's website to donate.

Â They might select some small subset of those with a probability argument

Â to decide to show them one version or the other of the website.

Â This would result in a sample, a much smaller number of people that had

Â observed the data, or had observed the

Â websites and decided whether to donate or not.

Â They would then calculate descriptive statistics, say the total number of

Â donations that they got over a certain number of visits, or

Â the average number of donations, amount of donations, they got over

Â a certain number of visits, and then they would use inferential statistics.

Â To try to decide if the statistics that they

Â calculated on this small sample would play out in the

Â same way if they applied it to the large

Â population of all people that might come to the website.

Â 4:38

So in here is one scenario, suppose that the

Â data work that they collected hypothetically turned out this

Â way, there were two versions of the website, there

Â was the donate version and there was the sign-up version.

Â And so what they could do is they could say for every

Â 1000 visitors, what was the total number of dollars that were donated?

Â So if they did that over a, or the average number of dollars donated.

Â 5:01

So for a 1000 visits you might get about $6,000 on average over a course of a week.

Â And so, suppose you ran that experiment three different times.

Â You might get observations of very

Â different amounts of dollars that you would

Â observe for the donate case, and for the sign-up version of the website.

Â 5:21

So what you see here is that the average number of donations may

Â be more or less the same, or it may be slightly different, but

Â it's very hard to tell, because at each of the different experiments, you

Â get a highly variable answer, under both of the different versions of the website.

Â 5:35

Here's another hypothetical case.

Â Suppose in this case, you got a very slightly smaller number of dollars donated

Â under when you showed the sign up version than you did the donate version.

Â Here, the variability is much smaller, so you can tell

Â for sure that the donate version gives you more money.

Â You might implement the donate version, but it's not

Â sort of the that would be a huge benefit.

Â 5:58

The much bigger benefit comes when you run

Â the same experiment, and a domain version has

Â small variability, and the sign up version has

Â small variability in the total number of dollars donated.

Â But there's a large difference between the amount people donated when they saw

Â the donate sign, and the amount they donated when they saw the sign outside.

Â So this overwhelms the variability that they saw in

Â the experiment and so it is very interesting for them,

Â and would suggest they should only show the donate

Â version of the website to people that came to visit.

Â Another important issue is the issue of confounding.

Â So suppose in a particular study that you measured both shoe size and

Â literacy and suppose that you were looking

Â for correlations between shoe size and literacy.

Â 6:43

You might actually observe quite a few of these correlations,

Â because people with small shoes tend to have less literacy.

Â But what you might be missing is that

Â age is actually the variable that's causing this relationship.

Â When you're very young, say a baby, you have a

Â very small shoe size, you also have a very small literacy.

Â 7:02

As you get older, you get a larger shoe size

Â and a larger literacy so that the variable age is actually

Â a confounder for the relationship between shoe size and literacy,

Â which might be very, very small once you account for age.

Â So if you only measure shoe size and literacy, you might

Â be led astray if you observe the correlation between these two

Â variables this is what is called confounding, so it's paying attention

Â to what are the other variables that might be causing a relationship.

Â This plays out all the time in lots of different studies, so here's one example.

Â This is actually from a possibly serious paper in

Â The New England Journal of Medicine, where they plotted

Â chocolate consumption in kilograms per year per capita versus

Â the Nobel laureates per ten million in the population.

Â And they observed that there's a relationship between these two variables.

Â 7:53

So what could be going on here?

Â There's lots of reasons why you might consume

Â more chocolate and have more Nobel Prize winners.

Â For example, your country might have better education system if they

Â have more money and so they would have more Nobel Laureates.

Â And if they have more money per capita, they

Â would also probably consume more choclat, chocolate per capita.

Â And so there's one reason, not many reasons why

Â though, this observed correlation may not be an actual relationship.

Â That is due to a scientific relationship

Â between chocolate consumption and Nobel Laureates per capita.

Â This is sometimes called Spurious correlation and it's also

Â why you often here the phrase correlation is not causation.

Â Even if you observe that two variables are correlated with each other, you have to

Â prove to yourself that they're not correlated

Â because of some other variables we didn't measure.

Â So there are some ways that you can deal with potential confounders.

Â So one way is you can fix var, fix a bunch of variables for example, in the

Â case where you're considering websites for Obama, 20 2012,

Â you can fix Obama 2012 on all the websites.

Â That should be a two there and then, that way, that

Â variable doesn't change no matter what the text is that you're showing.

Â People see you fix that variable so it can't be a con factor.

Â Another way is, that you can stratify.

Â So suppose you have two website colors, and you want to try out both of

Â those website colors with both phrases, and

Â you want to know which phrase works better.

Â Then what you can do is use both phrases equally on both colors.

Â That way, you've stratified your sample so you

Â have both website colors used equally with both phrases.

Â 9:28

If you can't do any of the, either of the other two things, if you

Â can't fix a variable or stratify it, then what you can do is randomize it.

Â And so what we mean by randomization is that you need to be able to

Â use a computer program, or flip a coin in order randomly assign people to groups.

Â Why does this help?

Â So, suppose that there's a particular study that we're preforming, and

Â suppose that there are ten experimental units that look like this.

Â So suppose there's a confounding variable, and lighter values of that confounding

Â variable correspond, lower values of that

Â confounding variable correspond to lighter colors.

Â And higher value of course, compounding variable correspond to the

Â darker colors, so you have these experimental units, and suppose

Â they are ordered by values in the compounding variable, and

Â suppose we are silly enough to apply one treatment applied it.

Â All of the var, all of the people that have a high value of the confounding

Â variable, and another treatment to all the people

Â that have a low value of the confounding variable.

Â Then it would be impossible to distinguish whether

Â any difference we saw was due to differences

Â in treatment, or due to differences in the

Â confounding variable, but if we randomly assign the treatment.

Â Then some of the people that get the green treatment will have

Â low value of the confounding variable and some will have high values.

Â And so since that'll be balanced, we'll

Â be able to determine whether the difference will

Â be to determine that the difference in treatment

Â is what we're actually observing in the outcome.

Â 10:52

So for a prediction study, you actually have slightly

Â different issues that come up from the inferential case.

Â So again, you might have a population of individuals, and you might not be able

Â to collect the data on all those

Â individuals but you want to predict something about them.

Â So for example, we might get individuals that

Â come in and we measure something about their genome,

Â and we want to predict whether they should get

Â whether they're going to respond to chemotherapy or not.

Â 11:15

So what we do is, we collect these individuals and we might collect

Â observations from people that did respond to

Â chemotherapy, and did not respond to chemoptherapy.

Â And then what we want to do is build a predictive

Â function so that if we get a new individual, we can

Â predict whether they're going to respond to chemotherapy up here, as

Â an orange person, or not respond down here as a green person.

Â So the idea here is that we'll still need to deal with probability and sampling and

Â potential compounding variables, because, when we're building this

Â prediction function, we want it to be highly accurate.

Â 11:48

Another issue that comes up is that,

Â prediction is slightly more challenging than inference.

Â So for example, if you look at these two

Â populations, the, the population for the light grey curve has

Â a mean, value about here, and the population for

Â the dark grey curve has a mean value about here.

Â If you look at the distribution of

Â observations from these two populations, and so what

Â you can see is, there's a difference in the P value of these two populations.

Â But if I tell you, for example, that

Â I've observed the value that comes, say, right here.

Â It's very difficult to know which of these two populations it came

Â from, because it's relatively likely that it came from the light gray population.

Â But it's also relatively likely that it came from the

Â dark gray population, so it's very difficult to tell the difference.

Â For prediction, you actually need the distributions to

Â be a little bit more separated, so these also,

Â these two distributions also have a different mean, but

Â now, they're far enough apart based on their variability.

Â That if I give you an observation that lands right about

Â here, you know it probably came from the dark grid population.

Â Where if I give you an observation here,

Â it probably came from a light grid population.

Â So it's important to pay attention to the

Â relative size of effects when considering prediction versus inference.

Â 13:02

For prediction there's several key qual, quantities that we need to be aware of.

Â So the first set of key quantities is, positive and negative statuses

Â so, for example, this is comes from a test a medical testing scenario.

Â So you can have, suppose you can either have a disease or not have a disease.

Â So plus means you have the disease, and minus means you don't have the

Â disease, and you have a test to determine whether you have that disease or not.

Â And the plus means you do, the test was

Â positive, and a negative means the test was negative.

Â 13:39

If you do not have the disease, and you test positive, that's a false positive.

Â If you do not if you have the disease, but don't test positive, that's a false

Â negative, and if you don't have the disease,

Â and you test negative, that's a true negative.

Â 13:53

Some quantities that you need to pay attention to are the

Â probability you have a positive test, given you have disease that sensitivity.

Â The probability you have a negative test,

Â given you have no disease that specificity.

Â The probability you have disease, given you have a positive test.

Â This is probably what you care about if you come in to

Â the clinics, so I tell you that you have a positive test.

Â Then you want to know, what's the probability you have disease.

Â Similarly, I tell you if you have a negative test, you might

Â know, you might want to know what's the probability that you have disease.

Â Accuracy is just the probability that you are correct in

Â the outcome, regardless of whether you are positive or negative.

Â 14:32

Another key component of paying attention to big data

Â and data science is data being aware of data dredging.

Â So suppose you want to consider something silly, like jelly beans cause acne.

Â This comes from an XKCD cartoon.

Â So, scientists could investigate, and they find out that jelly

Â beans aren't related to causing accuracy or sorry, causing acne.

Â So then what they say well that settles

Â that, but you could try to change your hypothesis.

Â You could say oh, well, actually it's just purple jelly beans that cause acne.

Â No, it's brown jelly beans that cause acne.

Â No, it's pink jelly beans, and you keep

Â going and going and going until you found a

Â result that you liked and you'd come up with

Â good news, green jelly beans are linked to acne.

Â But this of course ignores the fact that you

Â can try a whole bunch of different things first.

Â So one thing you have to pay attention to when dealing with big data,

Â or any data science problem, is being aware of the problem of data dredging.

Â 15:27

So in summary, good experiments have

Â replication so that you can measure variability.

Â They'd measure that variability, and compare it to the signal that they

Â were looking for, and they'd generalize to the problem that you care about.

Â Also important, good experiments are transparent

Â in both their code and their data.

Â 15:50

Also, in any data science pro, problem, you need to be aware of data dredging.

Â The other important issues with experimental

Â design, will be covered in later classes.

Â