0:12

And then we talked about bootstrap samples to drive a confidence interval and

Â really any other kind of statistic you may need.

Â One question is, why do we even need that shuffle test at all?

Â Why can't we just use the confidence interval itself?

Â After all,

Â it's possible to reason about significance even with a confidence interval.

Â You can see whether zero falls within the confidence interval or not.

Â 0:47

When you're testing significance,

Â you're starting with a null hypothesis and you're deciding whether to reject that or

Â not, and you can see intuitively when that might make a difference.

Â Consider a admittedly contrived case where you only have two data values,

Â one treatment and one placebo.

Â 1:03

Sorry, I said placebo, in this case, well let's go with placebo.

Â So one is a treatment, the other's a placebo, and

Â we're trying to measure the difference between them, right?

Â And let's imagine this is still survival days, so with the treatment,

Â they survive long.

Â And there's one patient to survive longer, 36 days, and

Â with the placebo they only survived 27 days.

Â So a 95% confidence interval will range from the value 9 to the value

Â 9, meaning there's sort of very, very tight interval.

Â With the shuffle test, you'll shuffle the labels here, and

Â do a bunch of experiments, and quickly determine that the p-value's about 0.5,

Â which is of course not significant, right?

Â Half the time the difference is 9, the other half it's negative 9,

Â and so the p-value will be quite high.

Â So, you always wanna start with the null hypothesis, that there is no difference,

Â test the significance of that with the shuffle test, and

Â then measure the confidence interval as a second step.

Â Okay, so caveats about the bootstrap.

Â When is it dangerous to use this?

Â Well, really it is pretty robust.

Â It actually makes fewer assumptions than a lot of the classical methods make,

Â as we've described.

Â It may underestimate the confidence interval for very small samples.

Â Usually in these data science regimes, in these big data regimes,

Â very small samples is not the concern that we typically have to worry about.

Â We often have mini datasets.

Â Bootstrapping can't be used to estimate the minimum or maximum of a population.

Â So you should consider why that might be.

Â 2:38

So when we're taking averages, it works pretty well.

Â But the minimum-maximum is very,

Â very sensitive to outliers, to very local, local values.

Â If even one more value,

Â as you scan through the data is higher, then that will be the maximum.

Â It could have an arbitrarily large effect on the statistic.

Â And the bootstrap is not going to,

Â if you take a bootstrap sample with replacement,

Â you might miss that one outlier that really dramatically changes the value.

Â And so you'll get a distribution that doesn't match the distribution of

Â the maximum, okay?

Â And so more generally, outliers can cause trouble with bootstrapping, but

Â they sorta can with any method.

Â So that's not a significant weakness of resampling methods in general,

Â and in fact, in a moment, we'll see one thing you can do about outliers.

Â Okay, so it is a little bit sensitive.

Â Resampling can be done kind of incorrectly with more complex

Â examples that fail to preserve the original sampling structure.

Â So we saw this a little bit with the example of the confidence interval just

Â now, where you need to do a bootstrap sample of one cohort and

Â a bootstrap sample of the other cohort,

Â as opposed to a bootstrap sample where you pool them together.

Â There are ways to actually do it in kind of a pooled way, but in general, you need

Â to respect that the structure needs to match the experiment you're trying to do.

Â Okay, another case is if the data are not independent,

Â you can't use the bootstrap sample.

Â And it may be tempting to do so, because it may not be obvious that it's not valid.

Â There are sometimes tricks you can play.

Â So the trick is if you have a sequence of mutually dependent values,

Â so that the independence assumption doesn't hold,

Â what you can do is break those into sub-samples that don't overlap.

Â So let's say time series of like stock prices or

Â something, which are not independent, of course, right?

Â Every value of a stock price depends on the value previously.

Â 4:44

Take sub-sequences of these stock prices, non-overlapping.

Â Break a long history into chunks, and

Â then treat each chunk as an individual observation.

Â Compute some statistics on that, and then do resampling on those chunks.

Â And that's been shown to work, so there are sometimes ways to apply this general

Â technique in kind of creative ways.

Â So it's not particularly limited to a few examples,

Â certainly not the few examples that I'm gonna show.

Â 5:10

But think of it as a very general approach.

Â You know, write a program,

Â do the simulation to simulate the experiments that you want.

Â You may have to think creatively about how to apply this in a safe manner, but

Â it's a very general approach.

Â 5:38

And you know, one quote here that I copied down that I sort of liked is,

Â it's tempting to somebody to say, oh, you let the data speak for itself.

Â And Nate Silver has this quote,

Â the numbers have no way of speaking for themselves.

Â There's always gonna be an interpretation about them.

Â There's always gonna be some knowledge about where that data came from.

Â There's always gonna be biases lurking in there that only the source of the data

Â 6:03

So putting this statistician's hat on as opposed to, say, the machine learning or

Â computer scientist's hat on, understanding the source of the data,

Â understanding the experimental design that led to that data.

Â Or if you weren't aware of the experimental design,

Â understanding the other kinds of biases or

Â contexts in which it was sampled is always gonna be crucially important, whether or

Â not you're using resampling methods or classical methods.

Â