Okay, so you would get exactly the distribution of the sample median of ten
die rolls. And if you wanted to know what the sample
distribution of the median of twenty die rolls, well you'd have to roll the die
twenty times, get a sample median. Repeat that process over and over again,
and that would do it for you. Okay so now we know, if we can actually
sample from the population distribution over and over and over again, how we would
get the sampling distribution of a statistic.
But when confronted with real data, we can't roll the die.
Right. We don't know what the population
distribution is, so we can't do it. But what we can do is roll from a die,
where every side of the die we've put on the number associated with an observed
data point, then we're not drawing from the population distribution, we're drawing
from the empirical distribution. Okay, then if we had ten data points and
we want to know what the distribution of the sample median of ten observations is.
Well, we can't draw from the population distribution, but what we can do is draw
samples of size ten from the distribution defined by the data we observed, and look
at what the distribution of the sample median is for those.
And that is exactly what the boot strap does, is it basically says, well, you know
the bootstrap in practice via re-sampling. It basically it says, well, we know
exactly what we would do if we actually knew what the population distribution was.
Why don't we just do that and use the sample distribution, and, and see how that
works. And it's sort of a really nifty idea.
So again, let's just take our 630 measurements of grey matter volume from
workers at a lead manufacturing plant, Then the median grey matter volume is
about 589 cubic centimeters. And we want a confidence interval for the
median of these measurements. How do we do that?
So here's our bootstrap procedure for calculating confidence interval for the
median of a data set of N observations where we know nothing about the.
Sampling distribution, of medians, of ten observations.
So, we would sample and observations with replacement, from the observed data
resulting in one simulated complete data set.
We would take the median of this simulated complete data set.
That would give us one bootstrap resample, and one bootstrap resampled sample median.
Then we would repeat the step B times, let's say.
Resulting in B simulated medians of N observations.
Those N observations having been drawn with replacement from the collection of
observed. Data, then these medians, well, let's say,
they're approximately draws from the sampling distribution of the median of N
observations, and they're exactly draws from the sampling distribution of the
median of N observations from the distribution of the observed data, but
we're going to say that's approximately equal to the sampling distribution of the
median of N observations drawn from the population distribution.
That's the leap of faith we're making, is that this bootstrap process approximates
if we, instead of drawing from the observed data, we're drawing from the
actual population distribution. And we could take these B sample meeting
and draw a histogram of them, and then say we wanted to know, you know, a confidence
interval, why not take the 2.95% confidence interval, why not take the
2.5th and 97.5th percentiles and call that a confidence interval for the media.
That's exactly a so called boot strap percentile confidence interval.
So it's hard to describe, and I know I'm butchering it, and if I were Efron I'd be
doing a much better job at doing this, but unfortunately you have me and not Efron
And it's difficult to describe, for me at least.
On the next page I'm showing you the R code for doing this and even I've neatened
up the R code a little bit, so it's probably a little bit longer that it needs
to be, you could do this in about four lines.
So here B is my number of bootstrap re-samples.
I said let's just do it a thousand times but, y'know, you wanna set this number B
to be big enough so you don't have to worry about the error in your Monte Carlo
re-sampling. You don't want the number of times that
you have rolled the die to be a factor in what you are doing you want to do it, you
can't. So here I did a 1000 but you know crank it
up until you're tired of waiting at least there is a science to how you pick B, but
we're not gonna talk about in the class. So N is the length of the number of
observations that I have. Okay.
Then I re-samples is this code right here just draws with replacement from the
collection of N observations, it draws B complete data sets of size N from that
distribution. The replace = true means that we're
sampling with a replacement. And then, here, this resamples.
I dumb them all into a matrix, so that every row is a complete data set.
So there's B rows, and N columns. And then I go up for every row, and I
calculate the median in this next line. And that's then B.
Medians, where each median was obtained from re-sample of N observations from the
observed data. And then if you take the standard
deviation of these medians then that is a bootstrap estimate of the standard
deviation of the distribution of the sampling mean.
If you take the quantiles, the 2.5th and 97.5th quantile, you get 582 to 595.
That is a bootstrap confidence interval for the median of Bray matter volumes
conducted in the non parametric way. And it's always informative in the
bootstrap to plot a histogram of your re-sampled, in this case, medians.
Okay so in here is my histogram of my resampled medians.
And then the 2.5th and 97.5th quantiles of my bootstrap resampled medians are drawn
here in dashed lines, and so that 95% of my resampled medians lie between these two
lines, and so we're gonna call that a bootstrap confidence interval.
Now, I'm going to give you some notes on the boot strap.
So, for the both the boot strap and the jack knife, today's lecture is really just
a teaser. As you can probably guess from my
description, they're sufficiently difficult techniques to where, you know,
you don't want to take these lectures and view them as enough knowledge to just run
out and use them willy nilly. I just wanted to give you a teaser so that
if you hear the terms you know what people are talking about.
So the boot strap, the one that I described today, is non-parametric.
It makes very little assumptions about the population distribution.
And the kind of theoretic arguments proving the validity of the bootstrap tend
to rely on large samples so there's a question about when and how you can apply
it, but I find it to be a very handy tool in general.
The confidence interval measure that I gave you.
These percentile confidence intervals. They're not very good.
You can improve on bootstrap confidence intervals by correcting the end points of
the intervals. And the bootstrap procedure, the one I
would recommend is this so called BCA, confidence interval, the bootstrap.
Package in R will calculate these for you directly if you like.
That's what these perc-, better, when I say here, better percentile bootstrap
confidence intervals correct for bias. And then, there's lots and lots of
variations in the bootstrap procedures. There's parametric bootstrapping,
There's bootstrapping for time series, You have to do something different for
bootstrapping for time series. There's all sorts of different ways to
think about the bootstrap, and data resampling in general.
And the book called An Introduction to the Bootstrap by Efron and Tibshirani.
For anyone who's taken this class, absorb the material, and it is at a level that
you should be able to understand. It's beautifully written.
It's a wonderful treatment of the subject, and then, in addition to this, there is
lots and lots of other books on the topic of the bootstrap.
Probably too many good ones to name. Some of them, unbelievably theoretical,
and other ones, quite accessible. I think this Efron and Tibshirani book
makes a very nice balance between, you know, giving you the why things work and
the how to do things combination. And it also covers the jackknife and other
data resampling procedures. The last thing I wanted to mention is, I
gave you the exact code that you could use to generate for yourself the bootstrap
sampling distribution. You could, of course, use the bootstrap
package in R which takes about as many lines of code in this case as programming
and it up yourself and just on this last slide I go through actually using the
bootstrap package. But the nice thing about the bootstrap
package is that it will actually give you this bias corrected interval.
In this case you can see that the bias corrected interval is nearly identical to
the percentile interval so it didn't make a big difference.
But you can, of course come up with instances where the bias corrected
interval is a little bit better. So that's the end of today's lecture.
That was a teaser on the idea of bootstrap re-sampling and a little bit on the use of
the jack knife. You know, I hope this inspired to go learn
a little bit more about these tools, they are among the kind of the wide class of
tools that became available as modern computing came about.
And the idea of being able to use our data.
Especially when we have large data sets, to use the data more fully, and to use the
data to come up with things like sampling distributions instead of using mathematics
and assumptions and that sort of thing. So it was a neat idea brought about by the
computational revolution in it, it's a very nifty technique.
Well next time will be our last lecture and I look forward to talking about it
with you.