1:57

And so when we had a two stage cluster sample,

what we talked about doing was estimating the variance of a statistic in this case

a proportion, for our current design, let's call it design one.

And we would calculate that by using the available data to

fit into the formula shown here.

1- the sampling fraction f that combination (1-f) defined at population

correction divided by the number of random events in the sample, in our case a.

Actually it's the number of random events in our variants calculation, but

lowercase a, the number of clusters in the sample, times s of a squared,

the variability of a cluster characteristic across the clusters.

We won't go through the formulas, but

we coupled that then as part of the estimation.

That was enough, we would take a square root and

go down one path from that to calculate a standard error and a confidence interval.

2:52

But we also did something a little bit different, and that was to say,

we also would like to understand the impact of the design on our outcomes.

So let's compare it back to a simple random sample,

the idea of building a design effect.

To estimate a design effect then, what we're going to do is

calculate from the same data the variance of that same statistic,

in this case a proportion, for the existing data, design one,

if you will, under simple random sampling assumptions.

Now for a proportion, it turns out that that variance calculation we

looked at very briefly was 1 minus the sampling fraction, 1-f,

that finite population correction, the whole thing all together,

times p(1-p) over the sample size -1, from that existing design.

4:07

And then finally from that design effect we said we also note that

there's something that drives it.

Let's estimate that driving factor,

that rate of homogeneity, by taking that design effect -1.

Stripping out from the design effect, if you will, the base,

the simple random sampling, and only the added effect from the clustering.

And then dividing that by b-, taking out the effect of the sub-sample size,

the b, from our existing design, and dividing our design effect -1 by b-1.

So we have a value now for homogeneity.

Now this is kind of a second thread that we might follow.

The first was to get towards constant intervals,

that's the one that's most useful for us analytically.

But here, these are building up for the next step in the process,

because what we're going to do is think about our next design.

We're going to build on this one.

We may do a new application in a different population.

We may go back to the same population at a different point in time.

We may go back to the same population at a different point in time and

change the sample design in some way, change the sub-sample size,

the number of clusters, the overall sample size.

But what we're going to need to do then is projection.

What is going to happen in a new setting without having drawn the sample?

So this is an essential design feature that comes up in any number of applied

math areas, whether it's in statistics, and in this case a branch of statistics

dealing with surveys, or in engineering, in civil engineering, or some other area.

What's going to be the outcome?

Well, what's the key outcome here?

5:48

There are two that are primary in our thinking, one is the mean, or

in this case the proportion, and the second is its standard error.

So what kind of standard error would we actually get if we changed the design?

In order to do this process,

what were going to do is need to build up from the past information we have,

our second thread, starting with that roh value, that rate of homogeneity.

That rate of homogeneity becomes portable, in a certain sense.

It becomes the foundation,

the building block upon which we're going to build our projection.

We're going to use the roh value that we've calculated from the past survey

to project a design effect,

but now we're not calculating the design effect here as a ratio of variances.

We're calculating a design effect as a combination of a simple random

sample variance, 1, a sub sample size, b sub 2.

The one that we're going to use in our new design, -1, times roh.

Well, we're not going to invent a new roh, we're going to borrow roh from past data.

Because we are using very similar clusters,

last time we used schools, now we're using schools.

Last time we used blocks, we're using blocks now.

Last time we used enumeration areas, we're using enumeration areas now.

And we're also measuring a very similar characteristic,

roh is specific to that variable.

So we can build a new design effect for that variable, for that particular design.

7:19

But we also know that if we were to calculate or have available a simple

random sampling variance, we could use the combination of the design effect and

the simple random sampling variance to project an actual variance,

to calculate an actual variance.

To come up with a numeric representation of our uncertainty,

our anticipated uncertainty under the new design.

So, we can calculate a simple random sampling variance.

Now here, this simple random sampling variance ignores the 1-f.

I probably could've ignored it on the other side, but

here just to simplify the calculations, we're rounding the 1-f to 1.

We're using a proportion again, p(1-p),

giving us essentially our element variance.

And we're dividing by the new sample size, n sub 2.

So we got a new sub-sample size, a new sample size, an old value of roh and

maybe a past value of the proportion or a new value of the proportion,

because we think that it's going to change in a particular way.

9:53

Now that's in contrast to our alternative design B, same reduced sample size,

1,200, but now we're going to keep the same number of clusters,

which implies now that the sub-sample size goes from 40 down to 20.

What would happen in this case?

Do we have sufficient tools to do this projection?

In a way, this is like what's going on with climate change projections.

They're projecting what would happen under alternative models.

We have one basic model here that involves the design effect, homogeneity and

sub-sample size, along with a simple random sampling variance that we can

inflate or adjust for the clustering effect reflected in our design effect.

10:38

So how would we do this?

For A, what we're going to do is compute the simple random sampling variance, or

I suppose we could compute the design effect first, and

then the simple random sampling variance, but we're going to have a new sample size,

a proportion that we're going to have to make an assumption about.

In step two, a design effect, in which we're going to use a past value of roh and

our new sub-sample size or old sub-sample size, in our case and in step two,

for alternative A, B is exactly the same as before, it's 40.

And then we're going to compute the product of the simple random sampling

variance and

the design effect to give us our projected sampling variance under our new design.

And for alternative B,

we'd repeat these steps as well replacing B = 40 with B = 20.

With the same roh value, the same proportion and

that would allow us to compare sampling variances between these two designs.

For design A, for example, with 1,200 as our sample size and

30 clusters of 40 elements each, our design effect in this

particular case is 2.1795, it's the same one that we had before.

We haven't changed anything before, we're using the same value of roh,

the same value of b, so it's the same.

So all we need to do in this particular case then

is compute a new simple random sampling variance.

And ignoring the finite population correction, which we actually did anyway

in that prior calculation in the illustration we had done before.

We have p(1-p), 0.4, the value we had before,

x 0.6 divided by 1,200, or

a sampling variance of the proportion of 0.0002.

The product of that with our design effect gives us a sampling variance of 0.0004358.

We would take a square root to get a standard error of course, but we'll stop

there, because we're going to compare this variance to the one under design B.

Under design B we have a design effect now that is different.

B has changed from 40 to 20, that design effect goes

from 2.17, 2.18, down to 1.575.

It's not cut in half because we cut the sample size in half,

the effect has been cut in half.

And so now in this particular case then, when we multiply together this

projected design effect, 1.575 under this new design,

by the simple random sampling variance, which hasn't changed.

It's still the same sample size and the same proportions,

we get a variance of 0.000315.

Well we can compare these.

And here's the table comparing what we had before in our original design, 2,400,

that's that first row of numbers, where we had 60 clusters of 40 elements each and

a design effect of 2.17, 2.18 and a variance of 0.000218.

In our projected A version, with 1,200 cases in the sample and

30 clusters of 40 elements each, same design effect as we've noted.

But our variance is now twice as large, it's doubled.

That is, if we take the sample size and cut it in half,

by cutting the number of clusters in half, we double the sampling variance.

It goes the other way too.

If we were to take the sample size and double it, by doubling the number of

clusters, our sampling variance would decrease by one-half.

14:23

Now we can begin to see alternatives here with a quantitative ways and

compare them and make decisions about, what's the best approach to use.

We're going to do a more refined version of this,

because one of the key issues will be, which of these designs should we use?

Should we use one that has sub-samples of size 40 or

sub-samples of size 20 or some other number.

What's the best sub-sample size?

As we look at alternative sub-sample sizes, we're going to need to choose one

that's appropriate, but we're going to look at that in lecture 6.

Here we have two more things to look at before we wrap up this lecture.

One, the impact of the design effect on sample size, and two,

the impact of the design effect or

these projectived variances on confidence intervals and their width.

Let's look at those in the second part of our lecture four

on designing two stage samples, thank you.