0:44

The basic problem is that, if we're thinking about say,

Â a nationally representative household survey with in-person visits.

Â Then in a large country, many of the households in a truly random sample,

Â or a simple random sample may be the only sampled household in their city,

Â or their town, or their village.

Â 1:06

So we may end up with a situation where we have hundreds of towns or

Â villages, each with a single household

Â that have to be visited to conduct these in person interviews.

Â So our travel budget could be immense.

Â And so, the time and

Â the budget requirements could easily become overwhelming.

Â Let me try to clarify how this works.

Â So say that we have a total population

Â of 25000 households in some hypothetical country.

Â And these households are distributed into sort of towns and

Â 1:48

villages of 10000 people, 1000 people, or 100 people.

Â Now if they're all adjacent to each other that may not be that big of a problem but

Â image if they're spread over a very

Â large geographic area perhaps a continent or just a very large country.

Â Now just by the laws of probability if we take our list of 25,000

Â households that's our sampling frame for this population.

Â And we, at random, draw 250 households.

Â On average, we're going to have households that we have to visit on average,

Â by luck of the draw, in almost every single one of our towns and villages.

Â So, on average, our say towns that have population 10,000,

Â our sample is 250 out of 25,000.

Â So one out of 100.

Â So on average a town of 10,000 people will have 100 households.

Â Now that isn't that much of a problem.

Â But on average each of the villages that each just have say,

Â 100 households, on average, are going to, by luck of the draw,

Â probably have one household that we have to visit.

Â Now, of course, in practice, it may be some of them may be zero,

Â some of them may be two.

Â But it's entirely possible that we would have to physically visit every single one

Â of the administrative units, the towns and

Â villages in the population that we want to study.

Â In many cases, for example, if we look at the villages with a population of only

Â 100 there are 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 such villages.

Â We might have to visit possibly every single one of them and if they're remote,

Â maybe they require a drive, or maybe they require somebody to fly to.

Â Our travel budget is going to go up very, very rapidly.

Â It might become unwieldy.

Â 3:53

The most common approach to dealing with this problem is multistage

Â cluster sampling.

Â When we talk about clusters these refer to geographic or

Â administrative units in which sampling units the individuals or

Â households that we're interested in may be nested.

Â And the basic idea is that we sample clusters,

Â and then within clusters we sample our individuals or households.

Â Or if we're truly multistage, we may sample other clusters,

Â until we get all the way down to individuals and households.

Â 4:26

Stage in multistage refers to the level of aggregation.

Â So as we'll see in an example, we might have multiple stages,

Â where we first draw a sample of states.

Â We draw states at random, we define states as clusters, and

Â we draw some states at random.

Â And then, at the next level down, the next stage within each state,

Â we define counties as clusters and randomly draw a selection of counties.

Â And then perhaps go even further down to further stages.

Â It could be city blocks and so forth.

Â Clusters are sampled at each level, at each stage.

Â Could be states, could be counties.

Â And then within each cluster, either lower level or lower stage clusters are sampled,

Â or we actually sample the units that are the focus of our analysis,

Â so perhaps individuals or households.

Â I'll go through an example in just a second.

Â 5:49

Now for the reasons that we just discussed.

Â This could be expensive and time consuming.

Â While we might have a lot of households in New York and L.A.

Â where we could just fly a team of interviewers into one of those cities.

Â And they could visit 50 or 100 households fairly straightforwardly.

Â We might have dozens or hundreds of households that were,

Â in every case, the only household in their own town.

Â So places like Montana or small towns in Utah or

Â Tennessee where we might have to fly an interview team into

Â the nearest city with an airport.

Â And then they might have to rent a car, spend an entire day just getting

Â to a household in a particular town to conduct just one interview.

Â Again, this would be very expensive, very time consuming.

Â And we almost never have the budget for doing a simple random sample like that.

Â So if we took a clustered sampling approach, a multi-staged clustered

Â sampling approach just hypothetically to cut down on the number of places that we

Â have to actually physically visit, we might first sample select five states.

Â And then within each of these five states we select five counties.

Â And then within each county select 10 residential blocks.

Â And then on each block we randomly select 10 households to visit.

Â That gives us a total sample size of 2,500.

Â And it turns out to be nationally representative.

Â But we've really cut down on the amount of travel that we have to

Â do in order to visit each of the households in our survey.

Â So again, 5 x 5 x 10 x 10 is 2,500.

Â So again, we have the same sample size, but again at much, much reduced cost.

Â 7:42

Now one thing that we have to deal with is that clusters could be states,

Â could be counties, could be provinces depending on what country we're

Â studying are likely to vary in size.

Â And this affects what we do when we sample clusters.

Â Basically what we need to do is make sure that the probability of a cluster

Â being selected, perhaps a state if that's our highest level, or

Â a province, should normally be proportional to its size,

Â that is proportional to its population.

Â If clusters are equally likely to be selected, but

Â the number of people in each cluster vary.

Â Then it turns out, and I'll illustrate this in just a second, that individuals

Â living in smaller clusters will be over represented in our final sample.

Â 8:33

So sampling with probability proportional to size will address this issue.

Â And it can be repeated at multiple levels.

Â So if we're first sampling states, and then within states counties,

Â and then within counties residential blocks, we can make sure at each level

Â the probability that a cluster is drawn, a state, a county, a block,

Â is proportional to the total number of people living there.

Â 9:03

One piece of terminology I should mention is that we refer

Â to these first stage units, as primary stage units PSU or FSU,

Â secondary stage units, SSU, and tertiary stage units, TSU.

Â You may see these expressions PSU, FSU and so forth in papers.

Â 9:23

Now let me explain or clarify what the problem with equal

Â likelihood of selecting clusters is.

Â So if we start with a population which is made up of,

Â say, towns of 10,000, 1,000, or 100 people.

Â And we decide that we're going to sample four of these clusters at random,

Â and then within each of them sample 50 people.

Â If If each of these units whether 100 or 10,000 or

Â 100 is equally likely to be selected and we select five of them at random.

Â Then we end up actually based on the laws of probability likely that

Â we'll actually end up with fairly large number of small clusters,

Â the towns, villages, with just a hundred people.

Â In our final sample in each of which will include 50 people, and

Â maybe again could be luck of the draw but, maybe just one larger town.

Â So we'll end up with a sample of 250 people out of a population of 25,000,

Â but we're essentially 200 out of those 250 people in this

Â hypothetical example live in small towns of 100 people each that only account for

Â a small fraction of the overall population.

Â 10:49

So if we redo with probability proportional to size where we

Â adjust our sampling mechanism so that the probability of a cluster,

Â a city or a town or a village being selected, is proportional to it's size.

Â We're much more likely to get a sample of clusters that resembles the population

Â as a whole or produces a sample that resembles a population as a whole.

Â So we might end up with say both of the, largest cities,

Â the 10,000 person cities in the final sample,

Â perhaps a city of 1,000 and then maybe just two places of 100.

Â And then 50 people in each of these clusters, for a total of 250 people.

Â So the 250 people in our sample if you look at the way they're

Â distributed across cities, towns and villages.

Â That distribution will resemble what we would see in the larger population.

Â So again, the mathematics for

Â this you have to take in more advanced class in survey sampling.

Â You can't go into that much detail here.

Â I just want to alert you to this issue.

Â 12:21

And then, within each of these urban districts or rural counties,

Â they sample urban neighborhoods or rural villages.

Â And then within each of these, they sample households.

Â So again, this makes it possible to keep stay within our reasonable budget,

Â while conducting a survey that is nationally represented for all of China.

Â 12:58

And then finally, again, down to households.

Â So multiple stages or multiple levels.

Â Finally, the general social survey starts with standard metropolitan statistical

Â areas as defined by the Census Bureau or rural counties.

Â Within these, block groups or enumeration districts.

Â These are technical terms.

Â Can't get into much detail.

Â But they come from the census.

Â A block group is a selection of city blocks.

Â And enumeration districts might be an area within a rural area.

Â 13:29

Then actual blocks.

Â And finally down to individuals.

Â One thing that we have to keep in mind if we're conducting a clustered sample is,

Â there are some implications for statistical inference.

Â Units within the same cluster may resemble each other that is households or

Â individuals living in the same village, the same town,

Â may have more in common with each other than they do with households and

Â individuals elsewhere in the country.

Â 14:00

So units drawn from sampled clusters may not vary as much

Â as units would be if they were drawn evenly from the population at large.

Â So if we drew a simple random sample from an entire country, we'll get a lot of

Â variations between the households and the individuals that are in our sample.

Â But if we are drawing from clusters,

Â because of the fact that people living together in the same town,

Â the same village, the same city, may have more things in common than they do with

Â say random people from elsewhere in the country.

Â We might not get quite as much variation from a clustered sample as we would get

Â from a simple random sample.

Â 14:41

Technically and this requires more study in a class

Â focused on sample survey design this implies that

Â clustering increases sampling variance and therefore, standard errors.

Â So what this means is that perhaps one survey to another using

Â a clustered approach, we might see our estimates bounce around

Â more than if we were conducting a simple random sample.

Â This effect is more pronounced when clusters are fewer.

Â So if we have a lot of clusters, each with a small number of units, individuals or

Â households, it's less of a problem.

Â 15:20

This will affect our calculations of statistical significance in our

Â statistical test.

Â And typically, when people make use of data that comes from multistage clustered

Â samples, they apply, or they may apply sample weights.

Â Or other clustering adjustments to a apply to essentially fix the issues that

Â arise with the tests of statistical significance.

Â I would like to recap some of the main issues that come up with multistage

Â clustered sampling.

Â Again, it's most relevant for in-person interviews.

Â That is, where we have to send a team out, and it's physically expensive and

Â time consuming for every single household or person that we want to visit.

Â It can save a lot of time and

Â money by reducing the number of physical locations that we have to send a crew to.

Â 16:26

Another issue that I really couldn't get into here, but

Â which you have to learn about if you take a more advanced class,

Â is that probability proportion to size may actually be more difficult in settings

Â where the populations of clusters are not known.

Â So pps may be straight forward in the United States,

Â where you have pretty good census data and

Â pretty good estimates of numbers of people living in states, counties and so forth.

Â And then it's fairly easy to make samples that are probability proportional size.

Â Can be a real problem though In other countries that don't have well

Â developed statistical systems where we actually may not know how many

Â people are living in any given province or any given town.

Â And then we may not know what weight to give those towns when

Â we're sampling those clusters.

Â Again, that's an issue for a much more advanced class.

Â