0:22
Now, what do we mean by that?
We mean that, with respect to some sort of random mechanism,
you can say that the estimates you've created represent the full population.
So the interpretation can be in terms of repeated sampling.
In other words, if we drew the sample over and over and over in the way that we did
for the particular probability sample that we drew,
we made an estimate from each one, and we looked at that ensemble of estimates.
Then they would make sense.
I mean, they would average out to the population value, for example.
1:06
The interpretation can, on the other hand, be in terms of models.
And we have to do this for non-probability samples because
we don't have this repeated sampling mechanism to fall back on,
a repeated sampling mechanism that we are under control of.
The modeling consists of either modeling how units show up in the sample,
which could be in some kind of quasi-randomization way, or
in terms of the structure of the y-values that we're measuring in the population.
Either one of those is a possibility for modelling.
Now, in probability sampling, we say an estimator is
unbiased if, over all the random samples that could be selected,
the values we compute from each of those samples averages out to the census value.
2:09
Another important concept is consistency.
So an estimator is said to be consistent if, as the sample
size gets big, the estimator gets closer and closer to the census value.
This is actually a more desirable property than unbiasedness.
This is saying that as the sample sizing increases, we get closer and
closer to what we're trying to estimate.
Unbiasedness can hold even if, here's the population value,
if our estimates bounce all around it by quite a large distance.
3:05
And we want these things to be true even for
complicated statistics, like medians and quartiles.
Now, on probability samples,
we've met various kinds in Course 4, which was about sampling people and records.
Some examples that you learned about there are simple random sampling,
stratified simple random sampling, stratified systematic random sampling,
two-stage sampling, multi-stage sampling.
We can sample with probabilities proportional to some measure of size.
This is often done for businesses or institutions.
All those are possibilities, and all those
3:53
Now on the other hand, non-probability samples are often used to make inference,
and we've got to think clearly about what we're doing there.
As I said a minute ago, the unbiasedness and
consistency properties have to be with respect to some sort of a model.
4:11
And we need to be able to estimate the population model from the sample.
So in that sense, our sample has to be projectable to the full population,
even though we didn't obtain it in a random way.
So if a sample has a serious holes in coverage, then
it's hard to justify saying their estimators are aiming at the right thing.
They may be biased. They may be estimators, but
not of the full population that we might be interested in.
For example, we've got a volunteer web panel and
it has no African-American women over 70 years old.
Well, if those are an important part of the population and
they behave differently, according to whatever we're measuring,
than the rest of the population, then we've got trouble.
5:14
Now, what types of non-probability samples are there?
There are many. I've just listed three general
categories here.
One might be a convenience sample.
For example, if you take all your students in an introductory psychology course and
you experiment on them in some way, that's a convenience sample.
Those students don't represent the entire population of a country or
even a subset of the country.
6:07
If we recruit persons from those who visit particular websites or
a particular website, that's another way of doing it.
A popup ad comes up and says, do you want to be part of a survey?
You say yes and you do it.
That would be a volunteer panel.
A little more organized way of doing this is called
a river sample where you post your ads on some carefully
selected set of websites where people may visit and
ask them if they want to be part of a panel that does surveys.
And they go through some steps to actually get in the panel.
But these are not random samples or probability samples of an entire
finite population because the sample doesn't have control over who shows up.
7:07
On the other hand,
there are probability samples that really suffer from non-response.
In fact, they may have such a huge amount of it that you begin to wonder,
should we even treat them as probability samples?
For example, if you do an overnight election poll these days
in the US by telephone, and it doesn't matter if you include
7:33
landlines and cellphones, you'll still get the same sort of phenomenon.
You'll only get about 5% of the people
to actually answer the phone and cooperate with you.
Well, a 5% response rate is hardly what you'd call a good sample.
And it's hardly what you would be willing to defend as a probability sample.
Now, even if we draw a probability sample, we may have coverage errors, but
certainly, we'll have coverage errors in non-probability samples.
It could be under- or over-coverage,
depending on the frame that we're drawing from.
What we try to do to combat that is
do something called calibrating the weights with auxiliary data.
So what we need is target population control totals that we know,
not necessarily for every individual in the population,
but at least we know grand totals for the population.
And we can adjust our sample weights so
that weighted estimates of these control
variables will match the population or census counts.
So if we do that, then what we hope is that the sample can be
projected to the target population using those covariates.
And we typically had to put a model-based interpretation on that.
So for example, some of the covariates might be counts if
we're doing persons, human population.
Counts by age, race, ethnicity, and
gender might be used as calibrating variables.
So the units we've got in our sample have to be expanded
using weights to represent the full population.
And at least we can do it in such a way that the weights will reproduce
the population control totals.
It doesn't necessarily mean that we do it for
all those other y-variables we're trying to estimate.
But if we can do it for
the control totals, then that's a step in the right direction.
So we'll learn more on how to do that in later sections.