0:58

Now systematic sampling is a very simple method of

making sample selection from a list, taking every so many elements.

So suppose that what we had was a population

of transactions in this case, records.

These are billing records from credit cards.

And in this particular case,

there's a little bit of information about each of these.

There's a date and a time reference number, that's quite a few digits.

A category, when the bill comes in,

it's classified with respect to the type of business or transaction that occurred.

A subcategory, and some additional credit card information and

then the amount, the amount that's shown in the last column there.

1:43

And so in this case, we may be interested in drawing a sample of these,

even though we've got all of them.

We may be interested in drawing a sample of them,

because we're going to apply some additional process to them,

to understand something about the nature of the billings that we're receiving.

We may be drawing a sample in order to call the card holder, and

ask them questions about the transaction that is not part of this record.

We may be calling a sample of individuals to talk about

other kinds of purchases they might make with our credit card, other kinds of

things that we need additional data that's not present with these transactions.

And so that's a reason for sampling these records and

then dealing with them in a sample setting set that involves survey data collection.

3:01

When do we stop?

Well, we're going to stop when we have N selections.

N is our sample size.

Now this list has capital N elements.

Let's suppose that in this particular case, this list had a 1000 elements in it,

and that we decided that we only needed 50 of them.

Well there's a problem then if we start, obviously,

if we start taking every tenth starting with the first one.

When we get down with our sample,

we will only will have sample from the first half of the list.

To get 50 selections taking every tenth starting with the first.

We got the first and the 11th and the 21st and so on.

And if I do that 50 times my last selection would be element 501.

3:42

That means that elements 502 to 1,000 have zero chance of being selected.

As do, by the way, choosing the first element to start,

elements 2, 3, 4 through 10.

Elements 12, 13, 14 14 through 20.

There's a problem here obviously, and doing systematic sampling and

always starting with the first and then taking every tenth.

We don't spread our sample out across the entire list and

if there's something different about the transactions

in the first half of the list compared to the second, we've missed.

So we need to spread our sample out over the whole list.

We're going to need to vary the count to account for the size.

We're going to have to scale this to the size of the list.

And we should also vary the selection start.

There's no randomization in this if I always start with the first one.

That poor first transaction's always going to be in all of my samples.

I once did a sample of students, from a population registry in

a registrar's office, for a university, and they had been doing this all along.

They had always been sampling by starting with the first case.

The programmer had an algorithm they found in a cookbook,

in a set of algorithms for random sampling,

there was actually systematic sampling, it always started with the first case.

I'd pity that poor student who was first in the list because there were,

in all the samples as long as they were first on the list.

We're going to vary that, so

we're going to do two things to modify this procedure.

What we're going to do is not take every 10th, but every 20th.

If there's 1,000 in the list and we need to get our sample spread across the whole

list, what we're going to do is take 1,000 divided by 50 to figure out the interval.

Not just an interval that's convenient 10, but

an interval that fits the size of the list.

So we're going to add to our consideration then account, but

our account interval may vary depending on the size of the list.

In addition, we won't start with the first, but if we're going to take our

sample from the first every 20th, what we need to do then, is possibly

vary the list selection by starting with a random place among the first 20.

Because that way when we start in the first 20, and we choose one at random, and

we keep adding 20 to that now to get our sample selections.

When we get done,

we will actually have a sample size 50 before we run off the end of the list,

because of the scaling that we've done in respect to the population size.

So back to our list then.

In our list,

our transaction list, we randomly choose to start with the fourth one.

We've looked up a random number.

We've generated a random number from our software systems, and we start with that

random selection, and we take the 4th, and then we add the interval, the 24th, and

we add the interval, the 44th, and we add the interval, and so on.

And so we've got a very even division of our population distribution shown on

the lower left hand side.

A very even spacing of our sample selection such that we get our required

sample size.

And we start at random.

There's a random element to this.

So we've adapted our selection process to the size of the sample and

the size of the list.

We've calculated an interval to make it more formal.

An interval let's call it k that is equal to the population size divided

by the sample size.

In this case 1,000 divided by 50 or 20.

And we choose the random start anywhere from 1 up to k, at random.

7:59

And then the rows are the sample selections.

1, 2, 3, up to 50.

And notice that if we had chosen to start at random with 1,

our first selection would've been 1, then 21, then 41, and so on.

Once we start, that sample is fixed.

It's systematic in that regard.

Always, when we start with 1, we get the same sample from this population.

But if our random starts with 4 as we had talked about before, 4,

24, 44 and so on, that set of selections is always the same.

Conceptually, what's actually been done here is the equivalent of

cluster sampling.

Every column which is a possible sample is a cluster.

It's a set of elements that always come into the sample together,

that's what a cluster is.

It's a school where all of the students come in together.

It's a block where all the housing units come in together.

Here it's a set of elements that always come in together because they've been

systematically selected.

And they're all the same size.

This is very interesting.

Here's a case where we have clusters of equal size, and there are 20 of them.

And by choosing one of them as a random starting point,

a random start from 1 to 20, we've chosen one of the clusters.

So this is equivalent to cluster sampling.

Each possible systematic sample is a cluster of lower case n elements.

9:21

Well, that means it's a little more context here than we had first thought.

We thought it was just a simple conic procedure, but now we've scaled it to

the population size relative to the sample size and we've added a random variation.

Let's talk a little bit more about some other features of systematic sampling in

our next lecture by turning to talk about those intervals.

Because sometimes in the interval, actually most of the time,

the interval will not be a whole number like that.

It won't be 20.

There may be some fractional part, 20.2.

20.57, 100 with some fraction, some decimal fraction.

What do we do with that when we do our sample selection?

And that will be our next lecture,

as we continue our discussion about systematic sampling.

Thank you.