0:15

So the package I'm going to use is the R survey package which is got a lot of

Â capabilities.

Â This is the one that's written by Thomas Lumley at the University of Auckland

Â in New Zealand.

Â And there's a data set in there called academic performance index,

Â api, which I'll use.

Â So I require the survey package and

Â then I ordered R that this is the data I want, by saying data api.

Â 0:45

And then, you define a design object.

Â So, with any of the software where going to handle survey data,

Â you have to inform what the design features are.

Â I'll talk more about this in course six,

Â the meaning of the survey design function and

Â how you do it in other packages, but I'll sketch it here.

Â The first thing you need to tell R is what the first stage units are.

Â So the parameter in R is id.

Â In this case, I'm saying that dnum

Â is the psu or cluster definition.

Â And in this case, dnum is short for district number.

Â 2:32

And notice, that the first stage unit is treated as a formula.

Â The weights are also, the fpc is also,

Â the data set itself is not used to specify that without a tilde.

Â Now, the next thing you have to do is specify the totals for

Â the population auxiliary variables that I'm going to use.

Â So what I've done here is I created data frame using this data dot frame statement.

Â The first column is going to be school type.

Â So the labels for that are E, H, and M which stands for

Â elementary, high school, and middle.

Â And these are just different grade ranges that are used in the US.

Â And then, I get the count of schools, 4421, 755,

Â 1018 in those three school types.

Â Now, how do I post stratify?

Â I just invoke this command postStratify.

Â 3:38

Notice here, that that's a capital S.

Â R's case sensitive.

Â So if you use the lower case s there,

Â it would bark at you until you couldn't find function.

Â So you've got to be careful to see exactly how your function name's spelled.

Â So I'm operating on this design object, dclus1,

Â I am poststratifying by these indexes

Â stype, and notice that's a formula again tilde in the front.

Â And then, I give it the control totals here, pop.types and

Â I define those back in the previous line.

Â And that's all there is to it.

Â It goes through, it creates these post stratified weights

Â that saves that information and it's saved into this new

Â object called dclus1p, p for postratification.

Â So just take a look at the weights that came out of this.

Â What I've done is I've rbind a summary of the weights for

Â the non post stratified design object dclus1.

Â And then, the weights for the dclus1 object.

Â So, this thing this weights function right here is an extractor kind of function.

Â It'll pull away to out of that design object and show them to you.

Â 5:11

So what would have I got if I look at the first row here,

Â that's the non post stratified object.

Â You see all the weights are the same.

Â So I've got an equal probability sample.

Â In the second row, those are the post stratified weight and

Â you can see those are spread out from 30.7 up to 53.93.

Â And why is that, it's because the sample itself is not

Â proportionally allocated among these school types.

Â So post stratifying, in a sense, corrects that and

Â we hope that, that will reduce variances.

Â If I had coverage errors it, we also hope that it will reduce those.

Â 5:57

So let's look at a couple results just to see how the point of estimates can change.

Â The first thing that I called for

Â here is the mean for variable called enrollment so

Â sv one mean is the function that I use to do that.

Â Enroll is the variable and it has a tilde there again.

Â It has to be specified as a formula, and here's the design object, dclus1.

Â So this is the before post straficiation version of this.

Â I get a point estimate of 549.72 students

Â per school enrolled, standard error 45.19.

Â If I do the same thing on the post stratified object,

Â you can see my point estimate mean change some.

Â I'm up to 594.27 and the standard error got bigger too.

Â 7:00

Now, let's take a look at total.

Â So note this, things can go in any direction but

Â if I compare the standard error before postsratification and

Â after actually made things worse in terms of the standard error.

Â The mean got bigger, the coefficient of variation could be actually smaller but

Â in this case it's not.

Â But there's no guarantee that you're going to improve estimates

Â of the mean with post stratification or of the total, although you may.

Â So let's look at the total, and

Â we can do that with the svytotal or enroll again, same variable.

Â So here, the answers there and

Â I've got a total of about 3.4 million,

Â standard error of 932 thousand and some.

Â And if I use the post stratified version dclus1p,

Â then what I get is this line right here.

Â And so you see, the total changed a bit, not tremendously,

Â but standard error did change quite a lot.

Â If I compare these two values.

Â Before post stratification, I went from 932,000,

Â after post stratification, I go to 406,000.

Â So I cut standard error by over 50% despite post stratifying and

Â that's on the estimated total.

Â Now, I can look at cvs and the function that will

Â do that is called little cv in the survey package.

Â So here, I just collect together the coefficient variation for

Â the mean of enrollment from the dclus1 post stratified object,

Â and then from the post stratified object.

Â So you see right here, I go from a cv of 0.82 to 0.00110,

Â so either in terms of standard error or

Â cv I made things worst by post stratifying here.

Â If I look at totals here, on the other hand, and I do or

Â compare the post stratify and the non poststratified object.

Â I go from .2737 or so to .1103.

Â In other words, I gained quite a lot

Â in terms of cv and standard error by post stratifying.

Â So notice also,

Â that these two are the same.

Â The cv on the mean and the total.

Â Now, why is that?

Â 10:06

It's because, when I divide by the sum of the weights to get mean,

Â I force the sum of the weights in the post stratify

Â variable to equal the population count.

Â Now, that's going to be true in every sample, so there's no extra variation.

Â That's like a constant.

Â After I poststratify that estimated pop count.

Â So what that leads to is the standard error relative to what you're estimating.

Â Is exactly the same in the mean and the total.

Â 10:47

Now, have you decided whether this post stratifying is good idea or

Â not, there's different ways of doing it.

Â But one way to think about this is every estimator has an implied model behind it.

Â And by model, I mean a structural model that relates why to whatever

Â covariants you're using in your estimator.

Â So in the post stratification case, it's really simple.

Â Common mean in every poststratum, call it beta sub gamma, common variance for

Â every element in a given post stratum, call it sigma squared gamma.

Â 11:35

Think of this as the way I would predict an additional

Â school within post rate of gamma, I take the mean of what I saw and

Â predict the next value would be equal to that mean,

Â if the common mean is a good predictor then you get variance reduction.

Â If it's not, you don't and that's what this bullet says.

Â The one thing about the post stratified estimator is it'll be approximately

Â design-unbiased, meaning in repeated sampling, if you do it over and

Â over again, you'll average out to the right thing even if the model is wrong.

Â 13:07

So I cross age by gender and that creates a number

Â of age groups times two or three post strata.

Â But suppose that you had two other variables that you should've considered.

Â Race-ethnicity and income level.

Â Because those are good predictors of the y's that you're analyzing from your data.

Â Then, you'll have the wrong model,

Â post stratification will be less efficient than it could be.

Â How do you take up the slack for that, or improve your estimator?

Â You could think about using raking where you included race-ethnicity and

Â income level as margins to write to.

Â Or you could use GREG which will accommodate if you had quantitative

Â income value, you see it would've accommodated both those in these

Â categorical age, gender and race ethnicity variables.

Â So, you can make post stratification fairly flexible but

Â if you do want to include both qualitative and quantitative,

Â then your best choice maybe this thing GREG.

Â