0:06

Now that we've heard about some reasons for imputing,

Â let's look at particular methods.

Â So I'll talk about means and hot deck, in particular.

Â But first, let's look at a list of all the possibilities

Â that we've got that we'll cover in this course.

Â The first one here is imputation based on logical rules.

Â Now, that is not normally what you'd think of as an imputation.

Â It's more like an edit check.

Â And the idea there is if, say, at one point in a questionnaire,

Â you get somebody's date of birth, and later on age is missing,

Â then you can do a little calculation and fill in what age is.

Â So that's a sort of fill-in imputation.

Â 0:57

Another choice that we've got would be filling in a mean

Â that's usually within some cells defined on characteristics that you need to

Â know about all the units in your sample.

Â And you may or not add a random error to that.

Â The third choice is what's called cold deck.

Â So the idea there is this works if you've got a continuing survey

Â where you've got a previous edition of the survey.

Â If a unit responded in the previous version,

Â then you go back, get that value or some possibly

Â indexed forward value of it and fill it in for the current missing value.

Â So it's called cold because you're referring back

Â to a dataset that is already in your hands.

Â It's not the current one that you're dealing with.

Â Now, in contrast to that is something called hot deck, and

Â what that amounts to is you look at your current dataset.

Â If you've got a missing value, then you find a similar case that

Â has complete data for the variable, and you just grab that value and fill it in.

Â So that's usually done within cells, also.

Â A fifth possibility is regression prediction.

Â So based on covariates that are available for all units, those that are missing or

Â non-missing, you generate a regression prediction, and you may or

Â may not add a random error to that.

Â Now, somewhat related to that is a method called predictive mean matching.

Â You find a unit that's got the closest

Â observed value to one predicted by regression for your missing case.

Â So you've got a missing case.

Â You make a regression prediction based on some covariates.

Â So you've gotta fit that regression from the complete cases.

Â Then you look at that prediction, find a complete case that actually has reported

Â data that's close to that prediction, and then you fill in that value.

Â So it's got the virtue of taking advantage of any sort of regression

Â relationship between covariates and your analysis variable.

Â 3:27

Now, each one of methods 2 through 6 can be done sequentially.

Â So you find an item where there are the fewest missing values.

Â You fill all those in.

Â You go to an item with the next most missing.

Â You use your complete data, plus the imputations you just made.

Â You impute for the missing values for this new variable, and

Â you keep going in a sequential method.

Â A variation on that which we'll get to later is called imputation through

Â chained equations, and we'll look at some software that will do that for you.

Â 4:08

Now, looking specifically at mean imputation,

Â one of the troubles with it is that if you've got a lot of missing values and

Â you keep repeatedly filling in the mean, you're going to introduce a kind of

Â a spike in the distribution, a lot of values at that one mean value.

Â So one way to work around that is to add a random error to the mean.

Â That would help reduce that distortion of repeatedly imputing the same value.

Â So what's the error?

Â The error could be normal with mean 0 and

Â a variance equal to the observed element variance of the nonmissing values.

Â But you don't have to use normal.

Â I mean, there's no law that says that's the way data are distributed.

Â So you could certainly use distributions other than normal.

Â If you looked at your complete data and

Â you found that they were distributed like a gamma or

Â some sort of chi-square distribution, then certainly that could be used.

Â 5:16

The cells or subgroups that you form to do this for

Â mean imputation are a way of accounting for

Â the possibility that the value depends on covariates.

Â So in that case, you're really implicitly using a model.

Â And the regression model that is in implicit in this is a kind of ANOVA model

Â analysis of variance where all the covariates are categorical.

Â Those are the ones used to form the cells.

Â And you're imputing a mean based on those categorical covariates.

Â 5:55

Now, hot deck imputation is somewhat different.

Â Usually, you put things into cells.

Â For example, if you've got a business survey,

Â you might use type of business x size.

Â If you're doing a survey of persons, you might use age x gender.

Â So within each of those cells, if a unit is missing,

Â what you do is find a non-missing version, draw one of those at random, and

Â fill it in, fill in its value as the imputed value.

Â So that's got the advantage of

Â not imputing impossible values because you're using observed values.

Â It does have these implicit assumptions that all units in

Â a group have a common mean.

Â So it's got an ANOVA model type of assumption underlying it.

Â That's the case where it makes the most sense to use a kind of hot

Â deck imputation.

Â