Missing Data? Try Multiple Imputation
It is common for researchers to exclude participants with missing data — but what are some ways to keep the participants and analyse the data in unbiased ways?
Data imputation involves replacing missing data with plausible values based on the Monte Carlo technique. Here, the missing values are replaced by several simulated versions.
1. you have missing data in a data set
2. missing data are simulated with different versions –> several simulated data sets
3. analyze data set (version #1) as it were a complete data set
4. repeat step 3 with other data set versions (e.g., #2, #3, …, #N). [for low rates of missing data, only 3-10 simulated data sets are needed.]
5. combine (average) the results to produce a single point estimates and confidence intervals (or p-values) that incorporate missing-data uncertainty.
How do I generate imputations for the missing values?
The imputation model:
Impose a probability model on the complete data (observed and missing values).
Key points on what to include in the imputation model: [~ 30 min into the Amelia I video (link below)]
1. Include all the variables that you want to include in your analysis model
e.g., age, education, ideology, income … have to include all the variables you will need later in analysis stage
2. Include variables that are highly predictive of the variables you are going to analyse
e.g., Voter turnout analysis: ideology — e.g., include views on homelessness, abortion (predictors of variables you’re interested in)
3. Include variables that are highly predictive of the missingness of your data
e.g., income the predict missingness — throw it in model as well
“…Because you are throwing in a lot of variables in the imputation model than the variables you would be looking at the analysis stage and that’s OK!”
*Note: this method assumes MAR (Missing At Random)
– “imputation model should be compatible with the analyses to be performed on the imputed datasets… In general, any association that may be important in subsequent analyses should be in the imputation model” … that means include those relevant variables in the imputation model.
– On the other hand, you don’t necessarily have to examine those variables in your final analyses (unless it’s of interest pertaining to the outcome).
“When working with binary or ordered categorical variables, it’s often acceptable to impute under under a normality assumption and then round off the continuous imputed values to the nearest category. Variables whose distributions are heavily skewed may be transformed (e.g., logs) to approximate normality and then transformed back to their original scale after imputation.” – the multiple imputation FAQ page
– if you have ordinal variables – code them to as close to an interval scale as possible
– include any non-linear relationship in your model, e.g., age and age-squared in voter turnout, so include age-squared in your imputation model
– if you will look at any interactions, throw those terms in your imputation model as well.
Software for Multiple Imputation:
– R (Amelia II, mi, etc.)
– Strata (mi, ice, mim, etc.)
There are many softwares to do multiple imputations, but if you use R, you can check out AmeliaII
– the user guide is very clear and helpful – can run through the example dataset and code.
video – Innovation in Amelia I (explanation of multiple imputation in general): http://vimeo.com/18534025
the multiple imputation FAQ page: http://sites.stat.psu.edu/~jls/mifaq.html