Abstract
Multivariate imputation using chained equations (MICE) is a popular algorithm for imputing missing data that entails
specifying multivariate models through conditional distributions. For imputing missing continuous variables, two common
imputation methods are the use of parametric imputation using a linear model and predictive mean matching. When
imputing missing binary variables, the default approach is parametric imputation using a logistic regression model. In the
R implementation of MICE, the use of predictive mean matching can be substantially faster than using logistic regression
as the imputation model for missing binary variables. However, there is a paucity of research into the statistical performance of predictive mean matching for imputing missing binary variables. Our objective was to compare the statistical
performance of predictive mean matching with that of logistic regression for imputing missing binary variables. Monte
Carlo simulations were used to compare the statistical performance of predictive mean matching with that of logistic
regression for imputing missing binary outcomes when the analysis model of scientific interest was a multivariable logistic
regression model. We varied the size of the analysis samples (N=250, 500, 1,000, 5,000, and 10,000) and the prevalence
of missing data (5%–50% in increments of 5%). In general, the statistical performance of predictive mean matching was
virtually identical to that of logistic regression for imputing missing binary variables when the analysis model was a logistic
regression model. This was true across a wide range of scenarios defined by sample size and the prevalence of missing
data. In conclusion, predictive mean matching can be used to impute missing binary variables. The use of predictive mean
matching to impute missing binary variables can result in a substantial reduction in computer processing time when
conducting simulations of multiple imputation.
specifying multivariate models through conditional distributions. For imputing missing continuous variables, two common
imputation methods are the use of parametric imputation using a linear model and predictive mean matching. When
imputing missing binary variables, the default approach is parametric imputation using a logistic regression model. In the
R implementation of MICE, the use of predictive mean matching can be substantially faster than using logistic regression
as the imputation model for missing binary variables. However, there is a paucity of research into the statistical performance of predictive mean matching for imputing missing binary variables. Our objective was to compare the statistical
performance of predictive mean matching with that of logistic regression for imputing missing binary variables. Monte
Carlo simulations were used to compare the statistical performance of predictive mean matching with that of logistic
regression for imputing missing binary outcomes when the analysis model of scientific interest was a multivariable logistic
regression model. We varied the size of the analysis samples (N=250, 500, 1,000, 5,000, and 10,000) and the prevalence
of missing data (5%–50% in increments of 5%). In general, the statistical performance of predictive mean matching was
virtually identical to that of logistic regression for imputing missing binary variables when the analysis model was a logistic
regression model. This was true across a wide range of scenarios defined by sample size and the prevalence of missing
data. In conclusion, predictive mean matching can be used to impute missing binary variables. The use of predictive mean
matching to impute missing binary variables can result in a substantial reduction in computer processing time when
conducting simulations of multiple imputation.
Original language | English |
---|---|
Pages (from-to) | 2172-2183 |
Journal | Statistical Methods in Medical Research |
Volume | 32 |
Issue number | 11 |
DOIs | |
Publication status | Published - Nov 2023 |
Keywords
- Missing data
- multiple imputation
- Monte Carlo simulations