Abstract
We examined the setting in which a variable that is subject to missingness is used
both as an inclusion/exclusion criterion for creating the analytic sample and
subsequently as the primary exposure in the analysis model that is of scientific
interest. An example is cancer stage, where patients with stage IV cancer are
often excluded from the analytic sample, and cancer stage (I to III) is an exposure
variable in the analysis model. We considered two analytic strategies. The first strategy, referred to as “exclude-then-impute,” excludes
subjects for whom the observed value of the target variable is equal
to the specified value and then uses multiple imputation to complete the data in the resultant sample. The second strategy, referred to
as “impute-then-exclude,” first uses multiple imputation to complete
the data and then excludes subjects based on the observed or filled-in values
in the completed samples. Monte Carlo simulations were used to compare
five methods (one based on “exclude-then-impute” and four based on
“impute-then-exclude”) along with the use of a complete case analysis. We
considered both missing completely at random and missing at random missing data mechanisms. We found that an impute-then-exclude strategy using
substantive model compatible fully conditional specification tended to have
superior performance across 72 different scenarios. We illustrated the
application of these methods using empirical data on patients hospitalized
with heart failure when heart failure subtype was used for cohort creation
(excluding subjects with heart failure with preserved ejection fraction) and was
also an exposure in the analysis model.
both as an inclusion/exclusion criterion for creating the analytic sample and
subsequently as the primary exposure in the analysis model that is of scientific
interest. An example is cancer stage, where patients with stage IV cancer are
often excluded from the analytic sample, and cancer stage (I to III) is an exposure
variable in the analysis model. We considered two analytic strategies. The first strategy, referred to as “exclude-then-impute,” excludes
subjects for whom the observed value of the target variable is equal
to the specified value and then uses multiple imputation to complete the data in the resultant sample. The second strategy, referred to
as “impute-then-exclude,” first uses multiple imputation to complete
the data and then excludes subjects based on the observed or filled-in values
in the completed samples. Monte Carlo simulations were used to compare
five methods (one based on “exclude-then-impute” and four based on
“impute-then-exclude”) along with the use of a complete case analysis. We
considered both missing completely at random and missing at random missing data mechanisms. We found that an impute-then-exclude strategy using
substantive model compatible fully conditional specification tended to have
superior performance across 72 different scenarios. We illustrated the
application of these methods using empirical data on patients hospitalized
with heart failure when heart failure subtype was used for cohort creation
(excluding subjects with heart failure with preserved ejection fraction) and was
also an exposure in the analysis model.
Original language | English |
---|---|
Pages (from-to) | 1525-1541 |
Journal | Statistics in Medicine |
Volume | 42 |
Issue number | 10 |
DOIs | |
Publication status | Published - 10 May 2023 |
Keywords
- missing data
- Monte Carlo simulations
- multiple imputation