Abstract
BACKGROUND: Leave-one-out cross-validation that fails to account for variable selection does not properly reflect prediction accuracy when the number of training sites is small. The impact on health effect estimates has rarely been studied.
METHODS: We randomly generated ten training and test sets for nitrogen dioxide and particulate matter. For each training set we developed models and evaluated them using across-holdout validation approach. Cross-holdout validationdevelops new models for each evaluation compared to refitting the model without variable selection, as in standard leave-one-out cross-validation. We also implemented holdout validation, which evaluates model predictions using independent test sets. We evaluated the relationship between cross-holdout validationand holdout validation R and estimates of the association between air pollution and forced vital capacity in the Dutch birth cohort.
RESULTS: Cross-holdout validationRs were generally identical to holdout validation Rs, but were notably smaller thanleave-one-out cross-validationRs. Decreases in forced vital capacityin relation to air pollution exposure were larger forland-use regression models that had larger holdout validationandcross-holdout validationRs rather than leave-one-out cross-validation R.
CONCLUSIONS: Cross-holdout validationaccurately reflects predictive ability of land-use regression models and is a useful validation approach for small datasets. Land-use regression predictive ability in terms of hold-out validation and cross-holdout validation rather than leave-one-out cross-validationwas associated with the magnitude of health effect estimates in a case study.
Original language | English |
---|---|
Pages (from-to) | 51-56 |
Number of pages | 6 |
Journal | Epidemiology |
Volume | 27 |
Issue number | 1 |
DOIs | |
Publication status | Published - Jan 2016 |