AUTHOR=van Os Hendrikus J. A. , Kanning Jos P. , Wermer Marieke J. H. , Chavannes Niels H. , Numans Mattijs E. , Ruigrok Ynte M. , van Zwet Erik W. , Putter Hein , Steyerberg Ewout W. , Groenwold Rolf H. H. 

TITLE=Developing Clinical Prediction Models Using Primary Care Electronic Health Record Data: The Impact of Data Preparation Choices on Model Performance

JOURNAL=Frontiers in Epidemiology

VOLUME=Volume 2 - 2022

YEAR=2022

URL=https://www.frontiersin.org/journals/epidemiology/articles/10.3389/fepid.2022.871630

DOI=10.3389/fepid.2022.871630

ISSN=2674-1199

ABSTRACT=Objective 
To quantify prediction model performance in relation to data preparation choices when using electronic health records (EHR).
Study Design and Setting
Cox proportional hazards models were developed predicting first-ever main adverse cardiovascular events using Dutch primary care EHR data. The reference model was based on a one-year run-in period, cardiovascular events were defined based on both EHR diagnosis and medication codes, and missing values were multiply imputed. We compared data preparation choices regarding i) length of the run-in period (two- or three-year run-in); ii) outcome definition (EHR diagnosis codes or medication codes only); and iii) methods addressing missing values (mean imputation or complete case analysis) by making variations on the derivation set and testing their impact in a validation set.
Results	
We included 89,491 patients in whom 6,736 first-ever main adverse cardiovascular events occurred during a median follow-up of eight years. Outcome definition based only on diagnosis codes led to systematic underestimation of risk (calibration curve intercept: 0.84; 95% CI: 0.83 – 0.84), while complete case analysis led to overestimation (calibration curve intercept: -0.52; 95% CI: -0.53 – -0.51). Differences in length of run-in period showed no relevant impact on calibration and discrimination.
Conclusion
Data preparation choices regarding outcome definition or methods to address missing values can have a substantial impact on the calibration of predictions, hampering reliable clinical decision support. This study further illustrates the urgency of transparent reporting of modelling choices in an EHR data setting.