AUTHOR=Preo Nicolo' , Capobianco Enrico TITLE=Significant EHR Feature-Driven T2D Inference: Predictive Machine Learning and Networks JOURNAL=Frontiers in Big Data VOLUME=2 YEAR=2019 URL=https://www.frontiersin.org/journals/big-data/articles/10.3389/fdata.2019.00030 DOI=10.3389/fdata.2019.00030 ISSN=2624-909X ABSTRACT=

Background: Electronic health records (EHR) play an important role for the redefinition of phenotypes in view of the wealth and heterogeneity of information now available from disparate data sources. A recent cross-sectional retrospective study has described the potential of EHR toward type 2 diabetes mellitus (T2D) screening when ad hoc models are used. About 10,000 US patients have been analyzed through a variety of inference techniques applied to all records with a variable degree of completeness. The analyses conducted in the reference study have indicated that EHR phenotypes significantly improved T2D detection.

Methods: With these US patients and the T2D data evidenced in the above study, we propose an integrative inference approach that leverages the prediction power of EHR features selected by two well-known methods, Random Forests and Lasso. The goal is 2-fold: reducing the Big Data redundancies potentially harmful to the predictive learning task and exploiting the interconnectivity of EHR features. A mutual information (MI) network is the inference tool used to identify communities useful to prioritize significant T2D features underlying the similarity between patients.

Results: Endowed with a different degree of granularity, the communities detected after the application of both methods were centered especially on T2D comorbidities and risk factors. As such, they appear very relevant for assessment of two main issues, T2D disease burden, and prevention.

Conclusions: Our analytical approach offers a solution for managing the EHR scale factor in a complex disease context. EHR are rich sources of phenotypic diversity through which novel stratifications of patients are expected. To enable these results, both pre-screening of variables and calibration of risk prediction methods become necessary steps in EHR analyses. We have presented networks identifying major T2D communities. The specific significance assigned to comorbidities and risk factors in relation to T2D can be inferred with accuracy from just a suitably reduced number of EHR features.