The final, formatted version of the article will be published soon.
ORIGINAL RESEARCH article
Front. Genet.
Sec. Computational Genomics
Volume 16 - 2025 |
doi: 10.3389/fgene.2025.1451290
This article is part of the Research Topic Critical Assessment of Massive Data Analysis (CAMDA)
Annual Conference 2023 View all 5 articles
Predicting Diabetes Complications from Electronic Health Record Visits Using
Provisionally accepted- 1 Al-Quds University, Jerusalem, Palestine
- 2 Zefat Academic College, Safed, Israel
- 3 Abdullah Gül University, Kayseri, Türkiye
Diabetes significantly affects millions of people worldwide, leading to substantial morbidity, disability, and mortality rates. Predicting diabetes-related complications from health records is crucial for early prevention and for the development of effective treatment plans. In order to predict four different complications of diabetes mellitus, i.e., retinopathy, chronic kidney disease, ischemic heart disease, and amputations, this study introduces a novel feature engineering approach. While developing the classification models, we utilize XGBoost feature selection method and various supervised machine learning algorithms, including Random Forest, XGBoost, LogitBoost, AdaBoost, and Decision Tree. These models were trained on synthetic electronic health records (EHR) generated by dual-adversarial autoencoders. These EHRs represent nearly 1 million synthetic patients derived from an authentic cohort of 979,308 individuals with diabetes. The variables considered in the models were the age range accompanied by chronic diseases that occur during patient visits starting from the onset of diabetes. Throughout the experiments, XGBoost and Random Forest demonstrated the best overall prediction performance. The final models, which are tailored to each complication and trained using our feature engineering approach, achieved an accuracy between 69% and 77% and an AUC between 77% and 84% using cross-validation, while the partitioned validation approach yielded an accuracy between 59% and 78% and an AUC between 66% and 85%. These findings imply that the performance of our method surpasses the performance of the traditional Bag-of-Features approach, highlighting the effectiveness of our approach in enhancing model accuracy and robustness.
Keywords: learning algorithms, including Random Forest, XGBoost, LogitBoost, Adaboost, and Decision Tree diabetes mellitus, Diabetes Complications, machine learning
Received: 18 Jun 2024; Accepted: 31 Jan 2025.
Copyright: © 2025 Voskergian, Yousef and Bakir-Gungor. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence:
Daniel Voskergian, Al-Quds University, Jerusalem, Palestine
Malik Yousef, Zefat Academic College, Safed, Israel
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.