ORIGINAL RESEARCH article
Front. Oncol.
Sec. Gastrointestinal Cancers: Colorectal Cancer
Volume 15 - 2025 | doi: 10.3389/fonc.2025.1508455
Early Prediction of Colorectal Adenoma Risk: Leveraging Large-Language Model for Clinical Electronic Medical Record Data
Provisionally accepted- 1Peking University Third Hospital, Haidian, China
- 2Information Management and Big Data Center, Peking University Third Hospital, Beijing, China
- 3Department of Gastroenterology, Peking University Third Hospital, Beijing, China
- 4Goodwill Hessian Health Technology Co. Ltd, Beijing, China
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Objective: To develop a non-invasive, radiation-free model for early colorectal adenoma prediction using clinical electronic medical record (EMR) data, addressing limitations in current diagnostic approaches for large-scale screening.: Retrospective analysis utilized 92,681 cases with EMR, spanning from 2012 to 2022, as the training cohort. Testing was performed on an independent test cohort of 19,265 cases from 2023. Several classical machine learning algorithms were applied in combination with the BGE-M3 large-language model (LLM) for enhanced semantic feature extraction. Area under the receiver operating characteristic curve (AUC) is the major metric for evaluating model performance. The Shapley additive explanations (SHAP) method was employed to identify the most influential risk factors. Results: XGBoost algorithm, integrated with BGE-M3, demonstrated superior performance (AUC = 0.9847) in the validation cohort. Notably, when applied to the independent test cohort, XGBoost maintained its strong predictive ability with an AUC of 0.9839 and an average advance prediction time of 6.88 hours, underscoring the effectiveness of the BGE-M3 model. The SHAP analysis further identified 16 high-impact risk factors, highlighting the interplay of genetic, lifestyle, and environmental influences on colorectal adenoma risk. Conclusion: This study developed a robust machine learning-based model for colorectal adenoma risk prediction, leveraging clinical EMR and LLM. The proposed model demonstrates high predictive accuracy and has the potential to enhance early detection, making it well-suited for large-scale screening programs. By facilitating early identification of individuals at risk, this approach may contribute to reducing the incidence and mortality associated with colorectal cancer. Funding: This study was supported by the Beijing Municipal Natural Science Foundation (Z210003) Findings from this study support integration of predictive models into clinical practice, highlighting the potential for optimizing colorectal cancer management through targeted strategies. This could lead to the development of novel, tailored interventions for individuals at high risk.
Keywords: Adenoma, Colorectal adenoma, Large Language Model, Early prediction, Electronic Medical Record, Colorectal cancer F-FDGPET/CT 18
Received: 09 Oct 2024; Accepted: 22 Apr 2025.
Copyright: © 2025 Yang, Xu, Ji, Li, Yang and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Jinjian Xu, Peking University Third Hospital, Haidian, China
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.