The final, formatted version of the article will be published soon.
ORIGINAL RESEARCH article
Front. Artif. Intell.
Sec. Natural Language Processing
Volume 8 - 2025 |
doi: 10.3389/frai.2025.1513674
Targeted Generative Data Augmentation for Automatic Metastases Detection from Free-text Radiology Reports
Provisionally accepted- 1 Queen's University, Kingston, Canada
- 2 Memorial Sloan Kettering Cancer Center, New York, New York, United States
Automatic identification of metastatic sites in cancer patients from electronic health records is a challenging yet crucial task with significant implications for diagnosis and treatment. In this study, we demonstrate how advancements in natural language processing, namely the instruction-following capability of recent large language models and extensive model pretraining, made it possible to automate metastases detection from radiology reports texts with a limited amount of gold-labelled data. Specifically, we prompt Llama3, an open-source instruction-tuned large language model, to generate synthetic training data to expand our limited labelled data and adapt BERT, a small pretrained language model, to the task. We further investigate three targeted data augmentation techniques which selectively expand the original training samples, leading to comparable or superior performance compared to vanilla data augmentation, in most cases, while being substantially more computationally efficient. In our experiments, data augmentation improved the average F1-score by 2.3, 3.5, and 3.9 points for lung, liver, and adrenal glands, the organs for which we had access to expert-annotated data. This observation suggests that Llama3, which has not been specifically tailored to this task or clinical data in general, can generate high-quality synthetic data through paraphrasing in the clinical context.We also compare metastasis identification accuracy between models utilizing institutionally standardized reports versus non-structured reports, which complicate the extraction of relevant information, and show how including patient history with a customized model architecture narrows the gap between those two setups from 7.3 to 4.5 points on F1-score under LoRA tuning. Our work delivers a broadly applicable solution with remarkable performance that does not require model customization for each institution, making large-scale, low-cost spatio-temporal cancer progression pattern extraction possible.
Keywords: Synthetic data generation, targeted data augmentation, metastases detection, Natural Language Processing, Large language models, free-text radiology report Generative Data Augmentation for Metastases Detection
Received: 18 Oct 2024; Accepted: 15 Jan 2025.
Copyright: © 2025 Ashofteh Barabadi, Zhu, Chan, Simpson and Do. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence:
Maede Ashofteh Barabadi, Queen's University, Kingston, Canada
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.