ORIGINAL RESEARCH article

Front. Digit. Health

Sec. Health Informatics

Volume 7 - 2025 | doi: 10.3389/fdgth.2025.1561358

This article is part of the Research TopicHealthcare Text Analytics: Unlocking the Evidence from Free Text, Volume VView all 3 articles

ArcTEXa novel clinical data enrichment pipeline to support realworld evidence oncology studies

Provisionally accepted
Keiran  TaitKeiran Tait1*Joseph  CroninJoseph Cronin1Olivia  WiperOlivia Wiper1Jamie  WallisJamie Wallis1Jim  DaviesJim Davies2Robert  DürichenRobert Dürichen1
  • 1Arcturis Data, Oxford, United Kingdom
  • 2Department of Computer Science, University of Oxford, Oxford, United Kingdom

The final, formatted version of the article will be published soon.

Data stored within electronic health records (EHRs) offer a valuable source of information for realworld evidence (RWE) studies in oncology. However, many key clinical features are only available within unstructured notes. We present ArcTEX, a novel data enrichment pipeline developed to extract oncological features from NHS unstructured clinical notes with high accuracy, even in resource-constrained environments where availability of GPUs might be limited. By design, the predicted outcomes of ArcTEX are free of patient-identifiable information, making this pipeline ideally suited for use in Trust environments. We compare our pipeline to existing discriminative and generative models, demonstrating its superiority over approaches such as Llama3/3.1/3.2 and other BERT based models, with a mean accuracy of 98.67% for several essential clinical features in endometrial and breast cancer. Additionally, we show that as few as 50 annotated training examples are needed to adapt the model to a different oncology area, such as lung cancer, with a different set of priority clinical features, achieving a comparable mean accuracy of 95% on average.

Keywords: Natural Language Processing, Data enrichment, Real world data, Electronic Health Records, oncology

Received: 15 Jan 2025; Accepted: 23 Apr 2025.

Copyright: © 2025 Tait, Cronin, Wiper, Wallis, Davies and Dürichen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Keiran Tait, Arcturis Data, Oxford, United Kingdom

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.