Skip to main content

ORIGINAL RESEARCH article

Front. Med.
Sec. Regulatory Science
Volume 11 - 2024 | doi: 10.3389/fmed.2024.1393123
This article is part of the Research Topic Unlocking The Potential of Health Data Spaces With The Proliferation of New Tools, Technologies and Digital Solutions View all 11 articles

A Scalable and Transparent Data Pipeline for AI-Enabled Health Data Ecosystems

Provisionally accepted
  • 1 Software Research and Development Consulting, Ankara, Türkiye
  • 2 San Juan de Dios Research Foundation, Barcelona, Catalonia, Spain

The final, formatted version of the article will be published soon.

    Introduction: Transparency and traceability are essential for establishing trustworthy artificial intelligence (AI). The lack of transparency in the data preparation process is a significant obstacle in developing reliable AI systems which can lead to issues related to reproducibility, debugging AI models, bias and fairness, and compliance and regulation. We introduce a formal data preparation pipeline specification to improve upon the manual and error-prone data extraction processes used in AI and data analytics applications, with a focus on traceability.We propose a declarative language to define the extraction of AI-ready datasets from health data adhering to a common data model, particularly those conforming to HL7 Fast Healthcare Interoperability Resources (FHIR). We utilize the FHIR profiling to develop a common data model tailored to an AI use case to enable the explicit declaration of the needed information such as phenotype and AI feature definitions. In our pipeline model, we convert complex, high-dimensional electronic health records data represented with irregular time series sampling to a flat structure by defining a target population, feature groups and final datasets. Our design considers the requirements of various AI use cases from different projects which lead to implementation of many feature types exhibiting intricate temporal relations.We implement a scalable and high-performant feature repository to execute the data preparation pipeline definitions. This software not only ensures reliable, fault-tolerant distributed processing to produce AI-ready datasets and their metadata including many statistics alongside, but also serve as a pluggable component of a decision support application based on a trained AI model during online prediction to automatically prepare feature values of individual entities. We deployed and tested the proposed methodology and the implementation in three different research projects. We present the developed FHIR profiles as a common data model, feature group definitions and feature definitions within a data preparation pipeline while training an AI model for "predicting complications after cardiac surgeries".Discussion: Through the implementation across various pilot use cases, it has been demonstrated that our framework possesses the necessary breadth and flexibility to define a diverse array of features, each tailored to specific temporal and contextual criteria.

    Keywords: artificial intelligence, Dataset, harmonization, Transparency, FHIR, interoperability, health data spaces

    Received: 28 Feb 2024; Accepted: 16 Jul 2024.

    Copyright: © 2024 Namli, SINACI, Gönül, Herguido, Cañadilla, Muñoz, Esteve and Laleci Erturkmen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

    * Correspondence:
    Tuncay Namli, Software Research and Development Consulting, Ankara, Türkiye
    A. Anil SINACI, Software Research and Development Consulting, Ankara, Türkiye
    Gokce Banu Laleci Erturkmen, Software Research and Development Consulting, Ankara, Türkiye

    Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.