Skip to main content

ORIGINAL RESEARCH article

Front. Med.
Sec. Translational Medicine
Volume 12 - 2025 | doi: 10.3389/fmed.2025.1435428

Diagnostics of lung cancer by fragmentated blood circulating cell-free DNA based on machine learning methods

Provisionally accepted
Ivan O Meshkov Ivan O Meshkov 1Alexander P Koturgin Alexander P Koturgin 1Pavel V Ershov Pavel V Ershov 1*Liubov A Safonova Liubov A Safonova 1Julia A Remizova Julia A Remizova 1Valentina V Maksyutina Valentina V Maksyutina 1Ekaterina D Maralova Ekaterina D Maralova 1Vasilisa A Astafieva Vasilisa A Astafieva 1Alexey A Ivashechkin Alexey A Ivashechkin 1Boris D Ignatiev Boris D Ignatiev 1Antonida V Makhotenko Antonida V Makhotenko 1Ekaterina A Snigir Ekaterina A Snigir 1Valentin V Makarov Valentin V Makarov 1Vladimir S Yudin Vladimir S Yudin 1Anton A Keskinov Anton A Keskinov 1Sergey M Yudin Sergey M Yudin 1Anna S Makarova Anna S Makarova 1Veronika I Skvortsova Veronika I Skvortsova 2
  • 1 Centre for Strategic Planning, of the Federal medical and biological agency, Moscow, Russia
  • 2 Federal Medical & Biological Agency of Russia, Moscow, Moscow Oblast, Russia

The final, formatted version of the article will be published soon.

    Minimally invasive diagnostics based on liquid biopsy makes it possible early detection of lung cancer (LC). The blood plasma circulating cell-free DNA (cfDNA) fragments reflect the genome and chromatin status and are considered as integral cancer biomarkers and the biological entities for 'cancer-of-origin' prediction. The aim of this work is to create a method for processing next-generation sequencing (NGS) data and an interpretable binary classification model (CM), which analyzed cfDNA fragmentation features for distinguishing healthy subjects and subjects with LC.Methods: 148 healthy subjects and 138 subjects with LC were included in the study. cfDNA fractions, isolated from blood plasma biospecimens, were used for DNA libraries preparations and NGS on the NovaSeq 6000 Illumina system with a coverage of 100 million reads/sample. Twelve variables, describing the abundance and length distribution of cfDNA fragments within each genomic interval, and 40 variables based on the values of position-weight matrices, describing combinations of 5-bplong terminal motifs of cfDNA fragments, were used to characterize genomic fragmentation. Classification models of the first phase of machine learning were based either on logistic regression with L1-and L2-regularization or were probabilistic CMs based on Gaussian processes. The second phase CM was based on kernel logistic regression.The final CM can distinguish healthy subjects and subjects with LC with AUC values of 0.870 -0.875. The performance of developed CM was evaluated using datum and testing sets for each LC stage category. Sensitivity values ranged from 66.7% to 85.7%, from 77.8% to 100%, and from 70% to 80% for LC stages I, II, and III, respectively. Specificity values ranged from 79.3% to 90.0%.Discussion: Thus, the CM has a good diagnostic value and does not require clinical or other data on tumor-associated biomarkers. The current method for LC detection has some advantages for future clinical implementation as a decision-making support system due to the performance of the CM requires data exclusively from NGS-analysis of blood plasma cfDNA fragmentation; the accuracy of the CM does not depend on any additional clinical data; the CM is highly interpretable and traceable; CM has appropriate modular architecture.

    Keywords: machine learning methods, lung cancer, Fragmentome, circulating cell-free DNA, cfDNA, Diagnostics Classification Model, cancer early detection

    Received: 20 May 2024; Accepted: 06 Jan 2025.

    Copyright: © 2025 Meshkov, Koturgin, Ershov, Safonova, Remizova, Maksyutina, Maralova, Astafieva, Ivashechkin, Ignatiev, Makhotenko, Snigir, Makarov, Yudin, Keskinov, Yudin, Makarova and Skvortsova. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

    * Correspondence: Pavel V Ershov, Centre for Strategic Planning, of the Federal medical and biological agency, Moscow, Russia

    Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.