Introduction

AUTHOR=Tayyab Maryam , Metz Luanne M. , Li David K.B. , Kolind Shannon , Carruthers Robert , Traboulsee Anthony , Tam Roger C. 

TITLE=Accounting for uncertainty in training data to improve machine learning performance in predicting new disease activity in early multiple sclerosis

JOURNAL=Frontiers in Neurology

VOLUME=Volume 14 - 2023

YEAR=2023

URL=https://www.frontiersin.org/journals/neurology/articles/10.3389/fneur.2023.1165267

DOI=10.3389/fneur.2023.1165267

ISSN=1664-2295

ABSTRACT=<sec><title>Introduction</title><p>Machine learning (ML) has great potential for using health data to predict clinical outcomes in individual patients. Missing data are a common challenge in training ML algorithms, such as when subjects withdraw from a clinical study, leaving some samples with missing outcome labels. In this study, we have compared three ML models to determine whether accounting for label uncertainty can improve a model’s predictions.</p></sec><sec><title>Methods</title><p>We used a dataset from a completed phase-III clinical trial that evaluated the efficacy of minocycline for delaying the conversion from clinically isolated syndrome to multiple sclerosis (MS), using the McDonald 2005 diagnostic criteria. There were a total of 142 participants, and at the 2-year follow-up 81 had converted to MS, 29 remained stable, and 32 had uncertain outcomes. In a stratified 7-fold cross-validation, we trained three random forest (RF) ML models using MRI volumetric features and clinical variables to predict the conversion outcome, which represented new disease activity within 2 years of a first clinical demyelinating event. One RF was trained using subjects with the uncertain labels excluded (RF<sub>exclude</sub>), another RF was trained using the entire dataset but with assumed labels for the uncertain group (RF<sub>naive</sub>), and a third, a probabilistic RF (PRF, a type of RF that can model label uncertainty) was trained on the entire dataset, with probabilistic labels assigned to the uncertain group.</p></sec><sec><title>Results</title><p>Probabilistic random forest outperformed both the RF models with the highest AUC (0.76, compared to 0.69 for RF<sub>exclude</sub> and 0.71 for RF<sub>naive</sub>) and F1-score (86.6% compared to 82.6% for RF<sub>exclude</sub> and 76.8% for RF<sub>naive</sub>).</p></sec><sec><title>Conclusion</title><p>Machine learning algorithms capable of modeling label uncertainty can improve predictive performance in datasets in which a substantial number of subjects have unknown outcomes.</p></sec>