AUTHOR=Harkness Rachael , Frangi Alejandro F. , Zucker Kieran , Ravikumar Nishant TITLE=Multi-centre benchmarking of deep learning models for COVID-19 detection in chest x-rays JOURNAL=Frontiers in Radiology VOLUME=4 YEAR=2024 URL=https://www.frontiersin.org/journals/radiology/articles/10.3389/fradi.2024.1386906 DOI=10.3389/fradi.2024.1386906 ISSN=2673-8740 ABSTRACT=Introduction

This study is a retrospective evaluation of the performance of deep learning models that were developed for the detection of COVID-19 from chest x-rays, undertaken with the goal of assessing the suitability of such systems as clinical decision support tools.

Methods

Models were trained on the National COVID-19 Chest Imaging Database (NCCID), a UK-wide multi-centre dataset from 26 different NHS hospitals and evaluated on independent multi-national clinical datasets. The evaluation considers clinical and technical contributors to model error and potential model bias. Model predictions are examined for spurious feature correlations using techniques for explainable prediction.

Results

Models performed adequately on NHS populations, with performance comparable to radiologists, but generalised poorly to international populations. Models performed better in males than females, and performance varied across age groups. Alarmingly, models routinely failed when applied to complex clinical cases with confounding pathologies and when applied to radiologist defined “mild” cases.

Discussion

This comprehensive benchmarking study examines the pitfalls in current practices that have led to impractical model development. Key findings highlight the need for clinician involvement at all stages of model development, from data curation and label definition, to model evaluation, to ensure that all clinical factors and disease features are appropriately considered during model design. This is imperative to ensure automated approaches developed for disease detection are fit-for-purpose in a clinical setting.