This study is a retrospective evaluation of the performance of deep learning models that were developed for the detection of COVID-19 from chest x-rays, undertaken with the goal of assessing the suitability of such systems as clinical decision support tools.
Models were trained on the National COVID-19 Chest Imaging Database (NCCID), a UK-wide multi-centre dataset from 26 different NHS hospitals and evaluated on independent multi-national clinical datasets. The evaluation considers clinical and technical contributors to model error and potential model bias. Model predictions are examined for spurious feature correlations using techniques for explainable prediction.
Models performed adequately on NHS populations, with performance comparable to radiologists, but generalised poorly to international populations. Models performed better in males than females, and performance varied across age groups. Alarmingly, models routinely failed when applied to complex clinical cases with confounding pathologies and when applied to radiologist defined “mild” cases.
This comprehensive benchmarking study examines the pitfalls in current practices that have led to impractical model development. Key findings highlight the need for clinician involvement at all stages of model development, from data curation and label definition, to model evaluation, to ensure that all clinical factors and disease features are appropriately considered during model design. This is imperative to ensure automated approaches developed for disease detection are fit-for-purpose in a clinical setting.