Skip to main content

BRIEF RESEARCH REPORT article

Front. Artif. Intell.
Sec. Medicine and Public Health
Volume 7 - 2024 | doi: 10.3389/frai.2024.1379297
This article is part of the Research Topic Soft Computing and Machine Learning Applications for Healthcare Systems View all 8 articles

A comparison of the diagnostic ability of large language models in challenging Clinical Cases

Provisionally accepted
Maria P. Khan Maria P. Khan 1Eoin D. O'Sullivan Eoin D. O'Sullivan 2,3*
  • 1 Metro North Hospital and Health Service, Herston, Queensland, Australia
  • 2 Kidney Health Service, Metro North Hospital and Health Service, Royal Brisbane and Women’s Hospital, Brisbane, Australia
  • 3 Institute for Molecular Bioscience, The University of Queensland, St Lucia, Queensland, Australia

The final, formatted version of the article will be published soon.

    The rise of accessible, consumer facing large language models (LLM) provides an opportunity for immediate diagnostic support for clinicians.To compare the different performance characteristics of common LLMS utility in solving complex clinical cases and assess the utility of a novel tool to grade LLM output Methods:Using a newly developed rubric to assess the models' diagnostic utility, we measured to models' ability to answer cases according to accuracy, readability, clinical interpretability, and an assessment of safety. Here we present a comparative analysis of three LLM models-Bing, Chat GPT, and Gemini-across a diverse set of clinical cases as presented in the New England Journal of Medicines case series.Our results suggest that models performed differently when presented with identical clinical information, with Gemini performing best. Our grading tool had low interobserver variability and proved a reliable tool to grade LLM clinical output.This research underscores the variation in model performance in clinical scenarios and highlights the importance of considering diagnostic model performance in diverse clinical scenarios prior to deployment. Furthermore, we provide a new tool to assess LLM output.

    Keywords: artificial intelligence, machine learning, Clinical Medicine, LLM, diagnostics

    Received: 30 Jan 2024; Accepted: 23 Jul 2024.

    Copyright: © 2024 Khan and O'Sullivan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

    * Correspondence: Eoin D. O'Sullivan, Kidney Health Service, Metro North Hospital and Health Service, Royal Brisbane and Women’s Hospital, Brisbane, Australia

    Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.