Skip to main content

ORIGINAL RESEARCH article

Front. Artif. Intell.
Sec. Medicine and Public Health
Volume 7 - 2024 | doi: 10.3389/frai.2024.1514896

Evaluating accuracy and reproducibility of large language model performance on critical care assessments in pharmacy education

Provisionally accepted
  • 1 University of Virginia, Charlottesville, Virginia, United States
  • 2 College of Pharmacy, University of Georgia, Athens, United States
  • 3 University of Colorado, Denver, Colorado, United States

The final, formatted version of the article will be published soon.

    Background: Large language models (LLMs) have demonstrated impressive performance on medical licensing and diagnosis-related exams. However, comparative evaluations to optimize LLM performance and ability in the domain of comprehensive medication management (CMM) are lacking. The purpose of this evaluation was to test various LLMs performance optimization strategies and performance on critical care pharmacotherapy questions used in the assessment of Doctor of Pharmacy students.In a comparative analysis using 219 multiple-choice pharmacotherapy questions, five LLMs (GPT-3.5, GPT-4, Claude 2, Llama2-7b and 2-13b) were evaluated. Each LLM was queried five times to evaluate the primary outcome of accuracy (i.e., correctness) and variance. The secondary outcomes explored impact of prompt engineering techniques (e.g., chain-ofthought, CoT), training of a customized GPT on performance, and comparison to third year doctor of pharmacy students on knowledge recall vs. knowledge application questions. Accuracy and variance were compared with student's t-test to compare performance under different model settings.Results: ChatGPT-4 exhibited the highest accuracy (71.6%), while Llama2-13b had the lowest variance (0.070). All LLMs performed more accurately on knowledge recall vs. knowledge application questions (e.g., ChatGPT-4: 87% vs 67%). When applied to ChatGPT-4, few-shot CoT across five runs improved accuracy (77.4% vs 71.5%) with no effect on variance. Selfconsistency and the custom-trained GPT demonstrated similar accuracy to ChatGPT-4 with fewshot CoT. Overall pharmacy student accuracy was 81%, compared to an optimal overall LLM accuracy of 73%. Comparing question types, six of the LLMs demonstrated equivalent or higher accuracy than pharmacy students on knowledge recall questions (e.g., self-consistency vs. students: 93% vs 84%), but pharmacy students achieved higher accuracy than all LLMs on knowledge application questions (e.g., self-consistency vs. students: 68% vs 80%).Conclusions: ChatGPT-4 was the most accurate LLM on critical care pharmacy questions and few-shot CoT improved accuracy the most. Average student accuracy was similar to LLMs overall, and higher on knowledge application questions. These findings support the need for future assessment of customized training for the type of output needed. Reliance on LLMs is only supported with recall-based questions.

    Keywords: Large Language Model, artificial intelligence, Pharmacy, Education, Critical Care, Medical Education, higher education, machine learning

    Received: 21 Oct 2024; Accepted: 23 Dec 2024.

    Copyright: © 2024 Yang, Hu, Most, Murray, Smith, Hawkins, Li and Sikora. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

    * Correspondence: Andrea Sikora, College of Pharmacy, University of Georgia, Athens, United States

    Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.