Skip to main content

BRIEF RESEARCH REPORT article

Front. Vet. Sci.
Sec. Comparative and Clinical Medicine
Volume 11 - 2024 | doi: 10.3389/fvets.2024.1490030
This article is part of the Research Topic Utilizing Real World Data and Real World Evidence in Veterinary Medicine: Current Practices and Future Potentials View all articles

Classification performance and reproducibility of GPT-4 omni for information extraction from veterinary electronic health records

Provisionally accepted
Judit Magnusson Wulcan Judit Magnusson Wulcan 1*Kevin Lee Jacques Kevin Lee Jacques 1Mary Ann Lee Mary Ann Lee 2Samantha Lee Kovacs Samantha Lee Kovacs 1Nicole Dausend Nicole Dausend 3Lauren E Prince Lauren E Prince 1Jonatan Wulcan Jonatan Wulcan 4Sina Marsilio Sina Marsilio 3Stefan M Keller Stefan M Keller 1*
  • 1 School of Veterinary Medicine, Department of Pathology, Microbiology and Immunology, University of California, Davis, Davis, United States
  • 2 College of Veterinary Medicine and Biomedical Sciences, James L. Voss Veterinary Teaching Hospital, Colorado State University, Fort Collins, Colorado, United States
  • 3 School of Veterinary Medicine, Department of Medicine and Epidemiology, University of California, Davis, Davis, California, United States
  • 4 Independent researcher, Malmoe, Sweden

The final, formatted version of the article will be published soon.

    Large language models (LLMs) can extract information from veterinary electronic health records (EHRs), but performance differences between models, the effect of hyperparameter settings, and the influence of text ambiguity have not been previously evaluated. This study addresses these gaps by comparing the performance of GPT-4 omni (GPT-4o) and GPT-3.5 Turbo under different conditions and by investigating the relationship between human interobserver agreement and LLM errors. The LLMs and five humans were tasked with identifying six clinical signs associated with feline chronic enteropathy in 250 EHRs from a veterinary referral hospital. The performance of GPT-4o compared to the majority opinion of human respondents, achieved 96.9% sensitivity, (interquartile range [IQR] 92.9-99.3%), 97.6% specificity (IQR 96.5-98.5%), 80.7% positive predictive value (IQR 70.8-84.6%), 99.5% negative predictive value (IQR 99.0-99.9%), 84.4% F1 score (IQR 77.3-90.4%), and 96.3% balanced accuracy (IQR 95.0-97.9%). The performance of GPT-4o was significantly better than that of its predecessor, GPT-3.5 Turbo, particularly with respect to sensitivity where GPT-3.5 Turbo only achieved 81.7% (IQR 78.9-84.8%). GPT-4o demonstrated greater reproducibility than human pairs, with an average Cohen's kappa of 0.98 (IQR 0.98-0.99) compared to 0.8 (IQR 0.78-0.81) of humans. Most GPT-4o errors occurred in instances where humans disagreed (35/43 errors [81.4%]), suggesting that these errors were more likely caused by ambiguity of the EHR than explicit model faults. Using GPT-4o to automate information extraction from veterinary EHRs is a viable alternative to manual extraction.

    Keywords: machine learning, artificial intelligence, generative-pretrained transformers, Chat-GPT, text mining, feline chronic enteropathy, Real world evidence (RWE), Real world data (RWD)

    Received: 02 Sep 2024; Accepted: 19 Dec 2024.

    Copyright: © 2024 Magnusson Wulcan, Jacques, Lee, Kovacs, Dausend, Prince, Wulcan, Marsilio and Keller. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

    * Correspondence:
    Judit Magnusson Wulcan, School of Veterinary Medicine, Department of Pathology, Microbiology and Immunology, University of California, Davis, Davis, United States
    Stefan M Keller, School of Veterinary Medicine, Department of Pathology, Microbiology and Immunology, University of California, Davis, Davis, United States

    Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.