ORIGINAL RESEARCH article

Front. Med.

Sec. Healthcare Professions Education

Volume 12 - 2025 | doi: 10.3389/fmed.2025.1545730

This article is part of the Research TopicLarge Language Models for Medical ApplicationsView all 15 articles

Comparison of Medical History Documentation Efficiency and Quality Based on GPT-4o: A Study on the Comparison Between Residents and Artificial Intelligence

Provisionally accepted
Li  XiaoyangLi Xiaoyang1*Xiaoyang  LuXiaoyang Lu1Xinyi  WangXinyi Wang1Zhenye  GongZhenye Gong1Jie  ChengJie Cheng1Weiguo  HuWeiguo Hu1Shaun  WuShaun Wu2Rong  WangRong Wang3Xinqi  GaoXinqi Gao1
  • 1Ruijin Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
  • 2WORK Medical Technology Group LTD, Hangzhou, Jiangsu Province, China
  • 3Shanghai Resident Sandardized Training Center, Shanghai, China

The final, formatted version of the article will be published soon.

Background: As medical technology advances, physicians' responsibilities in clinical practice continue to increase, with medical history documentation becoming an essential component. Artificial Intelligence (AI) technologies, particularly advances in Natural Language Processing (NLP), have introduced new possibilities for medical documentation. This study aims to evaluate the efficiency and quality of medical history documentation by ChatGPT-4o compared to resident physicians and explore the potential applications of AI in clinical documentation.Methods: Using a non-inferiority design, this study compared the documentation time and quality scores between 5 resident physicians from the hematology department (with an average of 2.4 years of clinical experience) and ChatGPT-4o based on identical case materials. Medical history quality was evaluated by two attending physicians with over 10 years of clinical experience using ten case content criteria. Data were analyzed using paired t-tests and Wilcoxon signed-rank tests, with Kappa coefficients used to assess scoring consistency. Detailed scoring criteria included completeness (coverage of history elements), accuracy (correctness of information), logic (organization and coherence of content), and professionalism (appropriate use of medical terminology and format), each rated on a 10-point scale.Results: In terms of medical history quality, ChatGPT-4o achieved an average score of 88.9, while resident physicians scored 89.6, with no statistically significant difference between the two (p=0.25). The Kappa coefficient between the two evaluators was 0.82, indicating good consistency in scoring. Non-inferiority testing showed that ChatGPT-4o's quality scores fell within the preset non-inferiority margin (5 points), indicating that its documentation quality was not inferior to that of resident physicians. ChatGPT-4o's average documentation time was 40.1 seconds, significantly shorter than the resident physicians' average of 14.9 minutes (p<0.001).ChatGPT-4o significantly reduced the time required for medical history between multiple resident physicians and ChatGPT-4o based on identical case materials.using ten case content criteria.documentation. Despite these positive results, practical considerations such as data preprocessing, data security, and privacy protection must be addressed in real-world applications. Future research should further explore ChatGPT-4o's capabilities in handling complex cases and its applicability across different clinical settings.

Keywords: artificial intelligence, GPT-4o, Medical History Documentation, quality, Efficiency

Received: 15 Dec 2024; Accepted: 24 Apr 2025.

Copyright: © 2025 Xiaoyang, Lu, Wang, Gong, Cheng, Hu, Wu, Wang and Gao. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Li Xiaoyang, Ruijin Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.