Skip to main content

ORIGINAL RESEARCH article

Front. Artif. Intell.
Sec. Medicine and Public Health
Volume 7 - 2024 | doi: 10.3389/frai.2024.1442975

Evaluation of Large Language Models under different training background in Chinese Medical Examination: A Comparative Study

Provisionally accepted
Siwen Zhang Siwen Zhang 1Qi Chu Qi Chu 2Yujun Li Yujun Li 1Jialu Liu Jialu Liu 1Jiayi Wang Jiayi Wang 1Chi Yan Chi Yan 1Wenxi Liu Wenxi Liu 1Yizhen Wang Yizhen Wang 1Chengcheng Zhao Chengcheng Zhao 1Xinyue Zhang Xinyue Zhang 1Yuwen Chen Yuwen Chen 1*
  • 1 Shenyang Pharmaceutical University, Shenyang, China
  • 2 Department of Clinical Laboratory, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences, Beijing, China

The final, formatted version of the article will be published soon.

    Background: Recently, Large Language Models have shown impressive potential in medical services. However, the aforementioned research primarily discusses the performance of LLMs developed in English within English-speaking medical contexts, ignoring the LLMs developed under different linguistic environments with respect to their performance in the Chinese clinical medicine field.Objective: Through a comparative analysis of three LLMs developed under different training background, we firstly evaluate their potential as medical service tools in a Chinese language environment. Furthermore, we also point out the limitations in the application of Chinese medical practice.Method: Utilizing the APIs provided by three LLMs, we conducted an automated assessment of their performance in the 2023 CMLE. We not only examined the accuracy of three LLMs across various question, but also categorized the reasons for their errors. Furthermore, we performed repetitive experiments on selected questions to evaluate the stability of the outputs generated by the LLMs.The accuracy of GPT-4, ERNIE Bot, and DISC-MedLLM in CMLE are 65.2%, 61.7%, and 25.3%. In error types, the knowledge errors of GPT-4 and ERNIE Bot account for 52.2% and 51.7%, while hallucinatory errors account for 36.4% and 52.6%. In the Chinese text generation experiment, the general LLMs demonstrated high natural language understanding ability and was able to generate clear and standardized Chinese texts. In repetitive experiments, the LLMs showed a certain output stability of 70% , but there were still cases of inconsistent output results.General LLMs, represented by GPT-4 and ERNIE Bot, demonstrate the capability to meet the standards of the CMLE. Despite being developed and trained in different linguistic contexts, they exhibit excellence in understanding Chinese natural language and Chinese clinical knowledge, highlighting their broad potential application in Chinese medical practice. However, these models still show deficiencies in mastering specialized knowledge, addressing ethical issues, and maintaining the outputs stability. Additionally, there is a tendency to avoid risk when providing medical advice.

    Keywords: General Large Language Model, Medical Large Language Model, CMLE, artificial intelligence, Open AI

    Received: 10 Jun 2024; Accepted: 06 Nov 2024.

    Copyright: © 2024 Zhang, Chu, Li, Liu, Wang, Yan, Liu, Wang, Zhao, Zhang and Chen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

    * Correspondence: Yuwen Chen, Shenyang Pharmaceutical University, Shenyang, China

    Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.