AUTHOR=Holmes Jason , Liu Zhengliang , Zhang Lian , Ding Yuzhen , Sio Terence T. , McGee Lisa A. , Ashman Jonathan B. , Li Xiang , Liu Tianming , Shen Jiajian , Liu Wei 

TITLE=Evaluating large language models on a highly-specialized topic, radiation oncology physics

JOURNAL=Frontiers in Oncology

VOLUME=Volume 13 - 2023

YEAR=2023

URL=https://www.frontiersin.org/journals/oncology/articles/10.3389/fonc.2023.1219326

DOI=10.3389/fonc.2023.1219326

ISSN=2234-943X

ABSTRACT=\noindent \textbf{Purpose:} We present the first study to investigate Large Language Models (LLMs) in answering radiation oncology physics questions. Because popular exams like AP Physics, LSAT, and GRE have large test-taker populations and ample test preparation resources in circulation, they may not allow for accurately assessing the true potential of LLMs. This paper proposes evaluating LLMs on a highly-specialized topic, radiation oncology physics, which may be more pertinent to scientific and medical communities in addition to being a valuable benchmark of LLMs.\\
\noindent \textbf{Methods:} We developed an exam consisting of 100 radiation oncology physics questions based on our expertise. Four LLMs, ChatGPT (GPT-3.5), ChatGPT (GPT-4), Bard (LaMDA), and BLOOMZ, were evaluated against medical physicists and non-experts. The performance of ChatGPT (GPT-4) was further explored by being asked to explain first, then answer. The deductive reasoning capability of ChatGPT (GPT-4) was evaluated using a novel approach (substituting the correct answer with "None of the above choices is the correct answer."). A majority vote analysis was used to approximate how well each group could score when working together.\\
\noindent \textbf{Results:} ChatGPT GPT-4 outperformed all other LLMs and medical physicists, on average, with improved accuracy when prompted to explain before answering. ChatGPT (GPT-3.5 and GPT-4) showed a high level of consistency in its answer choices across a number of trials, whether correct or incorrect, a characteristic that was not observed in the human test groups or Bard (LaMDA). In evaluating deductive reasoning ability, ChatGPT (GPT-4) demonstrated surprising accuracy, suggesting the potential presence of an emergent ability. Finally, although ChatGPT (GPT-4) performed well overall, its intrinsic properties did not allow for further improvement when scoring based on a majority vote across trials. In contrast, a team of medical physicists were able to greatly outperform ChatGPT (GPT-4) using a majority vote.\\
\noindent \textbf{Conclusion:} This study suggests a great potential for LLMs to work alongside radiation oncology experts as highly knowledgeable assistants.