Evaluating the performance of ChatGPT-4 on the United Kingdom Medical Licensing Assessment

Lai, U Hin; Wu, Keng Sam; Hsu, Ting-Yu; Kan, Jessie Kai Ching

doi:10.3389/fmed.2023.1240915

ORIGINAL RESEARCH article

Front. Med. , 19 September 2023

Sec. Healthcare Professions Education

Volume 10 - 2023 | https://doi.org/10.3389/fmed.2023.1240915

Evaluating the performance of ChatGPT-4 on the United Kingdom Medical Licensing Assessment

U Hin Lai^1,2^*

Keng Sam Wu^1,3^*

Ting-Yu Hsu^2,3

Jessie Kai Ching Kan^2,4

¹Sandwell and West Birmingham NHS Trust, West Bromwich, United Kingdom
²Aston Medical School, Birmingham, United Kingdom
³University Hospitals Birmingham NHS Trust, Birmingham, United Kingdom
⁴Worcestershire Acute Hospitals NHS Trust, Worcester, United Kingdom

Introduction: Recent developments in artificial intelligence large language models (LLMs), such as ChatGPT, have allowed for the understanding and generation of human-like text. Studies have found LLMs abilities to perform well in various examinations including law, business and medicine. This study aims to evaluate the performance of ChatGPT in the United Kingdom Medical Licensing Assessment (UKMLA).

Methods: Two publicly available UKMLA papers consisting of 200 single-best-answer (SBA) questions were screened. Nine SBAs were omitted as they contained images that were not suitable for input. Each question was assigned a specialty based on the UKMLA content map published by the General Medical Council. A total of 191 SBAs were inputted in ChatGPT-4 through three attempts over the course of 3 weeks (once per week).

Results: ChatGPT scored 74.9% (143/191), 78.0% (149/191) and 75.6% (145/191) on three attempts, respectively. The average of all three attempts was 76.3% (437/573) with a 95% confidence interval of (74.46% and 78.08%). ChatGPT answered 129 SBAs correctly and 32 SBAs incorrectly on all three attempts. On three attempts, ChatGPT performed well in mental health (8/9 SBAs), cancer (11/14 SBAs) and cardiovascular (10/13 SBAs). On three attempts, ChatGPT did not perform well in clinical haematology (3/7 SBAs), endocrine and metabolic (2/5 SBAs) and gastrointestinal including liver (3/10 SBAs). Regarding to response consistency, ChatGPT provided correct answers consistently in 67.5% (129/191) of SBAs but provided incorrect answers consistently in 12.6% (24/191) and inconsistent response in 19.9% (38/191) of SBAs, respectively.

Discussion and conclusion: This study suggests ChatGPT performs well in the UKMLA. There may be a potential correlation between specialty performance. LLMs ability to correctly answer SBAs suggests that it could be utilised as a supplementary learning tool in medical education with appropriate medical educator supervision.

Introduction

Artificial intelligence (AI) can be defined as “human intelligence exhibited by machine” or, more sophisticatedly in the field of AI research, “the study of intelligent agents, which are devices that perceive their environment and take actions to maximize their chance of success at some goal” (1). The initial idea of AI can be traced back to 1950 when Turing (2) proposed the question “Can machines think?” and his concept of the Imitation Game. The Turing test is a “method of inquiry in AI for determining whether or not a computer is capable of thinking like a human being” (3). Turing’s test has become an essential concept in AI philosophy and has been widely discussed and debated over the past several decades (4).

Seventy years after the initial proposal of the Imitation Game concept, the advance in computer chips and microprocessors and the development of deep neural network (DNN) enable computers to exhibit the characteristics of experiential learning by reassembling human intelligence (1, 5), in which the computers demonstrate the capacity to learn through refining their analysis with the use of computational algorithms. AI is widespread today, particularly in business and finance, supply chain management, and cybersecurity (6, 7). Its increasing applications in healthcare have also been evident, for example, histopathological and radiological imaging analysis (8, 9), AI-assisted endoscopy (10), and risk stratification of patients with carotid artery disease (9).

Given the growing phenomenon of AI in healthcare, medical educators should prepare for its potential impact on medical education to maximize learners’ learning opportunities. There have been suggestions that developing AI-driven intelligent tutoring systems can identify gaps in learners’ knowledge; facilitate learning with constructivist approaches; provide thoughtful feedback to students and teachers; and perform time-consuming tasks efficiently, such as recording attendance and grading assessments (11). Nevertheless, given that medicine is a high-stake profession in which training requires high-level accountability and transparency, developing an AI system with accurate and reliable medical knowledge is paramount.

ChatGPT is an AI chatbot developed based on large language models (LLMs). These machine-learning systems can learn autonomously after training on large quantities of unlabeled text, producing sophisticated and seemingly intelligent writing (12). Since the launching of the ChatGPT in November 2022, multiple pieces of literature (13–15) have demonstrated its capability of displaying comprehensible reasoning in professional examinations across different disciplinaries, including the United States Medical Licensing Examination (USMLE), the primary fellowship examination from The Royal College of Anesthetists and the New York State Bar Examination. There have been suggestions that ChatGPT can be used to improve the quality of medical education in several dimensions: automated scoring, teaching assistance, personalized learning, quick access to information, and generating case scenarios (16). Nevertheless, to our knowledge, there has yet to be a current study assessing the performance of ChatGPT in UK undergraduate medical examinations.

With the increasing use of tablet computing in medical education, medical students have found accessing broad medical knowledge easier in classrooms and clinical settings (17). The advent of ChatGPT, alongside the extensive use of technology, has allowed the synthesis of large bodies of medical knowledge to produce a personalized response to a question (18). Differential diagnosis generation is a particular use for junior medical students to help highlight “red flag” conditions that cannot be missed. It has also been proposed that ChatGPT can be used to prepare medical students for practical examinations that assess the clinical skills of medical students, also known as Objective Structured Clinical Examinations (OSCEs) (19). The increasing use of LLMs highlights the role of AI in medical education and the need for further assessment of the accuracy and consistency of this technology.

The United Kingdom Medical Licensing Assessment (UKMLA) is a newly derived national undergraduate medical exit examination. From 2024 onwards, all final-year medical students in the United Kingdom (UK) must pass the UKMLA before graduation (20). The UKMLA is divided into the Applied Knowledge Test (AKT) and the Clinical and Professional Skills Assessment (CPSA). The AKT is a multiple-choice exam consisting of 200 single-best-answer (SBA) questions. Candidates must choose the best answer out of the five, and there is no negative marking. The standard of the AKT is set by a national panel of experts from medical schools across the UK (21). Recently, a study reported that ChatGPT correctly answered 140/191 SBAs (73.3%) on the UKMLA (22).

In this study, we aim to evaluate the performance of ChatGPT in the AKT practice papers published by the Medical Schools Council and if answers generated by ChatGPT are consistent (23). This can serve as an indicator of the clinical knowledge that ChatGPT currently possesses and whether it is a reliable and accountable AI system to facilitate human learning in medical education.

Methods

Data collection

Publicly available UKMLA practice materials were utilized for this study (23). These include two AKT practice papers last updated in February 2023. Each practice paper consisted of 100 SBA questions with five choices. Two hundred SBAs were screened for suitability. Nine SBAs were excluded from the study as they included images that could not be input into ChatGPT.

In total, 191 SBAs were inputted individually into ChatGPT-4 May 24 Version 3.5 (OpenAI). ChatGPT answered each question with one of the five choices (e.g., A, B, C, D, E) alongside an explanation on why this is the correct answer. Each attempt was classified as a new attempt. Therefore, a “New Chat” was used for each attempt. ChatGPT had three attempts to answer the complete set of SBAs over 3 weeks (once per week). Three attempts were used to establish whether ChatGPT would generate consistent responses to the same question on each attempt.

Data analysis

The answers generated by ChatGPT on each attempt at the UKMLA practice materials were recorded into a Microsoft Excel spreadsheet. The answers generated were then compared with the answer key provided by the Medical Schools Council to determine whether ChatGPT provided the correct response for each question. Furthermore, individual SBAs inputted into ChatGPT were assigned a specialty based on the UKMLA Content Map published by the General Medical Council (GMC) in September 2019 (24). This allows the evaluation of the performance of ChatGPT in questions across different specialties. In addition, the variability in the response of ChatGPT to the same question in different attempts was also recorded, which provides a measure of the consistency of the response generated by ChatGPT. The statistical analysis of the data was performed with Microsoft Excel formulas.

Results

Performance of ChatGPT on the UKMLA

One hundred ninety-one SBAs were inputted into ChatGPT. ChatGPT scored 74.9% (143/191), 78.0% (149/191), and 75.6% (145/191) on three attempts, respectively, (Figure 1). Averaging three attempts, ChatGPT scored 76.3% (437/573) with a 95% confidence interval of (74.46% and 78.08%).

FIGURE 1

Figure 1. Results of ChatGPT on United Kingdom Medical Licensing Assessment (UKMLA) single-best-answer (SBAs) (n = 191) on each of three attempts.

Breakdown analysis of the AKT papers showed that the most tested specialties were obstetrics and gynecology (16 SBAs), acute and emergency (15 SBAs), cancer (14 SBAs), cardiovascular (13 SBAs), musculoskeletal (12 SBAs), infection (12 SBAs) and child health (12 SBAs) (Table 1). Of note, two specialties were only tested with one SBA; both of which ChatGPT generated a correct answer. In terms of the proportion of SBAs being answered correctly across different specialties, ChatGPT performed best in mental health (88.9%), cancer (78.6%), and cardiovascular (77.0%). Its performance in clinical hematology (28.6%), perioperative medicine and anesthesia (33.3%), and endocrine and metabolic (40.0%) were the worst in our study.

TABLE 1

Table 1. Performance of ChatGPT in relation to specialties tested in the United Kingdom Medical Licensing Assessment (UKMLA).

An analysis of SBAs where ChatGPT scored incorrectly on all three attempts was also conducted. Clinical hematology (3/7), endocrine and metabolic (2/5), and gastrointestinal, including liver (3/10) were the specialties in which ChatGPT had the highest tendency to consistently provide an incorrect response, accounting for the number of SBAs within a specialty (Table 2).

TABLE 2

Table 2. Results where ChatGPT generated the incorrect answer on all three attempts.

Consistency of ChatGPT on UKMLA

ChatGPT consistently answered 67.5% (129/191) of SBAs correctly and 12.6% (24/191) SBAs incorrectly on all three attempts. Of note, ChatGPT provided different answers to the same question with the same phrasing in 19.9% (38/191) of SBAs.

Amongst the different specialties, accounting for the number of SBAs within different specialties, ChatGPT provided inconsistent responses most in the field of clinical hemaetology (4/7), genetics and genomes (1/2), ophthalmology (2/5) and medicine of older adults (4/10). In contrast, ChatGPT provided consistent responses in General Practice and Primary Healthcare (4/4) and Social and Population Health (6/6) (Table 3).

TABLE 3

Table 3. Percentage of response consistency of ChatGPT in different specialties adjusted to the number of SBAs in different specialties.

Discussion

Discussion on the performance of ChatGPT

Our study suggests ChatGPT performed reasonably well on the UKMLA. As the UKMLA is relatively new and will be fully implemented in early 2024, there needs to be publicly available data on pass marks set by subject matter experts through the modified-Angoff method to determine if ChatGPT “passed” the examination (23). It should be noted that post-graduate medical examinations have been well-established and have statistics relating to pass marks set by standard-setting committees.

Analysing ChatGPT based on specialty performance on the UKMLA demonstrated that it did not perform well on gastrointestinal and liver scoring only 30% (3/10) of SBAs correctly. Interestingly, ChatGPT scored poorly on the American College of Gastroenterology self-assessment suggesting a correlation between specialties and ChatGPT performance (25). From our study, ChatGPT performed less well in the fields of hematology, perioperative medicine, and anesthetics. To date, further studies on ChatGPT performance in these post-graduate specialties have not been conducted. ChatGPT also performed below standard in the field of Acute and Emergency Medicine, answering only 53.3% (8/15) of the questions correctly. Given recognition and management of life-threatening emergencies are the essential competencies of final-year medical students, the under-performance in this area indicates that ChatGPT may not be an appropriate learning tool for undergraduate learners in this specialty, or at least, it should not be used as a main resource of learning.

A study by Jang and Lukasiewicz (26) found that generated answers are inconsistent when input text is paraphrased. Currently, there are no known studies on the consistency of answers generated by ChatGPT from a medical education context. Notably, Al-Shakarachi and Haq (22) also conducted a study on ChatGPT performance in UKMLA practice papers and found it achieved 100% accuracy in emergency medicine, palliative care, and otolaryngology in UKMLA practice papers. Our study has found ChatGPT performed differently in these specialties with 8/15 (53.3%), 3/5 (60.0%), and 2/3 (66.8%) SBAs answered correctly, respectively. This highlights the inconsistency in the performance of ChatGPT that could affect its ability to act as a tool in medical education.

To understand the inconsistency of the answers provided by ChatGPT, we need to discuss how LLM functions. As an LLM, the fundamental principle of how ChatGPT operates is to predict the next most reasonable word from the existing one to create an adequate response utilizing the vast amount of text and information that is fed into its database (27). Using the database, the probability of reasonable words is compared, informing ChatGPT which words are most likely to formulate a satisfactory response. However, if the highest probability word is always used, the response generated would be repetitive within itself, and to avoid creating a paragraph that repeats the same sentence frequently, randomization is added to ChatGPT’s response. This creates a strong ability allowing ChatGPT to generate text from scratch, creating fascinating poems and stories that do not exist in its database, but consequently lead to inconsistency and inaccuracy, which is demonstrated in our study. As mentioned in Hamolak (28) ChatGPT tends to confabulate references to create a sense of plausibility. Moreover, the information that it was trained on also only dates up to September 2021 which also implies the information may be outdated. Nevertheless, ChatGPT is still early in its developmental stage, and in our study, ChatGPT has performed reasonably well and scored 76.3% in UKMLA. The role of ChatGPT in medical education is still widely debated, particularly given the high-stringency nature of the medical profession. However, with the unprecedented speed of advancement in AI, further improvement in the accuracy and reliability of ChatGPT is conceivable. This makes LLM software, such as ChatGPT, a potentially valuable tool in both medical education and clinical practice in the future.

Comparative analysis of ChatGPT on examinations

The AKT for Membership of the Royal College of General Practitioners in the United Kingdom consists of 200 SBAs sat over 3 h and 10 min: akin to the UKMLA (29). On average, the pass mark has been set at around 70% (141/200) between April 2021 to April 2023 (30–36). Our result shows an average of 76.3% (437/573), suggesting that it could pass the final-year medical school examination. ChatGPT has been studied on various post-graduate medical examinations such as the Fellowship of the Royal College of Ophthalmologists (FRCOphth), the American Radiology Board examination, the Chinese National Medical Licensing Examination, the Taiwanese Pharmacist Licensing Examination and the American College of Gastroenterology self-assessment tests (25, 37–40). ChatGPT fared well on both the FRCOphth and the American Radiology Board Examination. It was unable to pass the Chinese National Medical Licensing Examination and the Taiwanese Licensing Examination. Additionally, it did not pass the American College of Gastroenterology self-assessment scoring 62.4% where the pass mark was set at 70%. It should be noted that these studies also used publicly available question banks where a pass mark was not standard set through the traditional modified-Angoff method.

On medical school entrance examinations, ChatGPT has performed accurately on the National Eligibility cum Entrance Test (NEET) in India (41). ChatGPT-4 scored 72.5%, 44.4%, and 50.5% in physics, chemistry, and biology on the NEET, respectively. The authors of the study suggest that a potential application of LLMs could be to act as a supplementary tool to aid students in preparing for pre-medical examinations and beyond. Additionally, it was also found that ChatGPT fares well in non-medical examinations such as the Test of Mathematics for University Admission (TMUA), Law National Aptitude Test (LNAT), and the Thinking Skills Assessment (TSA) (42); further highlighting the promising potential of the use of LLMs as a supplementary learning tool. However, another study that evaluated the performance of ChatGPT on the Japanese National Medical Practitioners Qualifying Examination (NMPQE) shows some concern about using LLMs as a supplementary learning tool (43). The NMPQE consists of 400 multiple-choice questions taken by medical students in Japan in their final year of medical school. Interestingly, the exam consists of 25+ “prohibited choices.” These prohibited choices are responses that are strictly avoided in medical practice in Japan, such as euthanasia, as it is illegal in Japan. A candidate for the NMPQE would fail the examination if they select more than three prohibited choices. These prohibited choices range from ethically, or legally, incorrect responses to clinically incorrect responses, such as offering oral hypoglycemics to pregnant women. The authors of the study found that ChatGPT-4 tends to select prohibited choices, such as offering euthanasia, in comparison to candidates. This highlights a potential limitation on the application of the use of LLMs as a learning tool for students in healthcare.

Applications

Despite its suboptimal performance in certain specialties in the UKMLA, the overall high accuracy of ChatGPT on undergraduate medical examinations suggests that it could be utilized, with caution, as a supplementary tool to facilitate learning by UK medical educators and UK learners in both undergraduate and postgraduate medical education settings. It has been shown that ChatGPT can generate clinical scenarios that could be applied in medical education (16, 44). Whether these clinical scenarios are accurate and of sufficient quality has not been studied. Nevertheless, reviewing these generated scenarios by clinicians for medical education use within a respective school would aid in guaranteeing accuracy and quality. Students in medical school could use ChatGPT as an individualized “personal tutor.” ChatGPT can explain medical concepts, generate questions and give feedback to students. Another potential application of ChatGPT in undergraduate medical education is within problem-based learning (PBL).

PBL is widely adopted in medical schools across the world. Typically, it centers on a clinical case, or “problem” where a group of medical students will discuss and solve it under the supervision of a clinical tutor (45). Medical students have been shown to increase knowledge-base and improve on higher-level thinking using PBL (46). Disadvantages of PBL include significant time investment of both students and staff, financial costs, and lack of suitable staff to undertake the role of a clinical tutor, dependent on the university. ChatGPT could play a role in these PBL sessions to address disadvantages. Based on our study, ChatGPT answering around 25% of SBAs incorrectly on the UKMLA suggests that there are limitations to its use and that clinical tutors are still vital in promoting accuracy, quality, and higher-level thinking. It should be noted that ChatGPT in its current form does not access real-time information from the internet (18). Therefore, up-to-date medical practice cannot be utilized in medical education.

Limitations

A significant limitation of our study was the small sample size of 191 SBAs from UKMLA practice materials, particularly in certain specialties such as medical ethics and law and surgery. Although our study demonstrated ChatGPT performance of 100% in these specialties, there was only one question relevant to ethics and law and surgery, respectively, in the practice materials. This result may not be reliable given the small sample size. Moreover, laws and ethics, and clinical guidance differ by country. The performance of ChatGPT in certain specialties may not be extrapolated to professional medical examinations in other countries. Furthermore, we cannot ascertain if SBAs in publicly available UKMLA practice materials have undergone a similar standard-setting process as the official UKMLA examination. The practice materials simulate the styles of SBAs of the official examination, but it may not be an accurate representation both in terms of the level of difficulty and the proportion of SBAs across different specialties. Further studies on ChatGPT’s performance in the official UKMLA examination with a larger sample of SBAs will be needed to address this. Given the SBAs in the official UKMLA will be undergoing the standard setting process, such as modified-Angoff, each SBAs can be assigned with an index of difficulty. It will be interesting in future studies to look at ChatGPT’s performance on SBAs with different ranges of index of difficulty. Finally, the current version of ChatGPT only allows for text-based input. As such, nine image-based SBAs were excluded from our study. This limitation may have affected the actual performance of ChatGPT.

Conclusion

Our study has demonstrated the ability of ChatGPT to answer SBAs on the UKMLA. It also noted a potential correlation between different specialties and the performance of ChatGPT. We also noted the possibility of utilizing LLMs as a supplementary learning tool in medical education under the supervision of appropriately trained medical educators. Further avenues of research involving standard-set UKMLA papers, the medical specialty-dependent performance of ChatGPT, and the use of LLMs in medical education could be conducted in the future.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary material, further inquiries can be directed to the corresponding authors.

Ethics statement

Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. Written informed consent from the participants was not required to participate in this study in accordance with the national legislation and the institutional requirements.

Author contributions

UL conceived this paper. UL and KW prepared the manuscript. T-YH prepared figures used in the manuscript. Data collection and analysis were done by UL, KW, T-YH, and JK. All authors contributed to the article and approved the submitted version.

Acknowledgments

This study on the UKMLA utilized ChatGPT-4 May 24 Version 3.5 by OpenAI. Available at: https://openai.com/chatgpt (Accessed July 5, 2023). This article was written by the authors and no forms of artificial intelligence were used.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmed.2023.1240915/full#supplementary-material

References

1. Bini, SA . Artificial intelligence, machine learning, deep learning, and cognitive computing: what do these terms mean and how will they impact health care? J Arthroplast. (2018) 33:2358–61. doi: 10.1016/j.arth.2018.02.067

PubMed Abstract | CrossRef Full Text | Google Scholar

2. Turing, AM . Computing machinery and intelligence. Mind. (1950) LIX:433–60. doi: 10.1093/mind/LIX.236.433

CrossRef Full Text | Google Scholar

3. St George, B , and Gillis, AS . Turing test. Available at: https://www.techtarget.com/searchenterpriseai/definition/Turing-test (Accessed June 14, 2023)

Evaluating the performance of ChatGPT-4 on the United Kingdom Medical Licensing Assessment

Introduction

Methods

Data collection

Data analysis

Results

Performance of ChatGPT on the UKMLA

Consistency of ChatGPT on UKMLA

Discussion

Discussion on the performance of ChatGPT

Comparative analysis of ChatGPT on examinations

Applications

Limitations

Conclusion

Data availability statement

Ethics statement

Author contributions

Acknowledgments

Conflict of interest

Publisher’s note

Supplementary material

References

95% of researchers rate our articles as excellent or good