» Articles » PMID: 38553693

Large Language Models for Generating Medical Examinations: Systematic Review

Overview
Journal BMC Med Educ
Publisher Biomed Central
Specialty Medical Education
Date 2024 Mar 30
PMID 38553693
Authors
Affiliations
Soon will be listed here.
Abstract

Background: Writing multiple choice questions (MCQs) for the purpose of medical exams is challenging. It requires extensive medical knowledge, time and effort from medical educators. This systematic review focuses on the application of large language models (LLMs) in generating medical MCQs.

Methods: The authors searched for studies published up to November 2023. Search terms focused on LLMs generated MCQs for medical examinations. Non-English, out of year range and studies not focusing on AI generated multiple-choice questions were excluded. MEDLINE was used as a search database. Risk of bias was evaluated using a tailored QUADAS-2 tool.

Results: Overall, eight studies published between April 2023 and October 2023 were included. Six studies used Chat-GPT 3.5, while two employed GPT 4. Five studies showed that LLMs can produce competent questions valid for medical exams. Three studies used LLMs to write medical questions but did not evaluate the validity of the questions. One study conducted a comparative analysis of different models. One other study compared LLM-generated questions with those written by humans. All studies presented faulty questions that were deemed inappropriate for medical exams. Some questions required additional modifications in order to qualify.

Conclusions: LLMs can be used to write MCQs for medical examinations. However, their limitations cannot be ignored. Further study in this field is essential and more conclusive evidence is needed. Until then, LLMs may serve as a supplementary tool for writing medical examinations. 2 studies were at high risk of bias. The study followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.

Citing Articles

Application of large language models in healthcare: A bibliometric analysis.

Zhang L, Zhao Q, Zhang D, Song M, Zhang Y, Wang X Digit Health. 2025; 11:20552076251324444.

PMID: 40035041 PMC: 11873863. DOI: 10.1177/20552076251324444.


Quality assurance and validity of AI-generated single best answer questions.

Ahmed A, Kerr E, OMalley A BMC Med Educ. 2025; 25(1):300.

PMID: 40001164 PMC: 11854382. DOI: 10.1186/s12909-025-06881-w.


Education and Training Assessment and Artificial Intelligence. A Pragmatic Guide for Educators.

Newton P, Jones S Br J Biomed Sci. 2025; 81:14049.

PMID: 39973890 PMC: 11837776. DOI: 10.3389/bjbs.2024.14049.


AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination.

Law A, So J, Lui C, Choi Y, Cheung K, Kei-Ching Hung K BMC Med Educ. 2025; 25(1):208.

PMID: 39923067 PMC: 11806894. DOI: 10.1186/s12909-025-06796-6.


Which curriculum components do medical students find most helpful for evaluating AI outputs?.

Waldock W, Lam G, Baptista A, Walls R, Sam A BMC Med Educ. 2025; 25(1):195.

PMID: 39915801 PMC: 11804085. DOI: 10.1186/s12909-025-06735-5.


References
1.
Sim S, Rasiah R . Relationship between item difficulty and discrimination indices in true/false-type multiple choice questions of a para-clinical multidisciplinary paper. Ann Acad Med Singap. 2006; 35(2):67-71. View

2.
Gilardi F, Alizadeh M, Kubli M . ChatGPT outperforms crowd workers for text-annotation tasks. Proc Natl Acad Sci U S A. 2023; 120(30):e2305016120. PMC: 10372638. DOI: 10.1073/pnas.2305016120. View

3.
Alhalaseh Y, Elshabrawy H, Erashdi M, Shahait M, Abu-Humdan A, Al-Hussaini M . Allocation of the "Already" Limited Medical Resources Amid the COVID-19 Pandemic, an Iterative Ethical Encounter Including Suggested Solutions From a Real Life Encounter. Front Med (Lausanne). 2021; 7:616277. PMC: 7840687. DOI: 10.3389/fmed.2020.616277. View

4.
Corsino L, Railey K, Brooks K, Ostrovsky D, Pinheiro S, McGhan-Johnson A . The Impact of Racial Bias in Patient Care and Medical Education: Let's Focus on the Educator. MedEdPORTAL. 2021; 17:11183. PMC: 8410857. DOI: 10.15766/mep_2374-8265.11183. View

5.
Clusmann J, Kolbinger F, Muti H, Carrero Z, Eckardt J, Laleh N . The future landscape of large language models in medicine. Commun Med (Lond). 2023; 3(1):141. PMC: 10564921. DOI: 10.1038/s43856-023-00370-1. View