Performance of Artificial Intelligence Chatbots in Sleep Medicine Certification Board Exams: ChatGPT Versus Google Bard
Overview
Authors
Affiliations
Purpose: To conduct a comparative performance evaluation of GPT-3.5, GPT-4 and Google Bard in self-assessment questions at the level of the American Sleep Medicine Certification Board Exam.
Methods: A total of 301 text-based single-best-answer multiple choice questions with four answer options each, across 10 categories, were included in the study and transcribed as inputs for GPT-3.5, GPT-4 and Google Bard. The first output responses generated were selected and matched for answer accuracy against the gold-standard answer provided by the American Academy of Sleep Medicine for each question. A global score of 80% and above is required by human sleep medicine specialists to pass each exam category.
Results: GPT-4 successfully achieved the pass mark of 80% or above in five of the 10 exam categories, including the Normal Sleep and Variants Self-Assessment Exam (2021), Circadian Rhythm Sleep-Wake Disorders Self-Assessment Exam (2021), Insomnia Self-Assessment Exam (2022), Parasomnias Self-Assessment Exam (2022) and the Sleep-Related Movements Self-Assessment Exam (2023). GPT-4 demonstrated superior performance in all exam categories and achieved a higher overall score of 68.1% when compared against both GPT-3.5 (46.8%) and Google Bard (45.5%), which was statistically significant (p value < 0.001). There was no significant difference in the overall score performance between GPT-3.5 and Google Bard.
Conclusions: Otolaryngologists and sleep medicine physicians have a crucial role through agile and robust research to ensure the next generation AI chatbots are built safely and responsibly.
ChatGPT-4 Omni's superiority in answering multiple-choice oral radiology questions.
Tassoker M BMC Oral Health. 2025; 25(1):173.
PMID: 39893407 PMC: 11786404. DOI: 10.1186/s12903-025-05554-w.
Generative artificial intelligence in graduate medical education.
Janumpally R, Nanua S, Ngo A, Youens K Front Med (Lausanne). 2025; 11:1525604.
PMID: 39867924 PMC: 11758457. DOI: 10.3389/fmed.2024.1525604.
Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis.
Zong H, Wu R, Cha J, Wang J, Wu E, Li J J Med Internet Res. 2024; 26():e66114.
PMID: 39729356 PMC: 11724220. DOI: 10.2196/66114.
Chen Y, Huang X, Yang F, Lin H, Lin H, Zheng Z BMC Med Educ. 2024; 24(1):1372.
PMID: 39593041 PMC: 11590336. DOI: 10.1186/s12909-024-06309-x.
Kunzle P, Paris S Clin Oral Investig. 2024; 28(11):575.
PMID: 39373739 PMC: 11458639. DOI: 10.1007/s00784-024-05968-w.