» Articles » PMID: 38167093

Assessing ChatGPT's Orthopedic In-service Training Exam Performance and Applicability in the Field

Overview
Publisher Biomed Central
Specialty Orthopedics
Date 2024 Jan 3
PMID 38167093
Authors
Affiliations
Soon will be listed here.
Abstract

Background: ChatGPT has gained widespread attention for its ability to understand and provide human-like responses to inputs. However, few works have focused on its use in Orthopedics. This study assessed ChatGPT's performance on the Orthopedic In-Service Training Exam (OITE) and evaluated its decision-making process to determine whether adoption as a resource in the field is practical.

Methods: ChatGPT's performance on three OITE exams was evaluated through inputting multiple choice questions. Questions were classified by their orthopedic subject area. Yearly, OITE technical reports were used to gauge scores against resident physicians. ChatGPT's rationales were compared with testmaker explanations using six different groups denoting answer accuracy and logic consistency. Variables were analyzed using contingency table construction and Chi-squared analyses.

Results: Of 635 questions, 360 were useable as inputs (56.7%). ChatGPT-3.5 scored 55.8%, 47.7%, and 54% for the years 2020, 2021, and 2022, respectively. Of 190 correct outputs, 179 provided a consistent logic (94.2%). Of 170 incorrect outputs, 133 provided an inconsistent logic (78.2%). Significant associations were found between test topic and correct answer (p = 0.011), and type of logic used and tested topic (p =  < 0.001). Basic Science and Sports had adjusted residuals greater than 1.96. Basic Science and correct, no logic; Basic Science and incorrect, inconsistent logic; Sports and correct, no logic; and Sports and incorrect, inconsistent logic; had adjusted residuals greater than 1.96.

Conclusions: Based on annual OITE technical reports for resident physicians, ChatGPT-3.5 performed around the PGY-1 level. When answering correctly, it displayed congruent reasoning with testmakers. When answering incorrectly, it exhibited some understanding of the correct answer. It outperformed in Basic Science and Sports, likely due to its ability to output rote facts. These findings suggest that it lacks the fundamental capabilities to be a comprehensive tool in Orthopedic Surgery in its current form.

Level Of Evidence: II.

Citing Articles

Which current chatbot is more competent in urological theoretical knowledge? A comparative analysis by the European board of urology in-service assessment.

Sahin M, Dogan C, Topkac E, Seramet S, Tuncer F, Yazici C World J Urol. 2025; 43(1):116.

PMID: 39932577 PMC: 11813998. DOI: 10.1007/s00345-025-05499-3.


Examining the Role of Large Language Models in Orthopedics: Systematic Review.

Zhang C, Liu S, Zhou X, Zhou S, Tian Y, Wang S J Med Internet Res. 2024; 26:e59607.

PMID: 39546795 PMC: 11607553. DOI: 10.2196/59607.


ChatGPT provides acceptable responses to patient questions regarding common shoulder pathology.

Ghilzai U, Fiedler B, Ghali A, Singh A, Cass B, Young A Shoulder Elbow. 2024; :17585732241283971.

PMID: 39545009 PMC: 11559869. DOI: 10.1177/17585732241283971.


Discrepancies in ChatGPT's Hip Fracture Recommendations in Older Adults for 2021 AAOS Evidence-Based Guidelines.

Kim H, Yoon P, Yoon J, Kim H, Choi Y, Park S J Clin Med. 2024; 13(19).

PMID: 39408030 PMC: 11477870. DOI: 10.3390/jcm13195971.


ChatGPT-4 Surpasses Residents: A Study of Artificial Intelligence Competency in Plastic Surgery In-service Examinations and Its Advancements from ChatGPT-3.5.

Hubany S, Scala F, Hashemi K, Kapoor S, Fedorova J, Vaccaro M Plast Reconstr Surg Glob Open. 2024; 12(9):e6136.

PMID: 39239234 PMC: 11377087. DOI: 10.1097/GOX.0000000000006136.


References
1.
Le H, Wick J, Haus B, Dyer G . Orthopaedic In-Training Examination: History, Perspective, and Tips for Residents. J Am Acad Orthop Surg. 2021; 29(9):e427-e437. DOI: 10.5435/JAAOS-D-20-01020. View

2.
Sinha R, Deb Roy A, Kumar N, Mondal H . Applicability of ChatGPT in Assisting to Solve Higher Order Problems in Pathology. Cureus. 2023; 15(2):e35237. PMC: 10033699. DOI: 10.7759/cureus.35237. View

3.
Haver H, Ambinder E, Bahl M, Oluyemi E, Jeudy J, Yi P . Appropriateness of Breast Cancer Prevention and Screening Recommendations Provided by ChatGPT. Radiology. 2023; 307(4):e230424. DOI: 10.1148/radiol.230424. View

4.
Rao A, Kim J, Kamineni M, Pang M, Lie W, Dreyer K . Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot. J Am Coll Radiol. 2023; 20(10):990-997. PMC: 10733745. DOI: 10.1016/j.jacr.2023.05.003. View

5.
Rees E, Quinn P, Davies B, Fotheringham V . How does peer teaching compare to faculty teaching? A systematic review and meta-analysis (.). Med Teach. 2015; 38(8):829-37. DOI: 10.3109/0142159X.2015.1112888. View