The Performance of Artificial Intelligence Large Language Model-linked Chatbots in Surgical Decision-making for Gastroesophageal Reflux Disease

Overview

Journal Surg Endosc

Publisher Springer

Specialties Gastroenterology
General Surgery
Radiology

Date 2024 Apr 17

PMID 38630178

Authors

Bright Huo

Elisa Calabrese

Patricia Sylla

Sunjay Kumar

Romeo C Ignacio

Rodolfo Oviedo

Imran Hassan

Bethany J Slater

Andreas Kaiser

Danielle S Walsh

Wesley Vosburg

Affiliations

Soon will be listed here.

Abstract

Background: Large language model (LLM)-linked chatbots may be an efficient source of clinical recommendations for healthcare providers and patients. This study evaluated the performance of LLM-linked chatbots in providing recommendations for the surgical management of gastroesophageal reflux disease (GERD).

Methods: Nine patient cases were created based on key questions addressed by the Society of American Gastrointestinal and Endoscopic Surgeons (SAGES) guidelines for the surgical treatment of GERD. ChatGPT-3.5, ChatGPT-4, Copilot, Google Bard, and Perplexity AI were queried on November 16th, 2023, for recommendations regarding the surgical management of GERD. Accurate chatbot performance was defined as the number of responses aligning with SAGES guideline recommendations. Outcomes were reported with counts and percentages.

Results: Surgeons were given accurate recommendations for the surgical management of GERD in an adult patient for 5/7 (71.4%) KQs by ChatGPT-4, 3/7 (42.9%) KQs by Copilot, 6/7 (85.7%) KQs by Google Bard, and 3/7 (42.9%) KQs by Perplexity according to the SAGES guidelines. Patients were given accurate recommendations for 3/5 (60.0%) KQs by ChatGPT-4, 2/5 (40.0%) KQs by Copilot, 4/5 (80.0%) KQs by Google Bard, and 1/5 (20.0%) KQs by Perplexity, respectively. In a pediatric patient, surgeons were given accurate recommendations for 2/3 (66.7%) KQs by ChatGPT-4, 3/3 (100.0%) KQs by Copilot, 3/3 (100.0%) KQs by Google Bard, and 2/3 (66.7%) KQs by Perplexity. Patients were given appropriate guidance for 2/2 (100.0%) KQs by ChatGPT-4, 2/2 (100.0%) KQs by Copilot, 1/2 (50.0%) KQs by Google Bard, and 1/2 (50.0%) KQs by Perplexity.

Conclusions: Gastrointestinal surgeons, gastroenterologists, and patients should recognize both the promise and pitfalls of LLM's when utilized for advice on surgical management of GERD. Additional training of LLM's using evidence-based health information is needed.

Citing Articles

Large Language Models for Chatbot Health Advice Studies: A Systematic Review.

Huo B, Boyle A, Marfo N, Tangamornsuksan W, Steen J, McKechnie T JAMA Netw Open. 2025; 8(2):e2457879.

PMID: 39903463 PMC: 11795331. DOI: 10.1001/jamanetworkopen.2024.57879.

A Performance Evaluation of Large Language Models in Keratoconus: A Comparative Study of ChatGPT-3.5, ChatGPT-4.0, Gemini, Copilot, Chatsonic, and Perplexity.

Reyhan A, Mutaf C, Uzun I, Yuksekyayla F J Clin Med. 2024; 13(21).

PMID: 39518652 PMC: 11547000. DOI: 10.3390/jcm13216512.

Assessing the Accuracy of Artificial Intelligence Models in Scoliosis Classification and Suggested Therapeutic Approaches.

Fabijan A, Zawadzka-Fabijan A, Fabijan R, Zakrzewski K, Nowoslawska E, Polis B J Clin Med. 2024; 13(14).

PMID: 39064053 PMC: 11278075. DOI: 10.3390/jcm13144013.

References

Thirunavukarasu A, Ting D, Elangovan K, Gutierrez L, Tan T, Ting D . Large language models in medicine. Nat Med. 2023; 29(8):1930-1940. DOI: 10.1038/s41591-023-02448-8. View

Ayers J, Poliak A, Dredze M, Leas E, Zhu Z, Kelley J . Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Intern Med. 2023; 183(6):589-596. PMC: 10148230. DOI: 10.1001/jamainternmed.2023.1838. View

Lee T, Staller K, Botoman V, Pathipati M, Varma S, Kuo B . ChatGPT Answers Common Patient Questions About Colonoscopy. Gastroenterology. 2023; 165(2):509-511.e7. DOI: 10.1053/j.gastro.2023.04.033. View

Walker H, Ghani S, Kuemmerli C, Nebiker C, Muller B, Raptis D . Reliability of Medical Information Provided by ChatGPT: Assessment Against Clinical Guidelines and Patient Information Quality Instrument. J Med Internet Res. 2023; 25:e47479. PMC: 10365578. DOI: 10.2196/47479. View

Amante D, Hogan T, Pagoto S, English T, Lapane K . Access to care and use of the Internet to search for health information: results from the US National Health Interview Survey. J Med Internet Res. 2015; 17(4):e106. PMC: 4430679. DOI: 10.2196/jmir.4126. View

Mikalef P, Kourouthanassis P, Pateli A . Online information search behaviour of physicians. Health Info Libr J. 2017; 34(1):58-73. DOI: 10.1111/hir.12170. View

El-Serag H, Sweet S, Winchester C, Dent J . Update on the epidemiology of gastro-oesophageal reflux disease: a systematic review. Gut. 2013; 63(6):871-80. PMC: 4046948. DOI: 10.1136/gutjnl-2012-304269. View

Slater B, Dirks R, McKinley S, Ansari M, Kohn G, Thosani N . SAGES guidelines for the surgical treatment of gastroesophageal reflux (GERD). Surg Endosc. 2021; 35(9):4903-4917. DOI: 10.1007/s00464-021-08625-5. View

Moore M, Afaneh C, Benhuri D, Antonacci C, Abelson J, Zarnegar R . Gastroesophageal reflux disease: A review of surgical decision making. World J Gastrointest Surg. 2016; 8(1):77-83. PMC: 4724590. DOI: 10.4240/wjgs.v8.i1.77. View

10.

Marcinkevics R, Reis Wolfertstetter P, Klimiene U, Chin-Cheong K, Paschke A, Zerres J . Interpretable and intervenable ultrasonography-based machine learning models for pediatric appendicitis. Med Image Anal. 2023; 91:103042. DOI: 10.1016/j.media.2023.103042. View

11.

Emile S, Ghareeb W, Elfeki H, El Sorogy M, Fouad A, ElRefai M . Development and Validation of an Artificial Intelligence-Based Model to Predict Gastroesophageal Reflux Disease After Sleeve Gastrectomy. Obes Surg. 2022; 32(8):2537-2547. PMC: 9273557. DOI: 10.1007/s11695-022-06112-x. View

12.

Ge Z, Wang B, Chang J, Yu Z, Zhou Z, Zhang J . Using deep learning and explainable artificial intelligence to assess the severity of gastroesophageal reflux disease according to the Los Angeles Classification System. Scand J Gastroenterol. 2023; 58(6):596-604. DOI: 10.1080/00365521.2022.2163185. View

13.

Kung T, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepano C . Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023; 2(2):e0000198. PMC: 9931230. DOI: 10.1371/journal.pdig.0000198. View

14.

Mahajan A, Esper S, Oo T, McKibben J, Garver M, Artman J . Development and Validation of a Machine Learning Model to Identify Patients Before Surgery at High Risk for Postoperative Adverse Events. JAMA Netw Open. 2023; 6(7):e2322285. PMC: 10329211. DOI: 10.1001/jamanetworkopen.2023.22285. View