» Articles » PMID: 39606472

Disagreements in Medical Ethics Question Answering Between Large Language Models and Physicians

Abstract

Importance: Medical ethics is inherently complex, shaped by a broad spectrum of opinions, experiences, and cultural perspectives. The integration of large language models (LLMs) in healthcare is new and requires an understanding of their consistent adherence to ethical standards.

Objective: To compare the agreement rates in answering questions based on ethically ambiguous situations between three frontier LLMs (GPT-4, Gemini-pro-1.5, and Llama-3-70b) and a multi-disciplinary physician group.

Methods: In this cross-sectional study, three LLMs generated 1,248 medical ethics questions. These questions were derived based on the principles outlined in the American College of Physicians Ethics Manual. The topics spanned traditional, inclusive, interdisciplinary, and contemporary themes. Each model was then tasked in answering all generated questions. Twelve practicing physicians evaluated and responded to a randomly selected 10% subset of these questions. We compared agreement rates in question answering among the physicians, between the physicians and LLMs, and among LLMs.

Results: The models generated a total of 3,744 answers. Despite physicians perceiving the questions' complexity as moderate, with scores between 2 and 3 on a 5-point scale, their agreement rate was only 55.9%. The agreement between physicians and LLMs was also low at 57.9%. In contrast, the agreement rate among LLMs was notably higher at 76.8% (p < 0.001), emphasizing the consistency in LLM responses compared to both physician-physician and physician-LLM agreement.

Conclusions: LLMs demonstrate higher agreement rates in ethically complex scenarios compared to physicians, suggesting their potential utility as consultants in ambiguous ethical situations. Future research should explore how LLMs can enhance consistency while adapting to the complexities of real-world ethical dilemmas.

References
1.
Parviainen J, Rantala J . Chatbot breakthrough in the 2020s? An ethical reflection on the trend of automated consultations in health care. Med Health Care Philos. 2021; 25(1):61-71. PMC: 8416570. DOI: 10.1007/s11019-021-10049-w. View

2.
Glicksberg B, Timsina P, Patel D, Sawant A, Vaid A, Raut G . Evaluating the accuracy of a state-of-the-art large language model for prediction of admissions from the emergency room. J Am Med Inform Assoc. 2024; 31(9):1921-1928. PMC: 11339523. DOI: 10.1093/jamia/ocae103. View

3.
Ayers J, Poliak A, Dredze M, Leas E, Zhu Z, Kelley J . Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Intern Med. 2023; 183(6):589-596. PMC: 10148230. DOI: 10.1001/jamainternmed.2023.1838. View

4.
Mashayekhi J, Mafinejad M, Changiz T, Moosapour H, Salari P, Nedjat S . Exploring medical ethics' implementation challenges: A qualitative study. J Educ Health Promot. 2021; 10:66. PMC: 8057159. DOI: 10.4103/jehp.jehp_766_20. View

5.
Hendelman W, Byszewski A . Formation of medical student professional identity: categorizing lapses of professionalism, and the learning environment. BMC Med Educ. 2014; 14:139. PMC: 4102062. DOI: 10.1186/1472-6920-14-139. View