Establishing Best Practices in Large Language Model Research: an Application to Repeat Prompting

Overview

Journal J Am Med Inform Assoc

Publisher Oxford University Press

Specialty Medical Informatics

Date 2024 Dec 10

PMID 39656836

Authors

Robert J Gallo

Michael Baiocchi

Thomas R Savage

Jonathan H Chen

Affiliations

Soon will be listed here.

Abstract

Objectives: We aimed to demonstrate the importance of establishing best practices in large language model research, using repeat prompting as an illustrative example.

Materials And Methods: Using data from a prior study investigating potential model bias in peer review of medical abstracts, we compared methods that ignore correlation in model outputs from repeated prompting with a random effects method that accounts for this correlation.

Results: High correlation within groups was found when repeatedly prompting the model, with intraclass correlation coefficient of 0.69. Ignoring the inherent correlation in the data led to over 100-fold inflation of effective sample size. After appropriately accounting for this issue, the authors' results reverse from a small but highly significant finding to no evidence of model bias.

Discussion: The establishment of best practices for LLM research is urgently needed, as demonstrated in this case where accounting for repeat prompting in analyses was critical for accurate study conclusions.

Citing Articles

GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial.

Goh E, Gallo R, Strong E, Weng Y, Kerman H, Freed J Nat Med. 2025; .

PMID: 39910272 DOI: 10.1038/s41591-024-03456-y.

References

Savage T, Wang J, Gallo R, Boukil A, Patel V, Safavi-Naini S . Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment. J Am Med Inform Assoc. 2024; 32(1):139-149. PMC: 11648734. DOI: 10.1093/jamia/ocae254. View

Goh E, Gallo R, Hom J, Strong E, Weng Y, Kerman H . Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Netw Open. 2024; 7(10):e2440969. PMC: 11519755. DOI: 10.1001/jamanetworkopen.2024.40969. View

von Wedel D, Schmitt R, Thiele M, Leuner R, Shay D, Redaelli S . Affiliation Bias in Peer Review of Abstracts by a Large Language Model. JAMA. 2023; 331(3):252-253. PMC: 10753437. DOI: 10.1001/jama.2023.24641. View

Riley R, Cole T, Deeks J, Kirkham J, Morris J, Perera R . On the 12th Day of Christmas, a Statistician Sent to Me . . . BMJ. 2023; 379:e072883. PMC: 9844255. DOI: 10.1136/bmj-2022-072883. View

Bland J, Altman D . Correlation, regression, and repeated data. BMJ. 1994; 308(6933):896. PMC: 2539813. DOI: 10.1136/bmj.308.6933.896. View

Gallo R, Savage T, Chen J . Affiliation Bias in Peer Review of Abstracts. JAMA. 2024; 331(14):1234-1235. DOI: 10.1001/jama.2024.3520. View

Savage T, Nayak A, Gallo R, Rangan E, Chen J . Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. NPJ Digit Med. 2024; 7(1):20. PMC: 10808088. DOI: 10.1038/s41746-024-01010-1. View

Rutterford C, Copas A, Eldridge S . Methods for sample size determination in cluster randomized trials. Int J Epidemiol. 2015; 44(3):1051-67. PMC: 4521133. DOI: 10.1093/ije/dyv113. View

Hager P, Jungmann F, Holland R, Bhagat K, Hubrecht I, Knauer M . Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat Med. 2024; 30(9):2613-2622. PMC: 11405275. DOI: 10.1038/s41591-024-03097-1. View

10.

Hemming K, Eldridge S, Forbes G, Weijer C, Taljaard M . How to design efficient cluster randomised trials. BMJ. 2017; 358:j3064. PMC: 5508848. DOI: 10.1136/bmj.j3064. View

11.

Perlis R, Fihn S . Evaluating the Application of Large Language Models in Clinical Research Contexts. JAMA Netw Open. 2023; 6(10):e2335924. DOI: 10.1001/jamanetworkopen.2023.35924. View

12.

Zack T, Lehman E, Suzgun M, Rodriguez J, Celi L, Gichoya J . Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study. Lancet Digit Health. 2023; 6(1):e12-e22. DOI: 10.1016/S2589-7500(23)00225-X. View

13.

von Wedel D, Shay D, Schaefer M . Affiliation Bias in Peer Review of Abstracts-Reply. JAMA. 2024; 331(14):1235-1236. DOI: 10.1001/jama.2024.3523. View

14.

Wang L, Chen X, Deng X, Wen H, You M, Liu W . Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs. NPJ Digit Med. 2024; 7(1):41. PMC: 10879172. DOI: 10.1038/s41746-024-01029-4. View