A Study of Generative Large Language Model for Medical Research and Healthcare

Overview

Journal NPJ Digit Med

Specialty Medical Informatics

Date 2023 Nov 17

PMID 37973919

Authors

Cheng Peng

Xi Yang

Aokun Chen

Kaleb E Smith

Nima PourNejatian

Anthony B Costa

Cheryl Martin

Mona G Flores

Ying Zhang

Tanja Magoc

Gloria Lipori

Duane A Mitchell

Naykky S Ospina

Mustafa M Ahmed

William R Hogan

Elizabeth A Shenkman

Yi Guo

Jiang Bian

Yonghui Wu

Affiliations

Soon will be listed here.

Abstract

There are enormous enthusiasm and concerns in applying large language models (LLMs) to healthcare. Yet current assumptions are based on general-purpose LLMs such as ChatGPT, which are not developed for medical use. This study develops a generative clinical LLM, GatorTronGPT, using 277 billion words of text including (1) 82 billion words of clinical text from 126 clinical departments and approximately 2 million patients at the University of Florida Health and (2) 195 billion words of diverse general English text. We train GatorTronGPT using a GPT-3 architecture with up to 20 billion parameters and evaluate its utility for biomedical natural language processing (NLP) and healthcare text generation. GatorTronGPT improves biomedical natural language processing. We apply GatorTronGPT to generate 20 billion words of synthetic text. Synthetic NLP models trained using synthetic text generated by GatorTronGPT outperform models trained using real-world clinical text. Physicians' Turing test using 1 (worst) to 9 (best) scale shows that there are no significant differences in linguistic readability (p = 0.22; 6.57 of GatorTronGPT compared with 6.93 of human) and clinical relevance (p = 0.91; 7.0 of GatorTronGPT compared with 6.97 of human) and that physicians cannot differentiate them (p < 0.001). This study provides insights into the opportunities and challenges of LLMs for medical research and healthcare.

Citing Articles

Low responsiveness of machine learning models to critical or deteriorating health conditions.

Pias T, Afrose S, Tuli M, Trisha I, Deng X, Nemeroff C Commun Med (Lond). 2025; 5(1):62.

PMID: 40069422 PMC: 11897252. DOI: 10.1038/s43856-025-00775-0.

Agents for Change: Artificial Intelligent Workflows for Quantitative Clinical Pharmacology and Translational Sciences.

Shahin M, Goswami S, Lobentanzer S, Corrigan B Clin Transl Sci. 2025; 18(3):e70188.

PMID: 40055986 PMC: 11889410. DOI: 10.1111/cts.70188.

Medical foundation large language models for comprehensive text analysis and beyond.

Xie Q, Chen Q, Chen A, Peng C, Hu Y, Lin F NPJ Digit Med. 2025; 8(1):141.

PMID: 40044845 PMC: 11882967. DOI: 10.1038/s41746-025-01533-1.

Analyzing patient perspectives with large language models: a cross-sectional study of sentiment and thematic classification on exception from informed consent.

Kornblith A, Singh C, Innes J, Chang T, Adelgais K, Holsti M Sci Rep. 2025; 15(1):6179.

PMID: 39979559 PMC: 11842787. DOI: 10.1038/s41598-025-89996-w.

InfectA-Chat, an Arabic Large Language Model for Infectious Diseases: Comparative Analysis.

Selcuk Y, Kim E, Ahn I JMIR Med Inform. 2025; 13:e63881.

PMID: 39928922 PMC: 11851044. DOI: 10.2196/63881.

References

Luo R, Sun L, Xia Y, Qin T, Zhang S, Poon H . BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief Bioinform. 2022; 23(6). DOI: 10.1093/bib/bbac409. View

Yang X, Lyu T, Li Q, Lee C, Bian J, Hogan W . A study of deep learning methods for de-identification of clinical notes in cross-institute settings. BMC Med Inform Decis Mak. 2019; 19(Suppl 5):232. PMC: 6894104. DOI: 10.1186/s12911-019-0935-4. View

Wongpakaran N, Wongpakaran T, Wedding D, Gwet K . A comparison of Cohen's Kappa and Gwet's AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples. BMC Med Res Methodol. 2013; 13:61. PMC: 3643869. DOI: 10.1186/1471-2288-13-61. View

Kroth P, Morioka-Douglas N, Veres S, Babbott S, Poplau S, Qeadan F . Association of Electronic Health Record Design and Use Factors With Clinician Stress and Burnout. JAMA Netw Open. 2019; 2(8):e199609. PMC: 6704736. DOI: 10.1001/jamanetworkopen.2019.9609. View

Grunebaum A, Chervenak J, Pollet S, Katz A, Chervenak F . The exciting potential for ChatGPT in obstetrics and gynecology. Am J Obstet Gynecol. 2023; 228(6):696-705. DOI: 10.1016/j.ajog.2023.03.009. View

Downing N, Bates D, Longhurst C . Physician Burnout in the Electronic Health Record Era: Are We Ignoring the Real Cause?. Ann Intern Med. 2018; 169(1):50-51. DOI: 10.7326/M18-0139. View

Hou Y, Xia Y, Wu L, Xie S, Fan Y, Zhu J . Discovering drug-target interaction knowledge from biomedical literature. Bioinformatics. 2022; 38(22):5100-5107. DOI: 10.1093/bioinformatics/btac648. View

Patel S, Lam K . ChatGPT: the future of discharge summaries?. Lancet Digit Health. 2023; 5(3):e107-e108. DOI: 10.1016/S2589-7500(23)00021-3. View

Yang X, Chen A, PourNejatian N, Shin H, Smith K, Parisien C . A large language model for electronic health records. NPJ Digit Med. 2022; 5(1):194. PMC: 9792464. DOI: 10.1038/s41746-022-00742-2. View

10.

Ali S, Dobbs T, Hutchings H, Whitaker I . Using ChatGPT to write patient clinic letters. Lancet Digit Health. 2023; 5(4):e179-e181. DOI: 10.1016/S2589-7500(23)00048-1. View

11.

Johnson A, Pollard T, Shen L, Lehman L, Feng M, Ghassemi M . MIMIC-III, a freely accessible critical care database. Sci Data. 2016; 3:160035. PMC: 4878278. DOI: 10.1038/sdata.2016.35. View

12.

Azamfirei R, Kudchadkar S, Fackler J . Large language models and the perils of their hallucinations. Crit Care. 2023; 27(1):120. PMC: 10032023. DOI: 10.1186/s13054-023-04393-x. View

13.

Li J, Sun Y, Johnson R, Sciaky D, Wei C, Leaman R . BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database (Oxford). 2016; 2016. PMC: 4860626. DOI: 10.1093/database/baw068. View

14.

Peng C, Yang X, Yu Z, Bian J, Hogan W, Wu Y . Clinical concept and relation extraction using prompt-based machine reading comprehension. J Am Med Inform Assoc. 2023; 30(9):1486-1493. PMC: 10436141. DOI: 10.1093/jamia/ocad107. View

15.

Gaffney A, Woolhandler S, Cai C, Bor D, Himmelstein J, McCormick D . Medical Documentation Burden Among US Office-Based Physicians in 2019: A National Study. JAMA Intern Med. 2022; 182(5):564-566. PMC: 8961402. DOI: 10.1001/jamainternmed.2022.0372. View

16.

Li H, Moon J, Purkayastha S, Celi L, Trivedi H, Gichoya J . Ethics of large language models in medicine and medical research. Lancet Digit Health. 2023; 5(6):e333-e335. DOI: 10.1016/S2589-7500(23)00083-3. View

17.

Lee P, Bubeck S, Petro J . Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N Engl J Med. 2023; 388(13):1233-1239. DOI: 10.1056/NEJMsr2214184. View

18.

Straw I, Callison-Burch C . Artificial Intelligence in mental health and the biases of language based models. PLoS One. 2020; 15(12):e0240376. PMC: 7745984. DOI: 10.1371/journal.pone.0240376. View

19.

Wornow M, Xu Y, Thapa R, Patel B, Steinberg E, Fleming S . The shaky foundations of large language models and foundation models for electronic health records. NPJ Digit Med. 2023; 6(1):135. PMC: 10387101. DOI: 10.1038/s41746-023-00879-8. View

20.

Caliskan A, Bryson J, Narayanan A . Semantics derived automatically from language corpora contain human-like biases. Science. 2017; 356(6334):183-186. DOI: 10.1126/science.aal4230. View