Me-LLaMA: Foundation Large Language Models for Medical Applications

Overview

Journal Res Sq

Date 2024 Jun 3

PMID 38826372

Authors

Qianqian Xie

Qingyu Chen

Aokun Chen

Cheng Peng

Yan Hu

Fongci Lin

Xueqing Peng

Jimin Huang

Jeffrey Zhang

Vipina Keloth

Xinyu Zhou

Huan He

Lucila Ohno-Machado

Yonghui Wu

Hua Xu

Jiang Bian

Affiliations

Soon will be listed here.

Abstract

Recent advancements in large language models (LLMs) such as ChatGPT and LLaMA have hinted at their potential to revolutionize medical applications, yet their application in clinical settings often reveals limitations due to a lack of specialized training on medical-specific data. In response to this challenge, this study introduces Me-LLaMA, a novel medical LLM family that includes foundation models - Me-LLaMA 13/70B, along with their chat-enhanced versions - Me-LLaMA 13/70B-chat, developed through continual pre-training and instruction tuning of LLaMA2 using large medical datasets. Our methodology leverages a comprehensive domain-specific data suite, including a large-scale, continual pre-training dataset with 129B tokens, an instruction tuning dataset with 214k samples, and a new medical evaluation benchmark (MIBE) across six critical medical tasks with 12 datasets. Our extensive evaluation using the MIBE shows that Me-LLaMA models achieve overall better performance than existing open-source medical LLMs in zero-shot, few-shot and supervised learning abilities. With task-specific instruction tuning, Me-LLaMA models outperform ChatGPT on 7 out of 8 datasets and GPT-4 on 5 out of 8 datasets. In addition, we investigated the catastrophic forgetting problem, and our results show that Me-LLaMA models outperform other open-source medical LLMs in mitigating this issue. Me-LLaMA is one of the largest open-source medical foundation LLMs that use both biomedical and clinical data. It exhibits superior performance across both general and medical tasks compared to other open-source medical LLMs, rendering it an attractive choice for medical AI applications. We release our models, datasets, and evaluation scripts at: https://github.com/BIDS-Xu-Lab/Me-LLaMA.

References

Rajpurkar P, Chen E, Banerjee O, Topol E . AI in health and medicine. Nat Med. 2022; 28(1):31-38. DOI: 10.1038/s41591-021-01614-0. View

Baker S, Silins I, Guo Y, Ali I, Hogberg J, Stenius U . Automatic semantic classification of scientific literature according to the hallmarks of cancer. Bioinformatics. 2015; 32(3):432-40. DOI: 10.1093/bioinformatics/btv585. View

Chen Q, Sun H, Liu H, Jiang Y, Ran T, Jin X . An extensive benchmark study on biomedical text generation and mining with ChatGPT. Bioinformatics. 2023; 39(9). PMC: 10562950. DOI: 10.1093/bioinformatics/btad557. View

Peng C, Yang X, Chen A, Smith K, PourNejatian N, Costa A . A study of generative large language model for medical research and healthcare. NPJ Digit Med. 2023; 6(1):210. PMC: 10654385. DOI: 10.1038/s41746-023-00958-w. View

French . Catastrophic forgetting in connectionist networks. Trends Cogn Sci. 1999; 3(4):128-135. DOI: 10.1016/s1364-6613(99)01294-2. View

Johnson A, Pollard T, Berkowitz S, Greenbaum N, Lungren M, Deng C . MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci Data. 2019; 6(1):317. PMC: 6908718. DOI: 10.1038/s41597-019-0322-0. View

Singhal K, Azizi S, Tu T, Mahdavi S, Wei J, Chung H . Large language models encode clinical knowledge. Nature. 2023; 620(7972):172-180. PMC: 10396962. DOI: 10.1038/s41586-023-06291-2. View

Uzuner O, South B, Shen S, DuVall S . 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J Am Med Inform Assoc. 2011; 18(5):552-6. PMC: 3168320. DOI: 10.1136/amiajnl-2011-000203. View

Sackett D . Evidence-based medicine. Semin Perinatol. 1997; 21(1):3-5. DOI: 10.1016/s0146-0005(97)80013-4. View

10.

Hu Y, Chen Q, Du J, Peng X, Keloth V, Zuo X . Improving large language models for clinical named entity recognition via prompt engineering. J Am Med Inform Assoc. 2024; 31(9):1812-1820. PMC: 11339492. DOI: 10.1093/jamia/ocad259. View

11.

Johnson A, Pollard T, Shen L, Lehman L, Feng M, Ghassemi M . MIMIC-III, a freely accessible critical care database. Sci Data. 2016; 3:160035. PMC: 4878278. DOI: 10.1038/sdata.2016.35. View

12.

Abacha A, MRabet Y, Sharp M, Goodwin T, Shooshan S, Demner-Fushman D . Bridging the Gap Between Consumers' Medication Questions and Trusted Answers. Stud Health Technol Inform. 2019; 264:25-29. DOI: 10.3233/SHTI190176. View

13.

Li Y, Li Z, Zhang K, Dan R, Jiang S, Zhang Y . ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge. Cureus. 2023; 15(6):e40895. PMC: 10364849. DOI: 10.7759/cureus.40895. View