A Large Language Model Framework for Literature-based Disease-gene Association Prediction

Overview

Journal Brief Bioinform

Publisher Oxford University Press

Specialty Biology

Date 2025 Feb 25

PMID 39998433

Authors

Peng-Hsuan Li

Yih-Yun Sun

Hsueh-Fen Juan

Chien-Yu Chen

Huai-Kuang Tsai

Jia-Hsin Huang

Affiliations

Soon will be listed here.

Abstract

With the exponential growth of biomedical literature, leveraging Large Language Models (LLMs) for automated medical knowledge understanding has become increasingly critical for advancing precision medicine. However, current approaches face significant challenges in reliability, verifiability, and scalability when extracting complex biological relationships from scientific literature using LLMs. To overcome the obstacles of LLM development in biomedical literature understating, we propose LORE, a novel unsupervised two-stage reading methodology with LLM that models literature as a knowledge graph of verifiable factual statements and, in turn, as semantic embeddings in Euclidean space. LORE captured essential gene pathogenicity information when applied to PubMed abstracts for large-scale understanding of disease-gene relationships. We demonstrated that modeling a latent pathogenic flow in the semantic embedding with supervision from the ClinVar database led to a 90% mean average precision in identifying relevant genes across 2097 diseases. This work provides a scalable and reproducible approach for leveraging LLMs in biomedical literature analysis, offering new opportunities for researchers to identify therapeutic targets efficiently.

References

Tate J, Bamford S, Jubb H, Sondka Z, Beare D, Bindal N . COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic Acids Res. 2018; 47(D1):D941-D947. PMC: 6323903. DOI: 10.1093/nar/gky1015. View

Wei C, Allot A, Lai P, Leaman R, Tian S, Luo L . PubTator 3.0: an AI-powered literature resource for unlocking biomedical knowledge. Nucleic Acids Res. 2024; 52(W1):W540-W546. PMC: 11223843. DOI: 10.1093/nar/gkae235. View

Pinto B, Oliveira A, Singh Y, Jimenez L, Goncalves A, Ogava R . ACE2 Expression Is Increased in the Lungs of Patients With Comorbidities Associated With Severe COVID-19. J Infect Dis. 2020; 222(4):556-563. PMC: 7377288. DOI: 10.1093/infdis/jiaa332. View

Landrum M, Lee J, Benson M, Brown G, Chao C, Chitipiralla S . ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 2015; 44(D1):D862-8. PMC: 4702865. DOI: 10.1093/nar/gkv1222. View

Pinero J, Bravo A, Queralt-Rosinach N, Gutierrez-Sacristan A, Deu-Pons J, Centeno E . DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Res. 2016; 45(D1):D833-D839. PMC: 5210640. DOI: 10.1093/nar/gkw943. View

Jin Q, Leaman R, Lu Z . Retrieve, Summarize, and Verify: How Will ChatGPT Affect Information Seeking from the Medical Literature?. J Am Soc Nephrol. 2023; 34(8):1302-1304. PMC: 10400098. DOI: 10.1681/ASN.0000000000000166. View

Percha B, Altman R . A global network of biomedical relationships derived from text. Bioinformatics. 2018; 34(15):2614-2624. PMC: 6061699. DOI: 10.1093/bioinformatics/bty114. View

Singhal K, Azizi S, Tu T, Mahdavi S, Wei J, Chung H . Large language models encode clinical knowledge. Nature. 2023; 620(7972):172-180. PMC: 10396962. DOI: 10.1038/s41586-023-06291-2. View

Legrand J, Gogdemir R, Bousquet C, Dalleau K, Devignes M, Digan W . PGxCorpus, a manually annotated corpus for pharmacogenomics. Sci Data. 2020; 7(1):3. PMC: 6940385. DOI: 10.1038/s41597-019-0342-9. View

10.

Amberger J, Bocchini C, Scott A, Hamosh A . OMIM.org: leveraging knowledge across phenotype-gene relationships. Nucleic Acids Res. 2018; 47(D1):D1038-D1043. PMC: 6323937. DOI: 10.1093/nar/gky1151. View

11.

Hunter L, Cohen K . Biomedical language processing: what's beyond PubMed?. Mol Cell. 2006; 21(5):589-94. PMC: 1702322. DOI: 10.1016/j.molcel.2006.02.012. View

12.

Lee P, Bubeck S, Petro J . Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N Engl J Med. 2023; 388(13):1233-1239. DOI: 10.1056/NEJMsr2214184. View

13.

Whirl-Carrillo M, Huddart R, Gong L, Sangkuhl K, Thorn C, Whaley R . An Evidence-Based Framework for Evaluating Pharmacogenomics Knowledge for Personalized Medicine. Clin Pharmacol Ther. 2021; 110(3):563-572. PMC: 8457105. DOI: 10.1002/cpt.2350. View

14.

Bodenreider O . The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2003; 32(Database issue):D267-70. PMC: 308795. DOI: 10.1093/nar/gkh061. View

15.

Xu D, Zhang M, Xie Y, Wang F, Chen M, Zhu K . DTMiner: identification of potential disease targets through biomedical literature mining. Bioinformatics. 2016; 32(23):3619-3626. PMC: 5181534. DOI: 10.1093/bioinformatics/btw503. View

16.

Li P, Chen T, Yu J, Shih S, Su C, Lin Y . pubmedKB: an interactive web server for exploring biomedical entity relations in the biomedical literature. Nucleic Acids Res. 2022; 50(W1):W616-W622. PMC: 9252824. DOI: 10.1093/nar/gkac310. View

17.

Buch A, Vertes P, Seidlitz J, Kim S, Grosenick L, Liston C . Molecular and network-level mechanisms explaining individual differences in autism spectrum disorder. Nat Neurosci. 2023; 26(4):650-663. PMC: 11446249. DOI: 10.1038/s41593-023-01259-x. View

18.

Szklarczyk D, Kirsch R, Koutrouli M, Nastou K, Mehryary F, Hachilif R . The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 2022; 51(D1):D638-D646. PMC: 9825434. DOI: 10.1093/nar/gkac1000. View

19.

Lai P, Wei C, Luo L, Chen Q, Lu Z . BioREx: Improving biomedical relation extraction by leveraging heterogeneous datasets. J Biomed Inform. 2023; 146:104487. DOI: 10.1016/j.jbi.2023.104487. View

20.

Bravo A, Pinero J, Queralt-Rosinach N, Rautschka M, Furlong L . Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research. BMC Bioinformatics. 2015; 16:55. PMC: 4466840. DOI: 10.1186/s12859-015-0472-9. View