» Articles » PMID: 25886734

Extraction of Relations Between Genes and Diseases from Text and Large-scale Data Analysis: Implications for Translational Research

Overview
Publisher Biomed Central
Specialty Biology
Date 2015 Apr 19
PMID 25886734
Citations 87
Authors
Affiliations
Soon will be listed here.
Abstract

Background: Current biomedical research needs to leverage and exploit the large amount of information reported in scientific publications. Automated text mining approaches, in particular those aimed at finding relationships between entities, are key for identification of actionable knowledge from free text repositories. We present the BeFree system aimed at identifying relationships between biomedical entities with a special focus on genes and their associated diseases.

Results: By exploiting morpho-syntactic information of the text, BeFree is able to identify gene-disease, drug-disease and drug-target associations with state-of-the-art performance. The application of BeFree to real-case scenarios shows its effectiveness in extracting information relevant for translational research. We show the value of the gene-disease associations extracted by BeFree through a number of analyses and integration with other data sources. BeFree succeeds in identifying genes associated to a major cause of morbidity worldwide, depression, which are not present in other public resources. Moreover, large-scale extraction and analysis of gene-disease associations, and integration with current biomedical knowledge, provided interesting insights on the kind of information that can be found in the literature, and raised challenges regarding data prioritization and curation. We found that only a small proportion of the gene-disease associations discovered by using BeFree is collected in expert-curated databases. Thus, there is a pressing need to find alternative strategies to manual curation, in order to review, prioritize and curate text-mining data and incorporate it into domain-specific databases. We present our strategy for data prioritization and discuss its implications for supporting biomedical research and applications.

Conclusions: BeFree is a novel text mining system that performs competitively for the identification of gene-disease, drug-disease and drug-target associations. Our analyses show that mining only a small fraction of MEDLINE results in a large dataset of gene-disease associations, and only a small proportion of this dataset is actually recorded in curated resources (2%), raising several issues on data prioritization and curation. We propose that joint analysis of text mined data with data curated by experts appears as a suitable approach to both assess data quality and highlight novel and interesting information.

Citing Articles

A large language model framework for literature-based disease-gene association prediction.

Li P, Sun Y, Juan H, Chen C, Tsai H, Huang J Brief Bioinform. 2025; 26(1).

PMID: 39998433 PMC: 11851487. DOI: 10.1093/bib/bbaf070.


State of the interactomes: an evaluation of molecular networks for generating biological insights.

Wright S, Colton S, Schaffer L, Pillich R, Churas C, Pratt D Mol Syst Biol. 2024; 21(1):1-29.

PMID: 39653848 PMC: 11697402. DOI: 10.1038/s44320-024-00077-y.


AlzGenPred - CatBoost-based gene classifier for predicting Alzheimer's disease using high-throughput sequencing data.

Shukla R, Singh T Sci Rep. 2024; 14(1):30294.

PMID: 39639110 PMC: 11621786. DOI: 10.1038/s41598-024-82208-x.


Transforming Clinical Research: The Power of High-Throughput Omics Integration.

Vitorino R Proteomes. 2024; 12(3).

PMID: 39311198 PMC: 11417901. DOI: 10.3390/proteomes12030025.


Dataset of miRNA-disease relations extracted from textual data using transformer-based neural networks.

Madan S, Kuhnel L, Frohlich H, Hofmann-Apitius M, Fluck J Database (Oxford). 2024; 2024.

PMID: 39104284 PMC: 11300841. DOI: 10.1093/database/baae066.


References
1.
Rindflesch T, Fiszman M . The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text. J Biomed Inform. 2004; 36(6):462-77. DOI: 10.1016/j.jbi.2003.11.003. View

2.
Buyko E, Beisswanger E, Hahn U . The extraction of pharmacogenetic and pharmacogenomic relations--a case study using PharmGKB. Pac Symp Biocomput. 2011; :376-87. View

3.
Arighi C, Wu C, Cohen K, Hirschman L, Krallinger M, Valencia A . BioCreative-IV virtual issue. Database (Oxford). 2014; 2014. PMC: 4030502. DOI: 10.1093/database/bau039. View

4.
Gurulingappa H, Mateen-Rajput A, Toldo L . Extraction of potential adverse drug events from medical case reports. J Biomed Semantics. 2012; 3(1):15. PMC: 3599676. DOI: 10.1186/2041-1480-3-15. View

5.
Xu R, Wang Q . A knowledge-driven conditional approach to extract pharmacogenomics specific drug-gene relationships from free text. J Biomed Inform. 2012; 45(5):827-34. PMC: 4589154. DOI: 10.1016/j.jbi.2012.04.011. View