RENET2: High-performance Full-text Gene-disease Relation Extraction with Iterative Training Data Expansion

Overview

Journal NAR Genom Bioinform

Publisher Oxford University Press

Specialty Biology

Date 2021 Jul 8

PMID 34235433

Citations 5

Authors

Junhao Su

Ye Wu

Hing-Fung Ting

Tak-Wah Lam

Ruibang Luo

Affiliations

Soon will be listed here.

Abstract

Relation extraction (RE) is a fundamental task for extracting gene-disease associations from biomedical text. Many state-of-the-art tools have limited capacity, as they can extract gene-disease associations only from single sentences or abstract texts. A few studies have explored extracting gene-disease associations from full-text articles, but there exists a large room for improvements. In this work, we propose RENET2, a deep learning-based RE method, which implements Section Filtering and ambiguous relations modeling to extract gene-disease associations from full-text articles. We designed a novel iterative training data expansion strategy to build an annotated full-text dataset to resolve the scarcity of labels on full-text articles. In our experiments, RENET2 achieved an F1-score of 72.13% for extracting gene-disease associations from an annotated full-text dataset, which was 27.22, 30.30, 29.24 and 23.87% higher than BeFree, DTMiner, BioBERT and RENET, respectively. We applied RENET2 to (i) ∼1.89M full-text articles from PubMed Central and found ∼3.72M gene-disease associations; and (ii) the LitCovid articles and ranked the top 15 proteins associated with COVID-19, supported by recent articles. RENET2 is an efficient and accurate method for full-text gene-disease association extraction. The source-code, manually curated abstract/full-text training data, and results of RENET2 are available at GitHub.

Citing Articles

An NLP-based method to mine gene and function relationships from published articles.

Kumar N, Mukhtar M Sci Rep. 2025; 15(1):7503.

PMID: 40033048 PMC: 11876572. DOI: 10.1038/s41598-025-91809-z.

LSD600: the first corpus of biomedical abstracts annotated with lifestyle-disease relations.

Nourani E, Makri E, Mao X, Pyysalo S, Brunak S, Nastou K Database (Oxford). 2025; 2025.

PMID: 39824652 PMC: 11756709. DOI: 10.1093/database/baae129.

RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature.

Nastou K, Mehryary F, Ohta T, Luoma J, Pyysalo S, Jensen L Database (Oxford). 2024; 2024.

PMID: 39265993 PMC: 11394941. DOI: 10.1093/database/baae095.

DUVEL: an active-learning annotated biomedical corpus for the recognition of oligogenic combinations.

Nachtegael C, De Stefani J, Cnudde A, Lenaerts T Database (Oxford). 2024; 2024.

PMID: 38805753 PMC: 11131422. DOI: 10.1093/database/baae039.

Surveying biomedical relation extraction: a critical examination of current datasets and the proposal of a new resource.

Huang M, Han J, Lin P, You Y, Tsai R, Hsu W Brief Bioinform. 2024; 25(3).

PMID: 38609331 PMC: 11014787. DOI: 10.1093/bib/bbae132.

References

Little J, Bradley L, Bray M, Clyne M, Dorman J, Ellsworth D . Reporting, appraising, and integrating data on genotype prevalence and gene-disease associations. Am J Epidemiol. 2002; 156(4):300-10. DOI: 10.1093/oxfordjournals.aje.a000179. View

Gracia-Ramos A . Is the ACE2 Overexpression a Risk Factor for COVID-19 Infection?. Arch Med Res. 2020; 51(4):345-346. PMC: 7128661. DOI: 10.1016/j.arcmed.2020.03.011. View

Taha K, Davuluri R, Yoo P, Spencer J . Personizing the prediction of future susceptibility to a specific disease. PLoS One. 2021; 16(1):e0243127. PMC: 7787538. DOI: 10.1371/journal.pone.0243127. View

Kilicoglu H . Biomedical text mining for research rigor and integrity: tasks, challenges, directions. Brief Bioinform. 2017; 19(6):1400-1414. PMC: 6291799. DOI: 10.1093/bib/bbx057. View

Lee J, Yoon W, Kim S, Kim D, Kim S, So C . BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2019; 36(4):1234-1240. PMC: 7703786. DOI: 10.1093/bioinformatics/btz682. View

Timms A, Sathananthan R, Bradbury L, Athanasou N, Wordsworth B, Brown M . Genetic testing for haemochromatosis in patients with chondrocalcinosis. Ann Rheum Dis. 2002; 61(8):745-7. PMC: 1754204. DOI: 10.1136/ard.61.8.745. View

Bravo A, Pinero J, Queralt-Rosinach N, Rautschka M, Furlong L . Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research. BMC Bioinformatics. 2015; 16:55. PMC: 4466840. DOI: 10.1186/s12859-015-0472-9. View

Nourani E, Reshadat V . Association extraction from biomedical literature based on representation and transfer learning. J Theor Biol. 2019; 488:110112. DOI: 10.1016/j.jtbi.2019.110112. View

Chen Q, Allot A, Lu Z . Keep up with the latest coronavirus research. Nature. 2020; 579(7798):193. DOI: 10.1038/d41586-020-00694-1. View

10.

Pinero J, Ramirez-Anguita J, Sauch-Pitarch J, Ronzano F, Centeno E, Sanz F . The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Res. 2019; 48(D1):D845-D855. PMC: 7145631. DOI: 10.1093/nar/gkz1021. View

11.

Wei C, Allot A, Leaman R, Lu Z . PubTator central: automated concept annotation for biomedical full text articles. Nucleic Acids Res. 2019; 47(W1):W587-W593. PMC: 6602571. DOI: 10.1093/nar/gkz389. View

12.

Roberts R . PubMed Central: The GenBank of the published literature. Proc Natl Acad Sci U S A. 2001; 98(2):381-2. PMC: 33354. DOI: 10.1073/pnas.98.2.381. View

13.

Lippincott T, Seaghdha D, Korhonen A . Exploring subdomain variation in biomedical language. BMC Bioinformatics. 2011; 12:212. PMC: 3118171. DOI: 10.1186/1471-2105-12-212. View

14.

Zhou J, Fu B . The research on gene-disease association based on text-mining of PubMed. BMC Bioinformatics. 2018; 19(1):37. PMC: 5804013. DOI: 10.1186/s12859-018-2048-y. View

15.

Nelen M, Kremer H, Konings I, Schoute F, van Essen A, Koch R . Novel PTEN mutations in patients with Cowden disease: absence of clear genotype-phenotype correlations. Eur J Hum Genet. 1999; 7(3):267-73. DOI: 10.1038/sj.ejhg.5200289. View

16.

Kafkas S, Pi X, Marinos N, Talo F, Morrison A, McEntyre J . Section level search functionality in Europe PMC. J Biomed Semantics. 2015; 6:7. PMC: 4359544. DOI: 10.1186/s13326-015-0003-7. View

17.

Chen Q, Allot A, Lu Z . LitCovid: an open database of COVID-19 literature. Nucleic Acids Res. 2020; 49(D1):D1534-D1540. PMC: 7778958. DOI: 10.1093/nar/gkaa952. View

18.

Xu D, Zhang M, Xie Y, Wang F, Chen M, Zhu K . DTMiner: identification of potential disease targets through biomedical literature mining. Bioinformatics. 2016; 32(23):3619-3626. PMC: 5181534. DOI: 10.1093/bioinformatics/btw503. View

19.

Bundschus M, Dejori M, Stetter M, Tresp V, Kriegel H . Extraction of semantic biomedical relations from text using conditional random fields. BMC Bioinformatics. 2008; 9:207. PMC: 2386138. DOI: 10.1186/1471-2105-9-207. View

20.

Perera N, Dehmer M, Emmert-Streib F . Named Entity Recognition and Relation Detection for Biomedical Information Extraction. Front Cell Dev Biol. 2020; 8:673. PMC: 7485218. DOI: 10.3389/fcell.2020.00673. View