RENET2: High-performance Full-text Gene-disease Relation Extraction with Iterative Training Data Expansion
Overview
Affiliations
Relation extraction (RE) is a fundamental task for extracting gene-disease associations from biomedical text. Many state-of-the-art tools have limited capacity, as they can extract gene-disease associations only from single sentences or abstract texts. A few studies have explored extracting gene-disease associations from full-text articles, but there exists a large room for improvements. In this work, we propose RENET2, a deep learning-based RE method, which implements Section Filtering and ambiguous relations modeling to extract gene-disease associations from full-text articles. We designed a novel iterative training data expansion strategy to build an annotated full-text dataset to resolve the scarcity of labels on full-text articles. In our experiments, RENET2 achieved an F1-score of 72.13% for extracting gene-disease associations from an annotated full-text dataset, which was 27.22, 30.30, 29.24 and 23.87% higher than BeFree, DTMiner, BioBERT and RENET, respectively. We applied RENET2 to (i) ∼1.89M full-text articles from PubMed Central and found ∼3.72M gene-disease associations; and (ii) the LitCovid articles and ranked the top 15 proteins associated with COVID-19, supported by recent articles. RENET2 is an efficient and accurate method for full-text gene-disease association extraction. The source-code, manually curated abstract/full-text training data, and results of RENET2 are available at GitHub.
An NLP-based method to mine gene and function relationships from published articles.
Kumar N, Mukhtar M Sci Rep. 2025; 15(1):7503.
PMID: 40033048 PMC: 11876572. DOI: 10.1038/s41598-025-91809-z.
LSD600: the first corpus of biomedical abstracts annotated with lifestyle-disease relations.
Nourani E, Makri E, Mao X, Pyysalo S, Brunak S, Nastou K Database (Oxford). 2025; 2025.
PMID: 39824652 PMC: 11756709. DOI: 10.1093/database/baae129.
Nastou K, Mehryary F, Ohta T, Luoma J, Pyysalo S, Jensen L Database (Oxford). 2024; 2024.
PMID: 39265993 PMC: 11394941. DOI: 10.1093/database/baae095.
Nachtegael C, De Stefani J, Cnudde A, Lenaerts T Database (Oxford). 2024; 2024.
PMID: 38805753 PMC: 11131422. DOI: 10.1093/database/baae039.
Huang M, Han J, Lin P, You Y, Tsai R, Hsu W Brief Bioinform. 2024; 25(3).
PMID: 38609331 PMC: 11014787. DOI: 10.1093/bib/bbae132.