» Articles » PMID: 34235433

RENET2: High-performance Full-text Gene-disease Relation Extraction with Iterative Training Data Expansion

Overview
Specialty Biology
Date 2021 Jul 8
PMID 34235433
Citations 5
Authors
Affiliations
Soon will be listed here.
Abstract

Relation extraction (RE) is a fundamental task for extracting gene-disease associations from biomedical text. Many state-of-the-art tools have limited capacity, as they can extract gene-disease associations only from single sentences or abstract texts. A few studies have explored extracting gene-disease associations from full-text articles, but there exists a large room for improvements. In this work, we propose RENET2, a deep learning-based RE method, which implements Section Filtering and ambiguous relations modeling to extract gene-disease associations from full-text articles. We designed a novel iterative training data expansion strategy to build an annotated full-text dataset to resolve the scarcity of labels on full-text articles. In our experiments, RENET2 achieved an F1-score of 72.13% for extracting gene-disease associations from an annotated full-text dataset, which was 27.22, 30.30, 29.24 and 23.87% higher than BeFree, DTMiner, BioBERT and RENET, respectively. We applied RENET2 to (i) ∼1.89M full-text articles from PubMed Central and found ∼3.72M gene-disease associations; and (ii) the LitCovid articles and ranked the top 15 proteins associated with COVID-19, supported by recent articles. RENET2 is an efficient and accurate method for full-text gene-disease association extraction. The source-code, manually curated abstract/full-text training data, and results of RENET2 are available at GitHub.

Citing Articles

An NLP-based method to mine gene and function relationships from published articles.

Kumar N, Mukhtar M Sci Rep. 2025; 15(1):7503.

PMID: 40033048 PMC: 11876572. DOI: 10.1038/s41598-025-91809-z.


LSD600: the first corpus of biomedical abstracts annotated with lifestyle-disease relations.

Nourani E, Makri E, Mao X, Pyysalo S, Brunak S, Nastou K Database (Oxford). 2025; 2025.

PMID: 39824652 PMC: 11756709. DOI: 10.1093/database/baae129.


RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature.

Nastou K, Mehryary F, Ohta T, Luoma J, Pyysalo S, Jensen L Database (Oxford). 2024; 2024.

PMID: 39265993 PMC: 11394941. DOI: 10.1093/database/baae095.


DUVEL: an active-learning annotated biomedical corpus for the recognition of oligogenic combinations.

Nachtegael C, De Stefani J, Cnudde A, Lenaerts T Database (Oxford). 2024; 2024.

PMID: 38805753 PMC: 11131422. DOI: 10.1093/database/baae039.


Surveying biomedical relation extraction: a critical examination of current datasets and the proposal of a new resource.

Huang M, Han J, Lin P, You Y, Tsai R, Hsu W Brief Bioinform. 2024; 25(3).

PMID: 38609331 PMC: 11014787. DOI: 10.1093/bib/bbae132.


References
1.
Little J, Bradley L, Bray M, Clyne M, Dorman J, Ellsworth D . Reporting, appraising, and integrating data on genotype prevalence and gene-disease associations. Am J Epidemiol. 2002; 156(4):300-10. DOI: 10.1093/oxfordjournals.aje.a000179. View

2.
Gracia-Ramos A . Is the ACE2 Overexpression a Risk Factor for COVID-19 Infection?. Arch Med Res. 2020; 51(4):345-346. PMC: 7128661. DOI: 10.1016/j.arcmed.2020.03.011. View

3.
Taha K, Davuluri R, Yoo P, Spencer J . Personizing the prediction of future susceptibility to a specific disease. PLoS One. 2021; 16(1):e0243127. PMC: 7787538. DOI: 10.1371/journal.pone.0243127. View

4.
Kilicoglu H . Biomedical text mining for research rigor and integrity: tasks, challenges, directions. Brief Bioinform. 2017; 19(6):1400-1414. PMC: 6291799. DOI: 10.1093/bib/bbx057. View

5.
Lee J, Yoon W, Kim S, Kim D, Kim S, So C . BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2019; 36(4):1234-1240. PMC: 7703786. DOI: 10.1093/bioinformatics/btz682. View