GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains

Overview

Journal Biomed Res Int

Publisher Wiley

Specialties Biomedical Engineering
Biotechnology
General Medicine

Date 2015 Sep 18

PMID 26380306

Citations 107

Authors

Chih-Hsuan Wei

Hung-Yu Kao

Zhiyong Lu

Affiliations

Soon will be listed here.

Abstract

The automatic recognition of gene names and their associated database identifiers from biomedical text has been widely studied in recent years, as these tasks play an important role in many downstream text-mining applications. Despite significant previous research, only a small number of tools are publicly available and these tools are typically restricted to detecting only mention level gene names or only document level gene identifiers. In this work, we report GNormPlus: an end-to-end and open source system that handles both gene mention and identifier detection. We created a new corpus of 694 PubMed articles to support our development of GNormPlus, containing manual annotations for not only gene names and their identifiers, but also closely related concepts useful for gene name disambiguation, such as gene families and protein domains. GNormPlus integrates several advanced text-mining techniques, including SimConcept for resolving composite gene names. As a result, GNormPlus compares favorably to other state-of-the-art methods when evaluated on two widely used public benchmarking datasets, achieving 86.7% F1-score on the BioCreative II Gene Normalization task dataset and 50.1% F1-score on the BioCreative III Gene Normalization task dataset. The GNormPlus source code and its annotated corpus are freely available, and the results of applying GNormPlus to the entire PubMed are freely accessible through our web-based tool PubTator.

Citing Articles

Assessing Artificial Intelligence (AI) Implementation for Assisting Gene Linking (at the National Library of Medicine).

Islamaj R, Wei C, Lai P, Huston M, Coss C, Kochar P JAMIA Open. 2025; 8(1):ooae129.

PMID: 39776621 PMC: 11706533. DOI: 10.1093/jamiaopen/ooae129.

HunFlair2 in a cross-corpus evaluation of biomedical named entity recognition and normalization tools.

Sanger M, Garda S, Wang X, Weber-Genzel L, Droop P, Fuchs B Bioinformatics. 2024; 40(10).

PMID: 39302686 PMC: 11453098. DOI: 10.1093/bioinformatics/btae564.

BioTextQuest v2.0: An evolved tool for biomedical literature mining and concept discovery.

Theodosiou T, Vrettos K, Baltsavia I, Baltoumas F, Papanikolaou N, Antonakis A Comput Struct Biotechnol J. 2024; 23:3247-3253.

PMID: 39279874 PMC: 11399685. DOI: 10.1016/j.csbj.2024.08.016.

EnzChemRED, a rich enzyme chemistry relation extraction dataset.

Lai P, Coudert E, Aimo L, Axelsen K, Breuza L, de Castro E Sci Data. 2024; 11(1):982.

PMID: 39251610 PMC: 11384730. DOI: 10.1038/s41597-024-03835-7.

Integrating deep learning architectures for enhanced biomedical relation extraction: a pipeline approach.

Sarol M, Hong G, Guerra E, Kilicoglu H Database (Oxford). 2024; 2024.

PMID: 39197056 PMC: 11352595. DOI: 10.1093/database/baae079.

References

Huang C, Lu Z . Community challenges in biomedical text mining over 10 years: success, failure and the future. Brief Bioinform. 2015; 17(1):132-44. PMC: 4719069. DOI: 10.1093/bib/bbv024. View

Segura-Bedmar I, Martinez P, de Pablo-Sanchez C . Using a shallow linguistic kernel for drug-drug interaction extraction. J Biomed Inform. 2011; 44(5):789-804. DOI: 10.1016/j.jbi.2011.04.005. View

Zweigenbaum P, Demner-Fushman D, Yu H, Cohen K . Frontiers of biomedical text mining: current progress. Brief Bioinform. 2007; 8(5):358-75. PMC: 2516302. DOI: 10.1093/bib/bbm045. View

Jimeno Yepes A, Verspoor K . Mutation extraction tools can be combined for robust recognition of genetic variants in the literature. F1000Res. 2014; 3:18. PMC: 4176422. DOI: 10.12688/f1000research.3-18.v2. View

Krallinger M, Leitner F, Rodriguez-Penagos C, Valencia A . Overview of the protein-protein interaction annotation extraction task of BioCreative II. Genome Biol. 2008; 9 Suppl 2:S4. PMC: 2559988. DOI: 10.1186/gb-2008-9-s2-s4. View

Morgan A, Lu Z, Wang X, Cohen A, Fluck J, Ruch P . Overview of BioCreative II gene normalization. Genome Biol. 2008; 9 Suppl 2:S3. PMC: 2559987. DOI: 10.1186/gb-2008-9-s2-s3. View

Li L, Liu S, Li L, Fan W, Huang D, Zhou H . A multistage gene normalization system integrating multiple effective methods. PLoS One. 2013; 8(12):e81956. PMC: 3861319. DOI: 10.1371/journal.pone.0081956. View

Arighi C, Carterette B, Cohen K, Krallinger M, Wilbur W, Fey P . An overview of the BioCreative 2012 Workshop Track III: interactive text mining task. Database (Oxford). 2013; 2013:bas056. PMC: 3625048. DOI: 10.1093/database/bas056. View

Lu Z . PubMed and beyond: a survey of web tools for searching biomedical literature. Database (Oxford). 2011; 2011:baq036. PMC: 3025693. DOI: 10.1093/database/baq036. View

10.

Tsai R, Lai P . Multi-stage gene normalization for full-text articles with context-based species filtering for dynamic dictionary entry selection. BMC Bioinformatics. 2011; 12 Suppl 8:S7. PMC: 3269942. DOI: 10.1186/1471-2105-12-S8-S7. View

11.

Islamaj Dogan R, Murray G, Neveol A, Lu Z . Understanding PubMed user search behavior through log analysis. Database (Oxford). 2010; 2009:bap018. PMC: 2797455. DOI: 10.1093/database/bap018. View

12.

Wei C, Harris B, Kao H, Lu Z . tmVar: a text mining approach for extracting sequence variants in biomedical literature. Bioinformatics. 2013; 29(11):1433-9. PMC: 3661051. DOI: 10.1093/bioinformatics/btt156. View

13.

Lu Z, Kao H, Wei C, Huang M, Liu J, Kuo C . The gene normalization task in BioCreative III. BMC Bioinformatics. 2011; 12 Suppl 8:S2. PMC: 3269937. DOI: 10.1186/1471-2105-12-S8-S2. View

14.

Shatkay H, Feldman R . Mining the biomedical literature in the genomic era: an overview. J Comput Biol. 2004; 10(6):821-55. DOI: 10.1089/106652703322756104. View

15.

Leaman R, Wei C, Lu Z . tmChem: a high performance approach for chemical named entity recognition and normalization. J Cheminform. 2015; 7:S3. PMC: 4331693. DOI: 10.1186/1758-2946-7-S1-S3. View

16.

Smith L, Tanabe L, Ando R, Kuo C, Chung I, Hsu C . Overview of BioCreative II gene mention recognition. Genome Biol. 2008; 9 Suppl 2:S2. PMC: 2559986. DOI: 10.1186/gb-2008-9-s2-s2. View

17.

Krallinger M, Vazquez M, Leitner F, Salgado D, Chatr-Aryamontri A, Winter A . The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinformatics. 2011; 12 Suppl 8:S3. PMC: 3269938. DOI: 10.1186/1471-2105-12-S8-S3. View

18.

Rebholz-Schuhmann D, Kirsch H, Couto F . Facts from text--is text mining ready to deliver?. PLoS Biol. 2005; 3(2):e65. PMC: 548955. DOI: 10.1371/journal.pbio.0030065. View

19.

Rzhetsky A, Seringhaus M, Gerstein M . Seeking a new biology through text mining. Cell. 2008; 134(1):9-13. PMC: 2735884. DOI: 10.1016/j.cell.2008.06.029. View

20.

Leaman R, Gonzalez G . BANNER: an executable survey of advances in biomedical named entity recognition. Pac Symp Biocomput. 2008; :652-63. View