Sieve-based Relation Extraction of Gene Regulatory Networks from Biological Literature

Overview

Journal BMC Bioinformatics

Publisher Biomed Central

Specialty Biology

Date 2015 Nov 10

PMID 26551454

Citations 3

Authors

Slavko Zitnik

Marinka Zitnik

Blaz Zupan

Marko Bajec

Affiliations

Soon will be listed here.

Abstract

Background: Relation extraction is an essential procedure in literature mining. It focuses on extracting semantic relations between parts of text, called mentions. Biomedical literature includes an enormous amount of textual descriptions of biological entities, their interactions and results of related experiments. To extract them in an explicit, computer readable format, these relations were at first extracted manually from databases. Manual curation was later replaced with automatic or semi-automatic tools with natural language processing capabilities. The current challenge is the development of information extraction procedures that can directly infer more complex relational structures, such as gene regulatory networks.

Results: We develop a computational approach for extraction of gene regulatory networks from textual data. Our method is designed as a sieve-based system and uses linear-chain conditional random fields and rules for relation extraction. With this method we successfully extracted the sporulation gene regulation network in the bacterium Bacillus subtilis for the information extraction challenge at the BioNLP 2013 conference. To enable extraction of distant relations using first-order models, we transform the data into skip-mention sequences. We infer multiple models, each of which is able to extract different relationship types. Following the shared task, we conducted additional analysis using different system settings that resulted in reducing the reconstruction error of bacterial sporulation network from 0.73 to 0.68, measured as the slot error rate between the predicted and the reference network. We observe that all relation extraction sieves contribute to the predictive performance of the proposed approach. Also, features constructed by considering mention words and their prefixes and suffixes are the most important features for higher accuracy of extraction. Analysis of distances between different mention types in the text shows that our choice of transforming data into skip-mention sequences is appropriate for detecting relations between distant mentions.

Conclusions: Linear-chain conditional random fields, along with appropriate data transformations, can be efficiently used to extract relations. The sieve-based architecture simplifies the system as new sieves can be easily added or removed and each sieve can utilize the results of previous ones. Furthermore, sieves with conditional random fields can be trained on arbitrary text data and hence are applicable to broad range of relation extraction tasks and data domains.

Citing Articles

Curation, inference, and assessment of a globally reconstructed gene regulatory network for Streptomyces coelicolor.

Zorro-Aranda A, Escorcia-Rodriguez J, Gonzalez-Kise J, Freyre-Gonzalez J Sci Rep. 2022; 12(1):2840.

PMID: 35181703 PMC: 8857197. DOI: 10.1038/s41598-022-06658-x.

Identification of conclusive association entities in biomedical articles.

Liu R J Biomed Semantics. 2019; 10(1):1.

PMID: 30616688 PMC: 6322258. DOI: 10.1186/s13326-018-0194-9.

Bridging semantics and syntax with graph algorithms-state-of-the-art of extracting biomedical relations.

Luo Y, Uzuner O, Szolovits P Brief Bioinform. 2016; 18(1):160-178.

PMID: 26851224 PMC: 5221425. DOI: 10.1093/bib/bbw001.

References

Kwak M, Leroy G, Martinez J, Harwell J . Development and evaluation of a biomedical search engine using a predicate-based vector space model. J Biomed Inform. 2013; 46(5):929-39. DOI: 10.1016/j.jbi.2013.07.006. View

Errington J . Bacillus subtilis sporulation: regulation of gene expression and control of morphogenesis. Microbiol Rev. 1993; 57(1):1-33. PMC: 372899. DOI: 10.1128/mr.57.1.1-33.1993. View

Muller H, Kenny E, Sternberg P . Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol. 2004; 2(11):e309. PMC: 517822. DOI: 10.1371/journal.pbio.0020309. View

Krallinger M, Valencia A . Text-mining and information-retrieval services for molecular biology. Genome Biol. 2005; 6(7):224. PMC: 1175978. DOI: 10.1186/gb-2005-6-7-224. View

Pyysalo S, Ohta T, Rak R, Sullivan D, Mao C, Wang C . Overview of the ID, EPI and REL tasks of BioNLP Shared Task 2011. BMC Bioinformatics. 2012; 13 Suppl 11:S2. PMC: 3384257. DOI: 10.1186/1471-2105-13-S11-S2. View

Schmalisch M, Maiques E, Nikolov L, Camp A, Chevreux B, Muffler A . Small genes under sporulation control in the Bacillus subtilis genome. J Bacteriol. 2010; 192(20):5402-12. PMC: 2950494. DOI: 10.1128/JB.00534-10. View

Xu Y, Hong K, Tsujii J, Chang E . Feature engineering combined with machine learning and rule-based methods for structured information extraction from narrative clinical discharge summaries. J Am Med Inform Assoc. 2012; 19(5):824-32. PMC: 3422834. DOI: 10.1136/amiajnl-2011-000776. View

Peter I, Davidson E . A gene regulatory network controlling the embryonic specification of endoderm. Nature. 2011; 474(7353):635-9. PMC: 3976212. DOI: 10.1038/nature10100. View

Zitnik S, Subelj L, Bajec M . SkipCor: skip-mention coreference resolution using linear-chain conditional random fields. PLoS One. 2014; 9(6):e100101. PMC: 4067305. DOI: 10.1371/journal.pone.0100101. View

10.

Liu H, Christiansen T, Baumgartner Jr W, Verspoor K . BioLemmatizer: a lemmatization tool for morphological processing of biomedical text. J Biomed Semantics. 2012; 3:3. PMC: 3359276. DOI: 10.1186/2041-1480-3-3. View

11.

Polen H, Zapantis A, Clauson K, Jebrock J, Paris M . Ability of online drug databases to assist in clinical decision-making with infectious disease therapies. BMC Infect Dis. 2008; 8:153. PMC: 2613899. DOI: 10.1186/1471-2334-8-153. View

12.

Traag B, Pugliese A, Eisen J, Losick R . Gene conservation among endospore-forming bacteria reveals additional sporulation genes in Bacillus subtilis. J Bacteriol. 2012; 195(2):253-60. PMC: 3553846. DOI: 10.1128/JB.01778-12. View

13.

MacNeil L, Walhout A . Gene regulatory networks and the role of robustness and stochasticity in the control of gene expression. Genome Res. 2011; 21(5):645-57. PMC: 3083081. DOI: 10.1101/gr.097378.109. View

14.

Moreau Y, Tranchevent L . Computational tools for prioritizing candidate genes: boosting disease gene discovery. Nat Rev Genet. 2012; 13(8):523-36. DOI: 10.1038/nrg3253. View

15.

Amberger J, Bocchini C, Hamosh A . A new face and new challenges for Online Mendelian Inheritance in Man (OMIM®). Hum Mutat. 2011; 32(5):564-7. DOI: 10.1002/humu.21466. View

16.

Wei C, Kao H, Lu Z . PubTator: a web-based text mining tool for assisting biocuration. Nucleic Acids Res. 2013; 41(Web Server issue):W518-22. PMC: 3692066. DOI: 10.1093/nar/gkt441. View

17.

Van Landeghem S, Bjorne J, Abeel T, De Baets B, Salakoski T, Van de Peer Y . Semantically linking molecular entities in literature through entity relationships. BMC Bioinformatics. 2012; 13 Suppl 11:S6. PMC: 3384255. DOI: 10.1186/1471-2105-13-S11-S6. View

18.

Davis A, Wiegers T, Johnson R, Lay J, Lennon-Hopkins K, Saraceni-Richards C . Text mining effectively scores and ranks the literature for improving chemical-gene-disease curation at the comparative toxicogenomics database. PLoS One. 2013; 8(4):e58201. PMC: 3629079. DOI: 10.1371/journal.pone.0058201. View

19.

Franceschini A, Szklarczyk D, Frankild S, Kuhn M, Simonovic M, Roth A . STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 2012; 41(Database issue):D808-15. PMC: 3531103. DOI: 10.1093/nar/gks1094. View

20.

Piro R, Di Cunto F . Computational approaches to disease-gene prediction: rationale, classification and successes. FEBS J. 2012; 279(5):678-96. DOI: 10.1111/j.1742-4658.2012.08471.x. View