Classifying Protein-protein Interaction Articles Using Word and Syntactic Features

Overview

Journal BMC Bioinformatics

Publisher Biomed Central

Specialty Biology

Date 2011 Dec 14

PMID 22151252

Citations 14

Authors

Sun Kim

W John Wilbur

Affiliations

Soon will be listed here.

Abstract

Background: Identifying protein-protein interactions (PPIs) from literature is an important step in mining the function of individual proteins as well as their biological network. Since it is known that PPIs have distinctive patterns in text, machine learning approaches have been successfully applied to mine these patterns. However, the complex nature of PPI description makes the extraction process difficult.

Results: Our approach utilizes both word and syntactic features to effectively capture PPI patterns from biomedical literature. The proposed method automatically identifies gene names by a Priority Model, then extracts grammar relations using a dependency parser. A large margin classifier with Huber loss function learns from the extracted features, and unknown articles are predicted using this data-driven model. For the BioCreative III ACT evaluation, our official runs were ranked in top positions by obtaining maximum 89.15% accuracy, 61.42% F1 score, 0.55306 MCC score, and 67.98% AUC iP/R score.

Conclusions: Even though problems still remain, utilizing syntactic information for article-level filtering helps improve PPI ranking performance. The proposed system is a revision of previously developed algorithms in our group for the ACT evaluation. Our approach is valuable in showing how to use grammatical relations for PPI article filtering, in particular, with a limited training corpus. While current performance is far from satisfactory as an annotation tool, it is already useful for a PPI article search engine since users are mainly focused on highly-ranked results.

Citing Articles

Triage of documents containing protein interactions affected by mutations using an NLP based machine learning approach.

Qu J, Steppi A, Zhong D, Hao J, Wang J, Lung P BMC Genomics. 2020; 21(1):773.

PMID: 33167858 PMC: 7654050. DOI: 10.1186/s12864-020-07185-7.

The BioGRID interaction database: 2019 update.

Oughtred R, Stark C, Breitkreutz B, Rust J, Boucher L, Chang C Nucleic Acids Res. 2018; 47(D1):D529-D541.

PMID: 30476227 PMC: 6324058. DOI: 10.1093/nar/gky1079.

Document triage for identifying protein-protein interactions affected by mutations: a neural network ensemble approach.

Luo L, Yang Z, Lin H, Wang J Database (Oxford). 2018; 2018.

PMID: 30295718 PMC: 6147215. DOI: 10.1093/database/bay097.

Protein-Protein Interaction Article Classification Using a Convolutional Recurrent Neural Network with Pre-trained Word Embeddings.

Matos S, Antunes R J Integr Bioinform. 2017; 14(4).

PMID: 29236678 PMC: 6042813. DOI: 10.1515/jib-2017-0055.

Text Mining for Protein Docking.

Badal V, Kundrotas P, Vakser I PLoS Comput Biol. 2015; 11(12):e1004630.

PMID: 26650466 PMC: 4674139. DOI: 10.1371/journal.pcbi.1004630.

References

Salwinski L, Miller C, Smith A, Pettit F, Bowie J, Eisenberg D . The Database of Interacting Proteins: 2004 update. Nucleic Acids Res. 2003; 32(Database issue):D449-51. PMC: 308820. DOI: 10.1093/nar/gkh086. View

Miyao Y, Sagae K, Saetre R, Matsuzaki T, Tsujii J . Evaluating contributions of natural language parsers to protein-protein interaction extraction. Bioinformatics. 2008; 25(3):394-400. PMC: 2639072. DOI: 10.1093/bioinformatics/btn631. View

Aranda B, Achuthan P, Alam-Faruque Y, Armean I, Bridge A, Derow C . The IntAct molecular interaction database in 2010. Nucleic Acids Res. 2009; 38(Database issue):D525-31. PMC: 2808934. DOI: 10.1093/nar/gkp878. View

Ceol A, Aryamontri A, Licata L, Peluso D, Briganti L, Perfetto L . MINT, the molecular interaction database: 2009 update. Nucleic Acids Res. 2009; 38(Database issue):D532-9. PMC: 2808973. DOI: 10.1093/nar/gkp983. View

Huang M, Ding S, Wang H, Zhu X . Mining physical protein-protein interactions from the literature. Genome Biol. 2008; 9 Suppl 2:S12. PMC: 2559983. DOI: 10.1186/gb-2008-9-s2-s12. View

Niu Y, Otasek D, Jurisica I . Evaluation of linguistic features useful in extraction of interactions from PubMed; application to annotating known, high-throughput and predicted interactions in I2D. Bioinformatics. 2009; 26(1):111-9. PMC: 2796811. DOI: 10.1093/bioinformatics/btp602. View

Krallinger M, Morgan A, Smith L, Leitner F, Tanabe L, Wilbur J . Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge. Genome Biol. 2008; 9 Suppl 2:S1. PMC: 2559980. DOI: 10.1186/gb-2008-9-s2-s1. View

Donaldson I, Martin J, De Bruijn B, Wolting C, Lay V, Tuekam B . PreBIND and Textomy--mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics. 2003; 4:11. PMC: 153503. DOI: 10.1186/1471-2105-4-11. View

Smith L, Wilbur W . Finding related sentence pairs in MEDLINE. Inf Retr Boston. 2010; 13(6):601-617. PMC: 2992462. DOI: 10.1007/s10791-010-9126-8. View

10.

Rebholz-Schuhmann D, Jimeno-Yepes A, Arregui M, Kirsch H . Measuring prediction capacity of individual verbs for the identification of protein interactions. J Biomed Inform. 2009; 43(2):200-7. DOI: 10.1016/j.jbi.2009.09.007. View

11.

Lowe H, Barnett G . Understanding and using the medical subject headings (MeSH) vocabulary to perform literature searches. JAMA. 1994; 271(14):1103-8. View

12.

Bader G, Donaldson I, Wolting C, Ouellette B, Pawson T, Hogue C . BIND--The Biomolecular Interaction Network Database. Nucleic Acids Res. 2000; 29(1):242-5. PMC: 29820. DOI: 10.1093/nar/29.1.242. View

13.

Blaschke C, Leon E, Krallinger M, Valencia A . Evaluation of BioCreAtIvE assessment of task 2. BMC Bioinformatics. 2005; 6 Suppl 1:S16. PMC: 1869008. DOI: 10.1186/1471-2105-6-S1-S16. View

14.

Bjorne J, Ginter F, Pyysalo S, Tsujii J, Salakoski T . Complex event extraction at PubMed scale. Bioinformatics. 2010; 26(12):i382-90. PMC: 2881365. DOI: 10.1093/bioinformatics/btq180. View

15.

Baldi P, Brunak S, Chauvin Y, Andersen C, Nielsen H . Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics. 2000; 16(5):412-24. DOI: 10.1093/bioinformatics/16.5.412. View

16.

Jang H, Lim J, Lim J, Park S, Lee K, Park S . Finding the evidence for protein-protein interactions from PubMed abstracts. Bioinformatics. 2006; 22(14):e220-6. DOI: 10.1093/bioinformatics/btl203. View

17.

Vapnik V . An overview of statistical learning theory. IEEE Trans Neural Netw. 2008; 10(5):988-99. DOI: 10.1109/72.788640. View

18.

Kim S, Shin S, Lee I, Kim S, Sriram R, Zhang B . PIE: an online prediction system for protein-protein interactions from text. Nucleic Acids Res. 2008; 36(Web Server issue):W411-5. PMC: 2447724. DOI: 10.1093/nar/gkn281. View