Classifying Protein-protein Interaction Articles Using Word and Syntactic Features
Overview
Authors
Affiliations
Background: Identifying protein-protein interactions (PPIs) from literature is an important step in mining the function of individual proteins as well as their biological network. Since it is known that PPIs have distinctive patterns in text, machine learning approaches have been successfully applied to mine these patterns. However, the complex nature of PPI description makes the extraction process difficult.
Results: Our approach utilizes both word and syntactic features to effectively capture PPI patterns from biomedical literature. The proposed method automatically identifies gene names by a Priority Model, then extracts grammar relations using a dependency parser. A large margin classifier with Huber loss function learns from the extracted features, and unknown articles are predicted using this data-driven model. For the BioCreative III ACT evaluation, our official runs were ranked in top positions by obtaining maximum 89.15% accuracy, 61.42% F1 score, 0.55306 MCC score, and 67.98% AUC iP/R score.
Conclusions: Even though problems still remain, utilizing syntactic information for article-level filtering helps improve PPI ranking performance. The proposed system is a revision of previously developed algorithms in our group for the ACT evaluation. Our approach is valuable in showing how to use grammatical relations for PPI article filtering, in particular, with a limited training corpus. While current performance is far from satisfactory as an annotation tool, it is already useful for a PPI article search engine since users are mainly focused on highly-ranked results.
Qu J, Steppi A, Zhong D, Hao J, Wang J, Lung P BMC Genomics. 2020; 21(1):773.
PMID: 33167858 PMC: 7654050. DOI: 10.1186/s12864-020-07185-7.
The BioGRID interaction database: 2019 update.
Oughtred R, Stark C, Breitkreutz B, Rust J, Boucher L, Chang C Nucleic Acids Res. 2018; 47(D1):D529-D541.
PMID: 30476227 PMC: 6324058. DOI: 10.1093/nar/gky1079.
Luo L, Yang Z, Lin H, Wang J Database (Oxford). 2018; 2018.
PMID: 30295718 PMC: 6147215. DOI: 10.1093/database/bay097.
Matos S, Antunes R J Integr Bioinform. 2017; 14(4).
PMID: 29236678 PMC: 6042813. DOI: 10.1515/jib-2017-0055.
Text Mining for Protein Docking.
Badal V, Kundrotas P, Vakser I PLoS Comput Biol. 2015; 11(12):e1004630.
PMID: 26650466 PMC: 4674139. DOI: 10.1371/journal.pcbi.1004630.