An Integrated Machine Learning System to Computationally Screen Protein Databases for Protein Binding Peptide Ligands
Overview
Cell Biology
Molecular Biology
Authors
Affiliations
A fairly large set of protein interactions is mediated by families of peptide binding domains, such as Src homology 2 (SH2), SH3, PDZ, major histocompatibility complex, etc. To identify their ligands by experimental screening is not only labor-intensive but almost futile in screening low abundance species due to the suppression by high abundance species. An ideal way of studying protein-protein interactions is to use high throughput computational approaches to screen protein sequence databases to direct the validating experiments toward the most promising peptides. Predictors with only good cross-validation were not good enough to screen protein databases. In the current study we built integrated machine learning systems using three novel coding methods and screened the Swiss-Prot and GenBank protein databases for potential ligands of 10 SH3 and three PDZ domains. A large fraction of predictions has already been experimentally confirmed by other independent research groups, indicating a satisfying generalization capability for future applications in identifying protein interactions.
Premarathna G, Ellingson L PLoS One. 2021; 16(4):e0244905.
PMID: 33831020 PMC: 8031081. DOI: 10.1371/journal.pone.0244905.
Li N, Stein R, He W, Komives E, Wang W Mol Cell Proteomics. 2013; 12(10):2750-60.
PMID: 23842000 PMC: 3790288. DOI: 10.1074/mcp.O112.025015.
Hou T, Li N, Li Y, Wang W J Proteome Res. 2012; 11(5):2982-95.
PMID: 22468754 PMC: 3345086. DOI: 10.1021/pr3000688.
DomPep--a general method for predicting modular domain-mediated protein-protein interactions.
Li L, Zhao B, Du J, Zhang K, Ling C, Li S PLoS One. 2011; 6(10):e25528.
PMID: 22003397 PMC: 3189207. DOI: 10.1371/journal.pone.0025528.
Prediction of protease substrates using sequence and structure features.
Barkan D, Hostetter D, Mahrus S, Pieper U, Wells J, Craik C Bioinformatics. 2010; 26(14):1714-22.
PMID: 20505003 PMC: 2894511. DOI: 10.1093/bioinformatics/btq267.