A Model Stacking Framework for Identifying DNA Binding Proteins by Orchestrating Multi-View Features and Classifiers

Overview

Journal Genes (Basel)

Publisher MDPI

Date 2018 Aug 4

PMID 30071697

Citations 11

Authors

Xiu-Juan Liu

Xiu-Jun Gong

Hua Yu

Jia-Hui Xu

Affiliations

Soon will be listed here.

Abstract

Nowadays, various machine learning-based approaches using sequence information alone have been proposed for identifying DNA-binding proteins, which are crucial to many cellular processes, such as DNA replication, DNA repair and DNA modification. Among these methods, building a meaningful feature representation of the sequences and choosing an appropriate classifier are the most trivial tasks. Disclosing the significances and contributions of different feature spaces and classifiers to the final prediction is of the utmost importance, not only for the prediction performances, but also the practical clues of biological experiment designs. In this study, we propose a model stacking framework by orchestrating multi-view features and classifiers (MSFBinder) to investigate how to integrate and evaluate loosely-coupled models for predicting DNA-binding proteins. The framework integrates multi-view features including Local_DPP, 188D, Position-Specific Scoring Matrix (PSSM)_DWT and autocross-covariance of secondary structures(AC_Struc), which were extracted based on evolutionary information, sequence composition, physiochemical properties and predicted structural information, respectively. These features are fed into various loosely-coupled classifiers such as SVM and random forest. Then, a logistic regression model was applied to evaluate the contributions of these individual classifiers and to make the final prediction. When performing on the training dataset PDB1075, the proposed method achieves an accuracy of 83.53%. On the independent dataset PDB186, the method achieves an accuracy of 81.72%, which outperforms many existing methods. These results suggest that the framework is able to orchestrate various predicted models flexibly with good performances.

Citing Articles

SNARER: new molecular descriptors for SNARE proteins classification.

Auriemma Citarella A, Di Biasi L, Risi M, Tortora G BMC Bioinformatics. 2022; 23(1):148.

PMID: 35462533 PMC: 9035248. DOI: 10.1186/s12859-022-04677-z.

The Characterization of Structure and Prediction for Aquaporin in Tumour Progression by Machine Learning.

Chen Z, Jiao S, Zhao D, Zou Q, Xu L, Zhang L Front Cell Dev Biol. 2022; 10:845622.

PMID: 35178393 PMC: 8844512. DOI: 10.3389/fcell.2022.845622.

Identify DNA-Binding Proteins Through the Extreme Gradient Boosting Algorithm.

Zhao Z, Yang W, Zhai Y, Liang Y, Zhao Y Front Genet. 2022; 12:821996.

PMID: 35154264 PMC: 8837382. DOI: 10.3389/fgene.2021.821996.

A sequence-based multiple kernel model for identifying DNA-binding proteins.

Qian Y, Jiang L, Ding Y, Tang J, Guo F BMC Bioinformatics. 2021; 22(Suppl 3):291.

PMID: 34058979 PMC: 8167993. DOI: 10.1186/s12859-020-03875-x.

Prediction of DNA binding proteins using local features and long-term dependencies with primary sequences based on deep learning.

Li G, Du X, Li X, Zou L, Zhang G, Wu Z PeerJ. 2021; 9:e11262.

PMID: 33986992 PMC: 8101451. DOI: 10.7717/peerj.11262.

References

Ma X, Guo J, Sun X . DNABP: Identification of DNA-Binding Proteins Based on Feature Selection Using a Random Forest and Predicting Binding Residues. PLoS One. 2016; 11(12):e0167345. PMC: 5132331. DOI: 10.1371/journal.pone.0167345. View

Paliwal K, Sharma A, Lyons J, Dehzangi A . Improving protein fold recognition using the amalgamation of evolutionary-based and structural based information. BMC Bioinformatics. 2014; 15 Suppl 16:S12. PMC: 4290640. DOI: 10.1186/1471-2105-15-S16-S12. View

Jaiswal R, Singh S, Bastia D, Escalante C . Crystallization and preliminary X-ray characterization of the eukaryotic replication terminator Reb1-Ter DNA complex. Acta Crystallogr F Struct Biol Commun. 2015; 71(Pt 4):414-8. PMC: 4388176. DOI: 10.1107/S2053230X15004112. View

Dehzangi A, Paliwal K, Lyons J, Sharma A, Sattar A . Proposing a highly accurate protein structural class predictor using segmentation-based features. BMC Genomics. 2014; 15 Suppl 1:S2. PMC: 4046757. DOI: 10.1186/1471-2164-15-S1-S2. View

Nanni L, Brahnam S, Lumini A . Wavelet images and Chou's pseudo amino acid composition for protein classification. Amino Acids. 2011; 43(2):657-65. DOI: 10.1007/s00726-011-1114-9. View

Zhang L, Zhang C, Gao R, Yang R . An Ensemble Method to Distinguish Bacteriophage Virion from Non-Virion Proteins Based on Protein Sequence Characteristics. Int J Mol Sci. 2015; 16(9):21734-58. PMC: 4613277. DOI: 10.3390/ijms160921734. View

Liu B, Xu J, Fan S, Xu R, Zhou J, Wang X . PseDNA-Pro: DNA-Binding Protein Identification by Combining Chou's PseAAC and Physicochemical Distance Transformation. Mol Inform. 2016; 34(1):8-17. DOI: 10.1002/minf.201400025. View

Jones D . Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol. 1999; 292(2):195-202. DOI: 10.1006/jmbi.1999.3091. View

Li L, Zhang Y, Zou L, Li C, Yu B, Zheng X . An ensemble classifier for eukaryotic protein subcellular location prediction using gene ontology categories and amino acid hydrophobicity. PLoS One. 2012; 7(1):e31057. PMC: 3268814. DOI: 10.1371/journal.pone.0031057. View

10.

Song L, Li D, Zeng X, Wu Y, Guo L, Zou Q . nDNA-Prot: identification of DNA-binding proteins based on unbalanced classification. BMC Bioinformatics. 2014; 15:298. PMC: 4165999. DOI: 10.1186/1471-2105-15-298. View

11.

Lin C, Zou Y, Qin J, Liu X, Jiang Y, Ke C . Hierarchical classification of protein folds using a novel ensemble classifier. PLoS One. 2013; 8(2):e56499. PMC: 3577917. DOI: 10.1371/journal.pone.0056499. View

12.

Cai C, Han L, Ji Z, Chen X, Chen Y . SVM-Prot: Web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res. 2003; 31(13):3692-7. PMC: 169006. DOI: 10.1093/nar/gkg600. View

13.

Chowdhury S, Shatabda S, Dehzangi A . iDNAProt-ES: Identification of DNA-binding Proteins Using Evolutionary and Structural Features. Sci Rep. 2017; 7(1):14938. PMC: 5668250. DOI: 10.1038/s41598-017-14945-1. View

14.

Zhou C, Yu H, Ding Y, Guo F, Gong X . Multi-scale encoding of amino acid sequences for predicting protein interactions using gradient boosting decision tree. PLoS One. 2017; 12(8):e0181426. PMC: 5549711. DOI: 10.1371/journal.pone.0181426. View

15.

Zhang J, Liu B . PSFM-DBT: Identifying DNA-Binding Proteins by Combing Position Specific Frequency Matrix and Distance-Bigram Transformation. Int J Mol Sci. 2017; 18(9). PMC: 5618505. DOI: 10.3390/ijms18091856. View

16.

Wang Y, Ding Y, Guo F, Wei L, Tang J . Improved detection of DNA-binding proteins via compression technology on PSSM information. PLoS One. 2017; 12(9):e0185587. PMC: 5621689. DOI: 10.1371/journal.pone.0185587. View

17.

Yu L, Guo Y, Zhang Z, Li Y, Li M, Li G . SecretP: a new method for predicting mammalian secreted proteins. Peptides. 2010; 31(4):574-8. DOI: 10.1016/j.peptides.2009.12.026. View

18.

Heffernan R, Dehzangi A, Lyons J, Paliwal K, Sharma A, Wang J . Highly accurate sequence-based prediction of half-sphere exposures of amino acid residues in proteins. Bioinformatics. 2015; 32(6):843-9. DOI: 10.1093/bioinformatics/btv665. View

19.

Zhang Y, Yu D, Li S, Fan Y, Huang Y, Shen H . Predicting protein-ATP binding sites from primary sequence through fusing bi-profile sampling of multi-view features. BMC Bioinformatics. 2012; 13:118. PMC: 3424114. DOI: 10.1186/1471-2105-13-118. View

20.

Dehzangi A, Heffernan R, Sharma A, Lyons J, Paliwal K, Sattar A . Gram-positive and Gram-negative protein subcellular localization by incorporating evolutionary-based descriptors into Chou׳s general PseAAC. J Theor Biol. 2014; 364:284-94. DOI: 10.1016/j.jtbi.2014.09.029. View