» Articles » PMID: 32363397

Computational Prediction and Interpretation of Both General and Specific Types of Promoters in Escherichia Coli by Exploiting a Stacked Ensemble-learning Framework

Overview
Journal Brief Bioinform
Specialty Biology
Date 2020 May 5
PMID 32363397
Citations 33
Authors
Affiliations
Soon will be listed here.
Abstract

Promoters are short consensus sequences of DNA, which are responsible for transcription activation or the repression of all genes. There are many types of promoters in bacteria with important roles in initiating gene transcription. Therefore, solving promoter-identification problems has important implications for improving the understanding of their functions. To this end, computational methods targeting promoter classification have been established; however, their performance remains unsatisfactory. In this study, we present a novel stacked-ensemble approach (termed SELECTOR) for identifying both promoters and their respective classification. SELECTOR combined the composition of k-spaced nucleic acid pairs, parallel correlation pseudo-dinucleotide composition, position-specific trinucleotide propensity based on single-strand, and DNA strand features and using five popular tree-based ensemble learning algorithms to build a stacked model. Both 5-fold cross-validation tests using benchmark datasets and independent tests using the newly collected independent test dataset showed that SELECTOR outperformed state-of-the-art methods in both general and specific types of promoter prediction in Escherichia coli. Furthermore, this novel framework provides essential interpretations that aid understanding of model success by leveraging the powerful Shapley Additive exPlanation algorithm, thereby highlighting the most important features relevant for predicting both general and specific types of promoters and overcoming the limitations of existing 'Black-box' approaches that are unable to reveal causal relationships from large amounts of initially encoded features.

Citing Articles

AlzGenPred - CatBoost-based gene classifier for predicting Alzheimer's disease using high-throughput sequencing data.

Shukla R, Singh T Sci Rep. 2024; 14(1):30294.

PMID: 39639110 PMC: 11621786. DOI: 10.1038/s41598-024-82208-x.


A stacking ensemble model for predicting the occurrence of carotid atherosclerosis.

Zhang X, Tang C, Wang S, Liu W, Yang W, Wang D Front Endocrinol (Lausanne). 2024; 15:1390352.

PMID: 39109079 PMC: 11300245. DOI: 10.3389/fendo.2024.1390352.


iProL: identifying DNA promoters from sequence information based on Longformer pre-trained model.

Peng B, Sun G, Fan Y BMC Bioinformatics. 2024; 25(1):224.

PMID: 38918692 PMC: 11201334. DOI: 10.1186/s12859-024-05849-9.


GP-HTNLoc: A graph prototype head-tail network-based model for multi-label subcellular localization prediction of ncRNAs.

Han S, Liu L Comput Struct Biotechnol J. 2024; 23:2034-2048.

PMID: 38765609 PMC: 11101938. DOI: 10.1016/j.csbj.2024.04.052.


Recognition of cyanobacteria promoters via Siamese network-based contrastive learning under novel non-promoter generation.

Yang G, Li J, Hu J, Shi J Brief Bioinform. 2024; 25(3).

PMID: 38701419 PMC: 11066903. DOI: 10.1093/bib/bbae193.


References
1.
Ramprakash J, Schwarz F . Energetic contributions to the initiation of transcription in E. coli. Biophys Chem. 2008; 138(3):91-8. DOI: 10.1016/j.bpc.2008.09.007. View

2.
Qi H, Jiang Z, Zhang K, Yang S, He F, Zhang Z . PlaD: A Transcriptomics Database for Plant Defense Responses to Pathogens, Providing New Insights into Plant Immune System. Genomics Proteomics Bioinformatics. 2018; 16(4):283-293. PMC: 6205082. DOI: 10.1016/j.gpb.2018.08.002. View

3.
Zhang C . A symmetrical theory of DNA sequences and its applications. J Theor Biol. 1997; 187(3):297-306. DOI: 10.1006/jtbi.1997.0401. View

4.
Hong J, Luo Y, Zhang Y, Ying J, Xue W, Xie T . Protein functional annotation of simultaneously improved stability, accuracy and false discovery rate achieved by a sequence-based deep learning. Brief Bioinform. 2019; 21(4):1437-1447. PMC: 7412958. DOI: 10.1093/bib/bbz081. View

5.
Vacic V, Iakoucheva L, Radivojac P . Two Sample Logo: a graphical representation of the differences between two sets of sequence alignments. Bioinformatics. 2006; 22(12):1536-7. DOI: 10.1093/bioinformatics/btl151. View