Computational Prediction and Interpretation of Both General and Specific Types of Promoters in Escherichia Coli by Exploiting a Stacked Ensemble-learning Framework

Overview

Journal Brief Bioinform

Publisher Oxford University Press

Specialty Biology

Date 2020 May 5

PMID 32363397

Citations 33

Authors

Fuyi Li

Jinxiang Chen

Zongyuan Ge

Ya Wen

Yanwei Yue

Morihiro Hayashida

Abdelkader Baggag

Halima Bensmail

Jiangning Song

Affiliations

Soon will be listed here.

Abstract

Promoters are short consensus sequences of DNA, which are responsible for transcription activation or the repression of all genes. There are many types of promoters in bacteria with important roles in initiating gene transcription. Therefore, solving promoter-identification problems has important implications for improving the understanding of their functions. To this end, computational methods targeting promoter classification have been established; however, their performance remains unsatisfactory. In this study, we present a novel stacked-ensemble approach (termed SELECTOR) for identifying both promoters and their respective classification. SELECTOR combined the composition of k-spaced nucleic acid pairs, parallel correlation pseudo-dinucleotide composition, position-specific trinucleotide propensity based on single-strand, and DNA strand features and using five popular tree-based ensemble learning algorithms to build a stacked model. Both 5-fold cross-validation tests using benchmark datasets and independent tests using the newly collected independent test dataset showed that SELECTOR outperformed state-of-the-art methods in both general and specific types of promoter prediction in Escherichia coli. Furthermore, this novel framework provides essential interpretations that aid understanding of model success by leveraging the powerful Shapley Additive exPlanation algorithm, thereby highlighting the most important features relevant for predicting both general and specific types of promoters and overcoming the limitations of existing 'Black-box' approaches that are unable to reveal causal relationships from large amounts of initially encoded features.

Citing Articles

AlzGenPred - CatBoost-based gene classifier for predicting Alzheimer's disease using high-throughput sequencing data.

Shukla R, Singh T Sci Rep. 2024; 14(1):30294.

PMID: 39639110 PMC: 11621786. DOI: 10.1038/s41598-024-82208-x.

A stacking ensemble model for predicting the occurrence of carotid atherosclerosis.

Zhang X, Tang C, Wang S, Liu W, Yang W, Wang D Front Endocrinol (Lausanne). 2024; 15:1390352.

PMID: 39109079 PMC: 11300245. DOI: 10.3389/fendo.2024.1390352.

iProL: identifying DNA promoters from sequence information based on Longformer pre-trained model.

Peng B, Sun G, Fan Y BMC Bioinformatics. 2024; 25(1):224.

PMID: 38918692 PMC: 11201334. DOI: 10.1186/s12859-024-05849-9.

GP-HTNLoc: A graph prototype head-tail network-based model for multi-label subcellular localization prediction of ncRNAs.

Han S, Liu L Comput Struct Biotechnol J. 2024; 23:2034-2048.

PMID: 38765609 PMC: 11101938. DOI: 10.1016/j.csbj.2024.04.052.

Recognition of cyanobacteria promoters via Siamese network-based contrastive learning under novel non-promoter generation.

Yang G, Li J, Hu J, Shi J Brief Bioinform. 2024; 25(3).

PMID: 38701419 PMC: 11066903. DOI: 10.1093/bib/bbae193.

References

Ramprakash J, Schwarz F . Energetic contributions to the initiation of transcription in E. coli. Biophys Chem. 2008; 138(3):91-8. DOI: 10.1016/j.bpc.2008.09.007. View

Qi H, Jiang Z, Zhang K, Yang S, He F, Zhang Z . PlaD: A Transcriptomics Database for Plant Defense Responses to Pathogens, Providing New Insights into Plant Immune System. Genomics Proteomics Bioinformatics. 2018; 16(4):283-293. PMC: 6205082. DOI: 10.1016/j.gpb.2018.08.002. View

Zhang C . A symmetrical theory of DNA sequences and its applications. J Theor Biol. 1997; 187(3):297-306. DOI: 10.1006/jtbi.1997.0401. View

Hong J, Luo Y, Zhang Y, Ying J, Xue W, Xie T . Protein functional annotation of simultaneously improved stability, accuracy and false discovery rate achieved by a sequence-based deep learning. Brief Bioinform. 2019; 21(4):1437-1447. PMC: 7412958. DOI: 10.1093/bib/bbz081. View

Vacic V, Iakoucheva L, Radivojac P . Two Sample Logo: a graphical representation of the differences between two sets of sequence alignments. Bioinformatics. 2006; 22(12):1536-7. DOI: 10.1093/bioinformatics/btl151. View

Liu B, Liu F, Wang X, Chen J, Fang L, Chou K . Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. 2015; 43(W1):W65-71. PMC: 4489303. DOI: 10.1093/nar/gkv458. View

Gama-Castro S, Salgado H, Santos-Zavaleta A, Ledezma-Tejeida D, Muniz-Rascado L, Garcia-Sotelo J . RegulonDB version 9.0: high-level integration of gene regulation, coexpression, motif clustering and beyond. Nucleic Acids Res. 2015; 44(D1):D133-43. PMC: 4702833. DOI: 10.1093/nar/gkv1156. View

Li F, Chen J, Leier A, Marquez-Lago T, Liu Q, Wang Y . DeepCleave: a deep learning predictor for caspase and matrix metalloprotease substrates and cleavage sites. Bioinformatics. 2019; 36(4):1057-1065. PMC: 8215920. DOI: 10.1093/bioinformatics/btz721. View

Deng L, Pan J, Xu X, Yang W, Liu C, Liu H . PDRLGB: precise DNA-binding residue prediction using a light gradient boosting machine. BMC Bioinformatics. 2019; 19(Suppl 19):522. PMC: 6311926. DOI: 10.1186/s12859-018-2527-1. View

10.

Liu B, Long R, Chou K . iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework. Bioinformatics. 2016; 32(16):2411-8. DOI: 10.1093/bioinformatics/btw186. View

11.

Yang S, Li H, He H, Zhou Y, Zhang Z . Critical assessment and performance improvement of plant-pathogen protein-protein interaction prediction methods. Brief Bioinform. 2017; 20(1):274-287. DOI: 10.1093/bib/bbx123. View

12.

Song J, Li F, Leier A, Marquez-Lago T, Akutsu T, Haffari G . PROSPERous: high-throughput prediction of substrate cleavage sites for 90 proteases with improved accuracy. Bioinformatics. 2017; 34(4):684-687. PMC: 5860617. DOI: 10.1093/bioinformatics/btx670. View

13.

Lin H, Deng E, Ding H, Chen W, Chou K . iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res. 2014; 42(21):12961-72. PMC: 4245931. DOI: 10.1093/nar/gku1019. View

14.

Li F, Li C, Revote J, Zhang Y, Webb G, Li J . GlycoMine: a new bioinformatics tool for highly accurate mapping of the human N-linked and O-linked glycoproteomes by incorporating structural features. Sci Rep. 2016; 6:34595. PMC: 5052564. DOI: 10.1038/srep34595. View

15.

Lv H, Zhang Z, Li S, Tan J, Chen W, Lin H . Evaluation of different computational methods on 5-methylcytosine sites identification. Brief Bioinform. 2019; 21(3):982-995. DOI: 10.1093/bib/bbz048. View

16.

Arora S, Ritchings B, Almira E, Lory S, Ramphal R . A transcriptional activator, FleQ, regulates mucin adhesion and flagellar gene expression in Pseudomonas aeruginosa in a cascade manner. J Bacteriol. 1997; 179(17):5574-81. PMC: 179431. DOI: 10.1128/jb.179.17.5574-5581.1997. View

17.

Friedel M, Nikolajewa S, Suhnel J, Wilhelm T . DiProDB: a database for dinucleotide properties. Nucleic Acids Res. 2008; 37(Database issue):D37-40. PMC: 2686603. DOI: 10.1093/nar/gkn597. View

18.

Potvin E, Sanschagrin F, Levesque R . Sigma factors in Pseudomonas aeruginosa. FEMS Microbiol Rev. 2007; 32(1):38-55. DOI: 10.1111/j.1574-6976.2007.00092.x. View

19.

Jia C, Zhang M, Fan C, Li F, Song J . Formator: Predicting Lysine Formylation Sites Based on the Most Distant Undersampling and Safe-Level Synthetic Minority Oversampling. IEEE/ACM Trans Comput Biol Bioinform. 2019; 18(5):1937-1945. DOI: 10.1109/TCBB.2019.2957758. View

20.

Mei S, Li F, Leier A, Marquez-Lago T, Giam K, Croft N . A comprehensive review and performance evaluation of bioinformatics tools for HLA class I peptide-binding prediction. Brief Bioinform. 2019; 21(4):1119-1135. PMC: 7373177. DOI: 10.1093/bib/bbz051. View