» Articles » PMID: 23865838

A Fast Weak Motif-finding Algorithm Based on Community Detection in Graphs

Overview
Publisher Biomed Central
Specialty Biology
Date 2013 Jul 20
PMID 23865838
Citations 3
Authors
Affiliations
Soon will be listed here.
Abstract

Background: Identification of transcription factor binding sites (also called 'motif discovery') in DNA sequences is a basic step in understanding genetic regulation. Although many successful programs have been developed, the problem is far from being solved on account of diversity in gene expression/regulation and the low specificity of binding sites. State-of-the-art algorithms have their own constraints (e.g., high time or space complexity for finding long motifs, low precision in identification of weak motifs, or the OOPS constraint: one occurrence of the motif instance per sequence) which limit their scope of application.

Results: In this paper, we present a novel and fast algorithm we call TFBSGroup. It is based on community detection from a graph and is used to discover long and weak (l,d) motifs under the ZOMOPS constraint (zero, one or multiple occurrence(s) of the motif instance(s) per sequence), where l is the length of a motif and d is the maximum number of mutations between a motif instance and the motif itself. Firstly, TFBSGroup transforms the (l, d) motif search in sequences to focus on the discovery of dense subgraphs within a graph. It identifies these subgraphs using a fast community detection method for obtaining coarse-grained candidate motifs. Next, it greedily refines these candidate motifs towards the true motif within their own communities. Empirical studies on synthetic (l, d) samples have shown that TFBSGroup is very efficient (e.g., it can find true (18, 6), (24, 8) motifs within 30 seconds). More importantly, the algorithm has succeeded in rapidly identifying motifs in a large data set of prokaryotic promoters generated from the Escherichia coli database RegulonDB. The algorithm has also accurately identified motifs in ChIP-seq data sets for 12 mouse transcription factors involved in ES cell pluripotency and self-renewal.

Conclusions: Our novel heuristic algorithm, TFBSGroup, is able to quickly identify nearly exact matches for long and weak (l, d) motifs in DNA sequences under the ZOMOPS constraint. It is also capable of finding motifs in real applications. The source code for TFBSGroup can be obtained from http://bioinformatics.bioengr.uic.edu/TFBSGroup/.

Citing Articles

A Comparative Analysis Between k-Mers and Community Detection-Based Features for the Task of Protein Classification.

Tangirala K, Herndon N, Caragea D IEEE Trans Nanobioscience. 2016; 15(2):84-92.

PMID: 26863669 PMC: 6245644. DOI: 10.1109/TNB.2016.2523501.


MoTeX-II: structured MoTif eXtraction from large-scale datasets.

Pissis S BMC Bioinformatics. 2014; 15:235.

PMID: 25004797 PMC: 4227134. DOI: 10.1186/1471-2105-15-235.


Practical guidelines for the comprehensive analysis of ChIP-seq data.

Bailey T, Krajewski P, Ladunga I, Lefebvre C, Li Q, Liu T PLoS Comput Biol. 2013; 9(11):e1003326.

PMID: 24244136 PMC: 3828144. DOI: 10.1371/journal.pcbi.1003326.

References
1.
Elnitski L, Jin V, Farnham P, Jones S . Locating mammalian transcription factor binding sites: a survey of computational and experimental techniques. Genome Res. 2006; 16(12):1455-64. DOI: 10.1101/gr.4140006. View

2.
Plumbridge J . DNA binding sites for the Mlc and NagC proteins: regulation of nagE, encoding the N-acetylglucosamine-specific transporter in Escherichia coli. Nucleic Acids Res. 2001; 29(2):506-14. PMC: 29661. DOI: 10.1093/nar/29.2.506. View

3.
Eskin E, Pevzner P . Finding composite regulatory patterns in DNA sequences. Bioinformatics. 2002; 18 Suppl 1:S354-63. DOI: 10.1093/bioinformatics/18.suppl_1.s354. View

4.
Samitt C, Hansen F, Miller J, Schaechter M . In vivo studies of DnaA binding to the origin of replication of Escherichia coli. EMBO J. 1989; 8(3):989-93. PMC: 400901. DOI: 10.1002/j.1460-2075.1989.tb03462.x. View

5.
Hengen P, Bartram S, Stewart L, Schneider T . Information analysis of Fis binding sites. Nucleic Acids Res. 1998; 25(24):4994-5002. PMC: 147151. DOI: 10.1093/nar/25.24.4994. View