» Articles » PMID: 19478010

Efficient Exact Motif Discovery

Overview
Journal Bioinformatics
Specialty Biology
Date 2009 May 30
PMID 19478010
Citations 16
Authors
Affiliations
Soon will be listed here.
Abstract

Motivation: The motif discovery problem consists of finding over-represented patterns in a collection of biosequences. It is one of the classical sequence analysis problems, but still has not been satisfactorily solved in an exact and efficient manner. This is partly due to the large number of possibilities of defining the motif search space and the notion of over-representation. Even for well-defined formalizations, the problem is frequently solved in an ad hoc manner with heuristics that do not guarantee to find the best motif.

Results: We show how to solve the motif discovery problem (almost) exactly on a practically relevant space of IUPAC generalized string patterns, using the p-value with respect to an i.i.d. model or a Markov model as the measure of over-representation. In particular, (i) we use a highly accurate compound Poisson approximation for the null distribution of the number of motif occurrences. We show how to compute the exact clump size distribution using a recently introduced device called probabilistic arithmetic automaton (PAA). (ii) We define two p-value scores for over-representation, the first one based on the total number of motif occurrences, the second one based on the number of sequences in a collection with at least one occurrence. (iii) We describe an algorithm to discover the optimal pattern with respect to either of the scores. The method exploits monotonicity properties of the compound Poisson approximation and is by orders of magnitude faster than exhaustive enumeration of IUPAC strings (11.8 h compared with an extrapolated runtime of 4.8 years). (iv) We justify the use of the proposed scores for motif discovery by showing our method to outperform other motif discovery algorithms (e.g. MEME, Weeder) on benchmark datasets. We also propose new motifs on Mycobacterium tuberculosis.

Availability And Implementation: The method has been implemented in Java. It can be obtained from http://ls11-www.cs.tu-dortmund.de/people/marschal/paa_md/.

Citing Articles

Proxi-RIMS-seq2 applied to native microbiomes uncovers hundreds of known and novel C methyltransferase specificities.

Yang W, Luyten Y, Reister E, Mangelson H, Sisson Z, Auch B bioRxiv. 2024; .

PMID: 39071437 PMC: 11275837. DOI: 10.1101/2024.07.15.603628.


A Survey of Archaeal Restriction-Modification Systems.

Anton B, Roberts R Microorganisms. 2023; 11(10).

PMID: 37894082 PMC: 10609329. DOI: 10.3390/microorganisms11102424.


Fast and exact quantification of motif occurrences in biological sequences.

Prosperi M, Marini S, Boucher C BMC Bioinformatics. 2021; 22(1):445.

PMID: 34537012 PMC: 8449872. DOI: 10.1186/s12859-021-04355-6.


Rapid identification of methylase specificity (RIMS-seq) jointly identifies methylated motifs and generates shotgun sequencing of bacterial genomes.

Baum C, Lin Y, Fomenkov A, Anton B, Chen L, Yan B Nucleic Acids Res. 2021; 49(19):e113.

PMID: 34417598 PMC: 8565308. DOI: 10.1093/nar/gkab705.


DiNAMO: highly sensitive DNA motif discovery in high-throughput sequencing data.

Saad C, Noe L, Richard H, Leclerc J, Buisine M, Touzet H BMC Bioinformatics. 2018; 19(1):223.

PMID: 29890948 PMC: 5996464. DOI: 10.1186/s12859-018-2215-1.


References
1.
Fratkin E, Naughton B, Brutlag D, Batzoglou S . MotifCut: regulatory motifs finding with maximum density subgraphs. Bioinformatics. 2006; 22(14):e150-7. DOI: 10.1093/bioinformatics/btl243. View

2.
Li N, Tompa M . Analysis of computational approaches for motif discovery. Algorithms Mol Biol. 2006; 1:8. PMC: 1540429. DOI: 10.1186/1748-7188-1-8. View

3.
Lladser M, Betterton M, Knight R . Multiple pattern matching: a Markov chain approach. J Math Biol. 2007; 56(1-2):51-92. DOI: 10.1007/s00285-007-0109-3. View

4.
Tompa M, Li N, Bailey T, Church G, De Moor B, Eskin E . Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005; 23(1):137-44. DOI: 10.1038/nbt1053. View

5.
Sandve G, Drablos F . A survey of motif discovery methods in an integrated framework. Biol Direct. 2006; 1:11. PMC: 1479319. DOI: 10.1186/1745-6150-1-11. View