» Articles » PMID: 34537012

Fast and Exact Quantification of Motif Occurrences in Biological Sequences

Overview
Publisher Biomed Central
Specialty Biology
Date 2021 Sep 19
PMID 34537012
Citations 3
Authors
Affiliations
Soon will be listed here.
Abstract

Background: Identification of motifs and quantification of their occurrences are important for the study of genetic diseases, gene evolution, transcription sites, and other biological mechanisms. Exact formulae for estimating count distributions of motifs under Markovian assumptions have high computational complexity and are impractical to be used on large motif sets. Approximated formulae, e.g. based on compound Poisson, are faster, but reliable p value calculation remains challenging. Here, we introduce 'motif_prob', a fast implementation of an exact formula for motif count distribution through progressive approximation with arbitrary precision. Our implementation speeds up the exact calculation, usually impractical, making it feasible and posit to substitute currently employed heuristics.

Results: We implement motif_prob in both Perl and C+ + languages, using an efficient error-bound iterative process for the exact formula, providing comparison with state-of-the-art tools (e.g. MoSDi) in terms of precision, run time benchmarks, along with a real-world use case on bacterial motif characterization. Our software is able to process a million of motifs (13-31 bases) over genome lengths of 5 million bases within the minute on a regular laptop, and the run times for both the Perl and C+ + code are several orders of magnitude smaller (50-1000× faster) than MoSDi, even when using their fast compound Poisson approximation (60-120× faster). In the real-world use cases, we first show the consistency of motif_prob with MoSDi, and then how the p-value quantification is crucial for enrichment quantification when bacteria have different GC content, using motifs found in antimicrobial resistance genes. The software and the code sources are available under the MIT license at https://github.com/DataIntellSystLab/motif_prob .

Conclusions: The motif_prob software is a multi-platform and efficient open source solution for calculating exact frequency distributions of motifs. It can be integrated with motif discovery/characterization tools for quantifying enrichment and deviation from expected frequency ranges with exact p values, without loss in data processing efficiency.

Citing Articles

Optimizing resource utilization for large scale problems through architecture aware scheduling.

Elsawwaf A, Aly G, Faheem H, Fayez M Sci Rep. 2024; 14(1):26356.

PMID: 39487166 PMC: 11530424. DOI: 10.1038/s41598-024-75711-8.


An average-case efficient two-stage algorithm for enumerating all longest common substrings of minimum length between genome pairs.

Prosperi M, Marini S, Boucher C Proc (IEEE Int Conf Healthc Inform). 2024; 2024:93-102.

PMID: 39308639 PMC: 11412151. DOI: 10.1109/ichi61247.2024.00020.


OCTOPUS: Disk-based, Multiplatform, Mobile-friendly Metagenomics Classifier.

Marini S, Barquero A, Wadhwani A, Bian J, Ruiz J, Boucher C bioRxiv. 2024; .

PMID: 38559026 PMC: 10979967. DOI: 10.1101/2024.03.15.585215.

References
1.
Hildebrand F, Meyer A, Eyre-Walker A . Evidence of selection upon genomic GC-content in bacteria. PLoS Genet. 2010; 6(9):e1001107. PMC: 2936529. DOI: 10.1371/journal.pgen.1001107. View

2.
Fogel G, Weekes D, Varga G, Dow E, Harlow H, Onyia J . Discovery of sequence motifs related to coexpression of genes using evolutionary computation. Nucleic Acids Res. 2004; 32(13):3826-35. PMC: 506801. DOI: 10.1093/nar/gkh713. View

3.
Luu P, Scholer H, Arauzo-Bravo M . Disclosing the crosstalk among DNA methylation, transcription factors, and histone marks in human pluripotent cells through discovery of DNA methylation motifs. Genome Res. 2013; 23(12):2013-29. PMC: 3847772. DOI: 10.1101/gr.155960.113. View

4.
Dang L, Tondl M, Chiu M, Revote J, Paten B, Tano V . TrawlerWeb: an online de novo motif discovery tool for next-generation sequencing datasets. BMC Genomics. 2018; 19(1):238. PMC: 5887194. DOI: 10.1186/s12864-018-4630-0. View

5.
Robin S, Daudin J, Richard H, Sagot M, Schbath S . Occurrence probability of structured motifs in random sequences. J Comput Biol. 2003; 9(6):761-73. DOI: 10.1089/10665270260518254. View