» Articles » PMID: 21596775

ESPRIT-Tree: Hierarchical Clustering Analysis of Millions of 16S RRNA Pyrosequences in Quasilinear Computational Time

Overview
Specialty Biochemistry
Date 2011 May 21
PMID 21596775
Citations 63
Authors
Affiliations
Soon will be listed here.
Abstract

Taxonomy-independent analysis plays an essential role in microbial community analysis. Hierarchical clustering is one of the most widely employed approaches to finding operational taxonomic units, the basis for many downstream analyses. Most existing algorithms have quadratic space and computational complexities, and thus can be used only for small or medium-scale problems. We propose a new online learning-based algorithm that simultaneously addresses the space and computational issues of prior work. The basic idea is to partition a sequence space into a set of subspaces using a partition tree constructed using a pseudometric, then recursively refine a clustering structure in these subspaces. The technique relies on new methods for fast closest-pair searching and efficient dynamic insertion and deletion of tree nodes. To avoid exhaustive computation of pairwise distances between clusters, we represent each cluster of sequences as a probabilistic sequence, and define a set of operations to align these probabilistic sequences and compute genetic distances between them. We present analyses of space and computational complexity, and demonstrate the effectiveness of our new algorithm using a human gut microbiota data set with over one million sequences. The new algorithm exhibits a quasilinear time and space complexity comparable to greedy heuristic clustering algorithms, while achieving a similar accuracy to the standard hierarchical clustering algorithm.

Citing Articles

Accurately clustering biological sequences in linear time by relatedness sorting.

Wright E Nat Commun. 2024; 15(1):3047.

PMID: 38589369 PMC: 11001989. DOI: 10.1038/s41467-024-47371-9.


Alignment-free comparison of metagenomics sequences via approximate string matching.

Chen J, Yang L, Li L, Goodison S, Sun Y Bioinform Adv. 2022; 2(1):vbac077.

PMID: 36388153 PMC: 9645238. DOI: 10.1093/bioadv/vbac077.


High-throughput proteomics: a methodological mini-review.

Cui M, Cheng C, Zhang L Lab Invest. 2022; 102(11):1170-1181.

PMID: 35922478 PMC: 9362039. DOI: 10.1038/s41374-022-00830-7.


Machine Learning Advances in Microbiology: A Review of Methods and Applications.

Jiang Y, Luo J, Huang D, Liu Y, Li D Front Microbiol. 2022; 13:925454.

PMID: 35711777 PMC: 9196628. DOI: 10.3389/fmicb.2022.925454.


Machine Learning as a Tool in Investigating the Possible Role of Microbiome in Development and Treatment of Cancer.

Hajeebu S, Ngembus N, Bandi P, Panigrahy P, Heindl S Cureus. 2021; 13(8):e17415.

PMID: 34589326 PMC: 8459918. DOI: 10.7759/cureus.17415.


References
1.
Schloss P, Handelsman J . Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness. Appl Environ Microbiol. 2005; 71(3):1501-6. PMC: 1065144. DOI: 10.1128/AEM.71.3.1501-1506.2005. View

2.
Franti P, Virmajoki O, Hautamaki V . Fast agglomerative clustering using a k-nearest neighbor graph. IEEE Trans Pattern Anal Mach Intell. 2006; 28(11):1875-81. DOI: 10.1109/TPAMI.2006.227. View

3.
Schloss P, Westcott S, Ryabin T, Hall J, Hartmann M, Hollister E . Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol. 2009; 75(23):7537-41. PMC: 2786419. DOI: 10.1128/AEM.01541-09. View

4.
Dethlefsen L, Huse S, Sogin M, Relman D . The pervasive effects of an antibiotic on the human gut microbiota, as revealed by deep 16S rRNA sequencing. PLoS Biol. 2008; 6(11):e280. PMC: 2586385. DOI: 10.1371/journal.pbio.0060280. View

5.
Edgar R . Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010; 26(19):2460-1. DOI: 10.1093/bioinformatics/btq461. View