ESPRIT-Tree: Hierarchical Clustering Analysis of Millions of 16S RRNA Pyrosequences in Quasilinear Computational Time

Overview

Journal Nucleic Acids Res

Publisher Oxford University Press

Specialty Biochemistry

Date 2011 May 21

PMID 21596775

Citations 63

Authors

Yunpeng Cai

Yijun Sun

Affiliations

Soon will be listed here.

Abstract

Taxonomy-independent analysis plays an essential role in microbial community analysis. Hierarchical clustering is one of the most widely employed approaches to finding operational taxonomic units, the basis for many downstream analyses. Most existing algorithms have quadratic space and computational complexities, and thus can be used only for small or medium-scale problems. We propose a new online learning-based algorithm that simultaneously addresses the space and computational issues of prior work. The basic idea is to partition a sequence space into a set of subspaces using a partition tree constructed using a pseudometric, then recursively refine a clustering structure in these subspaces. The technique relies on new methods for fast closest-pair searching and efficient dynamic insertion and deletion of tree nodes. To avoid exhaustive computation of pairwise distances between clusters, we represent each cluster of sequences as a probabilistic sequence, and define a set of operations to align these probabilistic sequences and compute genetic distances between them. We present analyses of space and computational complexity, and demonstrate the effectiveness of our new algorithm using a human gut microbiota data set with over one million sequences. The new algorithm exhibits a quasilinear time and space complexity comparable to greedy heuristic clustering algorithms, while achieving a similar accuracy to the standard hierarchical clustering algorithm.

Citing Articles

Accurately clustering biological sequences in linear time by relatedness sorting.

Wright E Nat Commun. 2024; 15(1):3047.

PMID: 38589369 PMC: 11001989. DOI: 10.1038/s41467-024-47371-9.

Alignment-free comparison of metagenomics sequences via approximate string matching.

Chen J, Yang L, Li L, Goodison S, Sun Y Bioinform Adv. 2022; 2(1):vbac077.

PMID: 36388153 PMC: 9645238. DOI: 10.1093/bioadv/vbac077.

High-throughput proteomics: a methodological mini-review.

Cui M, Cheng C, Zhang L Lab Invest. 2022; 102(11):1170-1181.

PMID: 35922478 PMC: 9362039. DOI: 10.1038/s41374-022-00830-7.

Machine Learning Advances in Microbiology: A Review of Methods and Applications.

Jiang Y, Luo J, Huang D, Liu Y, Li D Front Microbiol. 2022; 13:925454.

PMID: 35711777 PMC: 9196628. DOI: 10.3389/fmicb.2022.925454.

Machine Learning as a Tool in Investigating the Possible Role of Microbiome in Development and Treatment of Cancer.

Hajeebu S, Ngembus N, Bandi P, Panigrahy P, Heindl S Cureus. 2021; 13(8):e17415.

PMID: 34589326 PMC: 8459918. DOI: 10.7759/cureus.17415.

References

Schloss P, Handelsman J . Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness. Appl Environ Microbiol. 2005; 71(3):1501-6. PMC: 1065144. DOI: 10.1128/AEM.71.3.1501-1506.2005. View

Franti P, Virmajoki O, Hautamaki V . Fast agglomerative clustering using a k-nearest neighbor graph. IEEE Trans Pattern Anal Mach Intell. 2006; 28(11):1875-81. DOI: 10.1109/TPAMI.2006.227. View

Schloss P, Westcott S, Ryabin T, Hall J, Hartmann M, Hollister E . Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol. 2009; 75(23):7537-41. PMC: 2786419. DOI: 10.1128/AEM.01541-09. View

Dethlefsen L, Huse S, Sogin M, Relman D . The pervasive effects of an antibiotic on the human gut microbiota, as revealed by deep 16S rRNA sequencing. PLoS Biol. 2008; 6(11):e280. PMC: 2586385. DOI: 10.1371/journal.pbio.0060280. View

Edgar R . Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010; 26(19):2460-1. DOI: 10.1093/bioinformatics/btq461. View

Borneman J, Triplett E . Molecular microbial diversity in soils from eastern Amazonia: evidence for unusual microorganisms and microbial population shifts associated with deforestation. Appl Environ Microbiol. 1997; 63(7):2647-53. PMC: 168563. DOI: 10.1128/aem.63.7.2647-2653.1997. View

Turnbaugh P, Hamady M, Yatsunenko T, Cantarel B, Duncan A, Ley R . A core gut microbiome in obese and lean twins. Nature. 2008; 457(7228):480-4. PMC: 2677729. DOI: 10.1038/nature07540. View

Zhang Z, Schwartz S, Wagner L, Miller W . A greedy algorithm for aligning DNA sequences. J Comput Biol. 2000; 7(1-2):203-14. DOI: 10.1089/10665270050081478. View

Cole J, Wang Q, Cardenas E, Fish J, Chai B, Farris R . The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acids Res. 2008; 37(Database issue):D141-5. PMC: 2686447. DOI: 10.1093/nar/gkn879. View

10.

White J, Navlakha S, Nagarajan N, Ghodsi M, Kingsford C, Pop M . Alignment and clustering of phylogenetic markers--implications for microbial diversity studies. BMC Bioinformatics. 2010; 11:152. PMC: 2859756. DOI: 10.1186/1471-2105-11-152. View

11.

Huse S, Dethlefsen L, Huber J, Welch D, Relman D, Sogin M . Exploring microbial diversity and taxonomy using SSU rRNA hypervariable tag sequencing. PLoS Genet. 2008; 4(11):e1000255. PMC: 2577301. DOI: 10.1371/journal.pgen.1000255. View

12.

Li W, Godzik A . Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006; 22(13):1658-9. DOI: 10.1093/bioinformatics/btl158. View

13.

Yanagisawa K, Shyr Y, Xu B, Massion P, Larsen P, White B . Proteomic patterns of tumour subsets in non-small-cell lung cancer. Lancet. 2003; 362(9382):433-9. DOI: 10.1016/S0140-6736(03)14068-8. View

14.

Huse S, Welch D, Morrison H, Sogin M . Ironing out the wrinkles in the rare biosphere through improved OTU clustering. Environ Microbiol. 2010; 12(7):1889-98. PMC: 2909393. DOI: 10.1111/j.1462-2920.2010.02193.x. View

15.

Wang Q, Garrity G, Tiedje J, Cole J . Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol. 2007; 73(16):5261-7. PMC: 1950982. DOI: 10.1128/AEM.00062-07. View

16.

Sun Y, Cai Y, Huse S, Knight R, Farmerie W, Wang X . A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis. Brief Bioinform. 2011; 13(1):107-21. PMC: 3251834. DOI: 10.1093/bib/bbr009. View

17.

Sogin M, Morrison H, Huber J, Welch D, Huse S, Neal P . Microbial diversity in the deep sea and the underexplored "rare biosphere". Proc Natl Acad Sci U S A. 2006; 103(32):12115-20. PMC: 1524930. DOI: 10.1073/pnas.0605127103. View

18.

NEEDLEMAN S, Wunsch C . A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970; 48(3):443-53. DOI: 10.1016/0022-2836(70)90057-4. View

19.

Sun Y, Cai Y, Liu L, Yu F, Farrell M, McKendree W . ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences. Nucleic Acids Res. 2009; 37(10):e76. PMC: 2691849. DOI: 10.1093/nar/gkp285. View

20.

Sait M, Hugenholtz P, Janssen P . Cultivation of globally distributed soil bacteria from phylogenetic lineages previously only detected in cultivation-independent surveys. Environ Microbiol. 2002; 4(11):654-66. DOI: 10.1046/j.1462-2920.2002.00352.x. View