» Articles » PMID: 27832109

Real-Time Pathogen Detection in the Era of Whole-Genome Sequencing and Big Data: Comparison of K-mer and Site-Based Methods for Inferring the Genetic Distances Among Tens of Thousands of Salmonella Samples

Overview
Journal PLoS One
Date 2016 Nov 11
PMID 27832109
Citations 10
Authors
Affiliations
Soon will be listed here.
Abstract

The adoption of whole-genome sequencing within the public health realm for molecular characterization of bacterial pathogens has been followed by an increased emphasis on real-time detection of emerging outbreaks (e.g., food-borne Salmonellosis). In turn, large databases of whole-genome sequence data are being populated. These databases currently contain tens of thousands of samples and are expected to grow to hundreds of thousands within a few years. For these databases to be of optimal use one must be able to quickly interrogate them to accurately determine the genetic distances among a set of samples. Being able to do so is challenging due to both biological (evolutionary diverse samples) and computational (petabytes of sequence data) issues. We evaluated seven measures of genetic distance, which were estimated from either k-mer profiles (Jaccard, Euclidean, Manhattan, Mash Jaccard, and Mash distances) or nucleotide sites (NUCmer and an extended multi-locus sequence typing (MLST) scheme). When analyzing empirical data (whole-genome sequence data from 18,997 Salmonella isolates) there are features (e.g., genomic, assembly, and contamination) that cause distances inferred from k-mer profiles, which treat absent data as informative, to fail to accurately capture the distance between samples when compared to distances inferred from differences in nucleotide sites. Thus, site-based distances, like NUCmer and extended MLST, are superior in performance, but accessing the computing resources necessary to perform them may be challenging when analyzing large databases.

Citing Articles

Assessment of plasmids for relating the 2020 Salmonella enterica serovar Newport onion outbreak to farms implicated by the outbreak investigation.

Commichaux S, Rand H, Javkar K, Molloy E, Pettengill J, Pightling A BMC Genomics. 2023; 24(1):165.

PMID: 37016310 PMC: 10074901. DOI: 10.1186/s12864-023-09245-0.


Polyphyly in widespread serovars and using genomic proximity to choose the best reference genome for bioinformatics analyses.

Cherchame E, Ilango G, Noel V, Cadel-Six S Front Public Health. 2022; 10:963188.

PMID: 36159272 PMC: 9493441. DOI: 10.3389/fpubh.2022.963188.


Evaluation of various distance computation methods for construction of haplotype-based phylogenies from large MLST datasets.

Jacobson D, Zheng Y, Plucinski M, Qvarnstrom Y, Barratt J Mol Phylogenet Evol. 2022; 177:107608.

PMID: 35963590 PMC: 10127246. DOI: 10.1016/j.ympev.2022.107608.


Using Evolutionary Analyses to Refine Whole-Genome Sequence Match Criteria.

Pightling A, Rand H, Pettengill J Front Microbiol. 2022; 13:797997.

PMID: 35875579 PMC: 9301902. DOI: 10.3389/fmicb.2022.797997.


K-mer based prediction of relatedness and ribotypes.

Moore M, Wilcox M, Walker A, Eyre D Microb Genom. 2022; 8(4).

PMID: 35384833 PMC: 9453075. DOI: 10.1099/mgen.0.000804.


References
1.
Timme R, Pettengill J, Allard M, Strain E, Barrangou R, Wehnes C . Phylogenetic diversity of the enteric pathogen Salmonella enterica subsp. enterica inferred from genome-wide reference-free SNP characters. Genome Biol Evol. 2013; 5(11):2109-23. PMC: 3845640. DOI: 10.1093/gbe/evt159. View

2.
Seemann T . Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014; 30(14):2068-9. DOI: 10.1093/bioinformatics/btu153. View

3.
Marcais G, Kingsford C . A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011; 27(6):764-70. PMC: 3051319. DOI: 10.1093/bioinformatics/btr011. View

4.
Den Bakker H, Allard M, Bopp D, Brown E, Fontana J, Iqbal Z . Rapid whole-genome sequencing for surveillance of Salmonella enterica serovar enteritidis. Emerg Infect Dis. 2014; 20(8):1306-14. PMC: 4111163. DOI: 10.3201/eid2008.131399. View

5.
Hasegawa M, Kishino H, Yano T . Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol. 1985; 22(2):160-74. DOI: 10.1007/BF02101694. View