» Articles » PMID: 35583271

SPRISS: Approximating Frequent K-mers by Sampling Reads, and Applications

Overview
Journal Bioinformatics
Specialty Biology
Date 2022 May 18
PMID 35583271
Authors
Affiliations
Soon will be listed here.
Abstract

Motivation: The extraction of k-mers is a fundamental component in many complex analyses of large next-generation sequencing datasets, including reads classification in genomics and the characterization of RNA-seq datasets. The extraction of all k-mers and their frequencies is extremely demanding in terms of running time and memory, owing to the size of the data and to the exponential number of k-mers to be considered. However, in several applications, only frequent k-mers, which are k-mers appearing in a relatively high proportion of the data, are required by the analysis.

Results: In this work, we present SPRISS, a new efficient algorithm to approximate frequent k-mers and their frequencies in next-generation sequencing data. SPRISS uses a simple yet powerful reads sampling scheme, which allows to extract a representative subset of the dataset that can be used, in combination with any k-mer counting algorithm, to perform downstream analyses in a fraction of the time required by the analysis of the whole data, while obtaining comparable answers. Our extensive experimental evaluation demonstrates the efficiency and accuracy of SPRISS in approximating frequent k-mers, and shows that it can be used in various scenarios, such as the comparison of metagenomic datasets, the identification of discriminative k-mers, and SNP (single nucleotide polymorphism) genotyping, to extract insights in a fraction of the time required by the analysis of the whole dataset.

Availability And Implementation: SPRISS [a preliminary version (Santoro et al., 2021) of this work was presented at RECOMB 2021] is available at https://github.com/VandinLab/SPRISS.

Supplementary Information: Supplementary data are available at Bioinformatics online.

Citing Articles

SPRISS: approximating frequent k-mers by sampling reads, and applications.

Santoro D, Pellegrina L, Comin M, Vandin F Bioinformatics. 2022; 38(13):3343-3350.

PMID: 35583271 PMC: 9237683. DOI: 10.1093/bioinformatics/btac180.

References
1.
Li H, Durbin R . Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009; 25(14):1754-60. PMC: 2705234. DOI: 10.1093/bioinformatics/btp324. View

2.
Zhang Z, Wang W . RNA-Skim: a rapid method for RNA-Seq quantification at transcript level. Bioinformatics. 2014; 30(12):i283-i292. PMC: 4058932. DOI: 10.1093/bioinformatics/btu288. View

3.
Wedemeyer A, Kliemann L, Srivastav A, Schielke C, Reusch T, Rosenstiel P . An improved filtering algorithm for big read datasets and its application to single-cell assembly. BMC Bioinformatics. 2017; 18(1):324. PMC: 5496428. DOI: 10.1186/s12859-017-1724-7. View

4.
Danovaro R, Canals M, Tangherlini M, DellAnno A, Gambi C, Lastras G . A submarine volcanic eruption leads to a novel microbial habitat. Nat Ecol Evol. 2017; 1(6):144. DOI: 10.1038/s41559-017-0144. View

5.
Roy R, Bhattacharya D, Schliep A . Turtle: identifying frequent k-mers with cache-efficient algorithms. Bioinformatics. 2014; 30(14):1950-7. DOI: 10.1093/bioinformatics/btu132. View