» Articles » PMID: 28873405

KWIP: The K-mer Weighted Inner Product, a De Novo Estimator of Genetic Similarity

Overview
Specialty Biology
Date 2017 Sep 6
PMID 28873405
Citations 27
Authors
Affiliations
Soon will be listed here.
Abstract

Modern genomics techniques generate overwhelming quantities of data. Extracting population genetic variation demands computationally efficient methods to determine genetic relatedness between individuals (or "samples") in an unbiased manner, preferably de novo. Rapid estimation of genetic relatedness directly from sequencing data has the potential to overcome reference genome bias, and to verify that individuals belong to the correct genetic lineage before conclusions are drawn using mislabelled, or misidentified samples. We present the k-mer Weighted Inner Product (kWIP), an assembly-, and alignment-free estimator of genetic similarity. kWIP combines a probabilistic data structure with a novel metric, the weighted inner product (WIP), to efficiently calculate pairwise similarity between sequencing runs from their k-mer counts. It produces a distance matrix, which can then be further analysed and visualised. Our method does not require prior knowledge of the underlying genomes and applications include establishing sample identity and detecting mix-up, non-obvious genomic variation, and population structure. We show that kWIP can reconstruct the true relatedness between samples from simulated populations. By re-analysing several published datasets we show that our results are consistent with marker-based analyses. kWIP is written in C++, licensed under the GNU GPL, and is available from https://github.com/kdmurray91/kwip.

Citing Articles

Local Genomic Surveillance of Invasive in Eastern North Carolina (ENC) in 2022-2023.

Huang W, Markantonis J, Yin C, Pozdol J, Briley K, Fallon J Int J Mol Sci. 2024; 25(15).

PMID: 39125755 PMC: 11311789. DOI: 10.3390/ijms25158179.


Comparison of k-mer-based comparative metagenomic tools and approaches.

Ponsero A, Miller M, Hurwitz B Microbiome Res Rep. 2023; 2(4):27.

PMID: 38058765 PMC: 10696585. DOI: 10.20517/mrr.2023.26.


Whole genome sequencing of human Borrelia burgdorferi isolates reveals linked blocks of accessory genome elements located on plasmids and associated with human dissemination.

Lemieux J, Huang W, Hill N, Cerar T, Freimark L, Hernandez S PLoS Pathog. 2023; 19(8):e1011243.

PMID: 37651316 PMC: 10470944. DOI: 10.1371/journal.ppat.1011243.


Whole genome sequencing of isolates reveals linked clusters of plasmid-borne accessory genome elements associated with virulence.

Lemieux J, Huang W, Hill N, Cerar T, Freimark L, Hernandez S bioRxiv. 2023; .

PMID: 36909473 PMC: 10002713. DOI: 10.1101/2023.02.26.530159.


Feature extraction based on microstate sequences for EEG-based emotion recognition.

Chen J, Zhao Z, Shu Q, Cai G Front Psychol. 2023; 13:1065196.

PMID: 36619090 PMC: 9816384. DOI: 10.3389/fpsyg.2022.1065196.


References
1.
Song K, Ren J, Zhai Z, Liu X, Deng M, Sun F . Alignment-free sequence comparison based on next-generation sequencing reads. J Comput Biol. 2013; 20(2):64-79. PMC: 3581251. DOI: 10.1089/cmb.2012.0228. View

2.
Zhang Q, Pell J, Canino-Koning R, Howe A, Brown C . These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. PLoS One. 2014; 9(7):e101271. PMC: 4111482. DOI: 10.1371/journal.pone.0101271. View

3.
Spindel J, McCouch S . When more is better: how data sharing would accelerate genomic selection of crop plants. New Phytol. 2016; 212(4):814-826. DOI: 10.1111/nph.14174. View

4.
Sims G, Jun S, Wu G, Kim S . Whole-genome phylogeny of mammals: evolutionary information in genic and nongenic regions. Proc Natl Acad Sci U S A. 2009; 106(40):17077-82. PMC: 2761373. DOI: 10.1073/pnas.0909377106. View

5.
Morgenstern B, Zhu B, Horwege S, Leimeister C . Estimating evolutionary distances between genomic sequences from spaced-word matches. Algorithms Mol Biol. 2015; 10:5. PMC: 4327811. DOI: 10.1186/s13015-015-0032-x. View