» Articles » PMID: 31510689

Efficient Haplotype Matching Between a Query and a Panel for Genealogical Search

Overview
Journal Bioinformatics
Specialty Biology
Date 2019 Sep 13
PMID 31510689
Citations 6
Authors
Affiliations
Soon will be listed here.
Abstract

Motivation: With the wide availability of whole-genome genotype data, there is an increasing need for conducting genetic genealogical searches efficiently. Computationally, this task amounts to identifying shared DNA segments between a query individual and a very large panel containing millions of haplotypes. The celebrated Positional Burrows-Wheeler Transform (PBWT) data structure is a pre-computed index of the panel that enables constant time matching at each position between one haplotype and an arbitrarily large panel. However, the existing algorithm (Durbin's Algorithm 5) can only identify set-maximal matches, the longest matches ending at any location in a panel, while in real genealogical search scenarios, multiple 'good enough' matches are desired.

Results: In this work, we developed two algorithmic extensions of Durbin's Algorithm 5, that can find all L-long matches, matches longer than or equal to a given length L, between a query and a panel. In the first algorithm, PBWT-Query, we introduce 'virtual insertion' of the query into the PBWT matrix of the panel, and then scanning up and down for the PBWT match blocks with length greater than L. In our second algorithm, L-PBWT-Query, we further speed up PBWT-Query by introducing additional data structures that allow us to avoid iterating through blocks of incomplete matches. The efficiency of PBWT-Query and L-PBWT-Query is demonstrated using the simulated data and the UK Biobank data. Our results show that our proposed algorithms can detect related individuals for a given query efficiently in very large cohorts which enables a fast on-line query search.

Availability And Implementation: genome.ucf.edu/pbwt-query.

Supplementary Information: Supplementary data are available at Bioinformatics online.

Citing Articles

Dynamic -PBWT: Dynamic Run-length Compressed PBWT for Biobank Scale Data.

Shakya P, Sanaullah A, Zhi D, Zhang S bioRxiv. 2025; .

PMID: 39975111 PMC: 11838588. DOI: 10.1101/2025.02.04.636479.


Haplotype Matching with GBWT for Pangenome Graphs.

Sanaullah A, Villalobos S, Zhi D, Zhang S bioRxiv. 2025; .

PMID: 39975036 PMC: 11838520. DOI: 10.1101/2025.02.03.634410.


Discovery of runs-of-homozygosity diplotype clusters and their associations with diseases in UK Biobank.

Naseri A, Zhi D, Zhang S Elife. 2024; 13.

PMID: 38905121 PMC: 11249732. DOI: 10.7554/eLife.81698.


GRAPE: genomic relatedness detection pipeline.

Medvedev A, Lebedev M, Ponomarev A, Kosaretskiy M, Osipenko D, Tischenko A F1000Res. 2023; 11:589.

PMID: 37224332 PMC: 10182380. DOI: 10.12688/f1000research.111658.2.


Syllable-PBWT for space-efficient haplotype long-match query.

Wang V, Naseri A, Zhang S, Zhi D Bioinformatics. 2022; 39(1).

PMID: 36440908 PMC: 9805553. DOI: 10.1093/bioinformatics/btac734.


References
1.
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira M, Bender D . PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007; 81(3):559-75. PMC: 1950838. DOI: 10.1086/519795. View

2.
Gusev A, Lowe J, Stoffel M, Daly M, Altshuler D, Breslow J . Whole population, genome-wide mapping of hidden relatedness. Genome Res. 2008; 19(2):318-26. PMC: 2652213. DOI: 10.1101/gr.081398.108. View

3.
Chen G, Marjoram P, Wall J . Fast and flexible simulation of DNA sequence data. Genome Res. 2008; 19(1):136-42. PMC: 2612967. DOI: 10.1101/gr.083634.108. View

4.
Manichaikul A, Mychaleckyj J, Rich S, Daly K, Sale M, Chen W . Robust relationship inference in genome-wide association studies. Bioinformatics. 2010; 26(22):2867-73. PMC: 3025716. DOI: 10.1093/bioinformatics/btq559. View

5.
Browning B, Browning S . A fast, powerful method for detecting identity by descent. Am J Hum Genet. 2011; 88(2):173-82. PMC: 3035716. DOI: 10.1016/j.ajhg.2011.01.010. View