Efficient Haplotype Matching Between a Query and a Panel for Genealogical Search

Overview

Journal Bioinformatics

Publisher Oxford University Press

Specialty Biology

Date 2019 Sep 13

PMID 31510689

Citations 6

Authors

Ardalan Naseri

Erwin Holzhauser

Degui Zhi

Shaojie Zhang

Affiliations

Soon will be listed here.

Abstract

Motivation: With the wide availability of whole-genome genotype data, there is an increasing need for conducting genetic genealogical searches efficiently. Computationally, this task amounts to identifying shared DNA segments between a query individual and a very large panel containing millions of haplotypes. The celebrated Positional Burrows-Wheeler Transform (PBWT) data structure is a pre-computed index of the panel that enables constant time matching at each position between one haplotype and an arbitrarily large panel. However, the existing algorithm (Durbin's Algorithm 5) can only identify set-maximal matches, the longest matches ending at any location in a panel, while in real genealogical search scenarios, multiple 'good enough' matches are desired.

Results: In this work, we developed two algorithmic extensions of Durbin's Algorithm 5, that can find all L-long matches, matches longer than or equal to a given length L, between a query and a panel. In the first algorithm, PBWT-Query, we introduce 'virtual insertion' of the query into the PBWT matrix of the panel, and then scanning up and down for the PBWT match blocks with length greater than L. In our second algorithm, L-PBWT-Query, we further speed up PBWT-Query by introducing additional data structures that allow us to avoid iterating through blocks of incomplete matches. The efficiency of PBWT-Query and L-PBWT-Query is demonstrated using the simulated data and the UK Biobank data. Our results show that our proposed algorithms can detect related individuals for a given query efficiently in very large cohorts which enables a fast on-line query search.

Availability And Implementation: genome.ucf.edu/pbwt-query.

Supplementary Information: Supplementary data are available at Bioinformatics online.

Citing Articles

Dynamic -PBWT: Dynamic Run-length Compressed PBWT for Biobank Scale Data.

Shakya P, Sanaullah A, Zhi D, Zhang S bioRxiv. 2025; .

PMID: 39975111 PMC: 11838588. DOI: 10.1101/2025.02.04.636479.

Haplotype Matching with GBWT for Pangenome Graphs.

Sanaullah A, Villalobos S, Zhi D, Zhang S bioRxiv. 2025; .

PMID: 39975036 PMC: 11838520. DOI: 10.1101/2025.02.03.634410.

Discovery of runs-of-homozygosity diplotype clusters and their associations with diseases in UK Biobank.

Naseri A, Zhi D, Zhang S Elife. 2024; 13.

PMID: 38905121 PMC: 11249732. DOI: 10.7554/eLife.81698.

GRAPE: genomic relatedness detection pipeline.

Medvedev A, Lebedev M, Ponomarev A, Kosaretskiy M, Osipenko D, Tischenko A F1000Res. 2023; 11:589.

PMID: 37224332 PMC: 10182380. DOI: 10.12688/f1000research.111658.2.

Syllable-PBWT for space-efficient haplotype long-match query.

Wang V, Naseri A, Zhang S, Zhi D Bioinformatics. 2022; 39(1).

PMID: 36440908 PMC: 9805553. DOI: 10.1093/bioinformatics/btac734.

References

Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira M, Bender D . PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007; 81(3):559-75. PMC: 1950838. DOI: 10.1086/519795. View

Gusev A, Lowe J, Stoffel M, Daly M, Altshuler D, Breslow J . Whole population, genome-wide mapping of hidden relatedness. Genome Res. 2008; 19(2):318-26. PMC: 2652213. DOI: 10.1101/gr.081398.108. View

Chen G, Marjoram P, Wall J . Fast and flexible simulation of DNA sequence data. Genome Res. 2008; 19(1):136-42. PMC: 2612967. DOI: 10.1101/gr.083634.108. View

Manichaikul A, Mychaleckyj J, Rich S, Daly K, Sale M, Chen W . Robust relationship inference in genome-wide association studies. Bioinformatics. 2010; 26(22):2867-73. PMC: 3025716. DOI: 10.1093/bioinformatics/btq559. View

Browning B, Browning S . A fast, powerful method for detecting identity by descent. Am J Hum Genet. 2011; 88(2):173-82. PMC: 3035716. DOI: 10.1016/j.ajhg.2011.01.010. View

Browning B, Browning S . Improving the accuracy and efficiency of identity-by-descent detection in population data. Genetics. 2013; 194(2):459-71. PMC: 3664855. DOI: 10.1534/genetics.113.150029. View

Thompson E . Identity by descent: variation in meiosis, across genomes, and in populations. Genetics. 2013; 194(2):301-26. PMC: 3664843. DOI: 10.1534/genetics.112.148825. View

Browning B, Browning S . Detecting identity by descent and estimating genotype error rates in sequence data. Am J Hum Genet. 2013; 93(5):840-51. PMC: 3824133. DOI: 10.1016/j.ajhg.2013.09.014. View

Durbin R . Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT). Bioinformatics. 2014; 30(9):1266-72. PMC: 3998136. DOI: 10.1093/bioinformatics/btu014. View

10.

Rodriguez J, Bercovici S, Huang L, Frostig R, Batzoglou S . Parente2: a fast and accurate method for detecting identity by descent. Genome Res. 2014; 25(2):280-9. PMC: 4315301. DOI: 10.1101/gr.173641.114. View

11.

Campbell N, Harmon S, Narum S . Genotyping-in-Thousands by sequencing (GT-seq): A cost effective SNP genotyping method based on custom amplicon sequencing. Mol Ecol Resour. 2014; 15(4):855-67. DOI: 10.1111/1755-0998.12357. View

12.

Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J . UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015; 12(3):e1001779. PMC: 4380465. DOI: 10.1371/journal.pmed.1001779. View

13.

Jiang Z, Wang H, Michal J, Zhou X, Liu B, Solberg Woods L . Genome Wide Sampling Sequencing for SNP Genotyping: Methods, Challenges and Future Development. Int J Biol Sci. 2016; 12(1):100-8. PMC: 4679402. DOI: 10.7150/ijbs.13498. View

14.

Loh P, Palamara P, Price A . Fast and accurate long-range phasing in a UK Biobank cohort. Nat Genet. 2016; 48(7):811-6. PMC: 4925291. DOI: 10.1038/ng.3571. View

15.

Loh P, Danecek P, Palamara P, Fuchsberger C, Reshef Y, Finucane H . Reference-based phasing using the Haplotype Reference Consortium panel. Nat Genet. 2016; 48(11):1443-1448. PMC: 5096458. DOI: 10.1038/ng.3679. View

16.

Khan R, Mittelman D . Consumer genomics will change your life, whether you get tested or not. Genome Biol. 2018; 19(1):120. PMC: 6100720. DOI: 10.1186/s13059-018-1506-1. View

17.

Bycroft C, Freeman C, Petkova D, Band G, Elliott L, Sharp K . The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018; 562(7726):203-209. PMC: 6786975. DOI: 10.1038/s41586-018-0579-z. View

18.

Erlich Y, Shor T, Peer I, Carmi S . Identity inference of genomic data using long-range familial searches. Science. 2018; 362(6415):690-694. PMC: 7549546. DOI: 10.1126/science.aau4832. View