» Articles » PMID: 31780650

Accurate, Scalable and Integrative Haplotype Estimation

Overview
Journal Nat Commun
Specialty Biology
Date 2019 Nov 30
PMID 31780650
Citations 249
Authors
Affiliations
Soon will be listed here.
Abstract

The number of human genomes being genotyped or sequenced increases exponentially and efficient haplotype estimation methods able to handle this amount of data are now required. Here we present a method, SHAPEIT4, which substantially improves upon other methods to process large genotype and high coverage sequencing datasets. It notably exhibits sub-linear running times with sample size, provides highly accurate haplotypes and allows integrating external phasing information such as large reference panels of haplotypes, collections of pre-phased variants and long sequencing reads. We provide SHAPEIT4 in an open source format and demonstrate its performance in terms of accuracy and running times on two gold standard datasets: the UK Biobank data and the Genome In A Bottle.

Citing Articles

Machine Learning Methods for Classifying Multiple Sclerosis and Alzheimer's Disease Using Genomic Data.

Arnal Segura M, Bini G, Krithara A, Paliouras G, Tartaglia G Int J Mol Sci. 2025; 26(5).

PMID: 40076709 PMC: 11900513. DOI: 10.3390/ijms26052085.


Improving genetic variant identification for quantitative traits using ensemble learning-based approaches.

Sharma J, Jangale V, Shekhawat R, Yadav P BMC Genomics. 2025; 26(1):237.

PMID: 40075256 PMC: 11899862. DOI: 10.1186/s12864-025-11443-x.


Evaluation of genomic selection models using whole genome sequence data and functional annotation in Belgian Blue cattle.

Yuan C, Gillon A, Gualdron Duarte J, Takeda H, Coppieters W, Georges M Genet Sel Evol. 2025; 57(1):10.

PMID: 40038647 PMC: 11881496. DOI: 10.1186/s12711-025-00955-5.


ralphi: a deep reinforcement learning framework for haplotype assembly.

Battistella E, Maheshwari A, Ekim B, Berger B, Popic V bioRxiv. 2025; .

PMID: 40027721 PMC: 11870604. DOI: 10.1101/2025.02.17.638151.


Dynamic -PBWT: Dynamic Run-length Compressed PBWT for Biobank Scale Data.

Shakya P, Sanaullah A, Zhi D, Zhang S bioRxiv. 2025; .

PMID: 39975111 PMC: 11838588. DOI: 10.1101/2025.02.04.636479.


References
1.
Sharp K, Kretzschmar W, Delaneau O, Marchini J . Phasing for medical sequencing using rare variants and large haplotype reference panels. Bioinformatics. 2016; 32(13):1974-80. PMC: 4920110. DOI: 10.1093/bioinformatics/btw065. View

2.
Zook J, Chapman B, Wang J, Mittelman D, Hofmann O, Hide W . Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol. 2014; 32(3):246-51. DOI: 10.1038/nbt.2835. View

3.
Li N, Stephens M . Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics. 2004; 165(4):2213-33. PMC: 1462870. DOI: 10.1093/genetics/165.4.2213. View

4.
Danecek P, Auton A, Abecasis G, Albers C, Banks E, DePristo M . The variant call format and VCFtools. Bioinformatics. 2011; 27(15):2156-8. PMC: 3137218. DOI: 10.1093/bioinformatics/btr330. View

5.
Bycroft C, Freeman C, Petkova D, Band G, Elliott L, Sharp K . The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018; 562(7726):203-209. PMC: 6786975. DOI: 10.1038/s41586-018-0579-z. View