» Articles » PMID: 39424793

The GIAB Genomic Stratifications Resource for Human Reference Genomes

Abstract

Despite the growing variety of sequencing and variant-calling tools, no workflow performs equally well across the entire human genome. Understanding context-dependent performance is critical for enabling researchers, clinicians, and developers to make informed tradeoffs when selecting sequencing hardware and software. Here we describe a set of "stratifications," which are BED files that define distinct contexts throughout the genome. We define these for GRCh37/38 as well as the new T2T-CHM13 reference, adding many new hard-to-sequence regions which are critical for understanding performance as the field progresses. Specifically, we highlight the increase in hard-to-map and GC-rich stratifications in CHM13 relative to the previous references. We then compare the benchmarking performance with each reference and show the performance penalty brought about by these additional difficult regions in CHM13. Additionally, we demonstrate how the stratifications can track context-specific improvements over different platform iterations, using Oxford Nanopore Technologies as an example. The means to generate these stratifications are available as a snakemake pipeline at https://github.com/usnistgov/giab-stratifications . We anticipate this being useful in enabling precise risk-reward calculations when building sequencing pipelines for any of the commonly-used reference genomes.

Citing Articles

Structural variation, selection, and diversification of the gene family from the human pangenome.

Dishuck P, Munson K, Lewis A, Dougherty M, Underwood J, Harvey W bioRxiv. 2025; .

PMID: 39975192 PMC: 11838601. DOI: 10.1101/2025.02.04.636496.


A personalized multi-platform assessment of somatic mosaicism in the human frontal cortex.

Zhou W, Mumm C, Gan Y, Switzenberg J, Wang J, De Oliveira P bioRxiv. 2025; .

PMID: 39763954 PMC: 11702624. DOI: 10.1101/2024.12.18.629274.


Nanopore sequencing of 1000 Genomes Project samples to build a comprehensive catalog of human genetic variation.

Gustafson J, Gibson S, Damaraju N, Zalusky M, Hoekzema K, Twesigomwe D medRxiv. 2024; .

PMID: 38496498 PMC: 10942501. DOI: 10.1101/2024.03.05.24303792.


Nanopore Long-Read Sequencing Unveils Genomic Disruptions in Alzheimer's Disease.

Ramirez P, Sun W, Dehkordi S, Zare H, Pascarella G, Carninci P bioRxiv. 2024; .

PMID: 38370753 PMC: 10871260. DOI: 10.1101/2024.02.01.578450.

References
1.
Majidian S, Kahaei M, de Ridder D . Hap10: reconstructing accurate and long polyploid haplotypes using linked reads. BMC Bioinformatics. 2020; 21(1):253. PMC: 7302376. DOI: 10.1186/s12859-020-03584-5. View

2.
Behera S, LeFaive J, Orchard P, Mahmoud M, Paulin L, Farek J . FixItFelix: improving genomic analysis by fixing reference errors. Genome Biol. 2023; 24(1):31. PMC: 9942314. DOI: 10.1186/s13059-023-02863-7. View

3.
Jarvis E, Formenti G, Rhie A, Guarracino A, Yang C, Wood J . Semi-automated assembly of high-quality diploid human reference genomes. Nature. 2022; 611(7936):519-531. PMC: 9668749. DOI: 10.1038/s41586-022-05325-5. View

4.
Benjamini Y, Speed T . Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 2012; 40(10):e72. PMC: 3378858. DOI: 10.1093/nar/gks001. View

5.
Zook J, Chapman B, Wang J, Mittelman D, Hofmann O, Hide W . Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol. 2014; 32(3):246-51. DOI: 10.1038/nbt.2835. View