Computational Graph Pangenomics: a Tutorial on Data Structures and Their Applications

Overview

Journal Nat Comput

Publisher Springer

Specialty Biology

Date 2023 Mar 27

PMID 36969737

Authors

Jasmijn A Baaijens

Paola Bonizzoni

Christina Boucher

Gianluca Della Vedova

Yuri Pirola

Raffaella Rizzi

Jouni Siren

Affiliations

Soon will be listed here.

Abstract

Computational pangenomics is an emerging research field that is changing the way computer scientists are facing challenges in biological sequence analysis. In past decades, contributions from combinatorics, stringology, graph theory and data structures were essential in the development of a plethora of software tools for the analysis of the human genome. These tools allowed computational biologists to approach ambitious projects at population scale, such as the 1000 Genomes Project. A major contribution of the 1000 Genomes Project is the characterization of a broad spectrum of genetic variations in the human genome, including the discovery of novel variations in the South Asian, African and European populations-thus enhancing the catalogue of variability within the reference genome. Currently, the need to take into account the high variability in population genomes as well as the specificity of an individual genome in a personalized approach to medicine is rapidly pushing the abandonment of the traditional paradigm of using a single reference genome. A graph-based representation of multiple genomes, or , is replacing the linear reference genome. This means completely rethinking well-established procedures to analyze, store, and access information from genome representations. Properly addressing these challenges is crucial to face the computational tasks of ambitious healthcare projects aiming to characterize human diversity by sequencing 1M individuals (Stark et al. 2019). This tutorial aims to introduce readers to the most recent advances in the theory of data structures for the representation of graph pangenomes. We discuss efficient representations of and the variability of in graph pangenomes, and highlight applications in solving computational problems in human and microbial (viral) pangenomes.

Citing Articles

Haplotype Matching with GBWT for Pangenome Graphs.

Sanaullah A, Villalobos S, Zhi D, Zhang S bioRxiv. 2025; .

PMID: 39975036 PMC: 11838520. DOI: 10.1101/2025.02.03.634410.

Differential quantification of alternative splicing events on spliced pangenome graphs.

Ciccolella S, Cozzi D, Della Vedova G, Kuria S, Bonizzoni P, Denti L PLoS Comput Biol. 2024; 20(12):e1012665.

PMID: 39652592 PMC: 11658704. DOI: 10.1371/journal.pcbi.1012665.

PangeBlocks: customized construction of pangenome graphs via maximal blocks.

Avila Cartes J, Bonizzoni P, Ciccolella S, Della Vedova G, Denti L BMC Bioinformatics. 2024; 25(1):344.

PMID: 39497039 PMC: 11533710. DOI: 10.1186/s12859-024-05958-5.

Constructing and personalizing population pangenome graphs.

Chikhi R, Dufresne Y, Medvedev P Nat Methods. 2024; 21(11):1980-1981.

PMID: 39433877 DOI: 10.1038/s41592-024-02402-7.

When less is more: sketching with minimizers in genomics.

Ndiaye M, Prieto-Banos S, Fitzgerald L, Yazdizadeh Kharrazi A, Oreshkov S, Dessimoz C Genome Biol. 2024; 25(1):270.

PMID: 39402664 PMC: 11472564. DOI: 10.1186/s13059-024-03414-4.

References

Kokot M, Dlugosz M, Deorowicz S . KMC 3: counting and manipulating k-mer statistics. Bioinformatics. 2017; 33(17):2759-2761. DOI: 10.1093/bioinformatics/btx304. View

Kaplinski L, Lepamets M, Remm M . GenomeTester4: a toolkit for performing basic set operations - union, intersection and complement on k-mer lists. Gigascience. 2015; 4:58. PMC: 4669650. DOI: 10.1186/s13742-015-0097-y. View

Gagie T, Manzini G, Siren J . Wheeler graphs: A framework for BWT-based data structures. Theor Comput Sci. 2017; 698:67-78. PMC: 5727778. DOI: 10.1016/j.tcs.2017.06.016. View

Li W, Malhotra R, Wu S, Jha M, Rodrigo A, Poss M . ViPRA-Haplo: De Novo Reconstruction of Viral Populations Using Paired End Sequencing Data. IEEE/ACM Trans Comput Biol Bioinform. 2024; 21(3):492-500. DOI: 10.1109/TCBB.2024.3374595. View

Schneider V, Graves-Lindsay T, Howe K, Bouk N, Chen H, Kitts P . Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 2017; 27(5):849-864. PMC: 5411779. DOI: 10.1101/gr.213611.116. View

Sibbesen J, Maretty L, Krogh A . Accurate genotyping across variant classes and lengths using variant graphs. Nat Genet. 2018; 50(7):1054-1059. DOI: 10.1038/s41588-018-0145-5. View

Chikhi R, Limasset A, Medvedev P . Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics. 2016; 32(12):i201-i208. PMC: 4908363. DOI: 10.1093/bioinformatics/btw279. View

Boucher C, Cvacho O, Gagie T, Holub J, Manzini G, Navarro G . PFP Compressed Suffix Trees. Proc Worksh Algorithm Eng Exp. 2022; 2021:60-72. PMC: 8963198. DOI: 10.1137/1.9781611976472.5. View

Rakocevic G, Semenyuk V, Lee W, Spencer J, Browning J, Johnson I . Fast and accurate genomic analyses using genome graphs. Nat Genet. 2019; 51(2):354-362. DOI: 10.1038/s41588-018-0316-4. View

10.

Siren J, Valimaki N, Makinen V . Indexing Graphs for Path Queries with Applications in Genome Research. IEEE/ACM Trans Comput Biol Bioinform. 2015; 11(2):375-88. DOI: 10.1109/TCBB.2013.2297101. View

11.

Logsdon G, Vollger M, Eichler E . Long-read human genome sequencing and its applications. Nat Rev Genet. 2020; 21(10):597-614. PMC: 7877196. DOI: 10.1038/s41576-020-0236-x. View

12.

Makinen V, Navarro G, Siren J, Valimaki N . Storage and retrieval of highly repetitive sequence collections. J Comput Biol. 2010; 17(3):281-308. DOI: 10.1089/cmb.2009.0169. View

13.

Berlin K, Koren S, Chin C, Drake J, Landolin J, Phillippy A . Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol. 2015; 33(6):623-30. DOI: 10.1038/nbt.3238. View

14.

Novak A, Garrison E, Paten B . A graph extension of the positional Burrows-Wheeler transform and its applications. Algorithms Mol Biol. 2017; 12:18. PMC: 5505026. DOI: 10.1186/s13015-017-0109-9. View

15.

Sun S, Zhou Y, Chen J, Shi J, Zhao H, Zhao H . Extensive intraspecific gene order and gene structural variations between Mo17 and other maize genomes. Nat Genet. 2018; 50(9):1289-1295. DOI: 10.1038/s41588-018-0182-0. View

16.

Boucher C, Gagie T, Kuhnle A, Langmead B, Manzini G, Mun T . Prefix-free parsing for building big BWTs. Algorithms Mol Biol. 2019; 14:13. PMC: 6534911. DOI: 10.1186/s13015-019-0148-5. View

17.

Myers E . The fragment assembly string graph. Bioinformatics. 2005; 21 Suppl 2:ii79-85. DOI: 10.1093/bioinformatics/bti1114. View

18.

Sibbesen J, Eizenga J, Novak A, Siren J, Chang X, Garrison E . Haplotype-aware pantranscriptome analyses using spliced pangenome graphs. Nat Methods. 2023; 20(2):239-247. DOI: 10.1038/s41592-022-01731-9. View

19.

Naseri A, Zhi D, Zhang S . Multi-allelic positional Burrows-Wheeler transform. BMC Bioinformatics. 2019; 20(Suppl 11):279. PMC: 6551244. DOI: 10.1186/s12859-019-2821-6. View

20.

. Computational pan-genomics: status, promises and challenges. Brief Bioinform. 2016; 19(1):118-135. PMC: 5862344. DOI: 10.1093/bib/bbw089. View