Label-guided Seed-chain-extend Alignment on Annotated De Bruijn Graphs

Overview

Journal Bioinformatics

Publisher Oxford University Press

Specialty Biology

Date 2024 Jun 28

PMID 38940164

Authors

Harun Mustafa

Mikhail Karasikov

Nika Mansouri Ghiasi

Gunnar Ratsch

Andre Kahles

Affiliations

Soon will be listed here.

Abstract

Motivation: Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy.

Results: We introduce a new scoring model, 'multi-label alignment' (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, 'Label Change' incorporates more informative global sample similarity into local scores. To improve connectivity, 'Node Length Change' dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%-66.8% and covering 45.5%-47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment.

Availability And Implementation: The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.

References

Ma J, Caceres M, Salmela L, Makinen V, Tomescu A . Chaining for accurate alignment of erroneous long reads to acyclic variation graphs. Bioinformatics. 2023; 39(8). PMC: 10423031. DOI: 10.1093/bioinformatics/btad460. View

Sayers E, Bolton E, Brister J, Canese K, Chan J, Comeau D . Database resources of the National Center for Biotechnology Information in 2023. Nucleic Acids Res. 2022; 51(D1):D29-D38. PMC: 9825438. DOI: 10.1093/nar/gkac1032. View

Marchet C, Iqbal Z, Gautheret D, Salson M, Chikhi R . REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets. Bioinformatics. 2020; 36(Suppl_1):i177-i185. PMC: 7355249. DOI: 10.1093/bioinformatics/btaa487. View

Marchet C, Boucher C, Puglisi S, Medvedev P, Salson M, Chikhi R . Data structures based on -mers for querying large collections of sequencing data sets. Genome Res. 2020; 31(1):1-12. PMC: 7849385. DOI: 10.1101/gr.260604.119. View

Chang X, Eizenga J, Novak A, Siren J, Paten B . Distance indexing and seed clustering in sequence graphs. Bioinformatics. 2020; 36(Suppl_1):i146-i153. PMC: 7355256. DOI: 10.1093/bioinformatics/btaa446. View

Almodaresi F, Sarkar H, Srivastava A, Patro R . A space and time-efficient index for the compacted colored de Bruijn graph. Bioinformatics. 2018; 34(13):i169-i177. PMC: 6022659. DOI: 10.1093/bioinformatics/bty292. View

Katz K, Shutov O, Lapoint R, Kimelman M, Brister J, OSullivan C . The Sequence Read Archive: a decade more of explosive growth. Nucleic Acids Res. 2021; 50(D1):D387-D390. PMC: 8728234. DOI: 10.1093/nar/gkab1053. View

Karasikov M, Mustafa H, Ratsch G, Kahles A . Lossless indexing with counting de Bruijn graphs. Genome Res. 2022; 32(9):1754-1764. PMC: 9528980. DOI: 10.1101/gr.276607.122. View

Schulz T, Wittler R, Rahmann S, Hach F, Stoye J . Detecting high-scoring local alignments in pangenome graphs. Bioinformatics. 2021; 37(16):2266-2274. PMC: 8388040. DOI: 10.1093/bioinformatics/btab077. View

10.

Morgulis A, Gertz E, Schaffer A, Agarwala R . A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J Comput Biol. 2006; 13(5):1028-40. DOI: 10.1089/cmb.2006.13.1028. View

11.

Rautiainen M, Marschall T . GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol. 2020; 21(1):253. PMC: 7513500. DOI: 10.1186/s13059-020-02157-2. View

12.

Joudaki A, Meterez A, Mustafa H, Groot Koerkamp R, Kahles A, Ratsch G . Aligning distant sequences to graphs using long seed sketches. Genome Res. 2023; 33(7):1208-1217. PMC: 10538362. DOI: 10.1101/gr.277659.123. View

13.

Krannich T, White W, Niehus S, Holley G, Halldorsson B, Kehr B . Population-scale detection of non-reference sequence variants using colored de Bruijn graphs. Bioinformatics. 2021; 38(3):604-611. PMC: 8756200. DOI: 10.1093/bioinformatics/btab749. View

14.

Dvorkina T, Antipov D, Korobeynikov A, Nurk S . SPAligner: alignment of long diverged molecular sequences to assembly graphs. BMC Bioinformatics. 2020; 21(Suppl 12):306. PMC: 7379835. DOI: 10.1186/s12859-020-03590-7. View

15.

Harrison P, Ahamed A, Aslam R, Alako B, Burgin J, Buso N . The European Nucleotide Archive in 2020. Nucleic Acids Res. 2020; 49(D1):D82-D85. PMC: 7778925. DOI: 10.1093/nar/gkaa1028. View

16.

Chikhi R, Rizk G . Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms Mol Biol. 2013; 8(1):22. PMC: 3848682. DOI: 10.1186/1748-7188-8-22. View

17.

Bankevich A, Bzikadze A, Kolmogorov M, Antipov D, Pevzner P . Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads. Nat Biotechnol. 2022; 40(7):1075-1081. DOI: 10.1038/s41587-022-01220-6. View

18.

Huang W, Li L, Myers J, Marth G . ART: a next-generation sequencing read simulator. Bioinformatics. 2011; 28(4):593-4. PMC: 3278762. DOI: 10.1093/bioinformatics/btr708. View

19.

Rhie A, McCarthy S, Fedrigo O, Damas J, Formenti G, Koren S . Towards complete and error-free genome assemblies of all vertebrate species. Nature. 2021; 592(7856):737-746. PMC: 8081667. DOI: 10.1038/s41586-021-03451-0. View

20.

Limasset A, Cazaux B, Rivals E, Peterlongo P . Read mapping on de Bruijn graphs. BMC Bioinformatics. 2016; 17(1):237. PMC: 4910249. DOI: 10.1186/s12859-016-1103-9. View