» Articles » PMID: 38940164

Label-guided Seed-chain-extend Alignment on Annotated De Bruijn Graphs

Overview
Journal Bioinformatics
Specialty Biology
Date 2024 Jun 28
PMID 38940164
Authors
Affiliations
Soon will be listed here.
Abstract

Motivation: Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy.

Results: We introduce a new scoring model, 'multi-label alignment' (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, 'Label Change' incorporates more informative global sample similarity into local scores. To improve connectivity, 'Node Length Change' dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%-66.8% and covering 45.5%-47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment.

Availability And Implementation: The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.

References
1.
Ma J, Caceres M, Salmela L, Makinen V, Tomescu A . Chaining for accurate alignment of erroneous long reads to acyclic variation graphs. Bioinformatics. 2023; 39(8). PMC: 10423031. DOI: 10.1093/bioinformatics/btad460. View

2.
Sayers E, Bolton E, Brister J, Canese K, Chan J, Comeau D . Database resources of the National Center for Biotechnology Information in 2023. Nucleic Acids Res. 2022; 51(D1):D29-D38. PMC: 9825438. DOI: 10.1093/nar/gkac1032. View

3.
Marchet C, Iqbal Z, Gautheret D, Salson M, Chikhi R . REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets. Bioinformatics. 2020; 36(Suppl_1):i177-i185. PMC: 7355249. DOI: 10.1093/bioinformatics/btaa487. View

4.
Marchet C, Boucher C, Puglisi S, Medvedev P, Salson M, Chikhi R . Data structures based on -mers for querying large collections of sequencing data sets. Genome Res. 2020; 31(1):1-12. PMC: 7849385. DOI: 10.1101/gr.260604.119. View

5.
Chang X, Eizenga J, Novak A, Siren J, Paten B . Distance indexing and seed clustering in sequence graphs. Bioinformatics. 2020; 36(Suppl_1):i146-i153. PMC: 7355256. DOI: 10.1093/bioinformatics/btaa446. View