Efficient Detection and Assembly of Non-reference DNA Sequences with Synthetic Long Reads
Overview
Affiliations
Recent pan-genome studies have revealed an abundance of DNA sequences in human genomes that are not present in the reference genome. A lion's share of these non-reference sequences (NRSs) cannot be reliably assembled or placed on the reference genome. Improvements in long-read and synthetic long-read (aka linked-read) technologies have great potential for the characterization of NRSs. While synthetic long reads require less input DNA than long-read datasets, they are algorithmically more challenging to use. Except for computationally expensive whole-genome assembly methods, there is no synthetic long-read method for NRS detection. We propose a novel integrated alignment-based and local assembly-based algorithm, Novel-X, that uses the barcode information encoded in synthetic long reads to improve the detection of such events without a whole-genome de novo assembly. Our evaluations demonstrate that Novel-X finds many non-reference sequences that cannot be found by state-of-the-art short-read methods. We applied Novel-X to a diverse set of 68 samples from the Polaris HiSeq 4000 PGx cohort. Novel-X discovered 16 691 NRS insertions of size > 300 bp (total length 18.2 Mb). Many of them are population specific or may have a functional impact.
Blackbird: structural variant detection using synthetic and low-coverage long-reads.
Meleshko D, Yang R, Maharjan S, Danko D, Korobeynikov A, Hajirasouliha I bioRxiv. 2024; .
PMID: 39605582 PMC: 11601376. DOI: 10.1101/2024.11.17.624011.
Technology-enabled great leap in deciphering plant genomes.
Xie L, Gong X, Yang K, Huang Y, Zhang S, Shen L Nat Plants. 2024; 10(4):551-566.
PMID: 38509222 DOI: 10.1038/s41477-024-01655-6.
Wu Z, Li T, Jiang Z, Zheng J, Gu Y, Liu Y Nucleic Acids Res. 2024; 52(5):2212-2230.
PMID: 38364871 PMC: 10954445. DOI: 10.1093/nar/gkae086.
BLR: a flexible pipeline for haplotype analysis of multiple linked-read technologies.
Hojer P, Frick T, Siga H, Pourbozorgi P, Aghelpasand H, Martin M Nucleic Acids Res. 2023; 51(22):e114.
PMID: 37941142 PMC: 10711428. DOI: 10.1093/nar/gkad1010.