» Articles » PMID: 35924489

Efficient Detection and Assembly of Non-reference DNA Sequences with Synthetic Long Reads

Overview
Specialty Biochemistry
Date 2022 Aug 4
PMID 35924489
Authors
Affiliations
Soon will be listed here.
Abstract

Recent pan-genome studies have revealed an abundance of DNA sequences in human genomes that are not present in the reference genome. A lion's share of these non-reference sequences (NRSs) cannot be reliably assembled or placed on the reference genome. Improvements in long-read and synthetic long-read (aka linked-read) technologies have great potential for the characterization of NRSs. While synthetic long reads require less input DNA than long-read datasets, they are algorithmically more challenging to use. Except for computationally expensive whole-genome assembly methods, there is no synthetic long-read method for NRS detection. We propose a novel integrated alignment-based and local assembly-based algorithm, Novel-X, that uses the barcode information encoded in synthetic long reads to improve the detection of such events without a whole-genome de novo assembly. Our evaluations demonstrate that Novel-X finds many non-reference sequences that cannot be found by state-of-the-art short-read methods. We applied Novel-X to a diverse set of 68 samples from the Polaris HiSeq 4000 PGx cohort. Novel-X discovered 16 691 NRS insertions of size > 300 bp (total length 18.2 Mb). Many of them are population specific or may have a functional impact.

Citing Articles

Blackbird: structural variant detection using synthetic and low-coverage long-reads.

Meleshko D, Yang R, Maharjan S, Danko D, Korobeynikov A, Hajirasouliha I bioRxiv. 2024; .

PMID: 39605582 PMC: 11601376. DOI: 10.1101/2024.11.17.624011.


Technology-enabled great leap in deciphering plant genomes.

Xie L, Gong X, Yang K, Huang Y, Zhang S, Shen L Nat Plants. 2024; 10(4):551-566.

PMID: 38509222 DOI: 10.1038/s41477-024-01655-6.


Human pangenome analysis of sequences missing from the reference genome reveals their widespread evolutionary, phenotypic, and functional roles.

Wu Z, Li T, Jiang Z, Zheng J, Gu Y, Liu Y Nucleic Acids Res. 2024; 52(5):2212-2230.

PMID: 38364871 PMC: 10954445. DOI: 10.1093/nar/gkae086.


BLR: a flexible pipeline for haplotype analysis of multiple linked-read technologies.

Hojer P, Frick T, Siga H, Pourbozorgi P, Aghelpasand H, Martin M Nucleic Acids Res. 2023; 51(22):e114.

PMID: 37941142 PMC: 10711428. DOI: 10.1093/nar/gkad1010.

References
1.
Zhang L, Zhou X, Weng Z, Sidow A . diploid genome assembly for genome-wide structural variant detection. NAR Genom Bioinform. 2021; 2(1):lqz018. PMC: 7671403. DOI: 10.1093/nargab/lqz018. View

2.
Zook J, Chapman B, Wang J, Mittelman D, Hofmann O, Hide W . Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol. 2014; 32(3):246-51. DOI: 10.1038/nbt.2835. View

3.
Kehr B, Helgadottir A, Melsted P, Jonsson H, Helgason H, Jonasdottir A . Diversity in non-repetitive human sequences not found in the reference genome. Nat Genet. 2017; 49(4):588-593. DOI: 10.1038/ng.3801. View

4.
Kavak P, Lin Y, Numanagic I, Asghari H, Gungor T, Alkan C . Discovery and genotyping of novel sequence insertions in many sequenced individuals. Bioinformatics. 2017; 33(14):i161-i169. PMC: 5870608. DOI: 10.1093/bioinformatics/btx254. View

5.
Huang W, Li L, Myers J, Marth G . ART: a next-generation sequencing read simulator. Bioinformatics. 2011; 28(4):593-4. PMC: 3278762. DOI: 10.1093/bioinformatics/btr708. View