» Articles » PMID: 37378434

Mapinsights: Deep Exploration of Quality Issues and Error Profiles in High-throughput Sequence Data

Overview
Specialty Biochemistry
Date 2023 Jun 28
PMID 37378434
Authors
Affiliations
Soon will be listed here.
Abstract

High-throughput sequencing (HTS) has revolutionized science by enabling super-fast detection of genomic variants at base-pair resolution. Consequently, it poses the challenging problem of identification of technical artifacts, i.e. hidden non-random error patterns. Understanding the properties of sequencing artifacts holds the key in separating true variants from false positives. Here, we develop Mapinsights, a toolkit that performs quality control (QC) analysis of sequence alignment files, capable of detecting outliers based on sequencing artifacts of HTS data at a deeper resolution compared with existing methods. Mapinsights performs a cluster analysis based on novel and existing QC features derived from the sequence alignment for outlier detection. We applied Mapinsights on community standard open-source datasets and identified various quality issues including technical errors related to sequencing cycles, sequencing chemistry, sequencing libraries and across various orthogonal sequencing platforms. Mapinsights also enables identification of anomalies related to sequencing depth. A logistic regression-based model built on the features of Mapinsights shows high accuracy in detecting 'low-confidence' variant sites. Quantitative estimates and probabilistic arguments provided by Mapinsights can be utilized in identifying errors, bias and outlier samples, and also aid in improving the authenticity of variant calls.

Citing Articles

Harnessing the Power of Next-Generation Sequencing in Wastewater-Based Epidemiology and Global Disease Surveillance.

Farkas K, Williams R, Hillary L, Garcia-Delgado A, Jameson E, Kevill J Food Environ Virol. 2024; 17(1):5.

PMID: 39614945 PMC: 11608212. DOI: 10.1007/s12560-024-09616-0.

References
1.
Stoler N, Nekrutenko A . Sequencing error profiles of Illumina sequencing instruments. NAR Genom Bioinform. 2021; 3(1):lqab019. PMC: 8002175. DOI: 10.1093/nargab/lqab019. View

2.
Cheng S, Melkonian M, Smith S, Brockington S, Archibald J, Delaux P . 10KP: A phylodiverse genome sequencing plan. Gigascience. 2018; 7(3):1-9. PMC: 5869286. DOI: 10.1093/gigascience/giy013. View

3.
Arora K, Shah M, Johnson M, Sanghvi R, Shelton J, Nagulapalli K . Deep whole-genome sequencing of 3 cancer cell lines on 2 sequencing platforms. Sci Rep. 2019; 9(1):19123. PMC: 6911065. DOI: 10.1038/s41598-019-55636-3. View

4.
Poplin R, Chang P, Alexander D, Schwartz S, Colthurst T, Ku A . A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol. 2018; 36(10):983-987. DOI: 10.1038/nbt.4235. View

5.
Wang Q, Shashikant C, Jensen M, Altman N, Girirajan S . Novel metrics to measure coverage in whole exome sequencing datasets reveal local and global non-uniformity. Sci Rep. 2017; 7(1):885. PMC: 5429826. DOI: 10.1038/s41598-017-01005-x. View