» Articles » PMID: 31725321

Predicting the Number of Bases to Attain Sufficient Coverage in High-Throughput Sequencing Experiments

Overview
Journal J Comput Biol
Date 2019 Nov 15
PMID 31725321
Citations 2
Authors
Affiliations
Soon will be listed here.
Abstract

For many types of high-throughput sequencing experiments, success in downstream analysis depends on attaining sufficient coverage for individual positions in the genome. For example, when identifying single-nucleotide variants de novo, the number of reads supporting a particular variant call determines our confidence in that variant call. If sequenced reads are distributed uniformly along the genome, the coverage of a nucleotide position is easily approximated by a Poisson distribution, with rate equal to average sequencing depth. Unfortunately, as has become well known, high-throughput sequencing data are never uniform. The numerous factors contributing to variation in coverage have resisted attempts at direct modeling and change along with minor adjustments in the underlying technology. We propose a new nonparametric method to predict the portion of a genome that will attain some specified minimum coverage, as a function of sequencing effort, using information from a shallow sequencing experiment from the same library. Simulations show our approach performs well under an array of distributional assumptions that deviate from uniformity. We applied this approach to estimate coverage at varying depths in single-cell whole-genome sequencing data from multiple protocols. These resulted in highly accurate predictions, demonstrating the effectiveness of our approach in analyzing complexity of sequencing libraries and optimizing design of sequencing experiments.

Citing Articles

Tissue-specific features of the T cell repertoire after allogeneic hematopoietic cell transplantation in human and mouse.

DeWolf S, Elhanati Y, Nichols K, Waters N, Nguyen C, Slingerland J Sci Transl Med. 2023; 15(706):eabq0476.

PMID: 37494469 PMC: 10758167. DOI: 10.1126/scitranslmed.abq0476.


Performance comparisons between clustering models for reconstructing NGS results from technical replicates.

Zhai Y, Bardel C, Vallee M, Iwaz J, Roy P Front Genet. 2023; 14:1148147.

PMID: 37007945 PMC: 10060969. DOI: 10.3389/fgene.2023.1148147.

References
1.
Van den Berge K, Perraudeau F, Soneson C, Love M, Risso D, Vert J . Observation weights unlock bulk RNA-seq tools for zero inflation and single-cell applications. Genome Biol. 2018; 19(1):24. PMC: 6251479. DOI: 10.1186/s13059-018-1406-4. View

2.
McCarthy D, Chen Y, Smyth G . Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res. 2012; 40(10):4288-97. PMC: 3378882. DOI: 10.1093/nar/gks042. View

3.
Benjamini Y, Speed T . Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 2012; 40(10):e72. PMC: 3378858. DOI: 10.1093/nar/gks001. View

4.
Lander E, Waterman M . Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics. 1988; 2(3):231-9. DOI: 10.1016/0888-7543(88)90007-9. View

5.
Zhang L, Dong X, Lee M, Maslov A, Wang T, Vijg J . Single-cell whole-genome sequencing reveals the functional landscape of somatic mutations in B lymphocytes across the human lifespan. Proc Natl Acad Sci U S A. 2019; 116(18):9014-9019. PMC: 6500118. DOI: 10.1073/pnas.1902510116. View