Statistical Comparison of Methods to Estimate the Error Probability in Short-read Illumina Sequencing
Overview
Affiliations
As was the case in the beginning of the sequencing era, the new generation of short-read sequencing technologies still requires both accuracy of data processing methods and reliable measures of that accuracy. Inspired by the classic of the genre, the Phred method, we generalized those findings in the area of base quality value calibration. We introduce a simple, straightforward statistically established way to measure the performance of a calibrator, and to find an optimal way to assess its reliability. We illustrate the method by assessing the performance of several calibrators/predictors for Illumina, Genome Analyser 2 (GA2) data. The choice of the best predictor is based on optimization of validity, discriminative ability and discrimination power for several candidate predictors. We applied the method on data from one experimental run for genome of the phage varphiX, and found the best predictor out of ten candidates to be 'Purity', a statistics derived from corrected cluster intensities. The source code for the comparison of the predictors is available from the authors by request.
A Sequence-Based Novel Approach for Quality Evaluation of Third-Generation Sequencing Reads.
Zhang W, Huang N, Zheng J, Liao X, Wang J, Li H Genes (Basel). 2019; 10(1).
PMID: 30646604 PMC: 6356754. DOI: 10.3390/genes10010044.
Bacher U, Shumilov E, Flach J, Porret N, Joncourt R, Wiedemann G Blood Cancer J. 2018; 8(11):113.
PMID: 30420667 PMC: 6232163. DOI: 10.1038/s41408-018-0148-6.
Genetic Drivers of Epigenetic and Transcriptional Variation in Human Immune Cells.
Chen L, Ge B, Casale F, Vasquez L, Kwan T, Garrido-Martin D Cell. 2016; 167(5):1398-1414.e24.
PMID: 27863251 PMC: 5119954. DOI: 10.1016/j.cell.2016.10.026.
Detecting non-allelic homologous recombination from high-throughput sequencing data.
Parks M, Lawrence C, Raphael B Genome Biol. 2015; 16:72.
PMID: 25886137 PMC: 4425883. DOI: 10.1186/s13059-015-0633-1.
All Your Base: a fast and accurate probabilistic approach to base calling.
Massingham T, Goldman N Genome Biol. 2012; 13(2):R13.
PMID: 22377270 PMC: 4053729. DOI: 10.1186/gb-2012-13-2-r13.