Estimation of Sequencing Error Rates in Short Reads

Overview

Journal BMC Bioinformatics

Publisher Biomed Central

Specialty Biology

Date 2012 Aug 1

PMID 22846331

Citations 34

Authors

Xin Victoria Wang

Natalie Blades

Jie Ding

Razvan Sultana

Giovanni Parmigiani

Affiliations

Soon will be listed here.

Abstract

Background: Short-read data from next-generation sequencing technologies are now being generated across a range of research projects. The fidelity of this data can be affected by several factors and it is important to have simple and reliable approaches for monitoring it at the level of individual experiments.

Results: We developed a fast, scalable and accurate approach to estimating error rates in short reads, which has the added advantage of not requiring a reference genome. We build on the fundamental observation that there is a linear relationship between the copy number for a given read and the number of erroneous reads that differ from the read of interest by one or two bases. The slope of this relationship can be transformed to give an estimate of the error rate, both by read and by position. We present simulation studies as well as analyses of real data sets illustrating the precision and accuracy of this method, and we show that it is more accurate than alternatives that count the difference between the sample of interest and a reference genome. We show how this methodology led to the detection of mutations in the genome of the PhiX strain used for calibration of Illumina data. The proposed method is implemented in an R package, which can be downloaded from http://bcb.dfci.harvard.edu/∼vwang/shadowRegression.html.

Conclusions: The proposed method can be used to monitor the quality of sequencing pipelines at the level of individual experiments without the use of reference genomes. Furthermore, having an estimate of the error rates gives one the opportunity to improve analyses and inferences in many applications of next-generation sequencing data.

Citing Articles

GCphase: an SNP phasing method using a graph partition and error correction algorithm.

Luo J, Wang J, Zhai H, Wang J BMC Bioinformatics. 2024; 25(1):267.

PMID: 39160480 PMC: 11331634. DOI: 10.1186/s12859-024-05901-8.

Genetic identification of avian samples recovered from solar energy installations.

Gruppi C, Sanzenbacher P, Balekjian K, Hagar R, Hagen S, Rayne C PLoS One. 2023; 18(9):e0289949.

PMID: 37672506 PMC: 10482291. DOI: 10.1371/journal.pone.0289949.

Methods to improve the accuracy of next-generation sequencing.

Cheng C, Fei Z, Xiao P Front Bioeng Biotechnol. 2023; 11:982111.

PMID: 36741756 PMC: 9895957. DOI: 10.3389/fbioe.2023.982111.

Effect of Periodontal Interventions on Characteristics of the Periodontal Microbial Profile: A Systematic Review and Meta-Analysis.

Nath S, Pulikkotil S, Weyrich L, Zilm P, Kapellas K, Jamieson L Microorganisms. 2022; 10(8).

PMID: 36014000 PMC: 9416518. DOI: 10.3390/microorganisms10081582.

Evaluation of the correctable decoding sequencing as a new powerful strategy for DNA sequencing.

Cheng C, Xiao P Life Sci Alliance. 2022; 5(8).

PMID: 35422436 PMC: 9012935. DOI: 10.26508/lsa.202101294.

References

Cuevas J, Duffy S, Sanjuan R . Point mutation rate of bacteriophage PhiX174. Genetics. 2009; 183(2):747-9. PMC: 2766332. DOI: 10.1534/genetics.109.106005. View

Zhang L, Zhou W, Velculescu V, Kern S, Hruban R, Hamilton S . Gene expression profiles in normal and cancer cells. Science. 1997; 276(5316):1268-72. DOI: 10.1126/science.276.5316.1268. View

Dohm J, Lottaz C, Borodina T, Himmelbauer H . Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 2008; 36(16):e105. PMC: 2532726. DOI: 10.1093/nar/gkn425. View

Schroder J, Bailey J, Conway T, Zobel J . Reference-free validation of short read data. PLoS One. 2010; 5(9):e12681. PMC: 2943903. DOI: 10.1371/journal.pone.0012681. View

Salmela L . Correction of sequencing errors in a mixed set of reads. Bioinformatics. 2010; 26(10):1284-90. DOI: 10.1093/bioinformatics/btq151. View

Lash A, Tolstoshev C, Wagner L, Schuler G, Strausberg R, Riggins G . SAGEmap: a public gene expression resource. Genome Res. 2000; 10(7):1051-60. PMC: 310889. DOI: 10.1101/gr.10.7.1051. View

Shendure J, Ji H . Next-generation DNA sequencing. Nat Biotechnol. 2008; 26(10):1135-45. DOI: 10.1038/nbt1486. View

Kao W, Stevens K, Song Y . BayesCall: A model-based base-calling algorithm for high-throughput short-read sequencing. Genome Res. 2009; 19(10):1884-95. PMC: 2765266. DOI: 10.1101/gr.095299.109. View

Schroder J, Schroder H, Puglisi S, Sinha R, Schmidt B . SHREC: a short-read error correction method. Bioinformatics. 2009; 25(17):2157-63. DOI: 10.1093/bioinformatics/btp379. View

10.

Nakamura K, Oshima T, Morimoto T, Ikeda S, Yoshikawa H, Shiwa Y . Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res. 2011; 39(13):e90. PMC: 3141275. DOI: 10.1093/nar/gkr344. View

11.

Shi L, Reid L, Jones W, Shippy R, Warrington J, Baker S . The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol. 2006; 24(9):1151-61. PMC: 3272078. DOI: 10.1038/nbt1239. View

12.

Rougemont J, Amzallag A, Iseli C, Farinelli L, Xenarios I, Naef F . Probabilistic base calling of Solexa sequencing data. BMC Bioinformatics. 2008; 9:431. PMC: 2575221. DOI: 10.1186/1471-2105-9-431. View

13.

Bullard J, Purdom E, Hansen K, Dudoit S . Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010; 11:94. PMC: 2838869. DOI: 10.1186/1471-2105-11-94. View

14.

Birney E, Stamatoyannopoulos J, Dutta A, Guigo R, Gingeras T, Margulies E . Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007; 447(7146):799-816. PMC: 2212820. DOI: 10.1038/nature05874. View

15.

Kao W, Chan A, Song Y . ECHO: a reference-free short-read error correction algorithm. Genome Res. 2011; 21(7):1181-92. PMC: 3129260. DOI: 10.1101/gr.111351.110. View

16.

Hu H, Wrogemann K, Kalscheuer V, Tzschach A, Richard H, Haas S . Mutation screening in 86 known X-linked mental retardation genes by droplet-based multiplex PCR and massive parallel sequencing. Hugo J. 2011; 3(1-4):41-9. PMC: 2882650. DOI: 10.1007/s11568-010-9137-y. View

17.

Velculescu V, Zhang L, Vogelstein B, Kinzler K . Serial analysis of gene expression. Science. 1995; 270(5235):484-7. DOI: 10.1126/science.270.5235.484. View

18.

Huse S, Huber J, Morrison H, Sogin M, Welch D . Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol. 2007; 8(7):R143. PMC: 2323236. DOI: 10.1186/gb-2007-8-7-r143. View

19.

Leinonen R, Sugawara H, Shumway M . The sequence read archive. Nucleic Acids Res. 2010; 39(Database issue):D19-21. PMC: 3013647. DOI: 10.1093/nar/gkq1019. View

20.

Erlich Y, Mitra P, DelaBastide M, McCombie W, Hannon G . Alta-Cyclic: a self-optimizing base caller for next-generation sequencing. Nat Methods. 2008; 5(8):679-82. PMC: 2978646. DOI: 10.1038/nmeth.1230. View