Probabilistic Base Calling of Solexa Sequencing Data

Overview

Journal BMC Bioinformatics

Publisher Biomed Central

Specialty Biology

Date 2008 Oct 15

PMID 18851737

Citations 44

Authors

Jacques Rougemont

Arnaud Amzallag

Christian Iseli

Laurent Farinelli

Ioannis Xenarios

Felix Naef

Affiliations

Soon will be listed here.

Abstract

Background: Solexa/Illumina short-read ultra-high throughput DNA sequencing technology produces millions of short tags (up to 36 bases) by parallel sequencing-by-synthesis of DNA colonies. The processing and statistical analysis of such high-throughput data poses new challenges; currently a fair proportion of the tags are routinely discarded due to an inability to match them to a reference sequence, thereby reducing the effective throughput of the technology.

Results: We propose a novel base calling algorithm using model-based clustering and probability theory to identify ambiguous bases and code them with IUPAC symbols. We also select optimal sub-tags using a score based on information content to remove uncertain bases towards the ends of the reads.

Conclusion: We show that the method improves genome coverage and number of usable tags as compared with Solexa's data processing pipeline by an average of 15%. An R package is provided which allows fast and accurate base calling of Solexa's fluorescence intensity files and the production of informative diagnostic plots.

Citing Articles

Optocoder: computational decoding of spatially indexed bead arrays.

Senel E, Rajewsky N, Karaiskos N NAR Genom Bioinform. 2022; 4(2):lqac042.

PMID: 35685220 PMC: 9172073. DOI: 10.1093/nargab/lqac042.

Tumor DNA as a Cancer Biomarker through the Lens of Colorectal Neoplasia.

Cohen J, Diergaarde B, Papadopoulos N, Kinzler K, Schoen R Cancer Epidemiol Biomarkers Prev. 2020; 29(12):2441-2453.

PMID: 33033144 PMC: 7710619. DOI: 10.1158/1055-9965.EPI-20-0549.

How does inflammation drive mutagenesis in colorectal cancer?.

Hsu C, Sowers M, Hsu W, Eyzaguirre E, Qiu S, Chao C Trends Cancer Res. 2018; 12:111-132.

PMID: 30147278 PMC: 6107301.

From reads to operational taxonomic units: an ensemble processing pipeline for MiSeq amplicon sequencing data.

Mysara M, Njima M, Leys N, Raes J, Monsieurs P Gigascience. 2017; 6(2):1-10.

PMID: 28369460 PMC: 5466709. DOI: 10.1093/gigascience/giw017.

The ChIP-Seq tools and web server: a resource for analyzing ChIP-seq and other types of genomic data.

Ambrosini G, Dreos R, Kumar S, Bucher P BMC Genomics. 2016; 17(1):938.

PMID: 27863463 PMC: 5116162. DOI: 10.1186/s12864-016-3288-8.

References

Myers E, Miller W . Optimal alignments in linear space. Comput Appl Biosci. 1988; 4(1):11-7. DOI: 10.1093/bioinformatics/4.1.11. View

Dohm J, Lottaz C, Borodina T, Himmelbauer H . Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 2008; 36(16):e105. PMC: 2532726. DOI: 10.1093/nar/gkn425. View

Hinds D, Stuve L, Nilsen G, Halperin E, Eskin E, Ballinger D . Whole-genome patterns of common DNA variation in three human populations. Science. 2005; 307(5712):1072-9. DOI: 10.1126/science.1105436. View

Margulies M, Egholm M, Altman W, Attiya S, Bader J, Bemben L . Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005; 437(7057):376-80. PMC: 1464427. DOI: 10.1038/nature03959. View

Yakovchuk P, Protozanova E, Frank-Kamenetskii M . Base-stacking and base-pairing contributions into thermal stability of the DNA double helix. Nucleic Acids Res. 2006; 34(2):564-74. PMC: 1360284. DOI: 10.1093/nar/gkj454. View

Bentley D . Whole-genome re-sequencing. Curr Opin Genet Dev. 2006; 16(6):545-52. DOI: 10.1016/j.gde.2006.10.009. View

Barski A, Cuddapah S, Cui K, Roh T, Schones D, Wang Z . High-resolution profiling of histone methylations in the human genome. Cell. 2007; 129(4):823-37. DOI: 10.1016/j.cell.2007.05.009. View

Iseli C, Ambrosini G, Bucher P, Jongeneel C . Indexing strategies for rapid searches of short words in genome sequences. PLoS One. 2007; 2(6):e579. PMC: 1894650. DOI: 10.1371/journal.pone.0000579. View

Graf S, Nielsen F, Kurtz S, Huynen M, Birney E, Stunnenberg H . Optimized design and assessment of whole genome tiling arrays. Bioinformatics. 2007; 23(13):i195-204. PMC: 5892713. DOI: 10.1093/bioinformatics/btm200. View

10.

Mikkelsen T, Ku M, Jaffe D, Issac B, Lieberman E, Giannoukos G . Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature. 2007; 448(7153):553-60. PMC: 2921165. DOI: 10.1038/nature06008. View

11.

Korbel J, Urban A, Affourtit J, Godwin B, Grubert F, Simons J . Paired-end mapping reveals extensive structural variation in the human genome. Science. 2007; 318(5849):420-6. PMC: 2674581. DOI: 10.1126/science.1149504. View

12.

Hafner M, Landgraf P, Ludwig J, Rice A, Ojo T, Lin C . Identification of microRNAs and other small regulatory RNAs using cDNA library sequencing. Methods. 2007; 44(1):3-12. PMC: 2847350. DOI: 10.1016/j.ymeth.2007.09.009. View

13.

Pop M, Salzberg S . Bioinformatics challenges of new sequencing technology. Trends Genet. 2008; 24(3):142-9. PMC: 2680276. DOI: 10.1016/j.tig.2007.12.006. View

14.

Cokus S, Feng S, Zhang X, Chen Z, Merriman B, Haudenschild C . Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nature. 2008; 452(7184):215-9. PMC: 2377394. DOI: 10.1038/nature06745. View

15.

Smith A, Xuan Z, Zhang M . Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics. 2008; 9:128. PMC: 2335322. DOI: 10.1186/1471-2105-9-128. View

16.

Vera J, Wheat C, Fescemyer H, Frilander M, Crawford D, Hanski I . Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing. Mol Ecol. 2008; 17(7):1636-47. DOI: 10.1111/j.1365-294X.2008.03666.x. View

17.

Friedlander M, Chen W, Adamidi C, Maaskola J, Einspanier R, Knespel S . Discovering microRNAs from deep sequencing data using miRDeep. Nat Biotechnol. 2008; 26(4):407-15. DOI: 10.1038/nbt1394. View

18.

Hernandez D, Francois P, Farinelli L, Osteras M, Schrenzel J . De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res. 2008; 18(5):802-9. PMC: 2336802. DOI: 10.1101/gr.072033.107. View

19.

Chen W, Kalscheuer V, Tzschach A, Menzel C, Ullmann R, Schulz M . Mapping translocation breakpoints by next-generation sequencing. Genome Res. 2008; 18(7):1143-9. PMC: 2493403. DOI: 10.1101/gr.076166.108. View

20.

Dolan P, R Denver D . TileQC: a system for tile-based quality control of Solexa data. BMC Bioinformatics. 2008; 9:250. PMC: 2443380. DOI: 10.1186/1471-2105-9-250. View