Using Quality Scores and Longer Reads Improves Accuracy of Solexa Read Mapping

Overview

Journal BMC Bioinformatics

Publisher Biomed Central

Specialty Biology

Date 2008 Mar 1

PMID 18307793

Citations 110

Authors

Andrew D Smith

Zhenyu Xuan

Michael Q Zhang

Affiliations

Soon will be listed here.

Abstract

Background: Second-generation sequencing has the potential to revolutionize genomics and impact all areas of biomedical science. New technologies will make re-sequencing widely available for such applications as identifying genome variations or interrogating the oligonucleotide content of a large sample (e.g. ChIP-sequencing). The increase in speed, sensitivity and availability of sequencing technology brings demand for advances in computational technology to perform associated analysis tasks. The Solexa/Illumina 1G sequencer can produce tens of millions of reads, ranging in length from approximately 25-50 nt, in a single experiment. Accurately mapping the reads back to a reference genome is a critical task in almost all applications. Two sources of information that are often ignored when mapping reads from the Solexa technology are the 3' ends of longer reads, which contain a much higher frequency of sequencing errors, and the base-call quality scores.

Results: To investigate whether these sources of information can be used to improve accuracy when mapping reads, we developed the RMAP tool, which can map reads having a wide range of lengths and allows base-call quality scores to determine which positions in each read are more important when mapping. We applied RMAP to analyze data re-sequenced from two human BAC regions for varying read lengths, and varying criteria for use of quality scores. RMAP is freely available for downloading at http://rulai.cshl.edu/rmap/.

Conclusion: Our results indicate that significant gains in Solexa read mapping performance can be achieved by considering the information in 3' ends of longer reads, and appropriately using the base-call quality scores. The RMAP tool we have developed will enable researchers to effectively exploit this information in targeted re-sequencing projects.

Citing Articles

BLEND: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis.

Firtina C, Park J, Alser M, Kim J, Cali D, Shahroodi T NAR Genom Bioinform. 2023; 5(1):lqad004.

PMID: 36685727 PMC: 9853099. DOI: 10.1093/nargab/lqad004.

Bioinformatics and Machine Learning Approaches to Understand the Regulation of Mobile Genetic Elements.

Giassa I, Alexiou P Biology (Basel). 2021; 10(9).

PMID: 34571773 PMC: 8465862. DOI: 10.3390/biology10090896.

Boosting the power of transcriptomics by developing an efficient gene expression profiling approach.

Wang J, Xu J, Yang X, Xu S, Zhang M, Lu F Plant Biotechnol J. 2021; 20(1):201-210.

PMID: 34510693 PMC: 8710826. DOI: 10.1111/pbi.13706.

Technology dictates algorithms: recent developments in read alignment.

Alser M, Rotman J, Deshpande D, Taraszka K, Shi H, Baykal P Genome Biol. 2021; 22(1):249.

PMID: 34446078 PMC: 8390189. DOI: 10.1186/s13059-021-02443-7.

Levenshtein Distance, Sequence Comparison and Biological Database Search.

Berger B, Waterman M, Yu Y IEEE Trans Inf Theory. 2021; 67(6):3287-3294.

PMID: 34257466 PMC: 8274556. DOI: 10.1109/tit.2020.2996543.

References

Ma B, Tromp J, Li M . PatternHunter: faster and more sensitive homology search. Bioinformatics. 2002; 18(3):440-5. DOI: 10.1093/bioinformatics/18.3.440. View

Mikkelsen T, Ku M, Jaffe D, Issac B, Lieberman E, Giannoukos G . Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature. 2007; 448(7153):553-60. PMC: 2921165. DOI: 10.1038/nature06008. View

Bentley D . Whole-genome re-sequencing. Curr Opin Genet Dev. 2006; 16(6):545-52. DOI: 10.1016/j.gde.2006.10.009. View

Margulies M, Egholm M, Altman W, Attiya S, Bader J, Bemben L . Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005; 437(7057):376-80. PMC: 1464427. DOI: 10.1038/nature03959. View

Ewing B, Hillier L, Wendl M, Green P . Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 1998; 8(3):175-85. DOI: 10.1101/gr.8.3.175. View