Using Quality Scores and Longer Reads Improves Accuracy of Solexa Read Mapping
Overview
Authors
Affiliations
Background: Second-generation sequencing has the potential to revolutionize genomics and impact all areas of biomedical science. New technologies will make re-sequencing widely available for such applications as identifying genome variations or interrogating the oligonucleotide content of a large sample (e.g. ChIP-sequencing). The increase in speed, sensitivity and availability of sequencing technology brings demand for advances in computational technology to perform associated analysis tasks. The Solexa/Illumina 1G sequencer can produce tens of millions of reads, ranging in length from approximately 25-50 nt, in a single experiment. Accurately mapping the reads back to a reference genome is a critical task in almost all applications. Two sources of information that are often ignored when mapping reads from the Solexa technology are the 3' ends of longer reads, which contain a much higher frequency of sequencing errors, and the base-call quality scores.
Results: To investigate whether these sources of information can be used to improve accuracy when mapping reads, we developed the RMAP tool, which can map reads having a wide range of lengths and allows base-call quality scores to determine which positions in each read are more important when mapping. We applied RMAP to analyze data re-sequenced from two human BAC regions for varying read lengths, and varying criteria for use of quality scores. RMAP is freely available for downloading at http://rulai.cshl.edu/rmap/.
Conclusion: Our results indicate that significant gains in Solexa read mapping performance can be achieved by considering the information in 3' ends of longer reads, and appropriately using the base-call quality scores. The RMAP tool we have developed will enable researchers to effectively exploit this information in targeted re-sequencing projects.
Firtina C, Park J, Alser M, Kim J, Cali D, Shahroodi T NAR Genom Bioinform. 2023; 5(1):lqad004.
PMID: 36685727 PMC: 9853099. DOI: 10.1093/nargab/lqad004.
Giassa I, Alexiou P Biology (Basel). 2021; 10(9).
PMID: 34571773 PMC: 8465862. DOI: 10.3390/biology10090896.
Boosting the power of transcriptomics by developing an efficient gene expression profiling approach.
Wang J, Xu J, Yang X, Xu S, Zhang M, Lu F Plant Biotechnol J. 2021; 20(1):201-210.
PMID: 34510693 PMC: 8710826. DOI: 10.1111/pbi.13706.
Technology dictates algorithms: recent developments in read alignment.
Alser M, Rotman J, Deshpande D, Taraszka K, Shi H, Baykal P Genome Biol. 2021; 22(1):249.
PMID: 34446078 PMC: 8390189. DOI: 10.1186/s13059-021-02443-7.
Levenshtein Distance, Sequence Comparison and Biological Database Search.
Berger B, Waterman M, Yu Y IEEE Trans Inf Theory. 2021; 67(6):3287-3294.
PMID: 34257466 PMC: 8274556. DOI: 10.1109/tit.2020.2996543.