SeqEntropy: Genome-wide Assessment of Repeats for Short Read Sequencing

Overview

Journal PLoS One

Specialties General Medicine
Science

Date 2013 Apr 2

PMID 23544073

Citations 1

Authors

Hsueh-Ting Chu

William W L Hsiao

Theresa T H Tsao

D Frank Hsu

Chaur-Chin Chen

Sheng-An Lee

Cheng-Yan Kao

Affiliations

Soon will be listed here.

Abstract

Background: Recent studies on genome assembly from short-read sequencing data reported the limitation of this technology to reconstruct the entire genome even at very high depth coverage. We investigated the limitation from the perspective of information theory to evaluate the effect of repeats on short-read genome assembly using idealized (error-free) reads at different lengths.

Methodology/principal Findings: We define a metric H(k) to be the entropy of sequencing reads at a read length k and use the relative loss of entropy ΔH(k) to measure the impact of repeats for the reconstruction of whole-genome from sequences of length k. In our experiments, we found that entropy loss correlates well with de-novo assembly coverage of a genome, and a score of ΔH(k)>1% indicates a severe loss in genome reconstruction fidelity. The minimal read lengths to achieve ΔH(k)<1% are different for various organisms and are independent of the genome size. For example, in order to meet the threshold of ΔH(k)<1%, a read length of 60 bp is needed for the sequencing of human genome (3.2 10(9) bp) and 320 bp for the sequencing of fruit fly (1.8×10(8) bp). We also calculated the ΔH(k) scores for 2725 prokaryotic chromosomes and plasmids at several read lengths. Our results indicate that the levels of repeats in different genomes are diverse and the entropy of sequencing reads provides a measurement for the repeat structures.

Conclusions/significance: The proposed entropy-based measurement, which can be calculated in seconds to minutes in most cases, provides a rapid quantitative evaluation on the limitation of idealized short-read genome sequencing. Moreover, the calculation can be parallelized to scale up to large euakryotic genomes. This approach may be useful to tune the sequencing parameters to achieve better genome assemblies when a closely related genome is already available.

Citing Articles

Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome.

Li W, Freudenberg J, Miramontes P BMC Bioinformatics. 2014; 15:2.

PMID: 24386976 PMC: 3927684. DOI: 10.1186/1471-2105-15-2.

References

Dohm J, Lottaz C, Borodina T, Himmelbauer H . SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res. 2007; 17(11):1697-706. PMC: 2045152. DOI: 10.1101/gr.6435207. View

Bischof D, Vilei E, Frey J . Genomic differences between type strain PG1 and field strains of Mycoplasma mycoides subsp. mycoides small-colony type. Genomics. 2006; 88(5):633-41. PMC: 1798306. DOI: 10.1016/j.ygeno.2006.06.018. View

Blattner F, Plunkett 3rd G, Bloch C, Perna N, Burland V, Riley M . The complete genome sequence of Escherichia coli K-12. Science. 1997; 277(5331):1453-62. DOI: 10.1126/science.277.5331.1453. View

. Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. J Hered. 2009; 100(6):659-74. PMC: 2877544. DOI: 10.1093/jhered/esp086. View

Tammi M, Arner E, Britton T, Andersson B . Separation of nearly identical repeats in shotgun assemblies using defined nucleotide positions, DNPs. Bioinformatics. 2002; 18(3):379-88. DOI: 10.1093/bioinformatics/18.3.379. View

. A haplotype map of the human genome. Nature. 2005; 437(7063):1299-320. PMC: 1880871. DOI: 10.1038/nature04226. View

Alkan C, Sajjadian S, Eichler E . Limitations of next-generation genome sequence assembly. Nat Methods. 2010; 8(1):61-5. PMC: 3115693. DOI: 10.1038/nmeth.1527. View

Cerveau N, Leclercq S, Leroy E, Bouchon D, Cordaux R . Short- and long-term evolutionary dynamics of bacterial insertion sequences: insights from Wolbachia endosymbionts. Genome Biol Evol. 2011; 3:1175-86. PMC: 3205602. DOI: 10.1093/gbe/evr096. View

Rasko D, Webster D, Sahl J, Bashir A, Boisen N, Scheutz F . Origins of the E. coli strain causing an outbreak of hemolytic-uremic syndrome in Germany. N Engl J Med. 2011; 365(8):709-17. PMC: 3168948. DOI: 10.1056/NEJMoa1106920. View

10.

Flusberg B, Webster D, Lee J, Travers K, Olivares E, Clark T . Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat Methods. 2010; 7(6):461-5. PMC: 2879396. DOI: 10.1038/nmeth.1459. View

11.

Simpson J, Wong K, Jackman S, Schein J, Jones S, Birol I . ABySS: a parallel assembler for short read sequence data. Genome Res. 2009; 19(6):1117-23. PMC: 2694472. DOI: 10.1101/gr.089532.108. View

12.

Abecasis G, Altshuler D, Auton A, Brooks L, Durbin R, Gibbs R . A map of human genome variation from population-scale sequencing. Nature. 2010; 467(7319):1061-73. PMC: 3042601. DOI: 10.1038/nature09534. View

13.

Kingsford C, Schatz M, Pop M . Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics. 2010; 11:21. PMC: 2821320. DOI: 10.1186/1471-2105-11-21. View

14.

Cho N, Kim H, Lee J, Kim S, Kim J, Cha S . The Orientia tsutsugamushi genome reveals massive proliferation of conjugative type IV secretion system and host-cell interaction genes. Proc Natl Acad Sci U S A. 2007; 104(19):7981-6. PMC: 1876558. DOI: 10.1073/pnas.0611553104. View

15.

Xu J, Bjursell M, Himrod J, Deng S, Carmichael L, Chiang H . A genomic view of the human-Bacteroides thetaiotaomicron symbiosis. Science. 2003; 299(5615):2074-6. DOI: 10.1126/science.1080029. View

16.

Shapiro J, von Sternberg R . Why repetitive DNA is essential to genome function. Biol Rev Camb Philos Soc. 2005; 80(2):227-50. DOI: 10.1017/s1464793104006657. View

17.

Sundquist A, Ronaghi M, Tang H, Pevzner P, Batzoglou S . Whole-genome sequencing and assembly with high-throughput, short-read technologies. PLoS One. 2007; 2(5):e484. PMC: 1871613. DOI: 10.1371/journal.pone.0000484. View

18.

Wetzel J, Kingsford C, Pop M . Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies. BMC Bioinformatics. 2011; 12:95. PMC: 3103447. DOI: 10.1186/1471-2105-12-95. View

19.

Nakayama K, Yamashita A, Kurokawa K, Morimoto T, Ogawa M, Fukuhara M . The Whole-genome sequencing of the obligate intracellular bacterium Orientia tsutsugamushi revealed massive gene amplification during reductive genome evolution. DNA Res. 2008; 15(4):185-99. PMC: 2575882. DOI: 10.1093/dnares/dsn011. View