» Articles » PMID: 21478889

A Framework for Variation Discovery and Genotyping Using Next-generation DNA Sequencing Data

Abstract

Recent advances in sequencing technology make it possible to comprehensively catalog genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious, and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (i) initial read mapping; (ii) local realignment around indels; (iii) base quality score recalibration; (iv) SNP discovery and genotyping to find all potential variants; and (v) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We here discuss the application of these tools, instantiated in the Genome Analysis Toolkit, to deep whole-genome, whole-exome capture and multi-sample low-pass (∼4×) 1000 Genomes Project datasets.

Citing Articles

Primary exploration of cell-free DNA in the plasma of patients with parathyroid neoplasms using next-generation sequencing.

Zheng Q, Cui M, Wang O, Chang X, Xiao J, Chen T Cancer Cell Int. 2025; 25(1):86.

PMID: 40075389 PMC: 11905564. DOI: 10.1186/s12935-025-03699-w.


Single-cell eQTL mapping in yeast reveals a tradeoff between growth and reproduction.

Boocock J, Alexander N, Alamo Tapia L, Walter-McNeill L, Patel S, Munugala C Elife. 2025; 13.

PMID: 40073070 PMC: 11903034. DOI: 10.7554/eLife.95566.


Genetic dissection of flowering time and fine mapping of qFT.A02-1 in rapeseed (Brassica napus L.).

Li Y, Li X, Du D, Ma Q, Zhao Z, Wang L Theor Appl Genet. 2025; 138(4):70.

PMID: 40069358 DOI: 10.1007/s00122-025-04845-8.


T and NK cell functionality in a patient harboring heterozygous novel BCL11B p.Asp632fsAla∗91 and STX11 p.R129P mutations.

Erra L, Colado A, Brunello F, Prieto E, Goris V, Villa M Heliyon. 2025; 11(4):e42636.

PMID: 40066033 PMC: 11891720. DOI: 10.1016/j.heliyon.2025.e42636.


UVA-light-induced mutagenesis in the exome of human nucleotide excision repair-deficient cells.

Quintero-Ruiz N, Corradi C, Moreno N, de Souza T, Menck C Photochem Photobiol Sci. 2025; .

PMID: 40063310 DOI: 10.1007/s43630-025-00697-9.


References
1.
Ning Z, Cox A, Mullikin J . SSAHA: a fast search method for large DNA databases. Genome Res. 2001; 11(10):1725-9. PMC: 311141. DOI: 10.1101/gr.194201. View

2.
Freudenberg-Hua Y, Freudenberg J, Kluck N, Cichon S, Propping P, Nothen M . Single nucleotide variation analysis in 65 candidate genes for CNS disorders in a representative sample of the European population. Genome Res. 2003; 13(10):2271-6. PMC: 403700. DOI: 10.1101/gr.1299703. View

3.
Dohm J, Lottaz C, Borodina T, Himmelbauer H . Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 2008; 36(16):e105. PMC: 2532726. DOI: 10.1093/nar/gkn425. View

4.
Li R, Yu C, Li Y, Lam T, Yiu S, Kristiansen K . SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics. 2009; 25(15):1966-7. DOI: 10.1093/bioinformatics/btp336. View

5.
Hoberman R, Dias J, Ge B, Harmsen E, Mayhew M, Verlaan D . A probabilistic approach for SNP discovery in high-throughput human resequencing data. Genome Res. 2009; 19(9):1542-52. PMC: 2752119. DOI: 10.1101/gr.092072.109. View