» Articles » PMID: 28560825

PhredEM: a Phred-score-informed Genotype-calling Approach for Next-generation Sequencing Studies

Overview
Journal Genet Epidemiol
Specialties Genetics
Public Health
Date 2017 Jun 1
PMID 28560825
Citations 14
Authors
Affiliations
Soon will be listed here.
Abstract

A fundamental challenge in analyzing next-generation sequencing (NGS) data is to determine an individual's genotype accurately, as the accuracy of the inferred genotype is essential to downstream analyses. Correctly estimating the base-calling error rate is critical to accurate genotype calls. Phred scores that accompany each call can be used to decide which calls are reliable. Some genotype callers, such as GATK and SAMtools, directly calculate the base-calling error rates from phred scores or recalibrated base quality scores. Others, such as SeqEM, estimate error rates from the read data without using any quality scores. It is also a common quality control procedure to filter out reads with low phred scores. However, choosing an appropriate phred score threshold is problematic as a too high threshold may lose data, while a too low threshold may introduce errors. We propose a new likelihood-based genotype-calling approach that exploits all reads and estimates the per-base error rates by incorporating phred scores through a logistic regression model. The approach, which we call PhredEM, uses the expectation-maximization (EM) algorithm to obtain consistent estimates of genotype frequencies and logistic regression parameters. It also includes a simple, computationally efficient screening algorithm to identify loci that are estimated to be monomorphic, so that only loci estimated to be nonmonomorphic require application of the EM algorithm. Like GATK, PhredEM can be used together with a linkage-disequilibrium-based method such as Beagle, which can further improve genotype calling as a refinement step. We evaluate the performance of PhredEM using both simulated data and real sequencing data from the UK10K project and the 1000 Genomes project. The results demonstrate that PhredEM performs better than either GATK or SeqEM, and that PhredEM is an improved, robust, and widely applicable genotype-calling approach for NGS studies. The relevant software is freely available.

Citing Articles

PNNGS, a multi-convolutional parallel neural network for genomic selection.

Xie Z, Weng L, He J, Feng X, Xu X, Ma Y Front Plant Sci. 2024; 15:1410596.

PMID: 39290743 PMC: 11405342. DOI: 10.3389/fpls.2024.1410596.


Using nanopore sequencing to identify bacterial infection in joint replacements: a preliminary study.

Wilkinson H, McDonald J, McCarthy H, Perry J, Wright K, Hulme C Brief Funct Genomics. 2024; 23(5):509-516.

PMID: 38555497 PMC: 11428152. DOI: 10.1093/bfgp/elae008.


GRHL2-controlled gene expression networks in luminal breast cancer.

Wang Z, Coban B, Wu H, Chouaref J, Daxinger L, Paulsen M Cell Commun Signal. 2023; 21(1):15.

PMID: 36691073 PMC: 9869538. DOI: 10.1186/s12964-022-01029-5.


Genomic divergence, local adaptation, and complex demographic history may inform management of a popular sportfish species complex.

Gunn J, Berkman L, Koppelman J, Taylor A, Brewer S, Long J Ecol Evol. 2022; 12(10):e9370.

PMID: 36225830 PMC: 9534746. DOI: 10.1002/ece3.9370.


Oral microbiome research - A Beginner's glossary.

Deo P, Deshmukh R J Oral Maxillofac Pathol. 2022; 26(1):87-92.

PMID: 35571306 PMC: 9106258. DOI: 10.4103/jomfp.jomfp_455_21.


References
1.
Auton A, Brooks L, Durbin R, Garrison E, Kang H, Korbel J . A global reference for human genetic variation. Nature. 2015; 526(7571):68-74. PMC: 4750478. DOI: 10.1038/nature15393. View

2.
Li M, Nordborg M, Li L . Adjust quality scores from alignment and improve sequencing accuracy. Nucleic Acids Res. 2004; 32(17):5183-91. PMC: 521663. DOI: 10.1093/nar/gkh850. View

3.
Brockman W, Alvarez P, Young S, Garber M, Giannoukos G, Lee W . Quality scores and SNP detection in sequencing-by-synthesis systems. Genome Res. 2008; 18(5):763-70. PMC: 2336812. DOI: 10.1101/gr.070227.107. View

4.
Li H, Ruan J, Durbin R . Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008; 18(11):1851-8. PMC: 2577856. DOI: 10.1101/gr.078212.108. View

5.
Do R, Kathiresan S, Abecasis G . Exome sequencing and complex disease: practical aspects of rare variant association studies. Hum Mol Genet. 2012; 21(R1):R1-9. PMC: 3459641. DOI: 10.1093/hmg/dds387. View