An Evaluation Framework for Lossy Compression of Genome Sequencing Quality Values
Overview
Authors
Affiliations
This paper provides the specification and an initial validation of an evaluation framework for the comparison of lossy compressors of genome sequencing quality values. The goal is to define reference data, test sets, tools and metrics that shall be used to evaluate the impact of lossy compression of quality values on human genome variant calling. The functionality of the framework is validated referring to two state-of-the-art genomic compressors. This work has been spurred by the current activity within the ISO/IEC SC29/WG11 technical committee (a.k.a. MPEG), which is investigating the possibility of starting a standardization activity for genomic information representation.
Navigating bottlenecks and trade-offs in genomic data analysis.
Berger B, Yu Y Nat Rev Genet. 2022; 24(4):235-250.
PMID: 36476810 PMC: 10204111. DOI: 10.1038/s41576-022-00551-z.
CMIC: an efficient quality score compressor with random access functionality.
Chen H, Chen J, Lu Z, Wang R BMC Bioinformatics. 2022; 23(1):294.
PMID: 35870880 PMC: 9308261. DOI: 10.1186/s12859-022-04837-1.
MZPAQ: a FASTQ data compression tool.
El Allali A, Arshad M Source Code Biol Med. 2019; 14:3.
PMID: 31171931 PMC: 6547476. DOI: 10.1186/s13029-019-0073-5.
Systematic benchmarking of omics computational tools.
Mangul S, Martin L, Hill B, Lam A, Distler M, Zelikovsky A Nat Commun. 2019; 10(1):1393.
PMID: 30918265 PMC: 6437167. DOI: 10.1038/s41467-019-09406-4.
CALQ: compression of quality values of aligned sequencing data.
Voges J, Ostermann J, Hernaez M Bioinformatics. 2017; 34(10):1650-1658.
PMID: 29186284 PMC: 5946873. DOI: 10.1093/bioinformatics/btx737.