» Articles » PMID: 29881472

Identifying Mislabeled and Contaminated DNA Methylation Microarray Data: an Extended Quality Control Toolset with Examples from GEO

Overview
Publisher Biomed Central
Specialty Genetics
Date 2018 Jun 9
PMID 29881472
Citations 70
Authors
Affiliations
Soon will be listed here.
Abstract

Background: Mislabeled, contaminated or poorly performing samples can threaten power in methylation microarray analyses or even result in spurious associations. We describe a set of quality checks for the popular Illumina 450K and EPIC microarrays to identify problematic samples and demonstrate their application in publicly available datasets.

Methods: Quality checks implemented here include 17 control metrics defined by the manufacturer, a sex check to detect mislabeled sex-discordant samples, and both an identity check for fingerprinting sample donors and a measure of sample contamination based on probes querying high-frequency SNPs. These checks were tested on 80 datasets comprising 8327 samples run on the 450K microarray from the GEO repository.

Results: Nine hundred forty samples were flagged by at least one control metric and 133 samples from 20 datasets were assigned the wrong sex. In a dataset in which a subset of samples appear contaminated with a single source of DNA, we demonstrate that our measure based on outliers among SNP probes was strongly correlated (> 0.95) with another independent measure of contamination.

Conclusions: A more complete examination of samples that may be mislabeled, contaminated, or have poor performance due to technical problems will improve downstream analyses and replication of findings. We demonstrate that quality control problems are prevalent in a public repository of DNA methylation data. We advocate for a more thorough quality control workflow in epigenome-wide association studies and provide a software package to perform the checks described in this work. Reproducible code and supplementary material are available at 10.5281/zenodo.1172730.

Citing Articles

Maternal epigenetic index links early neglect to later neglectful care and other psychopathological, cognitive, and bonding effects.

Leon I, Gongora D, Rodrigo M, Herrero-Roldan S, Lopez Rodriguez M, Mitchell C Clin Epigenetics. 2025; 17(1):46.

PMID: 40057810 PMC: 11890505. DOI: 10.1186/s13148-025-01839-7.


Epigenetic signatures of intergenerational exposure to violence in three generations of Syrian refugees.

Mulligan C, Quinn E, Hamadmad D, Dutton C, Nevell L, Binder A Sci Rep. 2025; 15(1):5945.

PMID: 40016245 PMC: 11868390. DOI: 10.1038/s41598-025-89818-z.


Epigenome-wide association study of cerebrospinal fluid-based biomarkers of Alzheimer's disease in cognitively normal individuals.

Huls A, Liu J, Konwar C, Conneely K, Levey A, Lah J medRxiv. 2025; .

PMID: 39974053 PMC: 11838696. DOI: 10.1101/2025.02.04.25321657.


Comprehensive guide for epigenetics and transcriptomics data quality control.

Comendul A, Ruf-Zamojski F, Ford C, Agarwal P, Zaslavsky E, Nudelman G STAR Protoc. 2025; 6(1):103607.

PMID: 39869481 PMC: 11799959. DOI: 10.1016/j.xpro.2025.103607.


Placental and immune cell DNA methylation reference panel for bulk tissue cell composition estimation in epidemiological studies.

Campbell K, Colacino J, Dou J, Dolinoy D, Park S, Loch-Caruso R Epigenetics. 2024; 19(1):2437275.

PMID: 39648517 PMC: 11633140. DOI: 10.1080/15592294.2024.2437275.


References
1.
Morin A, Gatev E, McEwen L, MacIsaac J, Lin D, Koen N . Maternal blood contamination of collected cord blood can be identified using DNA methylation at three CpGs. Clin Epigenetics. 2017; 9:75. PMC: 5526324. DOI: 10.1186/s13148-017-0370-2. View

2.
Dedeurwaerder S, Defrance M, Bizet M, Calonne E, Bontempi G, Fuks F . A comprehensive overview of Infinium HumanMethylation450 data processing. Brief Bioinform. 2013; 15(6):929-41. PMC: 4239800. DOI: 10.1093/bib/bbt054. View

3.
Heiss J, Breitling L, Lehne B, Kooner J, Chambers J, Brenner H . Training a model for estimating leukocyte composition using whole-blood DNA methylation and cell counts as reference. Epigenomics. 2016; 9(1):13-20. DOI: 10.2217/epi-2016-0091. View

4.
Zhang X, Mu W, Zhang W . On the analysis of the illumina 450k array data: probes ambiguously mapped to the human genome. Front Genet. 2012; 3:73. PMC: 3343275. DOI: 10.3389/fgene.2012.00073. View

5.
Nestor C, Ottaviano R, Reinhardt D, Cruickshanks H, Mjoseng H, McPherson R . Rapid reprogramming of epigenetic and transcriptional profiles in mammalian culture systems. Genome Biol. 2015; 16:11. PMC: 4334405. DOI: 10.1186/s13059-014-0576-y. View