» Articles » PMID: 25294822

Svaseq: Removing Batch Effects and Other Unwanted Noise from Sequencing Data

Overview
Specialty Biochemistry
Date 2014 Oct 9
PMID 25294822
Citations 283
Authors
Affiliations
Soon will be listed here.
Abstract

It is now known that unwanted noise and unmodeled artifacts such as batch effects can dramatically reduce the accuracy of statistical inference in genomic experiments. These sources of noise must be modeled and removed to accurately measure biological variability and to obtain correct statistical inference when performing high-throughput genomic analysis. We introduced surrogate variable analysis (sva) for estimating these artifacts by (i) identifying the part of the genomic data only affected by artifacts and (ii) estimating the artifacts with principal components or singular vectors of the subset of the data matrix. The resulting estimates of artifacts can be used in subsequent analyses as adjustment factors to correct analyses. Here I describe a version of the sva approach specifically created for count data or FPKMs from sequencing experiments based on appropriate data transformation. I also describe the addition of supervised sva (ssva) for using control probes to identify the part of the genomic data only affected by artifacts. I present a comparison between these versions of sva and other methods for batch effect estimation on simulated data, real count-based data and FPKM-based data. These updates are available through the sva Bioconductor package and I have made fully reproducible analysis using these methods available from: https://github.com/jtleek/svaseq.

Citing Articles

Accurate identification of medulloblastoma subtypes from diverse data sources with severe batch effects by RaMBat.

Sun M, Wang J, Wan S bioRxiv. 2025; .

PMID: 40060540 PMC: 11888263. DOI: 10.1101/2025.02.24.640010.


Focal adhesion in the tumour metastasis: from molecular mechanisms to therapeutic targets.

Liu Z, Zhang X, Ben T, Li M, Jin Y, Wang T Biomark Res. 2025; 13(1):38.

PMID: 40045379 PMC: 11884212. DOI: 10.1186/s40364-025-00745-7.


Transcriptomic analysis of iPSC-derived endothelium reveals adaptations to high altitude hypoxia in energy metabolism and inflammation.

Gray O, Witonsky D, Jousma J, Sobreira D, Van Alstyne A, Huang R PLoS Genet. 2025; 21(2):e1011570.

PMID: 39928692 PMC: 11809796. DOI: 10.1371/journal.pgen.1011570.


Identification of the immune infiltration and biomarkers in ulcerative colitis based on liquid-liquid phase separation-related genes.

Hong Z, Fang S, Nie H, Zhou J, Hong Y, Liu L Sci Rep. 2025; 15(1):4484.

PMID: 39915583 PMC: 11802798. DOI: 10.1038/s41598-025-89252-1.


Highly effective batch effect correction method for RNA-seq count data.

Zhang X Comput Struct Biotechnol J. 2025; 27():58-64.

PMID: 39802213 PMC: 11718288. DOI: 10.1016/j.csbj.2024.12.010.


References
1.
Pickrell J, Marioni J, Pai A, Degner J, Engelhardt B, Nkadori E . Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010; 464(7289):768-72. PMC: 3089435. DOI: 10.1038/nature08872. View

2.
Leek J . Asymptotic conditional singular value decomposition for high-dimensional genomic data. Biometrics. 2010; 67(2):344-52. PMC: 3165001. DOI: 10.1111/j.1541-0420.2010.01455.x. View

3.
Montgomery S, Sammeth M, Gutierrez-Arcelus M, Lach R, Ingle C, Nisbett J . Transcriptome genetics using second generation sequencing in a Caucasian population. Nature. 2010; 464(7289):773-7. PMC: 3836232. DOI: 10.1038/nature08903. View

4.
Fan J, Han X, Gu W . Estimating False Discovery Proportion Under Arbitrary Covariance Dependence. J Am Stat Assoc. 2014; 107(499):1019-1035. PMC: 3983872. DOI: 10.1080/01621459.2012.720478. View

5.
Lambert C, Black L . Learning from our GWAS mistakes: from experimental design to scientific method. Biostatistics. 2012; 13(2):195-203. PMC: 3297828. DOI: 10.1093/biostatistics/kxr055. View