Semi-supervised Nonnegative Matrix Factorization for Gene Expression Deconvolution: a Case Study

Overview

Journal Infect Genet Evol

Specialties Biology
Genetics
Infectious Diseases

Date 2011 Sep 21

PMID 21930246

Citations 68

Authors

Renaud Gaujoux

Cathal Seoighe

Affiliations

Soon will be listed here.

Abstract

Heterogeneity in sample composition is an inherent issue in many gene expression studies and, in many cases, should be taken into account in the downstream analysis to enable correct interpretation of the underlying biological processes. Typical examples are infectious diseases or immunology-related studies using blood samples, where, for example, the proportions of lymphocyte sub-populations are expected to vary between cases and controls. Nonnegative Matrix Factorization (NMF) is an unsupervised learning technique that has been applied successfully in several fields, notably in bioinformatics where its ability to extract meaningful information from high-dimensional data such as gene expression microarrays has been demonstrated. Very recently, it has been applied to biomarker discovery and gene expression deconvolution in heterogeneous tissue samples. Being essentially unsupervised, standard NMF methods are not guaranteed to find components corresponding to the cell types of interest in the sample, which may jeopardize the correct estimation of cell proportions. We have investigated the use of prior knowledge, in the form of a set of marker genes, to improve gene expression deconvolution with NMF algorithms. We found that this improves the consistency with which both cell type proportions and cell type gene expression signatures are estimated. The proposed method was tested on a microarray dataset consisting of pure cell types mixed in known proportions. Pearson correlation coefficients between true and estimated cell type proportions improved substantially (typically from about 0.5 to approximately 0.8) with the semi-supervised (marker-guided) versions of commonly used NMF algorithms. Furthermore known marker genes associated with each cell type were assigned to the correct cell type more frequently for the guided versions. We conclude that the use of marker genes improves the accuracy of gene expression deconvolution using NMF and suggest modifications to how the marker gene information is used that may lead to further improvements.

Citing Articles

NMFProfiler: a multi-omics integration method for samples stratified in groups.

Mercadie A, Gravier E, Josse G, Fournier I, Viode C, Vialaneix N Bioinformatics. 2025; 41(2).

PMID: 39921890 PMC: 11855281. DOI: 10.1093/bioinformatics/btaf066.

Deep Learning Predicts Subtype Heterogeneity and Outcomes in Luminal A Breast Cancer Using Routinely Stained Whole-Slide Images.

Kurian N, Gann P, Kumar N, McGregor S, Verma R, Sethi A Cancer Res Commun. 2024; 5(1):157-166.

PMID: 39740059 PMC: 11770635. DOI: 10.1158/2767-9764.CRC-24-0397.

Alleviating batch effects in cell type deconvolution with SCCAF-D.

Feng S, Huang L, Pournara A, Huang Z, Yang X, Zhang Y Nat Commun. 2024; 15(1):10867.

PMID: 39738054 PMC: 11686230. DOI: 10.1038/s41467-024-55213-x.

Deconvolution from bulk gene expression by leveraging sample-wise and gene-wise similarities and single-cell RNA-Seq data.

Wang C, Lin Y, Li S, Guan J BMC Genomics. 2024; 25(1):875.

PMID: 39294558 PMC: 11409548. DOI: 10.1186/s12864-024-10728-x.

Brain high-throughput multi-omics data reveal molecular heterogeneity in Alzheimer's disease.

Eteleeb A, Novotny B, Tarraga C, Sohn C, Dhungel E, Brase L PLoS Biol. 2024; 22(4):e3002607.

PMID: 38687811 PMC: 11086901. DOI: 10.1371/journal.pbio.3002607.