Mining Gene Expression Data by Interpreting Principal Components

Overview

Journal BMC Bioinformatics

Publisher Biomed Central

Specialty Biology

Date 2006 Apr 8

PMID 16600052

Citations 26

Authors

Joseph C Roden

Brandon W King

Diane Trout

Ali Mortazavi

Barbara J Wold

Christopher E Hart

Affiliations

Soon will be listed here.

Abstract

Background: There are many methods for analyzing microarray data that group together genes having similar patterns of expression over all conditions tested. However, in many instances the biologically important goal is to identify relatively small sets of genes that share coherent expression across only some conditions, rather than all or most conditions as required in traditional clustering; e.g. genes that are highly up-regulated and/or down-regulated similarly across only a subset of conditions. Equally important is the need to learn which conditions are the decisive ones in forming such gene sets of interest, and how they relate to diverse conditional covariates, such as disease diagnosis or prognosis.

Results: We present a method for automatically identifying such candidate sets of biologically relevant genes using a combination of principal components analysis and information theoretic metrics. To enable easy use of our methods, we have developed a data analysis package that facilitates visualization and subsequent data mining of the independent sources of significant variation present in gene microarray expression datasets (or in any other similarly structured high-dimensional dataset). We applied these tools to two public datasets, and highlight sets of genes most affected by specific subsets of conditions (e.g. tissues, treatments, samples, etc.). Statistically significant associations for highlighted gene sets were shown via global analysis for Gene Ontology term enrichment. Together with covariate associations, the tool provides a basis for building testable hypotheses about the biological or experimental causes of observed variation.

Conclusion: We provide an unsupervised data mining technique for diverse microarray expression datasets that is distinct from major methods now in routine use. In test uses, this method, based on publicly available gene annotations, appears to identify numerous sets of biologically relevant genes. It has proven especially valuable in instances where there are many diverse conditions (10's to hundreds of different tissues or cell types), a situation in which many clustering and ordering algorithms become problematic. This approach also shows promise in other topic domains such as multi-spectral imaging datasets.

Citing Articles

Erythropoiesis and Gene Expression Analysis in Erythroid Progenitor Cells Derived from Patients with Hemoglobin H/Constant Spring Disease.

Wongkhammul N, Khamphikham P, Tongjai S, Tantiworawit A, Fanhchaksai K, Wongpalee S Int J Mol Sci. 2024; 25(20).

PMID: 39457028 PMC: 11508986. DOI: 10.3390/ijms252011246.

Model-based dimensionality reduction for single-cell RNA-seq using generalized bilinear models.

Nicol P, Miller J bioRxiv. 2023; .

PMID: 37162914 PMC: 10168202. DOI: 10.1101/2023.04.21.537881.

Analysis of Dormancy-Associated Transcriptional Networks Reveals a Shared Quiescence Signature in Lung and Colorectal Cancer.

Cuccu A, Francescangeli F, De Angelis M, Bruselles A, Giuliani A, Zeuner A Int J Mol Sci. 2022; 23(17).

PMID: 36077264 PMC: 9456317. DOI: 10.3390/ijms23179869.

A multivariate statistical test for differential expression analysis.

Tumminello M, Bertolazzi G, Sottile G, Sciaraffa N, Arancio W, Coronnello C Sci Rep. 2022; 12(1):8265.

PMID: 35585166 PMC: 9117296. DOI: 10.1038/s41598-022-12246-w.

Islet sympathetic innervation and islet neuropathology in patients with type 1 diabetes.

Campbell-Thompson M, Butterworth E, Boatwright J, Nair M, Nasif L, Nasif K Sci Rep. 2021; 11(1):6562.

PMID: 33753784 PMC: 7985489. DOI: 10.1038/s41598-021-85659-8.

References

Quackenbush J . Computational analysis of microarray data. Nat Rev Genet. 2001; 2(6):418-27. DOI: 10.1038/35076576. View

Khan J, Wei J, Ringner M, Saal L, Ladanyi M, Westermann F . Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med. 2001; 7(6):673-9. PMC: 1282521. DOI: 10.1038/89044. View

Su A, Wiltshire T, Batalov S, Lapp H, Ching K, Block D . A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A. 2004; 101(16):6062-7. PMC: 395923. DOI: 10.1073/pnas.0400782101. View

Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang C, Angelo M . Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci U S A. 2001; 98(26):15149-54. PMC: 64998. DOI: 10.1073/pnas.211566398. View

Wall M, Dyck P, Brettin T . SVDMAN--singular value decomposition analysis of microarray data. Bioinformatics. 2001; 17(6):566-8. DOI: 10.1093/bioinformatics/17.6.566. View

Sturn A, Quackenbush J, Trajanoski Z . Genesis: cluster analysis of microarray data. Bioinformatics. 2002; 18(1):207-8. DOI: 10.1093/bioinformatics/18.1.207. View

Nguyen D, Rocke D . Tumor classification by partial least squares using microarray gene expression data. Bioinformatics. 2002; 18(1):39-50. DOI: 10.1093/bioinformatics/18.1.39. View

Selaru F, Yin J, Olaru A, Mori Y, Xu Y, Epstein S . An unsupervised approach to identify molecular phenotypic components influencing breast cancer features. Cancer Res. 2004; 64(5):1584-8. DOI: 10.1158/0008-5472.can-03-3208. View

Eisen M, Spellman P, Brown P, Botstein D . Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A. 1998; 95(25):14863-8. PMC: 24541. DOI: 10.1073/pnas.95.25.14863. View

10.

Brown M, Grundy W, Lin D, Cristianini N, Sugnet C, Furey T . Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci U S A. 2000; 97(1):262-7. PMC: 26651. DOI: 10.1073/pnas.97.1.262. View

11.

FORBES A . Classification-algorithm evaluation: five performance measures based on confusion matrices. J Clin Monit. 1995; 11(3):189-206. DOI: 10.1007/BF01617722. View

12.

Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E . Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci U S A. 1999; 96(6):2907-12. PMC: 15868. DOI: 10.1073/pnas.96.6.2907. View

13.

Hilsenbeck S, Friedrichs W, Schiff R, OConnell P, Hansen R, Osborne C . Statistical analysis of array expression data as applied to the problem of tamoxifen resistance. J Natl Cancer Inst. 1999; 91(5):453-9. DOI: 10.1093/jnci/91.5.453. View

14.

Wen X, Fuhrman S, Michaels G, Carr D, Smith S, Barker J . Large-scale temporal gene expression mapping of central nervous system development. Proc Natl Acad Sci U S A. 1998; 95(1):334-9. PMC: 18216. DOI: 10.1073/pnas.95.1.334. View

15.

Troyanskaya O, Garber M, Brown P, Botstein D, Altman R . Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics. 2002; 18(11):1454-61. DOI: 10.1093/bioinformatics/18.11.1454. View

16.

Chu S, Derisi J, Eisen M, Mulholland J, Botstein D, Brown P . The transcriptional program of sporulation in budding yeast. Science. 1998; 282(5389):699-705. DOI: 10.1126/science.282.5389.699. View

17.

Mootha V, Lindgren C, Eriksson K, Subramanian A, Sihag S, Lehar J . PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet. 2003; 34(3):267-73. DOI: 10.1038/ng1180. View

18.

Mjolsness E, DeCoste D . Machine learning for science: state of the art and future prospects. Science. 2001; 293(5537):2051-5. DOI: 10.1126/science.293.5537.2051. View

19.

Tavazoie S, Hughes J, Campbell M, Cho R, Church G . Systematic determination of genetic network architecture. Nat Genet. 1999; 22(3):281-5. DOI: 10.1038/10343. View

20.

Yeung K, Fraley C, Murua A, Raftery A, Ruzzo W . Model-based clustering and data transformations for gene expression data. Bioinformatics. 2001; 17(10):977-87. DOI: 10.1093/bioinformatics/17.10.977. View