» Articles » PMID: 18172259

How Large a Training Set is Needed to Develop a Classifier for Microarray Data?

Overview
Journal Clin Cancer Res
Specialty Oncology
Date 2008 Jan 4
PMID 18172259
Citations 48
Authors
Affiliations
Soon will be listed here.
Abstract

Purpose: A common goal of gene expression microarray studies is the development of a classifier that can be used to divide patients into groups with different prognoses, or with different expected responses to a therapy. These types of classifiers are developed on a training set, which is the set of samples used to train a classifier. The question of how many samples are needed in the training set to produce a good classifier from high-dimensional microarray data is challenging.

Experimental Design: We present a model-based approach to determining the sample size required to adequately train a classifier.

Results: It is shown that sample size can be determined from three quantities: standardized fold change, class prevalence, and number of genes or features on the arrays. Numerous examples and important experimental design issues are discussed. The method is adapted to address ex post facto determination of whether the size of a training set used to develop a classifier was adequate. An interactive web site for performing the sample size calculations is provided.

Conclusion: We showed that sample size calculations for classifier development from high-dimensional microarray data are feasible, discussed numerous important considerations, and presented examples.

Citing Articles

Optimizing sample size for supervised machine learning with bulk transcriptomic sequencing: a learning curve approach.

Qi Y, Wang X, Qin L Brief Bioinform. 2025; 26(2).

PMID: 40072846 PMC: 11899567. DOI: 10.1093/bib/bbaf097.


Optimizing Sample Size for Supervised Machine Learning with Bulk Transcriptomic Sequencing: A Learning Curve Approach.

Qi Y, Wang X, Qin L ArXiv. 2024; .

PMID: 39314504 PMC: 11419172.


An approach for developing a blood-based screening panel for lung cancer based on clonal hematopoietic mutations.

Anandakrishnan R, Shahidi R, Dai A, Antony V, Zyvoloski I PLoS One. 2024; 19(8):e0307232.

PMID: 39172974 PMC: 11341013. DOI: 10.1371/journal.pone.0307232.


Small Non-Coding RNAs and Their Role in Locoregional Metastasis and Outcomes in Early-Stage Breast Cancer Patients.

Escuin D, Bell O, Garcia-Valdecasas B, Clos M, Larranaga I, Lopez-Vilaro L Int J Mol Sci. 2024; 25(7).

PMID: 38612790 PMC: 11011815. DOI: 10.3390/ijms25073982.


Revisiting Concurrent Radiation Therapy, Temozolomide, and the Histone Deacetylase Inhibitor Valproic Acid for Patients with Glioblastoma-Proteomic Alteration and Comparison Analysis with the Standard-of-Care Chemoirradiation.

Krauze A, Zhao Y, Li M, Shih J, Jiang W, Tasci E Biomolecules. 2023; 13(10).

PMID: 37892181 PMC: 10604983. DOI: 10.3390/biom13101499.