» Articles » PMID: 17540680

Logistic Regression for Disease Classification Using Microarray Data: Model Selection in a Large P and Small N Case

Overview
Journal Bioinformatics
Specialty Biology
Date 2007 Jun 2
PMID 17540680
Citations 42
Authors
Affiliations
Soon will be listed here.
Abstract

Motivation: Logistic regression is a standard method for building prediction models for a binary outcome and has been extended for disease classification with microarray data by many authors. A feature (gene) selection step, however, must be added to penalized logistic modeling due to a large number of genes and a small number of subjects. Model selection for this two-step approach requires new statistical tools because prediction error estimation ignoring the feature selection step can be severely downward biased. Generic methods such as cross-validation and non-parametric bootstrap can be very ineffective due to the big variability in the prediction error estimate.

Results: We propose a parametric bootstrap model for more accurate estimation of the prediction error that is tailored to the microarray data by borrowing from the extensive research in identifying differentially expressed genes, especially the local false discovery rate. The proposed method provides guidance on the two critical issues in model selection: the number of genes to include in the model and the optimal shrinkage for the penalized logistic regression. We show that selecting more than 20 genes usually helps little in further reducing the prediction error. Application to Golub's leukemia data and our own cervical cancer data leads to highly accurate prediction models.

Availability: R library GeneLogit at http://geocities.com/jg_liao

Citing Articles

Machine learning uncovers novel sex-specific dementia biomarkers linked to autism and eye diseases.

Khan A, Ghasemi A, Ingram K, Ay A J Alzheimers Dis Rep. 2025; 9:25424823251317177.

PMID: 40034518 PMC: 11864256. DOI: 10.1177/25424823251317177.


TabDEG: Classifying differentially expressed genes from RNA-seq data based on feature extraction and deep learning framework.

Feng S, Wang Z, Jin Y, Xu S PLoS One. 2024; 19(7):e0305857.

PMID: 39037985 PMC: 11262683. DOI: 10.1371/journal.pone.0305857.


An Explainable Deep Learning Classifier of Bovine Mastitis Based on Whole-Genome Sequence Data-Circumventing the p >> n Problem.

Kotlarz K, Mielczarek M, Biecek P, Wojdak-Maksymiec K, Suchocki T, Topolski P Int J Mol Sci. 2024; 25(9).

PMID: 38731932 PMC: 11083318. DOI: 10.3390/ijms25094715.


Machine learning models for predicting the onset of chronic kidney disease after surgery in patients with renal cell carcinoma.

Oh S, Byun S, Kim J, Jeong C, Kwak C, Hwang E BMC Med Inform Decis Mak. 2024; 24(1):85.

PMID: 38519947 PMC: 10960396. DOI: 10.1186/s12911-024-02473-8.


Systematic Characterization of p53-Regulated Long Noncoding RNAs across Human Cancers Reveals Remarkable Heterogeneity among Different Tumor Types.

Regunath K, Fomin V, Liu Z, Wang P, Hoque M, Tian B Mol Cancer Res. 2024; 22(6):555-571.

PMID: 38393317 PMC: 11703046. DOI: 10.1158/1541-7786.MCR-23-0295.