Machine Learning-based Receiver Operating Characteristic (ROC) Curves for Crisp and Fuzzy Classification of DNA Microarrays in Cancer Research

Overview

Journal Int J Approx Reason

Date 2008 Dec 17

PMID 19079753

Citations 10

Authors

Leif E Peterson

Matthew A Coleman

Affiliations

Soon will be listed here.

Abstract

Receiver operating characteristic (ROC) curves were generated to obtain classification area under the curve (AUC) as a function of feature standardization, fuzzification, and sample size from nine large sets of cancer-related DNA microarrays. Classifiers used included k nearest neighbor (kNN), näive Bayes classifier (NBC), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), learning vector quantization (LVQ1), logistic regression (LOG), polytomous logistic regression (PLOG), artificial neural networks (ANN), particle swarm optimization (PSO), constricted particle swarm optimization (CPSO), kernel regression (RBF), radial basis function networks (RBFN), gradient descent support vector machines (SVMGD), and least squares support vector machines (SVMLS). For each data set, AUC was determined for a number of combinations of sample size, total sum[-log(p)] of feature t-tests, with and without feature standardization and with (fuzzy) and without (crisp) fuzzification of features. Altogether, a total of 2,123,530 classification runs were made. At the greatest level of sample size, ANN resulted in a fitted AUC of 90%, while PSO resulted in the lowest fitted AUC of 72.1%. AUC values derived from 4NN were the most dependent on sample size, while PSO was the least. ANN depended the most on total statistical significance of features used based on sum[-log(p)], whereas PSO was the least dependent. Standardization of features increased AUC by 8.1% for PSO and -0.2% for QDA, while fuzzification increased AUC by 9.4% for PSO and reduced AUC by 3.8% for QDA. AUC determination in planned microarray experiments without standardization and fuzzification of features will benefit the most if CPSO is used for lower levels of feature significance (i.e., sum[-log(p)] ~ 50) and ANN is used for greater levels of significance (i.e., sum[-log(p)] ~ 500). When only standardization of features is performed, studies are likely to benefit most by using CPSO for low levels of feature statistical significance and LVQ1 for greater levels of significance. Studies involving only fuzzification of features should employ LVQ1 because of the substantial gain in AUC observed and low expense of LVQ1. Lastly, PSO resulted in significantly greater levels of AUC (89.5% average) when feature standardization and fuzzification were performed. In consideration of the data sets used and factors influencing AUC which were investigated, if low-expense computation is desired then LVQ1 is recommended. However, if computational expense is of less concern, then PSO or CPSO is recommended.

Citing Articles

Identifying Candidate Gene-Disease Associations via Graph Neural Networks.

Cinaglia P, Cannataro M Entropy (Basel). 2023; 25(6).

PMID: 37372253 PMC: 10296901. DOI: 10.3390/e25060909.

Computation of the distribution of model accuracy statistics in machine learning: Comparison between analytically derived distributions and simulation-based methods.

Huang A, Huang S Health Sci Rep. 2023; 6(4):e1214.

PMID: 37091362 PMC: 10119581. DOI: 10.1002/hsr2.1214.

A comprehensive survey on computational learning methods for analysis of gene expression data.

Bhandari N, Walambe R, Kotecha K, Khare S Front Mol Biosci. 2022; 9:907150.

PMID: 36458095 PMC: 9706412. DOI: 10.3389/fmolb.2022.907150.

Predictions from algorithmic modeling result in better decisions than from data modeling for soybean iron deficiency chlorosis.

Xu Z, Kurek A, Cannon S, Beavis W PLoS One. 2021; 16(7):e0240948.

PMID: 34242220 PMC: 8270216. DOI: 10.1371/journal.pone.0240948.

3-Dimensional facial expression recognition in human using multi-points warping.

Agbolade O, Nazri A, Yaakob R, Ghani A, Cheah Y BMC Bioinformatics. 2019; 20(1):619.

PMID: 31791234 PMC: 6889223. DOI: 10.1186/s12859-019-3153-2.

References

Wei C, Li J, Bumgarner R . Sample size for detecting differentially expressed genes in microarray experiments. BMC Genomics. 2004; 5:87. PMC: 533874. DOI: 10.1186/1471-2164-5-87. View

Singh D, Febbo P, Ross K, Jackson D, Manola J, Ladd C . Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2002; 1(2):203-9. DOI: 10.1016/s1535-6108(02)00030-2. View

J van t Veer L, Dai H, van de Vijver M, He Y, Hart A, Mao M . Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002; 415(6871):530-6. DOI: 10.1038/415530a. View

Hwang D, Schmitt W, Stephanopoulos G, Stephanopoulos G . Determination of minimum sample size and discriminatory expression patterns in microarray data. Bioinformatics. 2002; 18(9):1184-93. DOI: 10.1093/bioinformatics/18.9.1184. View

Jung S, Bang H, Young S . Sample size calculation for multiple testing in microarray data analysis. Biostatistics. 2004; 6(1):157-69. DOI: 10.1093/biostatistics/kxh026. View

Pomeroy S, Tamayo P, Gaasenbeek M, Sturla L, Angelo M, McLaughlin M . Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature. 2002; 415(6870):436-42. DOI: 10.1038/415436a. View

Li S, Bigler J, Lampe J, Potter J, Feng Z . FDR-controlling testing procedures and sample size determination for microarrays. Stat Med. 2005; 24(15):2267-80. DOI: 10.1002/sim.2119. View

Tibshirani R . A simple method for assessing sample sizes in microarray experiments. BMC Bioinformatics. 2006; 7:106. PMC: 1450307. DOI: 10.1186/1471-2105-7-106. View

Armstrong S, Staunton J, Silverman L, Pieters R, den Boer M, Minden M . MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat Genet. 2001; 30(1):41-7. DOI: 10.1038/ng765. View

10.

Tsai C, Wang S, Chen D, Chen J . Sample size for gene expression microarray experiments. Bioinformatics. 2004; 21(8):1502-8. DOI: 10.1093/bioinformatics/bti162. View

11.

Alon U, Barkai N, Notterman D, Gish K, Ybarra S, Mack D . Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci U S A. 1999; 96(12):6745-50. PMC: 21986. DOI: 10.1073/pnas.96.12.6745. View

12.

Mukherjee S, Tamayo P, Rogers S, Rifkin R, Engle A, Campbell C . Estimating dataset size requirements for classifying DNA microarray data. J Comput Biol. 2003; 10(2):119-42. DOI: 10.1089/106652703321825928. View

13.

Meissner M, Schmuker M, Schneider G . Optimized Particle Swarm Optimization (OPSO) and its application to artificial neural network training. BMC Bioinformatics. 2006; 7:125. PMC: 1464136. DOI: 10.1186/1471-2105-7-125. View

14.

Page G, Edwards J, Gadbury G, Yelisetti P, Wang J, Trivedi P . The PowerAtlas: a power and sample size atlas for microarray experimental design and research. BMC Bioinformatics. 2006; 7:84. PMC: 1395338. DOI: 10.1186/1471-2105-7-84. View

15.

Khan J, Wei J, Ringner M, Saal L, Ladanyi M, Westermann F . Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med. 2001; 7(6):673-9. PMC: 1282521. DOI: 10.1038/89044. View

16.

Tsai P, Ting Lee M . Split-plot microarray experiments: issues of design, power and sample size. Appl Bioinformatics. 2005; 4(3):187-94. DOI: 10.2165/00822942-200504030-00003. View

17.

Seo J, Gordish-Dressman H, Hoffman E . An interactive power analysis tool for microarray hypothesis testing and generation. Bioinformatics. 2006; 22(7):808-14. DOI: 10.1093/bioinformatics/btk052. View

18.

Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeek M, Mesirov J . Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999; 286(5439):531-7. DOI: 10.1126/science.286.5439.531. View

19.

Wang S, Chen J . Sample size for identifying differentially expressed genes in microarray experiments. J Comput Biol. 2004; 11(4):714-26. DOI: 10.1089/cmb.2004.11.714. View

20.

Gordon G, Jensen R, Hsiao L, Gullans S, Blumenstock J, Ramaswamy S . Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res. 2002; 62(17):4963-7. View