» Articles » PMID: 39830079

Beyond Size and Class Balance: Alpha As a New Dataset Quality Metric for Deep Learning

Overview
Journal ArXiv
Date 2025 Jan 20
PMID 39830079
Authors
Affiliations
Soon will be listed here.
Abstract

In deep learning, achieving high performance on image classification tasks requires diverse training sets. However, the current best practice-maximizing dataset size and class balance-does not guarantee dataset diversity. We hypothesized that, for a given model architecture, model performance can be improved by maximizing diversity more directly. To test this hypothesis, we introduce a comprehensive framework of diversity measures from ecology that generalizes familiar quantities like Shannon entropy by accounting for similarities among images. (Size and class balance emerge as special cases.) Analyzing thousands of subsets from seven medical datasets showed that the best correlates of performance were not size or class balance but -"big alpha"-a set of generalized entropy measures interpreted as the effective number of image-class pairs in the dataset, after accounting for image similarities. One of these, , explained 67% of the variance in balanced accuracy, vs. 54% for class balance and just 39% for size. The best pair of measures was size-plus- (79%), which outperformed size-plus-class-balance (74%). Subsets with the largest performed up to 16% better than those with the largest size (median improvement, 8%). We propose maximizing as a way to improve deep learning performance in medical imaging.

References
1.
Yang J, Shi R, Wei D, Liu Z, Zhao L, Ke B . MedMNIST v2 - A large-scale lightweight benchmark for 2D and 3D biomedical image classification. Sci Data. 2023; 10(1):41. PMC: 9852451. DOI: 10.1038/s41597-022-01721-8. View

2.
Chinn E, Arora R, Arnaout R, Arnaout R . ENRICHing medical imaging training sets enables more efficient machine learning. J Am Med Inform Assoc. 2023; 30(6):1079-1090. PMC: 10198519. DOI: 10.1093/jamia/ocad055. View

3.
Arnaout R, Curran L, Zhao Y, Levine J, Chinn E, Moon-Grady A . An ensemble of neural networks provides expert-level prenatal detection of complex congenital heart disease. Nat Med. 2021; 27(5):882-891. PMC: 8380434. DOI: 10.1038/s41591-021-01342-5. View

4.
Dey D, Arnaout R, Antani S, Badano A, Jacques L, Li H . Proceedings of the NHLBI Workshop on Artificial Intelligence in Cardiovascular Imaging: Translation to Patient Care. JACC Cardiovasc Imaging. 2023; 16(9):1209-1223. PMC: 10524663. DOI: 10.1016/j.jcmg.2023.05.012. View

5.
Shorten C, Khoshgoftaar T, Furht B . Text Data Augmentation for Deep Learning. J Big Data. 2021; 8(1):101. PMC: 8287113. DOI: 10.1186/s40537-021-00492-0. View