Subsumption Reduces Dataset Dimensionality Without Decreasing Performance of a Machine Learning Classifier

Overview

Journal Annu Int Conf IEEE Eng Med Biol Soc

Specialty Biomedical Engineering

Date 2021 Dec 11

PMID 34891595

Citations 1

Authors

Donald C Wunsch

Daniel B Hier

Affiliations

Soon will be listed here.

Abstract

When features in a high dimension dataset are organized hierarchically, there is an inherent opportunity to reduce dimensionality. Since more specific concepts are subsumed by more general concepts, subsumption can be applied successively to reduce dimensionality. We tested whether sub-sumption could reduce the dimensionality of a disease dataset without impairing classification accuracy. We started with a dataset that had 168 neurological patients, 14 diagnoses, and 293 unique features. We applied subsumption repeatedly to create eight successively smaller datasets, ranging from 293 dimensions in the largest dataset to 11 dimensions in the smallest dataset. We tested a MLP classifier on all eight datasets. Precision, recall, accuracy, and validation declined only at the lowest dimensionality. Our preliminary results suggest that when features in a high dimension dataset are derived from a hierarchical ontology, subsumption is a viable strategy to reduce dimensionality.Clinical relevance- Datasets derived from electronic health records are often of high dimensionality. If features in the dataset are based on concepts from a hierarchical ontology, subsumption can reduce dimensionality.

Citing Articles

The visualization of Orphadata neurology phenotypes.

Hier D, Yelugam R, Carrithers M, Wunsch 3rd D Front Digit Health. 2023; 5:1064936.

PMID: 36778102 PMC: 9911440. DOI: 10.3389/fdgth.2023.1064936.