» Articles » PMID: 34585237

Preventing Dataset Shift from Breaking Machine-learning Biomarkers

Overview
Journal Gigascience
Specialties Biology
Genetics
Date 2021 Sep 29
PMID 34585237
Citations 23
Authors
Affiliations
Soon will be listed here.
Abstract

Machine learning brings the hope of finding new biomarkers extracted from cohorts with rich biomedical measurements. A good biomarker is one that gives reliable detection of the corresponding condition. However, biomarkers are often extracted from a cohort that differs from the target population. Such a mismatch, known as a dataset shift, can undermine the application of the biomarker to new individuals. Dataset shifts are frequent in biomedical research, e.g.,  because of recruitment biases. When a dataset shift occurs, standard machine-learning techniques do not suffice to extract and validate biomarkers. This article provides an overview of when and how dataset shifts break machine-learning-extracted biomarkers, as well as detection and correction strategies.

Citing Articles

Transferability of Single- and Cross-Tissue Transcriptome Imputation Models Across Ancestry Groups.

Pagnuco I, Eyre S, Rattray M, Morris A Genet Epidemiol. 2025; 49(1):e22611.

PMID: 39812501 PMC: 11734644. DOI: 10.1002/gepi.22611.


Magnetic Resonance Imaging Liver Segmentation Protocol Enables More Consistent and Robust Annotations, Paving the Way for Advanced Computer-Assisted Analysis.

Jeltsch P, Monnin K, Jreige M, Fernandes-Mendes L, Girardet R, Dromain C Diagnostics (Basel). 2025; 14(24.

PMID: 39767146 PMC: 11726866. DOI: 10.3390/diagnostics14242785.


Power and reproducibility in the external validation of brain-phenotype predictions.

Rosenblatt M, Tejavibulya L, Sun H, Camp C, Khaitova M, Adkinson B Nat Hum Behav. 2024; 8(10):2018-2033.

PMID: 39085406 DOI: 10.1038/s41562-024-01931-7.


Considerations for Quality Control Monitoring of Machine Learning Models in Clinical Practice.

Faust L, Wilson P, Asai S, Fu S, Liu H, Ruan X JMIR Med Inform. 2024; 12:e50437.

PMID: 38941140 PMC: 11245651. DOI: 10.2196/50437.


Pathophysiological Features in Electronic Medical Records Sustain Model Performance under Temporal Dataset Shift.

Brosula R, Corbin C, Chen J AMIA Jt Summits Transl Sci Proc. 2024; 2024:95-104.

PMID: 38827052 PMC: 11141811.


References
1.
Dockes J, Varoquaux G, Poline J . Preventing dataset shift from breaking machine-learning biomarkers. Gigascience. 2021; 10(9). PMC: 8478611. DOI: 10.1093/gigascience/giab055. View

2.
Henrich J, Heine S, Norenzayan A . Most people are not WEIRD. Nature. 2010; 466(7302):29. DOI: 10.1038/466029a. View

3.
Faust O, Hagiwara Y, Hong T, Lih O, Rajendra Acharya U . Deep learning for healthcare applications based on physiological signals: A review. Comput Methods Programs Biomed. 2018; 161:1-13. DOI: 10.1016/j.cmpb.2018.04.005. View

4.
Hernan M, Hernandez-Diaz S, Robins J . A structural approach to selection bias. Epidemiology. 2004; 15(5):615-25. DOI: 10.1097/01.ede.0000135174.63482.43. View

5.
Adamson A, Smith A . Machine Learning and Health Care Disparities in Dermatology. JAMA Dermatol. 2018; 154(11):1247-1248. DOI: 10.1001/jamadermatol.2018.2348. View