Imbalanced Target Prediction with Pattern Discovery on Clinical Data Repositories

Overview

Journal BMC Med Inform Decis Mak

Publisher Biomed Central

Specialty Medical Informatics

Date 2017 Apr 22

PMID 28427384

Citations 4

Authors

Tak-Ming Chan

Yuxi Li

Choo-Chiap Chiau

Jane Zhu

Jie Jiang

Yong Huo

Affiliations

Soon will be listed here.

Abstract

Background: Clinical data repositories (CDR) have great potential to improve outcome prediction and risk modeling. However, most clinical studies require careful study design, dedicated data collection efforts, and sophisticated modeling techniques before a hypothesis can be tested. We aim to bridge this gap, so that clinical domain users can perform first-hand prediction on existing repository data without complicated handling, and obtain insightful patterns of imbalanced targets for a formal study before it is conducted. We specifically target for interpretability for domain users where the model can be conveniently explained and applied in clinical practice.

Methods: We propose an interpretable pattern model which is noise (missing) tolerant for practice data. To address the challenge of imbalanced targets of interest in clinical research, e.g., deaths less than a few percent, the geometric mean of sensitivity and specificity (G-mean) optimization criterion is employed, with which a simple but effective heuristic algorithm is developed.

Results: We compared pattern discovery to clinically interpretable methods on two retrospective clinical datasets. They contain 14.9% deaths in 1 year in the thoracic dataset and 9.1% deaths in the cardiac dataset, respectively. In spite of the imbalance challenge shown on other methods, pattern discovery consistently shows competitive cross-validated prediction performance. Compared to logistic regression, Naïve Bayes, and decision tree, pattern discovery achieves statistically significant (p-values < 0.01, Wilcoxon signed rank test) favorable averaged testing G-means and F1-scores (harmonic mean of precision and sensitivity). Without requiring sophisticated technical processing of data and tweaking, the prediction performance of pattern discovery is consistently comparable to the best achievable performance.

Conclusions: Pattern discovery has demonstrated to be robust and valuable for target prediction on existing clinical data repositories with imbalance and noise. The prediction results and interpretable patterns can provide insights in an agile and inexpensive way for the potential formal studies.

Citing Articles

Identifying Modifiable Predictors of COVID-19 Vaccine Side Effects: A Machine Learning Approach.

Abbaspour S, Robbins G, Blumenthal K, Hashimoto D, Hopcia K, Mukerji S Vaccines (Basel). 2022; 10(10).

PMID: 36298612 PMC: 9608090. DOI: 10.3390/vaccines10101747.

A pattern-discovery-based outcome predictive tool integrated with clinical data repository: design and a case study on contrast related acute kidney injury.

Li Y, Chan T, Feng J, Tao L, Jiang J, Zheng B BMC Med Inform Decis Mak. 2022; 22(1):103.

PMID: 35428291 PMC: 9013021. DOI: 10.1186/s12911-022-01841-6.

Pattern discovery and disentanglement on relational datasets.

Wong A, Zhou P, Butt Z Sci Rep. 2021; 11(1):5688.

PMID: 33707478 PMC: 7952710. DOI: 10.1038/s41598-021-84869-4.

Explanation and prediction of clinical data with imbalanced class distribution based on pattern discovery and disentanglement.

Zhou P, Wong A BMC Med Inform Decis Mak. 2021; 21(1):16.

PMID: 33422088 PMC: 7796578. DOI: 10.1186/s12911-020-01356-y.

References

Leung K, Wong K, Chan T, Wong M, Lee K, Lau C . Discovering protein-DNA binding sequence patterns using association rule mining. Nucleic Acids Res. 2010; 38(19):6324-37. PMC: 2965231. DOI: 10.1093/nar/gkq500. View

Oh S, Lee M, Zhang B . Ensemble learning with active example selection for imbalanced biomedical data classification. IEEE/ACM Trans Comput Biol Bioinform. 2010; 8(2):316-25. DOI: 10.1109/TCBB.2010.96. View

Taylor G, Muhlestein J, Wagner G, Bair T, Li P, Anderson J . Implementation of a computerized cardiovascular information system in a private hospital setting. Am Heart J. 1998; 136(5):792-803. DOI: 10.1016/s0002-8703(98)70123-1. View

Khalilia M, Chakraborty S, Popescu M . Predicting disease risks from highly imbalanced data using random forest. BMC Med Inform Decis Mak. 2011; 11:51. PMC: 3163175. DOI: 10.1186/1472-6947-11-51. View

Chan T, Wong K, Lee K, Wong M, Lau C, Tsui S . Discovering approximate-associated sequence patterns for protein-DNA interactions. Bioinformatics. 2011; 27(4):471-8. DOI: 10.1093/bioinformatics/btq682. View

Hripcsak G, Rothschild A . Agreement, the f-measure, and reliability in information retrieval. J Am Med Inform Assoc. 2005; 12(3):296-8. PMC: 1090460. DOI: 10.1197/jamia.M1733. View

Huang Z, Chan T, Dong W . MACE prediction of acute coronary syndrome via boosted resampling classification using electronic medical records. J Biomed Inform. 2017; 66:161-170. DOI: 10.1016/j.jbi.2017.01.001. View

Yoo I, Alafaireet P, Marinov M, Pena-Hernandez K, Gopidi R, Chang J . Data mining in healthcare and biomedicine: a survey of the literature. J Med Syst. 2011; 36(4):2431-48. DOI: 10.1007/s10916-011-9710-5. View

Tao D, Tang X, Li X, Wu X . Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval. IEEE Trans Pattern Anal Mach Intell. 2006; 28(7):1088-99. DOI: 10.1109/TPAMI.2006.134. View

10.

Rao S, McCoy L, Spertus J, Krone R, Singh M, Fitzgerald S . An updated bleeding model to predict the risk of post-procedure bleeding among patients undergoing percutaneous coronary intervention: a report using an expanded bleeding definition from the National Cardiovascular Data Registry CathPCI Registry. JACC Cardiovasc Interv. 2013; 6(9):897-904. DOI: 10.1016/j.jcin.2013.04.016. View

11.

Tompa M, Li N, Bailey T, Church G, De Moor B, Eskin E . Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005; 23(1):137-44. DOI: 10.1038/nbt1053. View

12.

Anderson H, Shaw R, Brindis R, Hewitt K, Krone R, Block P . A contemporary overview of percutaneous coronary interventions. The American College of Cardiology-National Cardiovascular Data Registry (ACC-NCDR). J Am Coll Cardiol. 2002; 39(7):1096-103. DOI: 10.1016/s0735-1097(02)01733-3. View

13.

Wasfy J, Singal G, OBrien C, Blumenthal D, Kennedy K, Strom J . Enhancing the Prediction of 30-Day Readmission After Percutaneous Coronary Intervention Using Data Extracted by Querying of the Electronic Health Record. Circ Cardiovasc Qual Outcomes. 2015; 8(5):477-85. DOI: 10.1161/CIRCOUTCOMES.115.001855. View

14.

Kim J, Ghasemzadeh N, Eapen D, Chung N, Storey J, Quyyumi A . Gene expression profiles associated with acute myocardial infarction and risk of cardiovascular death. Genome Med. 2014; 6(5):40. PMC: 4071233. DOI: 10.1186/gm560. View