COMMUTE: Communication-efficient Transfer Learning for Multi-site Risk Prediction

Overview

Journal J Biomed Inform

Publisher Elsevier

Specialty Medical Informatics

Date 2022 Nov 20

PMID 36403757

Authors

Tian Gu

Phil H Lee

Rui Duan

Affiliations

Soon will be listed here.

Abstract

Objectives: We propose a communication-efficient transfer learning approach (COMMUTE) that effectively incorporates multi-site healthcare data for training a risk prediction model in a target population of interest, accounting for challenges including population heterogeneity and data sharing constraints across sites.

Methods: We first train population-specific source models locally within each site. Using data from a given target population, COMMUTE learns a calibration term for each source model, which adjusts for potential data heterogeneity through flexible distance-based regularizations. In a centralized setting where multi-site data can be directly pooled, all data are combined to train the target model after calibration. When individual-level data are not shareable in some sites, COMMUTE requests only the locally trained models from these sites, with which, COMMUTE generates heterogeneity-adjusted synthetic data for training the target model. We evaluate COMMUTE via extensive simulation studies and an application to multi-site data from the electronic Medical Records and Genomics (eMERGE) Network to predict extreme obesity.

Results: Simulation studies show that COMMUTE outperforms methods without adjusting for population heterogeneity and methods trained in a single population over a broad spectrum of settings. Using eMERGE data, COMMUTE achieves an area under the receiver operating characteristic curve (AUC) around 0.80, which outperforms other benchmark methods with AUC ranging from 0.51 to 0.70.

Conclusion: COMMUTE improves the risk prediction in a target population with limited samples and safeguards against negative transfer when some source populations are highly different from the target. In a federated setting, it is highly communication efficient as it only requires each site to share model parameter estimates once, and no iterative communication or higher-order terms are needed.

Citing Articles

Multi-Task Learning with Summary Statistics.

Knight P, Duan R Adv Neural Inf Process Syst. 2024; 36:54020-54031.

PMID: 39351341 PMC: 11440483.

Learning across diverse biomedical data modalities and cohorts: Challenges and opportunities for innovation.

Rajendran S, Pan W, Sabuncu M, Chen Y, Zhou J, Wang F Patterns (N Y). 2024; 5(2):100913.

PMID: 38370129 PMC: 10873158. DOI: 10.1016/j.patter.2023.100913.

A synthetic data integration framework to leverage external summary-level information from heterogeneous populations.

Gu T, Taylor J, Mukherjee B Biometrics. 2023; 79(4):3831-3845.

PMID: 36876883 PMC: 10480346. DOI: 10.1111/biom.13852.

References

Sealock J, Lee Y, Moscati A, Venkatesh S, Voloudakis G, Straub P . Use of the PsycheMERGE Network to Investigate the Association Between Depression Polygenic Scores and White Blood Cell Count. JAMA Psychiatry. 2021; 78(12):1365-1374. PMC: 8529528. DOI: 10.1001/jamapsychiatry.2021.2959. View

Belbin G, Cullina S, Wenric S, Soper E, Glicksberg B, Torre D . Toward a fine-scale population health monitoring system. Cell. 2021; 184(8):2068-2083.e11. DOI: 10.1016/j.cell.2021.03.034. View

Haendel M, Chute C, Bennett T, Eichmann D, Guinney J, Kibbe W . The National COVID Cohort Collaborative (N3C): Rationale, design, infrastructure, and deployment. J Am Med Inform Assoc. 2020; 28(3):427-443. PMC: 7454687. DOI: 10.1093/jamia/ocaa196. View

Wang X, Zhang H, Xiong X, Hong C, Weber G, Brat G . SurvMaximin: Robust federated approach to transporting survival risk prediction models. J Biomed Inform. 2022; 134:104176. PMC: 9707637. DOI: 10.1016/j.jbi.2022.104176. View

Pulley J, Clayton E, Bernard G, Roden D, Masys D . Principles of human subjects protections applied in an opt-out, de-identified biobank. Clin Transl Sci. 2010; 3(1):42-8. PMC: 3075971. DOI: 10.1111/j.1752-8062.2010.00175.x. View

Martin A, Kanai M, Kamatani Y, Okada Y, Neale B, Daly M . Clinical use of current polygenic risk scores may exacerbate health disparities. Nat Genet. 2019; 51(4):584-591. PMC: 6563838. DOI: 10.1038/s41588-019-0379-x. View

Hripcsak G, Duke J, Shah N, Reich C, Huser V, Schuemie M . Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers. Stud Health Technol Inform. 2015; 216:574-8. PMC: 4815923. View

Kirby J, Speltz P, Rasmussen L, Basford M, Gottesman O, Peissig P . PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability. J Am Med Inform Assoc. 2016; 23(6):1046-1052. PMC: 5070514. DOI: 10.1093/jamia/ocv202. View

Roden D, Pulley J, Basford M, Bernard G, Clayton E, Balser J . Development of a large-scale de-identified DNA biobank to enable personalized medicine. Clin Pharmacol Ther. 2008; 84(3):362-9. PMC: 3763939. DOI: 10.1038/clpt.2008.89. View

10.

Segev N, Harel M, Mannor S, Crammer K, El-Yaniv R . Learn on Source, Refine on Target: A Model Transfer Learning Framework with Random Forests. IEEE Trans Pattern Anal Mach Intell. 2017; 39(9):1811-1824. DOI: 10.1109/TPAMI.2016.2618118. View

11.

Wu Y, Jiang X, Kim J, Ohno-Machado L . Grid Binary LOgistic REgression (GLORE): building shared models without sharing data. J Am Med Inform Assoc. 2012; 19(5):758-64. PMC: 3422844. DOI: 10.1136/amiajnl-2012-000862. View

12.

Tan X, Chang C, Zhou L, Tang L . A Tree-based Model Averaging Approach for Personalized Treatment Effect Estimation from Heterogeneous Data Sources. Proc Mach Learn Res. 2023; 162:21013-21036. PMC: 10711748. View

13.

Kundu P, Tang R, Chatterjee N . Generalized meta-analysis for multiple regression models across studies with disparate covariate information. Biometrika. 2019; 106(3):567-585. PMC: 6690173. DOI: 10.1093/biomet/asz030. View

14.

Tian Y, Feng Y . Transfer Learning under High-dimensional Generalized Linear Models. J Am Stat Assoc. 2024; 118(544):2684-2697. PMC: 10982637. DOI: 10.1080/01621459.2022.2071278. View

15.

Chatterjee N, Chen Y, Maas P, Carroll R . Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources. J Am Stat Assoc. 2016; 111(513):107-117. PMC: 4994914. DOI: 10.1080/01621459.2015.1123157. View

16.

Long M, Cao Y, Cao Z, Wang J, Jordan M . Transferable Representation Learning with Deep Adaptation Networks. IEEE Trans Pattern Anal Mach Intell. 2018; 41(12):3071-3085. DOI: 10.1109/TPAMI.2018.2868685. View

17.

Duan R, Boland M, Liu Z, Liu Y, Chang H, Xu H . Learning from electronic health records across multiple sites: A communication-efficient and privacy-preserving distributed algorithm. J Am Med Inform Assoc. 2019; 27(3):376-385. PMC: 7025371. DOI: 10.1093/jamia/ocz199. View

18.

Velez Edwards D, Naj A, Monda K, North K, Neuhouser M, Magvanjav O . Gene-environment interactions and obesity traits among postmenopausal African-American and Hispanic women in the Women's Health Initiative SHARe Study. Hum Genet. 2012; 132(3):323-36. PMC: 3704217. DOI: 10.1007/s00439-012-1246-3. View

19.

Duan R, Boland M, Moore J, Chen Y . ODAL: A one-shot distributed algorithm to perform logistic regressions on electronic health records data from multiple clinical sites. Pac Symp Biocomput. 2019; 24:30-41. PMC: 6417819. View

20.

Wiens J, Guttag J, Horvitz E . A study in transfer learning: leveraging data from multiple hospitals to enhance hospital-specific predictions. J Am Med Inform Assoc. 2014; 21(4):699-706. PMC: 4078276. DOI: 10.1136/amiajnl-2013-002162. View