» Articles » PMID: 39792955

A Robust Transfer Learning Approach for High-dimensional Linear Regression to Support Integration of Multi-source Gene Expression Data

Overview
Date 2025 Jan 10
PMID 39792955
Authors
Affiliations
Soon will be listed here.
Abstract

Transfer learning aims to integrate useful information from multi-source datasets to improve the learning performance of target data. This can be effectively applied in genomics when we learn the gene associations in a target tissue, and data from other tissues can be integrated. However, heavy-tail distribution and outliers are common in genomics data, which poses challenges to the effectiveness of current transfer learning approaches. In this paper, we study the transfer learning problem under high-dimensional linear models with t-distributed error (Trans-PtLR), which aims to improve the estimation and prediction of target data by borrowing information from useful source data and offering robustness to accommodate complex data with heavy tails and outliers. In the oracle case with known transferable source datasets, a transfer learning algorithm based on penalized maximum likelihood and expectation-maximization algorithm is established. To avoid including non-informative sources, we propose to select the transferable sources based on cross-validation. Extensive simulation experiments as well as an application demonstrate that Trans-PtLR demonstrates robustness and better performance of estimation and prediction when heavy-tail and outliers exist compared to transfer learning for linear regression model with normal error distribution. Data integration, Variable selection, T distribution, Expectation maximization algorithm, Genotype-Tissue Expression, Cross validation.

References
1.
Sturm G, List M, Zhang J . Tissue heterogeneity is prevalent in gene expression studies. NAR Genom Bioinform. 2021; 3(3):lqab077. PMC: 8415427. DOI: 10.1093/nargab/lqab077. View

2.
Schottlaender L, Abeti R, Jaunmuktane Z, Macmillan C, Chelban V, OCallaghan B . Bi-allelic JAM2 Variants Lead to Early-Onset Recessive Primary Familial Brain Calcification. Am J Hum Genet. 2020; 106(3):412-421. PMC: 7058839. DOI: 10.1016/j.ajhg.2020.02.007. View

3.
Li B, Cai T, Duan R . TARGETING UNDERREPRESENTED POPULATIONS IN PRECISION MEDICINE: A FEDERATED TRANSFER LEARNING APPROACH. Ann Appl Stat. 2024; 17(4):2970-2992. PMC: 11417462. DOI: 10.1214/23-AOAS1747. View

3.
de Hijas-Liste G, Klipp E, Balsa-Canto E, Banga J . Global dynamic optimization approach to predict activation in metabolic pathways. BMC Syst Biol. 2014; 8:1. PMC: 3892042. DOI: 10.1186/1752-0509-8-1. View

4.
Deng L, Liu D, Li Y, Wang R, Liu J, Zhang J . MSPCD: predicting circRNA-disease associations via integrating multi-source data and hierarchical neural network. BMC Bioinformatics. 2022; 23(Suppl 3):427. PMC: 9569055. DOI: 10.1186/s12859-022-04976-5. View