» Articles » PMID: 39447059

Prediction of Human O-linked Glycosylation Sites Using Stacked Generalization and Embeddings from Pre-trained Protein Language Model

Overview
Journal Bioinformatics
Specialty Biology
Date 2024 Oct 24
PMID 39447059
Authors
Affiliations
Soon will be listed here.
Abstract

Motivation: O-linked glycosylation, an essential post-translational modification process in Homo sapiens, involves attaching sugar moieties to the oxygen atoms of serine and/or threonine residues. It influences various biological and cellular functions. While threonine or serine residues within protein sequences are potential sites for O-linked glycosylation, not all serine and/or threonine residues undergo this modification, underscoring the importance of characterizing its occurrence. This study presents a novel approach for predicting intracellular and extracellular O-linked glycosylation events on proteins, which are crucial for comprehending cellular processes. Two base multi-layer perceptron models were trained by leveraging a stacked generalization framework. These base models respectively use ProtT5 and Ankh O-linked glycosylation site-specific embeddings whose combined predictions are used to train the meta-multi-layer perceptron model. Trained on extensive O-linked glycosylation datasets, the stacked-generalization model demonstrated high predictive performance on independent test datasets. Furthermore, the study emphasizes the distinction between nucleocytoplasmic and extracellular O-linked glycosylation, offering insights into their functional implications that were overlooked in previous studies. By integrating the protein language model's embedding with stacked generalization techniques, this approach enhances predictive accuracy of O-linked glycosylation events and illuminates the intricate roles of O-linked glycosylation in proteomics, potentially accelerating the discovery of novel glycosylation sites.

Results: Stack-OglyPred-PLM produces Sensitivity, Specificity, Matthews Correlation Coefficient, and Accuracy of 90.50%, 89.60%, 0.464, and 89.70%, respectively on a benchmark NetOGlyc-4.0 independent test dataset. These results demonstrate that Stack-OglyPred-PLM is a robust computational tool to predict O-linked glycosylation sites in proteins.

Availability And Implementation: The developed tool, programs, training, and test dataset are available at https://github.com/PakhrinLab/Stack-OglyPred-PLM.

Citing Articles

Enhanced O-glycosylation site prediction using explainable machine learning technique with spatial local environment.

Hong S, Chattaraj K, Guo J, Trout B, Braatz R Bioinformatics. 2025; 41(2).

PMID: 39878910 PMC: 11814488. DOI: 10.1093/bioinformatics/btaf034.


TargetCLP: clathrin proteins prediction combining transformed and evolutionary scale modeling-based multi-view features via weighted feature integration approach.

Ullah M, Akbar S, Raza A, Khan K, Zou Q Brief Bioinform. 2025; 26(1.

PMID: 39844339 PMC: 11753890. DOI: 10.1093/bib/bbaf026.

References
1.
Pakhrin S, Aoki-Kinoshita K, Caragea D, Kc D . DeepNGlyPred: A Deep Neural Network-Based Approach for Human N-Linked Glycosylation Site Prediction. Molecules. 2021; 26(23). PMC: 8658957. DOI: 10.3390/molecules26237314. View

2.
Bagdonaite I, Pallesen E, Ye Z, Vakhrushev S, Marinova I, Nielsen M . O-glycan initiation directs distinct biological pathways and controls epithelial differentiation. EMBO Rep. 2020; 21(6):e48885. PMC: 7271655. DOI: 10.15252/embr.201948885. View

3.
Zhu Y, Yin S, Zheng J, Shi Y, Jia C . O-glycosylation site prediction for by combining properties and sequence features with support vector machine. J Bioinform Comput Biol. 2021; 20(1):2150029. DOI: 10.1142/S0219720021500293. View

4.
Hu F, Li W, Li Y, Hou C, Ma J, Jia C . O-GlcNAcPRED-DL: Prediction of Protein O-GlcNAcylation Sites Based on an Ensemble Model of Deep Learning. J Proteome Res. 2023; 23(1):95-106. DOI: 10.1021/acs.jproteome.3c00458. View

5.
Littmann M, Heinzinger M, Dallago C, Weissenow K, Rost B . Protein embeddings and deep learning predict binding residues for various ligand classes. Sci Rep. 2021; 11(1):23916. PMC: 8668950. DOI: 10.1038/s41598-021-03431-4. View