» Articles » PMID: 38630609

Genome-scale Annotation of Protein Binding Sites Via Language Model and Geometric Deep Learning

Overview
Journal Elife
Specialty Biology
Date 2024 Apr 17
PMID 38630609
Authors
Affiliations
Soon will be listed here.
Abstract

Revealing protein binding sites with other molecules, such as nucleic acids, peptides, or small ligands, sheds light on disease mechanism elucidation and novel drug design. With the explosive growth of proteins in sequence databases, how to accurately and efficiently identify these binding sites from sequences becomes essential. However, current methods mostly rely on expensive multiple sequence alignments or experimental protein structures, limiting their genome-scale applications. Besides, these methods haven't fully explored the geometry of the protein structures. Here, we propose GPSite, a multi-task network for simultaneously predicting binding residues of DNA, RNA, peptide, protein, ATP, HEM, and metal ions on proteins. GPSite was trained on informative sequence embeddings and predicted structures from protein language models, while comprehensively extracting residual and relational geometric contexts in an end-to-end manner. Experiments demonstrate that GPSite substantially surpasses state-of-the-art sequence-based and structure-based approaches on various benchmark datasets, even when the structures are not well-predicted. The low computational cost of GPSite enables rapid genome-scale binding residue annotations for over 568,000 sequences, providing opportunities to unveil unexplored associations of binding sites with molecular functions, biological processes, and genetic variants. The GPSite webserver and annotation database can be freely accessed at https://bio-web1.nscc-gz.cn/app/GPSite.

Citing Articles

Twenty years of advances in prediction of nucleic acid-binding residues in protein sequences.

Basu S, Yu J, Kihara D, Kurgan L Brief Bioinform. 2025; 26(1).

PMID: 39833102 PMC: 11745544. DOI: 10.1093/bib/bbaf016.


Accurately predicting optimal conditions for microorganism proteins through geometric graph learning and language model.

Zhu M, Song Y, Yuan Q, Yang Y Commun Biol. 2024; 7(1):1709.

PMID: 39739114 PMC: 11683147. DOI: 10.1038/s42003-024-07436-3.


Geometric deep learning improves generalizability of MHC-bound peptide predictions.

Marzella D, Crocioni G, Radusinovic T, Lepikhov D, Severin H, Bodor D Commun Biol. 2024; 7(1):1661.

PMID: 39702482 PMC: 11659464. DOI: 10.1038/s42003-024-07292-1.


The ubiquitous pyridoxal 5'-phosphate-binding protein is also an RNA-binding protein.

Graziani C, Barile A, Parroni A, di Salvo M, De Cecio I, Colombo T Protein Sci. 2024; 33(12):e5242.

PMID: 39604152 PMC: 11602438. DOI: 10.1002/pro.5242.


AI-integrated network for RNA complex structure and dynamic prediction.

Liu H, Zhuo C, Gao J, Zeng C, Zhao Y Biophys Rev (Melville). 2024; 5(4):041304.

PMID: 39512332 PMC: 11540444. DOI: 10.1063/5.0237319.


References
1.
Fu L, Niu B, Zhu Z, Wu S, Li W . CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012; 28(23):3150-2. PMC: 3516142. DOI: 10.1093/bioinformatics/bts565. View

2.
Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W . Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023; 379(6637):1123-1130. DOI: 10.1126/science.ade2574. View

3.
. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 2022; 51(D1):D523-D531. PMC: 9825514. DOI: 10.1093/nar/gkac1052. View

4.
Li P, Liu Z . GeoBind: segmentation of nucleic acid binding interface on protein surface with geometric deep learning. Nucleic Acids Res. 2023; 51(10):e60. PMC: 10250245. DOI: 10.1093/nar/gkad288. View

5.
Yuan Q, Chen J, Zhao H, Zhou Y, Yang Y . Structure-aware protein-protein interaction site prediction using deep graph convolutional network. Bioinformatics. 2021; 38(1):125-132. DOI: 10.1093/bioinformatics/btab643. View