» Articles » PMID: 35702380

Contrastive Learning on Protein Embeddings Enlightens Midnight Zone

Overview
Specialty Biology
Date 2022 Jun 15
PMID 35702380
Authors
Affiliations
Soon will be listed here.
Abstract

Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI), facilitating the transfer of information from a protein with known annotation to a query without any annotation. A recent alternative expands the concept of HBI from sequence-distance lookup to embedding-based annotation transfer (EAT). These embeddings are derived from protein Language Models (pLMs). Here, we introduce using single protein representations from pLMs for contrastive learning. This learning procedure creates a new set of embeddings that optimizes constraints captured by hierarchical classifications of protein 3D structures defined by the CATH resource. The approach, dubbed , has an improved ability to recognize distant homologous relationships than more traditional techniques such as threading or fold recognition. Thus, these embeddings have allowed sequence comparison to step into the 'midnight zone' of protein similarity, i.e. the region in which distantly related sequences have a seemingly random pairwise sequence similarity. The novelty of this work is in the particular combination of tools and sampling techniques that ascertained good performance comparable or better to existing state-of-the-art sequence comparison methods. Additionally, since this method does not need to generate alignments it is also orders of magnitudes faster. The code is available at https://github.com/Rostlab/EAT.

Citing Articles

Learning maximally spanning representations improves protein function annotation.

Luo J, Luo Y bioRxiv. 2025; .

PMID: 40027840 PMC: 11870436. DOI: 10.1101/2025.02.13.638156.


PIPENN-EMB ensemble net and protein embeddings generalise protein interface prediction beyond homology.

Thomas D, Garcia Fernandez C, Haydarlou R, Feenstra K Sci Rep. 2025; 15(1):4391.

PMID: 39910126 PMC: 11799512. DOI: 10.1038/s41598-025-88445-y.


Accurate Predictions of Molecular Properties of Proteins via Graph Neural Networks and Transfer Learning.

Wozniak S, Janson G, Feig M bioRxiv. 2024; .

PMID: 39713395 PMC: 11661272. DOI: 10.1101/2024.12.10.627714.


Bilingual language model for protein sequence and structure.

Heinzinger M, Weissenow K, Sanchez J, Henkel A, Mirdita M, Steinegger M NAR Genom Bioinform. 2024; 6(4):lqae150.

PMID: 39633723 PMC: 11616678. DOI: 10.1093/nargab/lqae150.


AlphaFold Meets De Novo Drug Design: Leveraging Structural Protein Information in Multitarget Molecular Generative Models.

Bernatavicius A, Sicho M, Janssen A, Hassen A, Preuss M, van Westen G J Chem Inf Model. 2024; 64(21):8113-8122.

PMID: 39475544 PMC: 11558674. DOI: 10.1021/acs.jcim.4c00309.


References
1.
Nehrt N, Clark W, Radivojac P, Hahn M . Testing the ortholog conjecture with comparative functional genomic data from mammals. PLoS Comput Biol. 2011; 7(6):e1002073. PMC: 3111532. DOI: 10.1371/journal.pcbi.1002073. View

2.
Villegas-Morcillo A, Makrodimitris S, van Ham R, Gomez A, Sanchez V, Reinders M . Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function. Bioinformatics. 2020; 37(2):162-170. PMC: 8055213. DOI: 10.1093/bioinformatics/btaa701. View

3.
Alley E, Khimulya G, Biswas S, AlQuraishi M, Church G . Unified rational protein engineering with sequence-based deep representation learning. Nat Methods. 2019; 16(12):1315-1322. PMC: 7067682. DOI: 10.1038/s41592-019-0598-1. View

4.
Heinzinger M, Elnaggar A, Wang Y, Dallago C, Nechaev D, Matthes F . Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics. 2019; 20(1):723. PMC: 6918593. DOI: 10.1186/s12859-019-3220-8. View

5.
Mika S, Rost B . UniqueProt: Creating representative protein sequence sets. Nucleic Acids Res. 2003; 31(13):3789-91. PMC: 169026. DOI: 10.1093/nar/gkg620. View