Contrastive Learning on Protein Embeddings Enlightens Midnight Zone

Overview

Journal NAR Genom Bioinform

Publisher Oxford University Press

Specialty Biology

Date 2022 Jun 15

PMID 35702380

Authors

Michael Heinzinger

Maria Littmann

Ian Sillitoe

Nicola Bordin

Christine Orengo

Burkhard Rost

Affiliations

Soon will be listed here.

Abstract

Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI), facilitating the transfer of information from a protein with known annotation to a query without any annotation. A recent alternative expands the concept of HBI from sequence-distance lookup to embedding-based annotation transfer (EAT). These embeddings are derived from protein Language Models (pLMs). Here, we introduce using single protein representations from pLMs for contrastive learning. This learning procedure creates a new set of embeddings that optimizes constraints captured by hierarchical classifications of protein 3D structures defined by the CATH resource. The approach, dubbed , has an improved ability to recognize distant homologous relationships than more traditional techniques such as threading or fold recognition. Thus, these embeddings have allowed sequence comparison to step into the 'midnight zone' of protein similarity, i.e. the region in which distantly related sequences have a seemingly random pairwise sequence similarity. The novelty of this work is in the particular combination of tools and sampling techniques that ascertained good performance comparable or better to existing state-of-the-art sequence comparison methods. Additionally, since this method does not need to generate alignments it is also orders of magnitudes faster. The code is available at https://github.com/Rostlab/EAT.

Citing Articles

Learning maximally spanning representations improves protein function annotation.

Luo J, Luo Y bioRxiv. 2025; .

PMID: 40027840 PMC: 11870436. DOI: 10.1101/2025.02.13.638156.

PIPENN-EMB ensemble net and protein embeddings generalise protein interface prediction beyond homology.

Thomas D, Garcia Fernandez C, Haydarlou R, Feenstra K Sci Rep. 2025; 15(1):4391.

PMID: 39910126 PMC: 11799512. DOI: 10.1038/s41598-025-88445-y.

Accurate Predictions of Molecular Properties of Proteins via Graph Neural Networks and Transfer Learning.

Wozniak S, Janson G, Feig M bioRxiv. 2024; .

PMID: 39713395 PMC: 11661272. DOI: 10.1101/2024.12.10.627714.

Bilingual language model for protein sequence and structure.

Heinzinger M, Weissenow K, Sanchez J, Henkel A, Mirdita M, Steinegger M NAR Genom Bioinform. 2024; 6(4):lqae150.

PMID: 39633723 PMC: 11616678. DOI: 10.1093/nargab/lqae150.

AlphaFold Meets De Novo Drug Design: Leveraging Structural Protein Information in Multitarget Molecular Generative Models.

Bernatavicius A, Sicho M, Janssen A, Hassen A, Preuss M, van Westen G J Chem Inf Model. 2024; 64(21):8113-8122.

PMID: 39475544 PMC: 11558674. DOI: 10.1021/acs.jcim.4c00309.

References

Nehrt N, Clark W, Radivojac P, Hahn M . Testing the ortholog conjecture with comparative functional genomic data from mammals. PLoS Comput Biol. 2011; 7(6):e1002073. PMC: 3111532. DOI: 10.1371/journal.pcbi.1002073. View

Villegas-Morcillo A, Makrodimitris S, van Ham R, Gomez A, Sanchez V, Reinders M . Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function. Bioinformatics. 2020; 37(2):162-170. PMC: 8055213. DOI: 10.1093/bioinformatics/btaa701. View

Alley E, Khimulya G, Biswas S, AlQuraishi M, Church G . Unified rational protein engineering with sequence-based deep representation learning. Nat Methods. 2019; 16(12):1315-1322. PMC: 7067682. DOI: 10.1038/s41592-019-0598-1. View

Heinzinger M, Elnaggar A, Wang Y, Dallago C, Nechaev D, Matthes F . Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics. 2019; 20(1):723. PMC: 6918593. DOI: 10.1186/s12859-019-3220-8. View

Mika S, Rost B . UniqueProt: Creating representative protein sequence sets. Nucleic Acids Res. 2003; 31(13):3789-91. PMC: 169026. DOI: 10.1093/nar/gkg620. View

Eddy S . Profile hidden Markov models. Bioinformatics. 1999; 14(9):755-63. DOI: 10.1093/bioinformatics/14.9.755. View

Doolittle R, Feng D, Johnson M, McClure M . Origins and evolutionary relationships of retroviruses. Q Rev Biol. 1989; 64(1):1-30. DOI: 10.1086/416128. View

Himanen J, Goldgur Y, Miao H, Myshkin E, Guo H, Buck M . Ligand recognition by A-class Eph receptors: crystal structures of the EphA2 ligand-binding domain and the EphA2/ephrin-A1 complex. EMBO Rep. 2009; 10(7):722-8. PMC: 2727437. DOI: 10.1038/embor.2009.91. View

Rost B, Sander C . Prediction of protein secondary structure at better than 70% accuracy. J Mol Biol. 1993; 232(2):584-99. DOI: 10.1006/jmbi.1993.1413. View

10.

Littmann M, Heinzinger M, Dallago C, Olenyi T, Rost B . Embeddings from deep learning transfer GO annotations beyond homology. Sci Rep. 2021; 11(1):1160. PMC: 7806674. DOI: 10.1038/s41598-020-80786-0. View

11.

Soding J . Protein homology detection by HMM-HMM comparison. Bioinformatics. 2004; 21(7):951-60. DOI: 10.1093/bioinformatics/bti125. View

12.

Nallapareddy V, Bordin N, Sillitoe I, Heinzinger M, Littmann M, Waman V . CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models. Bioinformatics. 2023; 39(1). PMC: 9887088. DOI: 10.1093/bioinformatics/btad029. View

13.

Rost B, Liu J, Nair R, Wrzeszczynski K, Ofran Y . Automatic prediction of protein function. Cell Mol Life Sci. 2003; 60(12):2637-50. PMC: 11138487. DOI: 10.1007/s00018-003-3114-8. View

14.

Rost B . Enzyme function less conserved than anticipated. J Mol Biol. 2002; 318(2):595-608. DOI: 10.1016/S0022-2836(02)00016-5. View

15.

Hamid M, Friedberg I . Identifying antimicrobial peptides using word embedding with deep recurrent neural networks. Bioinformatics. 2018; 35(12):2009-2016. PMC: 6581433. DOI: 10.1093/bioinformatics/bty937. View

16.

Sievers F, Wilm A, Dineen D, Gibson T, Karplus K, Li W . Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 2011; 7:539. PMC: 3261699. DOI: 10.1038/msb.2011.75. View

17.

Jensen L, Gupta R, Blom N, Devos D, Tamames J, Kesmir C . Prediction of human protein function from post-translational modifications and localization features. J Mol Biol. 2002; 319(5):1257-65. DOI: 10.1016/S0022-2836(02)00379-0. View

18.

Steinegger M, Soding J . MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017; 35(11):1026-1028. DOI: 10.1038/nbt.3988. View

19.

Taylor W, Orengo C . A holistic approach to protein structure alignment. Protein Eng. 1989; 2(7):505-19. DOI: 10.1093/protein/2.7.505. View

20.

Edgar R, Sjolander K . COACH: profile-profile alignment of protein families using hidden Markov models. Bioinformatics. 2004; 20(8):1309-18. DOI: 10.1093/bioinformatics/bth091. View