Pairing Interacting Protein Sequences Using Masked Language Modeling

Overview

Journal Proc Natl Acad Sci U S A

Specialty Science

Date 2024 Jun 24

PMID 38913900

Authors

Umberto Lupo

Damiano Sgarbossa

Anne-Florence Bitbol

Affiliations

Soon will be listed here.

Abstract

Predicting which proteins interact together from amino acid sequences is an important task. We develop a method to pair interacting protein sequences which leverages the power of protein language models trained on multiple sequence alignments (MSAs), such as MSA Transformer and the EvoFormer module of AlphaFold. We formulate the problem of pairing interacting partners among the paralogs of two protein families in a differentiable way. We introduce a method called Differentiable Pairing using Alignment-based Language Models (DiffPALM) that solves it by exploiting the ability of MSA Transformer to fill in masked amino acids in multiple sequence alignments using the surrounding context. MSA Transformer encodes coevolution between functionally or structurally coupled amino acids within protein chains. It also captures inter-chain coevolution, despite being trained on single-chain data. Relying on MSA Transformer without fine-tuning, DiffPALM outperforms existing coevolution-based pairing methods on difficult benchmarks of shallow multiple sequence alignments extracted from ubiquitous prokaryotic protein datasets. It also outperforms an alternative method based on a state-of-the-art protein language model trained on single sequences. Paired alignments of interacting protein sequences are a crucial ingredient of supervised deep learning methods to predict the three-dimensional structure of protein complexes. Starting from sequences paired by DiffPALM substantially improves the structure prediction of some eukaryotic protein complexes by AlphaFold-Multimer. It also achieves competitive performance with using orthology-based pairing.

Citing Articles

Decoding the functional impact of the cancer genome through protein-protein interactions.

Fu H, Mo X, Ivanov A Nat Rev Cancer. 2025; 25(3):189-208.

PMID: 39810024 DOI: 10.1038/s41568-024-00784-6.

The Historical Evolution and Significance of Multiple Sequence Alignment in Molecular Structure and Function Prediction.

Zhang C, Wang Q, Li Y, Teng A, Hu G, Wuyun Q Biomolecules. 2025; 14(12.

PMID: 39766238 PMC: 11673352. DOI: 10.3390/biom14121531.

DiffPaSS-high-performance differentiable pairing of protein sequences using soft scores.

Lupo U, Sgarbossa D, Milighetti M, Bitbol A Bioinformatics. 2024; 41(1).

PMID: 39672677 PMC: 11676329. DOI: 10.1093/bioinformatics/btae738.

Machine learning meets physics: A two-way street.

Levine H, Tu Y Proc Natl Acad Sci U S A. 2024; 121(27):e2403580121.

PMID: 38913898 PMC: 11228530. DOI: 10.1073/pnas.2403580121.

Genomic language model predicts protein co-regulation and function.

Hwang Y, Cornman A, Kellogg E, Ovchinnikov S, Girguis P Nat Commun. 2024; 15(1):2880.

PMID: 38570504 PMC: 10991518. DOI: 10.1038/s41467-024-46947-9.

References

Lupo U, Sgarbossa D, Bitbol A . Protein language models trained on multiple sequence alignments learn phylogenetic relationships. Nat Commun. 2022; 13(1):6298. PMC: 9588007. DOI: 10.1038/s41467-022-34032-y. View

Gertz J, Elfond G, Shustrova A, Weisinger M, Pellegrini M, Cokus S . Inferring protein interactions from phylogenetic distance matrices. Bioinformatics. 2003; 19(16):2039-45. DOI: 10.1093/bioinformatics/btg278. View

Chowdhury R, Bouatta N, Biswas S, Floristean C, Kharkar A, Roy K . Single-sequence protein structure prediction using a language model and deep learning. Nat Biotechnol. 2022; 40(11):1617-1623. PMC: 10440047. DOI: 10.1038/s41587-022-01432-w. View

Szklarczyk D, Kirsch R, Koutrouli M, Nastou K, Mehryary F, Hachilif R . The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 2022; 51(D1):D638-D646. PMC: 9825434. DOI: 10.1093/nar/gkac1000. View

Izarzugaza J, Juan D, Pons C, Pazos F, Valencia A . Enhancing the prediction of protein pairings between interacting families using orthology information. BMC Bioinformatics. 2008; 9:35. PMC: 2263026. DOI: 10.1186/1471-2105-9-35. View

Ochoa D, Pazos F . Studying the co-evolution of protein families with the Mirrortree web server. Bioinformatics. 2010; 26(10):1370-1. DOI: 10.1093/bioinformatics/btq137. View

Zheng W, Wuyun Q, Freddolino L, Freddolino P, Zhang Y . Integrating deep learning, threading alignments, and a multi-MSA strategy for high-quality protein monomer and complex structure prediction in CASP15. Proteins. 2023; 91(12):1684-1703. PMC: 10840719. DOI: 10.1002/prot.26585. View

Pozzati G, Zhu W, Bassot C, Lamb J, Kundrotas P, Elofsson A . Limits and potential of combined folding and docking. Bioinformatics. 2021; 38(4):954-961. PMC: 8796369. DOI: 10.1093/bioinformatics/btab760. View

Izarzugaza J, Juan D, Pons C, Ranea J, Valencia A, Pazos F . TSEMA: interactive prediction of protein pairings between interacting families. Nucleic Acids Res. 2006; 34(Web Server issue):W315-9. PMC: 1538787. DOI: 10.1093/nar/gkl112. View

10.

Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W . Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023; 379(6637):1123-1130. DOI: 10.1126/science.ade2574. View

11.

Burger L, van Nimwegen E . Accurate prediction of protein-protein interactions from sequence alignments using a Bayesian method. Mol Syst Biol. 2008; 4:165. PMC: 2267735. DOI: 10.1038/msb4100203. View

12.

Gueudre T, Baldassi C, Zamparo M, Weigt M, Pagnani A . Simultaneous identification of specifically interacting paralogs and interprotein contacts by direct coupling analysis. Proc Natl Acad Sci U S A. 2016; 113(43):12186-12191. PMC: 5087065. DOI: 10.1073/pnas.1607570113. View

13.

Jothi R, Kann M, Przytycka T . Predicting protein-protein interaction by searching evolutionary tree automorphism space. Bioinformatics. 2005; 21 Suppl 1:i241-50. PMC: 1618802. DOI: 10.1093/bioinformatics/bti1009. View

14.

Bryant P, Pozzati G, Elofsson A . Improved prediction of protein-protein interactions using AlphaFold2. Nat Commun. 2022; 13(1):1265. PMC: 8913741. DOI: 10.1038/s41467-022-28865-w. View

15.

Szurmant H, Weigt M . Inter-residue, inter-protein and inter-family coevolution: bridging the scales. Curr Opin Struct Biol. 2017; 50:26-32. PMC: 5940578. DOI: 10.1016/j.sbi.2017.10.014. View

16.

Tillier E, Biro L, Li G, Tillo D . Codep: maximizing co-evolutionary interdependencies to discover interacting proteins. Proteins. 2006; 63(4):822-31. DOI: 10.1002/prot.20948. View

17.

Morcos F, Pagnani A, Lunt B, Bertolino A, Marks D, Sander C . Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc Natl Acad Sci U S A. 2011; 108(49):E1293-301. PMC: 3241805. DOI: 10.1073/pnas.1111471108. View

18.

Barakat M, Ortet P, Jourlin-Castelli C, Ansaldi M, Mejean V, Whitworth D . P2CS: a two-component system resource for prokaryotic signal transduction research. BMC Genomics. 2009; 10:315. PMC: 2716373. DOI: 10.1186/1471-2164-10-315. View

19.

Meynard-Piganeau B, Fabbri C, Weigt M, Pagnani A, Feinauer C . Generating interacting protein sequences using domain-to-domain translation. Bioinformatics. 2023; 39(7). PMC: 10329493. DOI: 10.1093/bioinformatics/btad401. View

20.

Steinegger M, Soding J . MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017; 35(11):1026-1028. DOI: 10.1038/nbt.3988. View