» Articles » PMID: 38913900

Pairing Interacting Protein Sequences Using Masked Language Modeling

Overview
Specialty Science
Date 2024 Jun 24
PMID 38913900
Authors
Affiliations
Soon will be listed here.
Abstract

Predicting which proteins interact together from amino acid sequences is an important task. We develop a method to pair interacting protein sequences which leverages the power of protein language models trained on multiple sequence alignments (MSAs), such as MSA Transformer and the EvoFormer module of AlphaFold. We formulate the problem of pairing interacting partners among the paralogs of two protein families in a differentiable way. We introduce a method called Differentiable Pairing using Alignment-based Language Models (DiffPALM) that solves it by exploiting the ability of MSA Transformer to fill in masked amino acids in multiple sequence alignments using the surrounding context. MSA Transformer encodes coevolution between functionally or structurally coupled amino acids within protein chains. It also captures inter-chain coevolution, despite being trained on single-chain data. Relying on MSA Transformer without fine-tuning, DiffPALM outperforms existing coevolution-based pairing methods on difficult benchmarks of shallow multiple sequence alignments extracted from ubiquitous prokaryotic protein datasets. It also outperforms an alternative method based on a state-of-the-art protein language model trained on single sequences. Paired alignments of interacting protein sequences are a crucial ingredient of supervised deep learning methods to predict the three-dimensional structure of protein complexes. Starting from sequences paired by DiffPALM substantially improves the structure prediction of some eukaryotic protein complexes by AlphaFold-Multimer. It also achieves competitive performance with using orthology-based pairing.

Citing Articles

Decoding the functional impact of the cancer genome through protein-protein interactions.

Fu H, Mo X, Ivanov A Nat Rev Cancer. 2025; 25(3):189-208.

PMID: 39810024 DOI: 10.1038/s41568-024-00784-6.


The Historical Evolution and Significance of Multiple Sequence Alignment in Molecular Structure and Function Prediction.

Zhang C, Wang Q, Li Y, Teng A, Hu G, Wuyun Q Biomolecules. 2025; 14(12.

PMID: 39766238 PMC: 11673352. DOI: 10.3390/biom14121531.


DiffPaSS-high-performance differentiable pairing of protein sequences using soft scores.

Lupo U, Sgarbossa D, Milighetti M, Bitbol A Bioinformatics. 2024; 41(1).

PMID: 39672677 PMC: 11676329. DOI: 10.1093/bioinformatics/btae738.


Machine learning meets physics: A two-way street.

Levine H, Tu Y Proc Natl Acad Sci U S A. 2024; 121(27):e2403580121.

PMID: 38913898 PMC: 11228530. DOI: 10.1073/pnas.2403580121.


Genomic language model predicts protein co-regulation and function.

Hwang Y, Cornman A, Kellogg E, Ovchinnikov S, Girguis P Nat Commun. 2024; 15(1):2880.

PMID: 38570504 PMC: 10991518. DOI: 10.1038/s41467-024-46947-9.

References
1.
Lupo U, Sgarbossa D, Bitbol A . Protein language models trained on multiple sequence alignments learn phylogenetic relationships. Nat Commun. 2022; 13(1):6298. PMC: 9588007. DOI: 10.1038/s41467-022-34032-y. View

2.
Gertz J, Elfond G, Shustrova A, Weisinger M, Pellegrini M, Cokus S . Inferring protein interactions from phylogenetic distance matrices. Bioinformatics. 2003; 19(16):2039-45. DOI: 10.1093/bioinformatics/btg278. View

3.
Chowdhury R, Bouatta N, Biswas S, Floristean C, Kharkar A, Roy K . Single-sequence protein structure prediction using a language model and deep learning. Nat Biotechnol. 2022; 40(11):1617-1623. PMC: 10440047. DOI: 10.1038/s41587-022-01432-w. View

4.
Szklarczyk D, Kirsch R, Koutrouli M, Nastou K, Mehryary F, Hachilif R . The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 2022; 51(D1):D638-D646. PMC: 9825434. DOI: 10.1093/nar/gkac1000. View

5.
Izarzugaza J, Juan D, Pons C, Pazos F, Valencia A . Enhancing the prediction of protein pairings between interacting families using orthology information. BMC Bioinformatics. 2008; 9:35. PMC: 2263026. DOI: 10.1186/1471-2105-9-35. View