» Articles » PMID: 38559182

Biophysics-based Protein Language Models for Protein Engineering

Overview
Journal bioRxiv
Date 2024 Apr 1
PMID 38559182
Authors
Affiliations
Soon will be listed here.
Abstract

Protein language models trained on evolutionary data have emerged as powerful tools for predictive problems involving protein sequence, structure, and function. However, these models overlook decades of research into biophysical factors governing protein function. We propose Mutational Effect Transfer Learning (METL), a protein language model framework that unites advanced machine learning and biophysical modeling. Using the METL framework, we pretrain transformer-based neural networks on biophysical simulation data to capture fundamental relationships between protein sequence, structure, and energetics. We finetune METL on experimental sequence-function data to harness these biophysical signals and apply them when predicting protein properties like thermostability, catalytic activity, and fluorescence. METL excels in challenging protein engineering tasks like generalizing from small training sets and position extrapolation, although existing methods that train on evolutionary signals remain powerful for many types of experimental assays. We demonstrate METL's ability to design functional green fluorescent protein variants when trained on only 64 examples, showcasing the potential of biophysics-based protein language models for protein engineering.

References
1.
Vornholt T, Mutny M, Schmidt G, Schellhaas C, Tachibana R, Panke S . Enhanced Sequence-Activity Mapping and Evolution of Artificial Metalloenzymes by Active Learning. ACS Cent Sci. 2024; 10(7):1357-1370. PMC: 11273458. DOI: 10.1021/acscentsci.4c00258. View

2.
Mighell T, Evans-Dutson S, ORoak B . A Saturation Mutagenesis Approach to Understanding PTEN Lipid Phosphatase Activity and Genotype-Phenotype Relationships. Am J Hum Genet. 2018; 102(5):943-955. PMC: 5986715. DOI: 10.1016/j.ajhg.2018.03.018. View

3.
Wittmann B, Yue Y, Arnold F . Informed training set design enables efficient machine learning-assisted directed protein evolution. Cell Syst. 2021; 12(11):1026-1045.e7. DOI: 10.1016/j.cels.2021.07.008. View

4.
Sarkisyan K, Bolotin D, Meer M, Usmanova D, Mishin A, Sharonov G . Local fitness landscape of the green fluorescent protein. Nature. 2016; 533(7603):397-401. PMC: 4968632. DOI: 10.1038/nature17995. View

5.
Chen L, Zhang Z, Li Z, Li R, Huo R, Chen L . Learning protein fitness landscapes with deep mutational scanning data from multiple sources. Cell Syst. 2023; 14(8):706-721.e5. DOI: 10.1016/j.cels.2023.07.003. View