» Articles » PMID: 38586054

FLIGHTED: Inferring Fitness Landscapes from Noisy High-Throughput Experimental Data

Overview
Journal bioRxiv
Date 2024 Apr 8
PMID 38586054
Authors
Affiliations
Soon will be listed here.
Abstract

Machine learning (ML) for protein design requires large protein fitness datasets generated by high-throughput experiments for training, fine-tuning, and benchmarking models. However, most models do not account for experimental noise inherent in these datasets, harming model performance and changing model rankings in benchmarking studies. Here we develop FLIGHTED, a Bayesian method of accounting for uncertainty by generating probabilistic fitness landscapes from noisy high-throughput experiments. We demonstrate how FLIGHTED can improve model performance on two categories of experiments: single-step selection assays, such as phage display and SELEX, and a novel high-throughput assay called DHARMA that ties activity to base editing. We then compare the performance of standard machine-learning models on fitness landscapes generated with and without FLIGHTED. Accounting for noise significantly improves model performance, especially of CNN architectures, and changes relative rankings on numerous common benchmarks. Based on our new benchmarking with FLIGHTED, data size, not model scale, currently appears to be limiting the performance of protein fitness models, and the choice of top model architecture matters more than the protein language model embedding. Collectively, our results indicate that FLIGHTED can be applied to any high-throughput assay and any machine learning model, making it straightforward for protein designers to account for experimental noise when modeling protein fitness.

References
1.
Busia A, Listgarten J . MBE: model-based enrichment estimation and prediction for differential sequencing data. Genome Biol. 2023; 24(1):218. PMC: 10544408. DOI: 10.1186/s13059-023-03058-w. View

2.
Fernandez-de-Cossio-Diaz J, Uguzzoni G, Pagnani A . Unsupervised Inference of Protein Fitness Landscape from Deep Mutational Scan. Mol Biol Evol. 2020; 38(1):318-328. PMC: 7783173. DOI: 10.1093/molbev/msaa204. View

3.
Miller S, Wang T, Liu D . Phage-assisted continuous and non-continuous evolution. Nat Protoc. 2020; 15(12):4101-4127. PMC: 7865204. DOI: 10.1038/s41596-020-00410-3. View

4.
Hsu C, Nisonoff H, Fannjiang C, Listgarten J . Learning protein fitness models from evolutionary and assay-labeled data. Nat Biotechnol. 2022; 40(7):1114-1122. DOI: 10.1038/s41587-021-01146-5. View

5.
Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L . ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Trans Pattern Anal Mach Intell. 2021; 44(10):7112-7127. DOI: 10.1109/TPAMI.2021.3095381. View