» Articles » PMID: 38745436

Genotype Sampling for Deep-learning Assisted Experimental Mapping of a Combinatorially Complete Fitness Landscape

Overview
Journal Bioinformatics
Specialty Biology
Date 2024 May 15
PMID 38745436
Authors
Affiliations
Soon will be listed here.
Abstract

Motivation: Experimental characterization of fitness landscapes, which map genotypes onto fitness, is important for both evolutionary biology and protein engineering. It faces a fundamental obstacle in the astronomical number of genotypes whose fitness needs to be measured for any one protein. Deep learning may help to predict the fitness of many genotypes from a smaller neural network training sample of genotypes with experimentally measured fitness. Here I use a recently published experimentally mapped fitness landscape of more than 260 000 protein genotypes to ask how such sampling is best performed.

Results: I show that multilayer perceptrons, recurrent neural networks, convolutional networks, and transformers, can explain more than 90% of fitness variance in the data. In addition, 90% of this performance is reached with a training sample comprising merely ≈103 sequences. Generalization to unseen test data is best when training data is sampled randomly and uniformly, or sampled to minimize the number of synonymous sequences. In contrast, sampling to maximize sequence diversity or codon usage bias reduces performance substantially. These observations hold for more than one network architecture. Simple sampling strategies may perform best when training deep learning neural networks to map fitness landscapes from experimental data.

Availability And Implementation: The fitness landscape data analyzed here is publicly available as described previously (Papkou et al. 2023). All code used to analyze this landscape is publicly available at https://github.com/andreas-wagner-uzh/fitness_landscape_sampling.

Citing Articles

Massively parallel experimental interrogation of natural variants in ancient signaling pathways reveals both purifying selection and local adaptation.

Aguilar-Rodriguez J, Vila J, Chen S, Razo-Mejia M, Ghosh O, Fraser H bioRxiv. 2024; .

PMID: 39553990 PMC: 11565963. DOI: 10.1101/2024.10.30.621178.

References
1.
Blaabjerg L, Kassem M, Good L, Jonsson N, Cagiada M, Johansson K . Rapid protein stability prediction using deep learning representations. Elife. 2023; 12. PMC: 10266766. DOI: 10.7554/eLife.82593. View

2.
Nikolados E, Wongprommoon A, Mac Aodha O, Cambray G, Oyarzun D . Accuracy and data efficiency in deep learning models of protein expression. Nat Commun. 2022; 13(1):7755. PMC: 9751117. DOI: 10.1038/s41467-022-34902-5. View

3.
Poelwijk F, Tanase-Nicola S, Kiviet D, Tans S . Reciprocal sign epistasis is a necessary condition for multi-peaked fitness landscapes. J Theor Biol. 2010; 272(1):141-4. DOI: 10.1016/j.jtbi.2010.12.015. View

4.
Angermueller C, Lee H, Reik W, Stegle O . DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning. Genome Biol. 2017; 18(1):67. PMC: 5387360. DOI: 10.1186/s13059-017-1189-z. View

5.
Hall D, Agan M, Pope S . Fitness epistasis among 6 biosynthetic loci in the budding yeast Saccharomyces cerevisiae. J Hered. 2010; 101 Suppl 1:S75-84. DOI: 10.1093/jhered/esq007. View