» Articles » PMID: 39873269

Improving the Generalization of Protein Expression Models with Mechanistic Sequence Information

Overview
Date 2025 Jan 28
PMID 39873269
Authors
Affiliations
Soon will be listed here.
Abstract

The growing demand for biological products drives many efforts to maximize expression of heterologous proteins. Advances in high-throughput sequencing can produce data suitable for building sequence-to-expression models with machine learning. The most accurate models have been trained on one-hot encodings, a mechanism-agnostic representation of nucleotide sequences. Moreover, studies have consistently shown that training on mechanistic sequence features leads to much poorer predictions, even with features that are known to correlate with expression, such as DNA sequence motifs, codon usage, or properties of mRNA secondary structures. However, despite their excellent local accuracy, current sequence-to-expression models can fail to generalize predictions far away from the training data. Through a comparative study across datasets in Escherichia coli and Saccharomyces cerevisiae, here we show that mechanistic sequence features can provide gains on model generalization, and thus improve their utility for predictive sequence design. We explore several strategies to integrate one-hot encodings and mechanistic features into a single predictive model, including feature stacking, ensemble model stacking, and geometric stacking, a novel architecture based on graph convolutional neural networks. Our work casts new light on mechanistic sequence features, underscoring the importance of domain-knowledge and feature engineering for accurate prediction of protein expression levels.

References
1.
de Boer C, Hughes T . YeTFaSCo: a database of evaluated yeast transcription factor sequence specificities. Nucleic Acids Res. 2011; 40(Database issue):D169-79. PMC: 3245003. DOI: 10.1093/nar/gkr993. View

2.
Jia Q, Wu H, Zhou X, Gao J, Zhao W, Aziz J . A "GC-rich" method for mammalian gene expression: a dominant role of non-coding DNA GC content in regulation of mammalian gene expression. Sci China Life Sci. 2010; 53(1):94-100. DOI: 10.1007/s11427-010-0003-x. View

3.
Angenent-Mari N, Garruss A, Soenksen L, Church G, Collins J . A deep learning approach to programmable RNA switches. Nat Commun. 2020; 11(1):5057. PMC: 7541447. DOI: 10.1038/s41467-020-18677-1. View

4.
Nikolados E, Wongprommoon A, Mac Aodha O, Cambray G, Oyarzun D . Accuracy and data efficiency in deep learning models of protein expression. Nat Commun. 2022; 13(1):7755. PMC: 9751117. DOI: 10.1038/s41467-022-34902-5. View

5.
Kelsic E, Chung H, Cohen N, Park J, Wang H, Kishony R . RNA Structural Determinants of Optimal Codons Revealed by MAGE-Seq. Cell Syst. 2016; 3(6):563-571.e6. PMC: 5234859. DOI: 10.1016/j.cels.2016.11.004. View