» Articles » PMID: 39780486

Integrating Genetic Algorithms and Language Models for Enhanced Enzyme Design

Overview
Journal Brief Bioinform
Date 2025 Jan 9
PMID 39780486
Authors
Affiliations
Soon will be listed here.
Abstract

Enzymes are molecular machines optimized by nature to allow otherwise impossible chemical processes to occur. Their design is a challenging task due to the complexity of the protein space and the intricate relationships between sequence, structure, and function. Recently, large language models (LLMs) have emerged as powerful tools for modeling and analyzing biological sequences, but their application to protein design is limited by the high cardinality of the protein space. This study introduces a framework that combines LLMs with genetic algorithms (GAs) to optimize enzymes. LLMs are trained on a large dataset of protein sequences to learn relationships between amino acid residues linked to structure and function. This knowledge is then leveraged by GAs to efficiently search for sequences with improved catalytic performance. We focused on two optimization tasks: improving the feasibility of biochemical reactions and increasing their turnover rate. Systematic evaluations on 105 biocatalytic reactions demonstrated that the LLM-GA framework generated mutants outperforming the wild-type enzymes in terms of feasibility in 90% of the instances. Further in-depth evaluation of seven reactions reveals the power of this methodology to make "the best of both worlds" and create mutants with structural features and flexibility comparable with the wild types. Our approach advances the state-of-the-art computational design of biocatalysts, ultimately opening opportunities for more sustainable chemical processes.

References
1.
Dryden D, Thomson A, White J . How much of protein sequence space has been explored by life on Earth?. J R Soc Interface. 2008; 5(25):953-6. PMC: 2459213. DOI: 10.1098/rsif.2008.0085. View

2.
Zou J, Huss M, Abid A, Mohammadi P, Torkamani A, Telenti A . A primer on deep learning in genomics. Nat Genet. 2018; 51(1):12-18. PMC: 11180539. DOI: 10.1038/s41588-018-0295-5. View

3.
Chang A, Jeske L, Ulbrich S, Hofmann J, Koblitz J, Schomburg I . BRENDA, the ELIXIR core data resource in 2021: new developments and updates. Nucleic Acids Res. 2020; 49(D1):D498-D508. PMC: 7779020. DOI: 10.1093/nar/gkaa1025. View

4.
Woolfson D, Williams D . The influence of proline residues on alpha-helical structure. FEBS Lett. 1990; 277(1-2):185-8. DOI: 10.1016/0014-5793(90)80839-b. View

5.
Berman H, Westbrook J, Feng Z, Gilliland G, Bhat T, Weissig H . The Protein Data Bank. Nucleic Acids Res. 1999; 28(1):235-42. PMC: 102472. DOI: 10.1093/nar/28.1.235. View