» Articles » PMID: 38951026

CodonBERT Large Language Model for MRNA Vaccines

Abstract

mRNA-based vaccines and therapeutics are gaining popularity and usage across a wide range of conditions. One of the critical issues when designing such mRNAs is sequence optimization. Even small proteins or peptides can be encoded by an enormously large number of mRNAs. The actual mRNA sequence can have a large impact on several properties, including expression, stability, immunogenicity, and more. To enable the selection of an optimal sequence, we developed CodonBERT, a large language model (LLM) for mRNAs. Unlike prior models, CodonBERT uses codons as inputs, which enables it to learn better representations. CodonBERT was trained using more than 10 million mRNA sequences from a diverse set of organisms. The resulting model captures important biological concepts. CodonBERT can also be extended to perform prediction tasks for various mRNA properties. CodonBERT outperforms previous mRNA prediction methods, including on a new flu vaccine data set.

Citing Articles

PlasGO: enhancing GO-based function prediction for plasmid-encoded proteins based on genetic structure.

Ji Y, Shang J, Guan J, Zou W, Liao H, Tang X Gigascience. 2024; 13.

PMID: 39704702 PMC: 11659980. DOI: 10.1093/gigascience/giae104.


Advanced technologies for the development of infectious disease vaccines.

Gupta A, Rudra A, Reed K, Langer R, Anderson D Nat Rev Drug Discov. 2024; 23(12):914-938.

PMID: 39433939 DOI: 10.1038/s41573-024-01041-z.


A Suite of Foundation Models Captures the Contextual Interplay Between Codons.

Naghipourfar M, Chen S, Howard M, Macdonald C, Saberi A, Hagen T bioRxiv. 2024; .

PMID: 39416097 PMC: 11482952. DOI: 10.1101/2024.10.10.617568.


Are genomic language models all you need? Exploring genomic language models on protein downstream tasks.

Boshar S, Trop E, de Almeida B, Copoiu L, Pierrot T Bioinformatics. 2024; 40(9).

PMID: 39212609 PMC: 11399231. DOI: 10.1093/bioinformatics/btae529.


Predicting the translation efficiency of messenger RNA in mammalian cells.

Zheng D, Persyn L, Wang J, Liu Y, Montoya F, Cenik C bioRxiv. 2024; .

PMID: 39149337 PMC: 11326250. DOI: 10.1101/2024.08.11.607362.

References
1.
Agarwal V, Shendure J . Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks. Cell Rep. 2020; 31(7):107663. DOI: 10.1016/j.celrep.2020.107663. View

2.
Jackson L, Anderson E, Rouphael N, Roberts P, Makhene M, Coler R . An mRNA Vaccine against SARS-CoV-2 - Preliminary Report. N Engl J Med. 2020; 383(20):1920-1931. PMC: 7377258. DOI: 10.1056/NEJMoa2022483. View

3.
Wayment-Steele H, Kladwang W, Watkins A, Kim D, Tunguz B, Reade W . Deep learning models for predicting RNA degradation via dual crowdsourcing. Nat Mach Intell. 2022; 4(12):1174-1184. PMC: 9771809. DOI: 10.1038/s42256-022-00571-8. View

4.
Aw J, Shen Y, Wilm A, Sun M, Lim X, Boon K . In Vivo Mapping of Eukaryotic RNA Interactomes Reveals Principles of Higher-Order Organization and Regulation. Mol Cell. 2016; 62(4):603-17. DOI: 10.1016/j.molcel.2016.04.028. View

5.
Leppek K, Byeon G, Kladwang W, Wayment-Steele H, Kerr C, Xu A . Combinatorial optimization of mRNA structure, stability, and translation for RNA-based therapeutics. Nat Commun. 2022; 13(1):1536. PMC: 8940940. DOI: 10.1038/s41467-022-28776-w. View