ProGen2: Exploring the Boundaries of Protein Language Models
Overview
Cell Biology
Molecular Biology
Affiliations
Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial-intelligence-driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective protein model development. We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters and trained on different sequence datasets drawn from over a billion proteins from genomic, metagenomic, and immune repertoire databases. ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel viable sequences, and predicting protein fitness without additional fine-tuning. As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a growing emphasis needs to be placed on the data distribution provided to a protein sequence model. Our models and code are open sourced for widespread adoption in protein engineering. A record of this paper's Transparent Peer Review process is included in the supplemental information.
Seq2Topt: a sequence-based deep learning predictor of enzyme optimal temperature.
Qiu S, Hu B, Zhao J, Xu W, Yang A Brief Bioinform. 2025; 26(2).
PMID: 40079266 PMC: 11904407. DOI: 10.1093/bib/bbaf114.
Generation of antigen-specific paired chain antibody sequences using large language models.
Wasdin P, Johnson N, Janke A, Held S, Marinov T, Jordaan G bioRxiv. 2025; .
PMID: 40027781 PMC: 11870394. DOI: 10.1101/2024.12.20.629482.
GenomeOcean: An Efficient Genome Foundation Model Trained on Large-Scale Metagenomic Assemblies.
Zhou Z, Riley R, Kautsar S, Wu W, Egan R, Hofmeyr S bioRxiv. 2025; .
PMID: 39975405 PMC: 11838515. DOI: 10.1101/2025.01.30.635558.
Leveraging large language models for peptide antibiotic design.
Guan C, Fernandes F, Franco O, de la Fuente-Nunez C Cell Rep Phys Sci. 2025; 6(1).
PMID: 39949833 PMC: 11823563. DOI: 10.1016/j.xcrp.2024.102359.
Chen J, Wang J, Hu Y, Li X, Qian Y, Song C Front Bioeng Biotechnol. 2025; 13:1506508.
PMID: 39906415 PMC: 11790633. DOI: 10.3389/fbioe.2025.1506508.