» Articles » PMID: 39642174

Large Scale Paired Antibody Language Models

Overview
Specialty Biology
Date 2024 Dec 6
PMID 39642174
Authors
Affiliations
Soon will be listed here.
Abstract

Antibodies are proteins produced by the immune system that can identify and neutralise a wide variety of antigens with high specificity and affinity, and constitute the most successful class of biotherapeutics. With the advent of next-generation sequencing, billions of antibody sequences have been collected in recent years, though their application in the design of better therapeutics has been constrained by the sheer volume and complexity of the data. To address this challenge, we present IgBert and IgT5, the best performing antibody-specific language models developed to date which can consistently handle both paired and unpaired variable region sequences as input. These models are trained comprehensively using the more than two billion unpaired sequences and two million paired sequences of light and heavy chains present in the Observed Antibody Space dataset. We show that our models outperform existing antibody and protein language models on a diverse range of design and regression tasks relevant to antibody engineering. This advancement marks a significant leap forward in leveraging machine learning, large scale data sets and high-performance computing for enhancing antibody design for therapeutic development.

Citing Articles

A curriculum learning approach to training antibody language models.

Burbach S, Briney B bioRxiv. 2025; .

PMID: 40060663 PMC: 11888446. DOI: 10.1101/2025.02.27.640641.


Quantifying antibody binding: techniques and therapeutic implications.

Lodge J, Kajtar L, Duxbury R, Hall D, Burley G, Cordy J MAbs. 2025; 17(1):2459795.

PMID: 39957177 PMC: 11834528. DOI: 10.1080/19420862.2025.2459795.


Self-supervised machine learning methods for protein design improve sampling but not the identification of high-fitness variants.

Ertelt M, Moretti R, Meiler J, Schoeder C Sci Adv. 2025; 11(7):eadr7338.

PMID: 39937901 PMC: 11817935. DOI: 10.1126/sciadv.adr7338.


Multi-Modal CLIP-Informed Protein Editing.

Yin M, Zhou H, Zhu Y, Lin M, Wu Y, Wu J Health Data Sci. 2024; 4:0211.

PMID: 39703565 PMC: 11658819. DOI: 10.34133/hds.0211.


Large scale paired antibody language models.

Kenlay H, Dreyer F, Kovaltsuk A, Miketa D, Pires D, Deane C PLoS Comput Biol. 2024; 20(12):e1012646.

PMID: 39642174 PMC: 11654935. DOI: 10.1371/journal.pcbi.1012646.

References
1.
Chaudhary N, Wesemann D . Analyzing Immunoglobulin Repertoires. Front Immunol. 2018; 9:462. PMC: 5861150. DOI: 10.3389/fimmu.2018.00462. View

2.
Steinegger M, Soding J . MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017; 35(11):1026-1028. DOI: 10.1038/nbt.3988. View

3.
Olsen T, Moal I, Deane C . Addressing the antibody germline bias and its effect on language models for improved antibody design. Bioinformatics. 2024; 40(11). PMC: 11543624. DOI: 10.1093/bioinformatics/btae618. View

4.
Prihoda D, Maamary J, Waight A, Juan V, Fayadat-Dilman L, Svozil D . BioPhi: A platform for antibody design, humanization, and humanness evaluation based on natural antibody repertoires and deep learning. MAbs. 2022; 14(1):2020203. PMC: 8837241. DOI: 10.1080/19420862.2021.2020203. View

5.
Leem J, Mitchell L, Farmery J, Barton J, Galson J . Deciphering the language of antibodies using self-supervised learning. Patterns (N Y). 2022; 3(7):100513. PMC: 9278498. DOI: 10.1016/j.patter.2022.100513. View