AbLang: an Antibody Language Model for Completing Antibody Sequences

Overview

Journal Bioinform Adv

Publisher Oxford University Press

Specialty Biology

Date 2023 Jan 26

PMID 36699403

Authors

Tobias H Olsen

Iain H Moal

Charlotte M Deane

Affiliations

Soon will be listed here.

Abstract

Motivation: General protein language models have been shown to summarize the semantics of protein sequences into representations that are useful for state-of-the-art predictive methods. However, for antibody specific problems, such as restoring residues lost due to sequencing errors, a model trained solely on antibodies may be more powerful. Antibodies are one of the few protein types where the volume of sequence data needed for such language models is available, e.g. in the Observed Antibody Space (OAS) database.

Results: Here, we introduce AbLang, a language model trained on the antibody sequences in the OAS database. We demonstrate the power of AbLang by using it to restore missing residues in antibody sequence data, a key issue with B-cell receptor repertoire sequencing, e.g. over 40% of OAS sequences are missing the first 15 amino acids. AbLang restores the missing residues of antibody sequences better than using IMGT germlines or the general protein language model ESM-1b. Further, AbLang does not require knowledge of the germline of the antibody and is seven times faster than ESM-1b.

Availability And Implementation: AbLang is a python package available at https://github.com/oxpig/AbLang.

Supplementary Information: Supplementary data are available at online.

Citing Articles

NanoAbLLaMA: construction of nanobody libraries with protein large language models.

Wang X, Chen H, Chen B, Liang L, Mei F, Huang B Front Chem. 2025; 13:1545136.

PMID: 40070407 PMC: 11893428. DOI: 10.3389/fchem.2025.1545136.

Contrastive Learning Enables Epitope Overlap Predictions for Targeted Antibody Discovery.

Holt C, Janke A, Amlashi P, Jamieson P, Marinov T, Georgiev I bioRxiv. 2025; .

PMID: 40060439 PMC: 11888244. DOI: 10.1101/2025.02.25.640114.

An antibody developability triaging pipeline exploiting protein language models.

Sweet-Jones J, Martin A MAbs. 2025; 17(1):2472009.

PMID: 40038849 PMC: 11901365. DOI: 10.1080/19420862.2025.2472009.

Deep learning-based design and experimental validation of a medicine-like human antibody library.

Rajagopal N, Choudhary U, Tsang K, Martin K, Karadag M, Chen H Brief Bioinform. 2025; 26(1).

PMID: 39851074 PMC: 11757908. DOI: 10.1093/bib/bbaf023.

Learning the language of antibody hypervariability.

Singh R, Im C, Qiu Y, Mackness B, Gupta A, Joren T Proc Natl Acad Sci U S A. 2025; 122(1):e2418918121.

PMID: 39793083 PMC: 11725859. DOI: 10.1073/pnas.2418918121.

References

Chaudhary N, Wesemann D . Analyzing Immunoglobulin Repertoires. Front Immunol. 2018; 9:462. PMC: 5861150. DOI: 10.3389/fimmu.2018.00462. View

Olsen T, Boyles F, Deane C . Observed Antibody Space: A diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences. Protein Sci. 2021; 31(1):141-146. PMC: 8740823. DOI: 10.1002/pro.4205. View

Huse S, Huber J, Morrison H, Sogin M, Welch D . Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol. 2007; 8(7):R143. PMC: 2323236. DOI: 10.1186/gb-2007-8-7-r143. View

Kovaltsuk A, Leem J, Kelm S, Snowden J, Deane C, Krawczyk K . Observed Antibody Space: A Resource for Data Mining Next-Generation Sequencing of Antibody Repertoires. J Immunol. 2018; 201(8):2502-2509. DOI: 10.4049/jimmunol.1800708. View

Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J . Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci U S A. 2021; 118(15). PMC: 8053943. DOI: 10.1073/pnas.2016239118. View

Steinegger M, Soding J . Clustering huge protein sequence sets in linear time. Nat Commun. 2018; 9(1):2542. PMC: 6026198. DOI: 10.1038/s41467-018-04964-5. View

Ghraichy M, von Niederhausern V, Kovaltsuk A, Galson J, Deane C, Truck J . Different B cell subpopulations show distinct patterns in their IgH repertoire metrics. Elife. 2021; 10. PMC: 8560093. DOI: 10.7554/eLife.73111. View

Kim D, Park D . Deep sequencing of B cell receptor repertoire. BMB Rep. 2019; 52(9):540-547. PMC: 6774421. View

Alley E, Khimulya G, Biswas S, AlQuraishi M, Church G . Unified rational protein engineering with sequence-based deep representation learning. Nat Methods. 2019; 16(12):1315-1322. PMC: 7067682. DOI: 10.1038/s41592-019-0598-1. View

10.

Giudicelli V, Chaume D, Lefranc M . IMGT/GENE-DB: a comprehensive database for human and mouse immunoglobulin and T cell receptor genes. Nucleic Acids Res. 2004; 33(Database issue):D256-61. PMC: 539964. DOI: 10.1093/nar/gki010. View

11.

Dunbar J, Deane C . ANARCI: antigen receptor numbering and receptor classification. Bioinformatics. 2015; 32(2):298-300. PMC: 4708101. DOI: 10.1093/bioinformatics/btv552. View