Dataset Size and Composition Impact the Reliability of Performance Benchmarks for Peptide-MHC Binding Predictions

Overview

Journal BMC Bioinformatics

Publisher Biomed Central

Specialty Biology

Date 2014 Jul 15

PMID 25017736

Citations 38

Authors

Yohan Kim

John Sidney

Soren Buus

Alessandro Sette

Morten Nielsen

Bjoern Peters

Affiliations

Soon will be listed here.

Abstract

Background: It is important to accurately determine the performance of peptide:MHC binding predictions, as this enables users to compare and choose between different prediction methods and provides estimates of the expected error rate. Two common approaches to determine prediction performance are cross-validation, in which all available data are iteratively split into training and testing data, and the use of blind sets generated separately from the data used to construct the predictive method. In the present study, we have compared cross-validated prediction performances generated on our last benchmark dataset from 2009 with prediction performances generated on data subsequently added to the Immune Epitope Database (IEDB) which served as a blind set.

Results: We found that cross-validated performances systematically overestimated performance on the blind set. This was found not to be due to the presence of similar peptides in the cross-validation dataset. Rather, we found that small size and low sequence/affinity diversity of either training or blind datasets were associated with large differences in cross-validated vs. blind prediction performances. We use these findings to derive quantitative rules of how large and diverse datasets need to be to provide generalizable performance estimates.

Conclusion: It has long been known that cross-validated prediction performance estimates often overestimate performance on independently generated blind set data. We here identify and quantify the specific factors contributing to this effect for MHC-I binding predictions. An increasing number of peptides for which MHC binding affinities are measured experimentally have been selected based on binding predictions and thus are less diverse than historic datasets sampling the entire sequence and affinity space, making them more difficult benchmark data sets. This has to be taken into account when comparing performance metrics between different benchmarks, and when deriving error estimates for predictions based on benchmark performance.

Citing Articles

Energy landscapes of peptide-MHC binding.

Collesano L, Luksza M, Lassig M PLoS Comput Biol. 2024; 20(9):e1012380.

PMID: 39226310 PMC: 11398667. DOI: 10.1371/journal.pcbi.1012380.

GIHP: Graph convolutional neural network based interpretable pan-specific HLA-peptide binding affinity prediction.

Su L, Yan Y, Ma B, Zhao S, Cui Z Front Genet. 2024; 15:1405032.

PMID: 39050251 PMC: 11266168. DOI: 10.3389/fgene.2024.1405032.

NeoMUST: an accurate and efficient multi-task learning model for neoantigen presentation.

Ma W, Zhang J, Yao H Life Sci Alliance. 2024; 7(4).

PMID: 38290755 PMC: 10828515. DOI: 10.26508/lsa.202302255.

Accurate TCR-pMHC interaction prediction using a BERT-based transfer learning method.

Zhang J, Ma W, Yao H Brief Bioinform. 2023; 25(1).

PMID: 38040492 PMC: 10783865. DOI: 10.1093/bib/bbad436.

epitopepredict: a tool for integrated MHC binding prediction.

Farrell D GigaByte. 2023; 2021:gigabyte13.

PMID: 36824339 PMC: 9631954. DOI: 10.46471/gigabyte.13.

References

Toseland C, Clayton D, McSparron H, Hemsley S, Blythe M, Paine K . AntiJen: a quantitative immunology database integrating functional, thermodynamic, kinetic, biophysical, and cellular data. Immunome Res. 2005; 1(1):4. PMC: 1289288. DOI: 10.1186/1745-7580-1-4. View

Hoof I, Peters B, Sidney J, Pedersen L, Sette A, Lund O . NetMHCpan, a method for MHC class I binding prediction beyond humans. Immunogenetics. 2008; 61(1):1-13. PMC: 3319061. DOI: 10.1007/s00251-008-0341-z. View

Brusic V, Rudy G, Harrison L . MHCPEP, a database of MHC-binding peptides: update 1997. Nucleic Acids Res. 1998; 26(1):368-71. PMC: 147255. DOI: 10.1093/nar/26.1.368. View

Lin H, Ray S, Tongchusak S, Reinherz E, Brusic V . Evaluation of MHC class I peptide binding prediction servers: applications for vaccine research. BMC Immunol. 2008; 9:8. PMC: 2323361. DOI: 10.1186/1471-2172-9-8. View

El-Manzalawy Y, Dobbs D, Honavar V . On evaluating MHC-II binding peptide prediction methods. PLoS One. 2008; 3(9):e3268. PMC: 2533399. DOI: 10.1371/journal.pone.0003268. View

Lundegaard C, Lamberth K, Harndahl M, Buus S, Lund O, Nielsen M . NetMHC-3.0: accurate web accessible predictions of human, mouse and monkey MHC class I affinities for peptides of length 8-11. Nucleic Acids Res. 2008; 36(Web Server issue):W509-12. PMC: 2447772. DOI: 10.1093/nar/gkn202. View

Vita R, Zarebski L, Greenbaum J, Emami H, Hoof I, Salimi N . The immune epitope database 2.0. Nucleic Acids Res. 2009; 38(Database issue):D854-62. PMC: 2808938. DOI: 10.1093/nar/gkp1004. View

Paul S, Weiskopf D, Angelo M, Sidney J, Peters B, Sette A . HLA class I alleles are associated with peptide-binding repertoires of different size, affinity, and immunogenicity. J Immunol. 2013; 191(12):5831-9. PMC: 3872965. DOI: 10.4049/jimmunol.1302101. View

Zhang H, Lundegaard C, Nielsen M . Pan-specific MHC class I predictors: a benchmark of HLA class I pan-specific prediction methods. Bioinformatics. 2008; 25(1):83-9. PMC: 2638932. DOI: 10.1093/bioinformatics/btn579. View

10.

Peters B, Bui H, Frankild S, Nielson M, Lundegaard C, Kostem E . A community resource benchmarking predictions of peptide binding to MHC-I molecules. PLoS Comput Biol. 2006; 2(6):e65. PMC: 1475712. DOI: 10.1371/journal.pcbi.0020065. View

11.

Zhang L, Udaka K, Mamitsuka H, Zhu S . Toward more accurate pan-specific MHC-peptide binding prediction: a review of current methods and tools. Brief Bioinform. 2011; 13(3):350-64. DOI: 10.1093/bib/bbr060. View

12.

Bhasin M, Singh H, Raghava G . MHCBN: a comprehensive database of MHC binding and non-binding peptides. Bioinformatics. 2003; 19(5):665-6. DOI: 10.1093/bioinformatics/btg055. View

13.

Nielsen M, Lundegaard C, Blicher T, Peters B, Sette A, Justesen S . Quantitative predictions of peptide binding to any HLA-DR molecule of known sequence: NetMHCIIpan. PLoS Comput Biol. 2008; 4(7):e1000107. PMC: 2430535. DOI: 10.1371/journal.pcbi.1000107. View

14.

Sette A, Vitiello A, Reherman B, Fowler P, Nayersina R, Kast W . The relationship between class I binding affinity and immunogenicity of potential cytotoxic T cell epitopes. J Immunol. 1994; 153(12):5586-92. View

15.

Wang P, Sidney J, Kim Y, Sette A, Lund O, Nielsen M . Peptide binding predictions for HLA DR, DP and DQ molecules. BMC Bioinformatics. 2010; 11:568. PMC: 2998531. DOI: 10.1186/1471-2105-11-568. View

16.

Assarsson E, Sidney J, Oseroff C, Pasquetto V, Bui H, Frahm N . A quantitative analysis of the variables affecting the repertoire of T cell specificities recognized after vaccinia virus infection. J Immunol. 2007; 178(12):7890-901. DOI: 10.4049/jimmunol.178.12.7890. View

17.

Lan Zhang G, Ansari H, Bradley P, Cawley G, Hertz T, Hu X . Machine learning competition in immunology - Prediction of HLA class I binding peptides. J Immunol Methods. 2011; 374(1-2):1-4. DOI: 10.1016/j.jim.2011.09.010. View

18.

Nielsen M, Lundegaard C, Worning P, Lauemoller S, Lamberth K, Buus S . Reliable prediction of T-cell epitopes using neural networks with novel sequence representations. Protein Sci. 2003; 12(5):1007-17. PMC: 2323871. DOI: 10.1110/ps.0239403. View

19.

Nielsen M, Lundegaard C, Blicher T, Lamberth K, Harndahl M, Justesen S . NetMHCpan, a method for quantitative predictions of peptide binding to any HLA-A and -B locus protein of known sequence. PLoS One. 2007; 2(8):e796. PMC: 1949492. DOI: 10.1371/journal.pone.0000796. View

20.

Kim Y, Ponomarenko J, Zhu Z, Tamang D, Wang P, Greenbaum J . Immune epitope database analysis resource. Nucleic Acids Res. 2012; 40(Web Server issue):W525-30. PMC: 3394288. DOI: 10.1093/nar/gks438. View