External Validation: a Simulation Study to Compare Cross-validation Versus Holdout or External Testing to Assess the Performance of Clinical Prediction Models Using PET Data from DLBCL Patients

Overview

Journal EJNMMI Res

Date 2022 Sep 11

PMID 36089634

Authors

Jakoba J Eertink

Martijn W Heymans

Gerben J C Zwezerijnen

Josee M Zijlstra

Henrica C W de Vet

Ronald Boellaard

Affiliations

Soon will be listed here.

Abstract

Aim: Clinical prediction models need to be validated. In this study, we used simulation data to compare various internal and external validation approaches to validate models.

Methods: Data of 500 patients were simulated using distributions of metabolic tumor volume, standardized uptake value, the maximal distance between the largest lesion and another lesion, WHO performance status and age of 296 diffuse large B cell lymphoma patients. These data were used to predict progression after 2 years based on an existing logistic regression model. Using the simulated data, we applied cross-validation, bootstrapping and holdout (n = 100). We simulated new external datasets (n = 100, n = 200, n = 500) and simulated stage-specific external datasets (1), varied the cut-off for high-risk patients (2) and the false positive and false negative rates (3) and simulated a dataset with EARL2 characteristics (4). All internal and external simulations were repeated 100 times. Model performance was expressed as the cross-validated area under the curve (CV-AUC ± SD) and calibration slope.

Results: The cross-validation (0.71 ± 0.06) and holdout (0.70 ± 0.07) resulted in comparable model performances, but the model had a higher uncertainty using a holdout set. Bootstrapping resulted in a CV-AUC of 0.67 ± 0.02. The calibration slope was comparable for these internal validation approaches. Increasing the size of the test set resulted in more precise CV-AUC estimates and smaller SD for the calibration slope. For test datasets with different stages, the CV-AUC increased as Ann Arbor stages increased. As expected, changing the cut-off for high risk and false positive- and negative rates influenced the model performance, which is clearly shown by the low calibration slope. The EARL2 dataset resulted in similar model performance and precision, but calibration slope indicated overfitting.

Conclusion: In case of small datasets, it is not advisable to use a holdout or a very small external dataset with similar characteristics. A single small testing dataset suffers from a large uncertainty. Therefore, repeated CV using the full training dataset is preferred instead. Our simulations also demonstrated that it is important to consider the impact of differences in patient population between training and test data, which may ask for adjustment or stratification of relevant variables.

Citing Articles

Pre-trained convolutional neural networks identify Parkinson's disease from spectrogram images of voice samples.

Rahmatallah Y, Kemp A, Iyer A, Pillai L, Larson-Prior L, Virmani T Sci Rep. 2025; 15(1):7337.

PMID: 40025201 PMC: 11873116. DOI: 10.1038/s41598-025-92105-6.

Discriminative ability, responsiveness, and interpretability of smoothness index of gait in people with multiple sclerosis.

Castiglia S, Dal Farra F, Trabassi D, Turolla A, Serrao M, Nocentini U Arch Physiother. 2025; 15:9-18.

PMID: 39906096 PMC: 11791763. DOI: 10.33393/aop.2025.3289.

Automatic Recognition of Motor Skills in Triathlon: A Novel Tool for Measuring Movement Cadence and Cycling Tasks.

Chesher S, Martinotti C, Chapman D, Rosalie S, Charlton P, Netto K J Funct Morphol Kinesiol. 2024; 9(4).

PMID: 39728253 PMC: 11676696. DOI: 10.3390/jfmk9040269.

The Aachen ACLF ICU score predicts ICU mortality in critically ill patients with acute-on-chronic liver failure.

Pollmanns M, Kister B, Abu Jhaisha S, Adams J, Kabak E, Brozat J Sci Rep. 2024; 14(1):30497.

PMID: 39681633 PMC: 11649908. DOI: 10.1038/s41598-024-82178-0.

Predictive Model Building for Aggregation Kinetics Based on Molecular Dynamics Simulations of an Antibody Fragment.

Wang Y, Williams H, Dikicioglu D, Dalby P Mol Pharm. 2024; 21(11):5827-5841.

PMID: 39348223 PMC: 11539058. DOI: 10.1021/acs.molpharmaceut.4c00859.

References

Harrell Jr F, Lee K, Mark D . Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med. 1996; 15(4):361-87. DOI: 10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4. View

Steyerberg E, Harrell Jr F . Prediction models need appropriate internal, internal-external, and external validation. J Clin Epidemiol. 2015; 69:245-7. PMC: 5578404. DOI: 10.1016/j.jclinepi.2015.04.005. View

Mayerhoefer M, Materka A, Langs G, Haggstrom I, Szczypinski P, Gibbs P . Introduction to Radiomics. J Nucl Med. 2020; 61(4):488-495. PMC: 9374044. DOI: 10.2967/jnumed.118.222893. View

Ceriani L, Milan L, Cascione L, Gritti G, Dalmasso F, Esposito F . Generation and validation of a PET radiomics model that predicts survival in diffuse large B cell lymphoma treated with R-CHOP14: A SAKK 38/07 trial post-hoc analysis. Hematol Oncol. 2021; 40(1):11-21. DOI: 10.1002/hon.2935. View

Steyerberg E, Bleeker S, Moll H, Grobbee D, Moons K . Internal and external validation of predictive models: a simulation study of bias and precision in small samples. J Clin Epidemiol. 2003; 56(5):441-7. DOI: 10.1016/s0895-4356(03)00047-7. View

Eertink J, van de Brug T, Wiegers S, Zwezerijnen G, Pfaehler E, Lugtenburg P . F-FDG PET baseline radiomics features improve the prediction of treatment outcome in diffuse large B-cell lymphoma. Eur J Nucl Med Mol Imaging. 2021; 49(3):932-942. PMC: 8803694. DOI: 10.1007/s00259-021-05480-3. View

Smith G, Seaman S, Wood A, Royston P, White I . Correcting for optimistic prediction in small data sets. Am J Epidemiol. 2014; 180(3):318-24. PMC: 4108045. DOI: 10.1093/aje/kwu140. View

Ferreira M, Lovinfosse P, Hermesse J, Decuypere M, Rousseau C, Lucia F . [F]FDG PET radiomics to predict disease-free survival in cervical cancer: a multi-scanner/center study with external validation. Eur J Nucl Med Mol Imaging. 2021; 48(11):3432-3443. PMC: 8440288. DOI: 10.1007/s00259-021-05303-5. View

Martens R, Koopman T, Noij D, Pfaehler E, Ubelhor C, Sharma S . Predictive value of quantitative F-FDG-PET radiomics analysis in patients with head and neck squamous cell carcinoma. EJNMMI Res. 2020; 10(1):102. PMC: 7477048. DOI: 10.1186/s13550-020-00686-2. View

10.

Kaalep A, Burggraaff C, Pieplenbosch S, Verwer E, Sera T, Zijlstra J . Quantitative implications of the updated EARL 2019 PET-CT performance standards. EJNMMI Phys. 2019; 6(1):28. PMC: 6933045. DOI: 10.1186/s40658-019-0257-8. View

11.

Aerts H, Velazquez E, Leijenaar R, Parmar C, Grossmann P, Carvalho S . Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nat Commun. 2014; 5:4006. PMC: 4059926. DOI: 10.1038/ncomms5006. View

12.

Iba K, Shinozaki T, Maruo K, Noma H . Re-evaluation of the comparative effectiveness of bootstrap-based optimism correction methods in the development of multivariable clinical prediction models. BMC Med Res Methodol. 2021; 21(1):9. PMC: 7789544. DOI: 10.1186/s12874-020-01201-w. View

13.

Lugtenburg P, de Nully Brown P, van der Holt B, dAmore F, Koene H, de Jongh E . Rituximab-CHOP With Early Rituximab Intensification for Diffuse Large B-Cell Lymphoma: A Randomized Phase III Trial of the HOVON and the Nordic Lymphoma Group (HOVON-84). J Clin Oncol. 2020; 38(29):3377-3387. DOI: 10.1200/JCO.19.03418. View