Inconsistency in the Use of the Term "validation" in Studies Reporting the Performance of Deep Learning Algorithms in Providing Diagnosis from Medical Imaging

Overview

Journal PLoS One

Specialties General Medicine
Science

Date 2020 Sep 11

PMID 32915901

Citations 10

Authors

Dong Wook Kim

Hye Young Jang

Yousun Ko

Jung Hee Son

Pyeong Hwa Kim

Seon-Ok Kim

Joon Seo Lim

Seong Ho Park

Affiliations

Soon will be listed here.

Abstract

Background: The development of deep learning (DL) algorithms is a three-step process-training, tuning, and testing. Studies are inconsistent in the use of the term "validation", with some using it to refer to tuning and others testing, which hinders accurate delivery of information and may inadvertently exaggerate the performance of DL algorithms. We investigated the extent of inconsistency in usage of the term "validation" in studies on the accuracy of DL algorithms in providing diagnosis from medical imaging.

Methods And Findings: We analyzed the full texts of research papers cited in two recent systematic reviews. The papers were categorized according to whether the term "validation" was used to refer to tuning alone, both tuning and testing, or testing alone. We analyzed whether paper characteristics (i.e., journal category, field of study, year of print publication, journal impact factor [JIF], and nature of test data) were associated with the usage of the terminology using multivariable logistic regression analysis with generalized estimating equations. Of 201 papers published in 125 journals, 118 (58.7%), 9 (4.5%), and 74 (36.8%) used the term to refer to tuning alone, both tuning and testing, and testing alone, respectively. A weak association was noted between higher JIF and using the term to refer to testing (i.e., testing alone or both tuning and testing) instead of tuning alone (vs. JIF <5; JIF 5 to 10: adjusted odds ratio 2.11, P = 0.042; JIF >10: adjusted odds ratio 2.41, P = 0.089). Journal category, field of study, year of print publication, and nature of test data were not significantly associated with the terminology usage.

Conclusions: Existing literature has a significant degree of inconsistency in using the term "validation" when referring to the steps in DL algorithm development. Efforts are needed to improve the accuracy and clarity in the terminology usage.

Citing Articles

Performance of Radiomics-based machine learning and deep learning-based methods in the prediction of tumor grade in meningioma: a systematic review and meta-analysis.

Tavanaei R, Akhlaghpasand M, Alikhani A, Hajikarimloo B, Ansari A, Yong R Neurosurg Rev. 2025; 48(1):78.

PMID: 39849257 DOI: 10.1007/s10143-025-03236-3.

Differences in technical and clinical perspectives on AI validation in cancer imaging: mind the gap!.

Chouvarda I, Colantonio S, Verde A, Jimenez-Pastor A, Cerda-Alberich L, Metz Y Eur Radiol Exp. 2025; 9(1):7.

PMID: 39812924 PMC: 11735720. DOI: 10.1186/s41747-024-00543-0.

The Role of Machine Learning in the Detection of Cardiac Fibrosis in Electrocardiograms: Scoping Review.

Handra J, James H, Mbilinyi A, Moller-Hansen A, ORiley C, Andrade J JMIR Cardio. 2025; 8:e60697.

PMID: 39753213 PMC: 11730231. DOI: 10.2196/60697.

Reporting Guidelines for Artificial Intelligence Studies in Healthcare (for Both Conventional and Large Language Models): What's New in 2024.

Park S, Suh C Korean J Radiol. 2024; 25(8):687-690.

PMID: 39028011 PMC: 11306008. DOI: 10.3348/kjr.2024.0598.

Position Statements of the Emerging Trends Committee of the Asian Oceanian Society of Radiology on the Adoption and Implementation of Artificial Intelligence for Radiology.

Wee N, Git K, Lee W, Raval G, Pattokhov A, Ming Ho E Korean J Radiol. 2024; 25(7):603-612.

PMID: 38942454 PMC: 11214917. DOI: 10.3348/kjr.2024.0419.

References

Ngiam K, Khor I . Big data and machine learning algorithms for health-care delivery. Lancet Oncol. 2019; 20(5):e262-e273. DOI: 10.1016/S1470-2045(19)30149-4. View

Nsoesie E . Evaluating Artificial Intelligence Applications in Clinical Settings. JAMA Netw Open. 2019; 1(5):e182658. DOI: 10.1001/jamanetworkopen.2018.2658. View

Zou J, Schiebinger L . AI can be sexist and racist - it's time to make it fair. Nature. 2018; 559(7714):324-326. DOI: 10.1038/d41586-018-05707-8. View

Liu X, Faes L, Kale A, Wagner S, Fu D, Bruynseels A . A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit Health. 2020; 1(6):e271-e297. DOI: 10.1016/S2589-7500(19)30123-2. View

Bluemke D, Moy L, Bredella M, Ertl-Wagner B, Fowler K, Goh V . Assessing Radiology Research on Artificial Intelligence: A Brief Guide for Authors, Reviewers, and Readers-From the Editorial Board. Radiology. 2020; 294(3):487-489. DOI: 10.1148/radiol.2019192515. View

Zech J, Badgeley M, Liu M, Costa A, Titano J, Oermann E . Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLoS Med. 2018; 15(11):e1002683. PMC: 6219764. DOI: 10.1371/journal.pmed.1002683. View

Do S, Song K, Chung J . Basics of Deep Learning: A Radiologist's Guide to Understanding Published Radiology Articles on Deep Learning. Korean J Radiol. 2020; 21(1):33-41. PMC: 6960318. DOI: 10.3348/kjr.2019.0312. View

Wang C, Liu C, Chang Y, Lafata K, Cui Y, Zhang J . Dose-Distribution-Driven PET Image-Based Outcome Prediction (DDD-PIOP): A Deep Learning Study for Oropharyngeal Cancer IMRT Application. Front Oncol. 2020; 10:1592. PMC: 7461989. DOI: 10.3389/fonc.2020.01592. View

Zhou Y, Xu J, Liu Q, Li C, Liu Z, Wang M . A Radiomics Approach With CNN for Shear-Wave Elastography Breast Tumor Classification. IEEE Trans Biomed Eng. 2018; 65(9):1935-1942. DOI: 10.1109/TBME.2018.2844188. View

10.

Vigilante K, Escaravage S, McConnell M . Big Data and the Intelligence Community - Lessons for Health Care. N Engl J Med. 2019; 380(20):1888-1890. DOI: 10.1056/NEJMp1815418. View

11.

Rajkomar A, Dean J, Kohane I . Machine Learning in Medicine. N Engl J Med. 2019; 380(14):1347-1358. DOI: 10.1056/NEJMra1814259. View

12.

Bien N, Rajpurkar P, Ball R, Irvin J, Park A, Jones E . Deep-learning-assisted diagnosis for knee magnetic resonance imaging: Development and retrospective validation of MRNet. PLoS Med. 2018; 15(11):e1002699. PMC: 6258509. DOI: 10.1371/journal.pmed.1002699. View

13.

Collins G, Reitsma J, Altman D, Moons K . Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ. 2015; 350:g7594. DOI: 10.1136/bmj.g7594. View

14.

Collins G, Moons K . Reporting of artificial intelligence prediction models. Lancet. 2019; 393(10181):1577-1579. DOI: 10.1016/S0140-6736(19)30037-6. View

15.

Yu K, Kohane I . Framing the challenges of artificial intelligence in medicine. BMJ Qual Saf. 2018; 28(3):238-241. DOI: 10.1136/bmjqs-2018-008551. View

16.

Li X, Zhang S, Zhang Q, Wei X, Pan Y, Zhao J . Diagnosis of thyroid cancer using deep convolutional neural network models applied to sonographic images: a retrospective, multicohort, diagnostic study. Lancet Oncol. 2018; 20(2):193-201. PMC: 7083202. DOI: 10.1016/S1470-2045(18)30762-9. View

17.

Park S, Kressel H . Connecting Technological Innovation in Artificial Intelligence to Real-world Medical Practice through Rigorous Clinical Validation: What Peer-reviewed Medical Journals Could Do. J Korean Med Sci. 2018; 33(22):e152. PMC: 5966371. DOI: 10.3346/jkms.2018.33.e152. View

18.

Kelly C, Karthikesalingam A, Suleyman M, Corrado G, King D . Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 2019; 17(1):195. PMC: 6821018. DOI: 10.1186/s12916-019-1426-2. View

19.

Van Calster B, Wynants L, Timmerman D, Steyerberg E, Collins G . Predictive analytics in health care: how can we know it works?. J Am Med Inform Assoc. 2019; 26(12):1651-1654. PMC: 6857503. DOI: 10.1093/jamia/ocz130. View

20.

Harris M, Qi A, Jeagal L, Torabi N, Menzies D, Korobitsyn A . A systematic review of the diagnostic accuracy of artificial intelligence-based computer programs to analyze chest x-rays for pulmonary tuberculosis. PLoS One. 2019; 14(9):e0221339. PMC: 6719854. DOI: 10.1371/journal.pone.0221339. View