Self-Distillation: Towards Efficient and Compact Neural Networks

Overview

Journal IEEE Trans Pattern Anal Mach Intell

Specialties Biomedical Engineering
Medical Informatics

Date 2021 Mar 18

PMID 33735074

Citations 10

Authors

Linfeng Zhang

Chenglong Bao

Kaisheng Ma

Affiliations

Soon will be listed here.

Abstract

Remarkable achievements have been obtained by deep neural networks in the last several years. However, the breakthrough in neural networks accuracy is always accompanied by explosive growth of computation and parameters, which leads to a severe limitation of model deployment. In this paper, we propose a novel knowledge distillation technique named self-distillation to address this problem. Self-distillation attaches several attention modules and shallow classifiers at different depths of neural networks and distills knowledge from the deepest classifier to the shallower classifiers. Different from the conventional knowledge distillation methods where the knowledge of the teacher model is transferred to another student model, self-distillation can be considered as knowledge transfer in the same model - from the deeper layers to the shallow layers. Moreover, the additional classifiers in self-distillation allow the neural network to work in a dynamic manner, which leads to a much higher acceleration. Experiments demonstrate that self-distillation has consistent and significant effectiveness on various neural networks and datasets. On average, 3.49 and 2.32 percent accuracy boost are observed on CIFAR100 and ImageNet. Besides, experiments show that self-distillation can be combined with other model compression methods, including knowledge distillation, pruning and lightweight model design.

Citing Articles

An improved ShuffleNetV2 method based on ensemble self-distillation for tomato leaf diseases recognition.

Ni S, Jia Y, Zhu M, Zhang Y, Wang W, Liu S Front Plant Sci. 2025; 15:1521008.

PMID: 39906224 PMC: 11790667. DOI: 10.3389/fpls.2024.1521008.

A Comprehensive Survey of Deep Learning Approaches in Image Processing.

Trigka M, Dritsas E Sensors (Basel). 2025; 25(2).

PMID: 39860903 PMC: 11769216. DOI: 10.3390/s25020531.

Masked autoencoder of multi-scale convolution strategy combined with knowledge distillation for facial beauty prediction.

Gan J, Xiong J Sci Rep. 2025; 15(1):2784.

PMID: 39843525 PMC: 11754610. DOI: 10.1038/s41598-025-86831-0.

Uncertainty-aware genomic deep learning with knowledge distillation.

Zhou J, Rizzo K, Tang Z, Koo P bioRxiv. 2024; .

PMID: 39605624 PMC: 11601481. DOI: 10.1101/2024.11.13.623485.

Graph masked self-distillation learning for prediction of mutation impact on protein-protein interactions.

Zhang Y, Dong M, Deng J, Wu J, Zhao Q, Gao X Commun Biol. 2024; 7(1):1400.

PMID: 39462102 PMC: 11513059. DOI: 10.1038/s42003-024-07066-9.