Survey on knowledge distillation and its application
- Vol. 28, Issue 9, Pages: 2817-2832(2023)
Received:28 March 2022,
Revised:2022-07-22,
Published:16 September 2023
DOI: 10.11834/jig.220273
移动端阅览

浏览全部资源
扫码关注微信
Received:28 March 2022,
Revised:2022-07-22,
Published:16 September 2023
移动端阅览
随着深度学习方法的不断发展,其存储代价和计算代价也不断增长,在资源受限的平台上,这种情况给其应用带来了挑战。为了应对这种挑战,研究者提出了一系列神经网络压缩方法,其中知识蒸馏是一种简单而有效的方法,成为研究热点之一。知识蒸馏的特点在于它采用了“教师—学生”架构,使用一个大型网络指导小型网络进行训练,以提升小型网络在应用场景下的性能,从而间接达到网络压缩的目的。同时,知识蒸馏具有不改变网络结构的特性,从而具有较好的可扩展性。本文首先介绍知识蒸馏的由来以及发展,随后根据方法优化的目标将知识蒸馏的改进方法分为两大类,即面向网络性能的知识蒸馏和面向网络压缩的知识蒸馏,并对经典方法和最新方法进行系统的分析和总结,最后列举知识蒸馏方法的几种典型应用场景,以便加深对各类知识蒸馏方法原理及其应用的理解。知识蒸馏方法发展至今虽然已经取得较好的效果,但是各类知识蒸馏方法仍然有不足之处,本文也对不同知识蒸馏方法的缺陷进行了总结,并根据网络性能和网络压缩两个方面的分析,给出对知识蒸馏研究的总结和展望。
Deep learning is an effective method in various tasks, including image classification, object detection, and semantic segmentation. Various architectures of deep neural networks (DNNs), such as Visual Geometry Group network(VGGNet), residual network(ResNet), and GoogLeNet, have been proposed recently. All of which have high computational costs and storage costs. The effectiveness of DNNs mainly comes from their high capacity and architectural complexity, which allow them to learn sufficient knowledge from datasets and generalize well to real-world scenes. However, high capacity and architectural complexity can also result in a drastic increase in storage and computational costs, thereby complicating the implementation of deep learning methods on devices with limited resources. Given the increasing demand for deep learning methods on portable devices, such as mobile phones, the cost of DNNs must be urgently reduced. Researchers have developed a series of methods called model compression to solve the aforementioned problem. These methods can be divided into four main categories: network pruning, weight quantization, weight decomposition, and knowledge distillation. Knowledge distillation is a comparably new method first introduced in 2014. It attempts to transfer the knowledge learned by a cumbersome network (teacher network) to a lightweight network (student network), thereby allowing the student network to perform similarly to the teacher network. Thus, compression can be achieved by using the student network for inference. Traditional knowledge distillation works by providing softened labels to the student network as the training target instead of allowing the student network to learn ground truth directly. The student network can learn about the correlation among classes in the classification problem by learning from softened labels. This approach can be taken as extra supervision while training. The student network trained by knowledge distillation should ideally approximate the performance of the teacher network. In this way, the computational and storage costs in the compressed network are reduced with minor degradation compared with those in the uncompressed network. However, this situation is almost unreachable when the compression rate is large enough to be comparable with the compression rates of other model compression methods. On the contrary, knowledge distillation can be taken as a measure of enhancing the performance of a deep learning model. Thus, this model can perform better than other models of similar size. Moreover, knowledge distillation is a method of model compression. In this study, we aim to review the knowledge distillation methods developed in recent years from a new perspective. We sort the existing methods according to their target by dividing them into performance-oriented methods and compression-oriented methods. Performance-oriented methods emphasize the improvement of the performance of the student network, whereas compression-oriented methods focus on the relationship between the size of the student network and its performance. We further divide these two categories into specific ideas. In performance-oriented methods, we describe state-of-the-art methods in two aspects: the representation of knowledge and ways of learning knowledge. The representation of knowledge has been widely studied in recent years. The researchers attempt to derive knowledge from the teacher network instead of outputting vectors to enrich the knowledge while training. Other forms of knowledge include a middle-layer feature map, representation extracted from the middle layer, and structural knowledge. The student network can learn about the teacher network’s behavior while forward propagating by combining this extra knowledge with the soft target in traditional knowledge distillation. Thus, the student network acts similarly to the teacher network. Studies on the way of learning knowledge attempt to explore distillation architectures on the basis of the teacher-student architecture. Moreover, architectures includingonline distillation, self-distillation, multiteacher distillation, progressive knowledge distillation, and generative adversarial network (GAN)-based knowledge distillation are proposed. These architectures focus on the effectiveness of distillation and different use cases. For example, online distillation and self-distillation can be applied when the teacher network with high capacity is unavailable. In compression-oriented knowledge distillation, researchers try to combine neural architecture search (NAS) methods with knowledge distillation to balance the relationship between the performance and the size of the student network. Many studies on the impact of the size difference between the teacher network and the student network on distillation performance are also available. They concluded that a wide gap between the teacher and the student can cause performance degradation. Then, bridging the gap between the teacher and the student with several middle-sized networks was proposed in these studies. We also formalize different kinds of knowledge distillation methods. The corresponding figures are shown uniformly to help researchers understand the basic ideas comprehensively and learn about recent works on knowledge distillation. One of the most notable characteristics of knowledge distillation is that the architectures of the teacher network and the student network stay intact during training. Thus, other methods for different tasks can be incorporated easily. In this study, we introduce recent works on different knowledge distillation tasks, including object detection, face recognition, and natural language processing. Finally, we summarize the knowledge distillation methods mentioned before and propose several possible ideas. Recent research on knowledge distillation has mainly focused on enhancing the performance of the student network. The major problem of the student network lies in finding a feasible source of knowledge from the teacher network. Moreover, compression-based knowledge distillation suffers from the problem of searching space when NAS adjusts network architecture. On the basis of the analysis above, we propose three possible ideas for researchers to study: 1) obtaining knowledge from various tasks and architectures in the form of knowledge distillation, 2) developing a searching space for NAS when combined with knowledge distillation and adjusting the teacher network while searching for the student network, and 3) developing a metric for knowledge distillation and other model compression methods to evaluate both task performance and compression performance.
Asif U , Tang J B and Harrer S . 2020 . Ensemble knowledge distillation for learning improved and efficient networks // Proceedings of the 24th European Conference on Artificial Intelligence . Santiago de Compostela, Spain : IOS Press: 953 - 960 [ DOI: 10.3233/FAIA200188 http://dx.doi.org/10.3233/FAIA200188 ]
Chen D , Mei J P , Wang C , Feng Y and Chen C . 2020a . Online knowledge distillation with diverse peers . Proceedings of the AAAI Conference on Artificial Intelligence , 34 ( 4 ): 3430 - 3437 [ DOI: 10.1609/aaai.v34i04.5746 http://dx.doi.org/10.1609/aaai.v34i04.5746 ]
Chen G B , Choi W , Yu X , Han T and Chandraker M . 2017 . Learning efficient object detection models with knowledge distillation // Proceedings of the 31st International Conference on Neural Information Processing Systems . Long Beach, USA : Curran Associates Inc.: 742 - 751
Chen H T , Wang Y H , Shu H , Wen C Y , Xu C J , Shi B X , Xu C and Xu C . 2020b . Distilling portable generative adversarial networks for image translation . Proceedings of the AAAI Conference on Artificial Intelligence , 34 ( 4 ): 3585 - 3592 [ DOI: 10.1609/AAAI.V34I04.5765 http://dx.doi.org/10.1609/AAAI.V34I04.5765 ]
Chen P G , Liu S , Zhao H S and Jia J Y . 2021 . Distilling knowledge via knowledge review // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Nashville, USA : IEEE: 5006 - 5015 [ DOI: 10.1109/CVPR46437.2021.00497 http://dx.doi.org/10.1109/CVPR46437.2021.00497 ]
Chen Z L , Zheng X X , Shen H L , Zeng Z Y , Zhou Y K and Zhao R C . 2020c . Improving knowledge distillation via category structure // Proceedings of the 16th European Conference on Computer Vision . Glasgow, UK : Springer: 205 - 219 [ DOI: 10.1007/978-3-030-58604-1_13 http://dx.doi.org/10.1007/978-3-030-58604-1_13 ]
Cho J H and Hariharan B . 2019 . On the efficacy of knowledge distillation // Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Seoul, Korea (South) : IEEE: 4793 - 4801 [ DOI: 10.1109/ICCV.2019.00489 http://dx.doi.org/10.1109/ICCV.2019.00489 ]
Chu Y C , Gong H , Wang X F and Liu P S . 2022 . Study on knowledge distillation of target detection algorithm based on YOLOv4 . Computer Science , 49 ( 6 A): 337 - 344
楚玉春 , 龚航 , 王学芳 , 刘培顺 . 2022 . 基于YOLOv4的目标检测知识蒸馏算法研究 . 计算机科学 , 49 ( 6 A): 337 - 344 [ DOI: 10.11896/jsjkx.210600204 http://dx.doi.org/10.11896/jsjkx.210600204 ]
Chung I , Park S U , Kim J and Kwak N . 2020 . Feature-map-level online adversarial knowledge distillation [EB/OL]. [ 2022-03-13 ]. https://arxiv.org/pdf/2002.01775.pdf https://arxiv.org/pdf/2002.01775.pdf
Dai X , Jiang Z R , Wu Z , Bao Y P , Wang Z C , Liu S and Zhou E J . 2021 . General instance distillation for object detection // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Nashville, USA : IEEE: 7838 - 7847 [ DOI: 10.1109/CVPR46437.2021.00775 http://dx.doi.org/10.1109/CVPR46437.2021.00775 ]
Fu H , Zhou S J , Yang Q H , Tang J J , Liu G Q , Liu K K and Li X L . 2021 . LRC-BERT: latent-representation contrastive knowledge distillation for natural language understanding . Proceedings of the AAAI Conference on Artificial Intelligence , 35 ( 14 ): 12830 - 12838 [ DOI: 10.1609/aaai.v35i14.17518 http://dx.doi.org/10.1609/aaai.v35i14.17518 ]
Gao M Y , Shen Y J , Li Q Q , Yan J J , Wan L , Lin D H , Loy C C and Tang X O . 2019 . An embarrassingly simple approach for knowledge distillation [EB/OL]. [ 2022-03-13 ]. https://arxiv.org/pdf/1812.01819v2.pdf https://arxiv.org/pdf/1812.01819v2.pdf
Ge Y X , Zhang X , Choi C L , Cheung K C , Zhao P P , Zhu F , Wang X G , Zhao R and Li H S . 2021 . Self-distillation with batch knowledge ensembling improves ImageNet classification [EB/OL]. [ 2022-03-13 ]. https://arxiv.org/pdf/2104.13298.pdf https://arxiv.org/pdf/2104.13298.pdf
Geng Z M , Yu M Q , Liu X B and Lyu C . 2020 . Combining attention mechanism and knowledge distillation for Siamese network compression . Journal of Image and Graphics , 25 ( 12 ): 2563 - 2577
耿增民 , 余梦巧 , 刘峡壁 , 吕超 . 2020 . 融合注意力机制与知识蒸馏的孪生网络压缩 . 中国图象图形学报 , 25 ( 12 ): 2563 - 2577 [ DOI: 10.11834/jig.200051 http://dx.doi.org/10.11834/jig.200051 ]
Gou J P , Yu B S , Maybank S J and Tao D C . 2021 . Knowledge distillation: a survey . International Journal of Computer Vision , 129 ( 6 ): 1789 - 1819 [ DOI: 10.1007/s11263-021-01453-z http://dx.doi.org/10.1007/s11263-021-01453-z ]
He K , Zhang X , Ren S and Sun J . 2016 . Deep residual learning for image recognition // Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Las Vegas, USA : IEEE: 770 - 778 [ DOI: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]
Hinton G , Vinyals O and Dean J . 2015 . Distilling the knowledge in a neural network [EB/OL]. [ 2022-03-13 ]. https://arxiv.org/pdf/1503.02531.pdf https://arxiv.org/pdf/1503.02531.pdf
Huang Y G , Shen P C , Tai Y , Li S X , Liu X M , Li J L , Huang F Y and Ji R R . 2020 . Improving face recognition from hard samples via distribution distillation loss // Proceedings of the 16th European Conference on Computer Vision . Glasgow, UK : Springer: 138 - 154 [ DOI: 10.1007/978-3-030-58577-8_9 http://dx.doi.org/10.1007/978-3-030-58577-8_9 ]
Huang Z H , Yang X Y , Yu J , Guo L and Li X . 2022 . Mutual learning knowledge distillation based on multi-stage Multi-generative Adversarial Network . Computer Science , 1 - 12
黄仲浩 , 杨兴耀 , 于炯 , 郭亮 , 李想 . 基于多阶段多生成对抗网络的互学习知识蒸馏方法 . 计算机科学 : 1 - 12 [ DOI: 10.11896/jsjkx.210800250 http://dx.doi.org/10.11896/jsjkx.210800250 ]
Jang Y , Lee H , Hwang S J and Shin J . 2019 . Learning what and where to transfer // Proceedings of the 36th International Conference on Machine Learning . Long Beach, USA : PMLR: 3030 - 3039
Ji M , Heo B and Park S . 2021a . Show, attend and distill: knowledge distillation via attention-based feature matching . Proceedings of the AAAI Conference on Artificial Intelligence , 35 ( 9 ): 7945 - 7952 [ DOI: 10.1609/aaai.v35i9.16969 http://dx.doi.org/10.1609/aaai.v35i9.16969 ]
Ji M , Shin S , Hwang S , Park G and Moon I C . 2021b . Refine myself by teaching myself: feature refinement via self-knowledge distillation // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Nashville, USA : IEEE: 10659 - 10668 [ DOI: 10.1109/CVPR46437.2021.01052 http://dx.doi.org/10.1109/CVPR46437.2021.01052 ]
Jiao X Q , Yin Y C , Shang L F , Jiang X , Chen X , Li L L , Wang F and Liu Q . 2020 . TinyBERT: distilling BERT for natural language understanding // Findings of the Association for Computational Linguistics: EMNLP 2020 . Virtual : ACL: 4163 - 4174 [ DOI: 10.18653/v1/2020.findings-emnlp.372 http://dx.doi.org/10.18653/v1/2020.findings-emnlp.372 ]
Jin Q , Ren J , Woodford O J , Wang J Z , Yuan G , Wang Y Z and Tulyakov S . 2021 . Teachers do more than teach: compressing image-to-image models // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville, USA : IEEE: 13595 - 13606 [ DOI: 10.1109/CVPR46437.2021.01339 http://dx.doi.org/10.1109/CVPR46437.2021.01339 ]
Kim J , Hyun M , Chung I and Kwak N . 2021 . Feature fusion for online mutual knowledge distillation // Proceedings of the 25th International Conference on Pattern Recognition (ICPR) . Milan, Italy : IEEE: 4619 - 4625 [ DOI: 10.1109/ICPR48806.2021.9412615 http://dx.doi.org/10.1109/ICPR48806.2021.9412615 ]
Kothandaraman D , Nambiar A and Mittal A . 2021 . Domain adaptive knowledge distillation for driving scene semantic segmentation // Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision Workshops (WACVW) . Waikola, USA : IEEE: 134 - 143 [ DOI: 10.1109/WACVW52041.2021.00019 http://dx.doi.org/10.1109/WACVW52041.2021.00019 ]
Lan X , Zhu X T and Gong S G . 2018 . Knowledge distillation by on-the-fly native ensemble // Proceedings of the 32nd International Conference on Neural Information Processing Systems . Montréal, Canada : Curran Associates Inc.: 7528 - 7538
LeCun Y , Bengio Y and Hinton G . Deep learning . Nature , 2015 , 521 ( 7553 ): 436 - 444 [ DOI: 10.1038/nature14539 http://dx.doi.org/10.1038/nature14539 ]
Li M Y , Lin J , Ding Y Y , Liu Z J , Zhu J Y and Han S . 2020a . GAN compression: efficient architectures for interactive conditional GANs // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Seattle, USA : IEEE: 5283 - 5293 [ DOI: 10.1109/CVPR42600.2020.00533 http://dx.doi.org/10.1109/CVPR42600.2020.00533 ]
Li Q Q , Jin S Y and Yan J J . 2017 . Mimicking very efficient network for object detection // Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Honolulu, USA : IEEE: 7341 - 7349 [ DOI: 10.1109/CVPR.2017.776 http://dx.doi.org/10.1109/CVPR.2017.776 ]
Li T H , Li J G , Liu Z and Zhang C S . 2020b . Few sample knowledge distillation for efficient network compression // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Seattle, USA : 14627 - 14635 [ DOI: 10.1109/cvpr42600.2020.01465 http://dx.doi.org/10.1109/cvpr42600.2020.01465 ]
Li X J , Wu J L , Fang H Y , Liao Y , Wang F and Qian C . 2020c . Local correlation consistency for knowledge distillation // Proceedings of the 16th European Conference on Computer Vision . Glasgow, UK : Springer: 18 - 33 [ DOI: 10.1007/978-3-030-58610-2_2 http://dx.doi.org/10.1007/978-3-030-58610-2_2 ]
Liu B L , Rao Y M , Lu J W , Zhou J and Hsieh C J . 2020a . MetaDistiller: network self-boosting via meta-learned top-down distillation // Proceedings of the 16th European Conference on Computer Vision . Glasgow, UK : Springer: 694 - 709 [ DOI: 10.1007/978-3-030-58568-6_41 http://dx.doi.org/10.1007/978-3-030-58568-6_41 ]
Liu H and Zhang X B . 2021 . Compression method for stepwise neural network based on relational distillation . Computer Systems and Applications , 30 ( 12 ): 248 - 254
刘昊 , 张晓滨 . 2021 . 基于关系型蒸馏的分步神经网络压缩方法 . 计算机系统应用 , 30 ( 12 ): 248 - 254 [ DOI: 10.15888/j.cnki.csa.008202 http://dx.doi.org/10.15888/j.cnki.csa.008202 ]
Liu Y , Jia X H , Tan M X , Vemulapalli R , Zhu Y K , Green B and Wang X G . 2020b . Search to distill: pearls are everywhere but not the eyes // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Seattle, USA : IEEE: 7536 - 7545 [ DOI: 10.1109/CVPR42600.2020.00756 http://dx.doi.org/10.1109/CVPR42600.2020.00756 ]
Liu Y , Zhang W and Wang J . 2020c . Learning from a lightweight teacher for efficient knowledge distillation [EB/OL]. [ 2022-03-13 ]. https://arxiv.org/pdf/2005.09163v1.pdf https://arxiv.org/pdf/2005.09163v1.pdf
Liu Y A , Zhang W and Wang J . 2020d . Adaptive multi-teacher multi-level knowledge distillation . Neurocomputing , 415 : 106 - 113 [ DOI: 10.1016/j.neucom.2020.07.048 http://dx.doi.org/10.1016/j.neucom.2020.07.048 ]
Liu Y F , Cao J J , Li B , Yuan C F , Hu W M , Li Y X and Duan Y Q . 2019a . Knowledge distillation via instance relationship graph // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Lone Beach, USA : IEEE: 7089 - 7097 [ DOI: 10.1109/CVPR.2019.00726 http://dx.doi.org/10.1109/CVPR.2019.00726 ]
Liu Y F , Chen K , Liu C , Qin Z C , Luo Z B and Wang J D . 2019b . Structured knowledge distillation for semantic segmentation // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Long Beach, USA : IEEE: 2599 - 2608 [ DOI: 10.1109/CVPR.2019.00271 http://dx.doi.org/10.1109/CVPR.2019.00271 ]
Meng Z , Li J Y , Zhao Y and Gong Y F . 2019 . Conditional teacher-student learning // Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing . Brighton, UK : IEEE: 6445 - 6449 [ DOI: 10.1109/ICASSP.2019.8683438 http://dx.doi.org/10.1109/ICASSP.2019.8683438 ]
Mirzadeh S I , Farajtabar M , Li A and Ghasemzadeh H . 2019 . Improved knowledge distillation via teacher assistant: bridging the gap between student and teacher [EB/OL]. [ 2022-03-13 ]. https://arxiv.org/pdf/1902.03393v1.pdf https://arxiv.org/pdf/1902.03393v1.pdf
Mitsuno K , Nomura Y and Kurita T . 2021 . Channel planting for deep neural networks using knowledge distillation // Proceedings of the 25th International Conference on Pattern Recognition (ICPR) . Milan, Italy : IEEE: 7573 - 7579 [ DOI: 10.1109/ICPR48806.2021.9412760 http://dx.doi.org/10.1109/ICPR48806.2021.9412760 ]
Oki H , Abe M , Miyao J and Kurita T . 2020 . Triplet loss for knowledge distillation // Proceedings of 2020 International Joint Conference on Neural Networks . Glasgow, UK : IEEE: 1 - 7 [ DOI: 10.1109/IJCNN48605.2020.9207148 http://dx.doi.org/10.1109/IJCNN48605.2020.9207148 ]
Park S U and Kwak N . 2019a . FEED: feature-level ensemble for knowledge distillation [EB/OL]. [ 2022-03-13 ]. https://arxiv.org/pdf/1909.10754.pdf https://arxiv.org/pdf/1909.10754.pdf
Park W , Kim D , Lu Y and Cho M . 2019b . Relational knowledge distillation // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Long Beach, USA : IEEE: 3962 - 3971 [ DOI: 10.1109/CVPR.2019.00409 http://dx.doi.org/10.1109/CVPR.2019.00409 ]
Ren Y X , Wu J , Xiao X F and Yang J C . 2021 . Online multi-granularity distillation for GAN compression [EB/OL]. [ 2022-03-13 ]. https://arxiv.org/pdf/2108.06908.pdf https://arxiv.org/pdf/2108.06908.pdf
Romero A , Ballas N , Kahou S E , Chassang A , Gatta C and Bengio Y . 2015 . FitNets: hints for thin deep nets [EB/OL]. [ 2022-03-13 ]. https://arxiv.org/pdf/1412.6550.pdf https://arxiv.org/pdf/1412.6550.pdf
Shi W D , Ren G H , Chen Y P and Yan S C . 2020 . ProxylessKD: direct knowledge distillation with inherited classifier for face recognition [EB/OL]. [ 2022-03-13 ]. https://arxiv.org/pdf/2011.00265.pdf https://arxiv.org/pdf/2011.00265.pdf
Shi W X , Song Y X , Zhou H , Li B H and Li L . 2021 . Follow your path: a progressive method for knowledge distillation // Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases . Bilbao, Spain : Springer: 596 - 611 [ DOI: 10.1007/978-3-030-86523-8_36 http://dx.doi.org/10.1007/978-3-030-86523-8_36 ]
Shu C Y , Liu Y F , Gao J F , Xu L and Shen C H . 2020 . Channel-wise distillation for semantic segmentation [EB/OL]. [ 2022-03-13 ]. https://arxiv.org/pdf/2011.13256v1.pdf https://arxiv.org/pdf/2011.13256v1.pdf
Simonyan K and Zisserman A . 2014 . Very deep convolutional networks for large-scale image recognition [EB/OL]. [ 2022-03-13 ]. https://arxiv.org/pdf/1409.1556v6.pdf https://arxiv.org/pdf/1409.1556v6.pdf
Son W , Na J , Choi J and Hwang W . 2021 . Densely guided knowledge distillation using multiple teacher assistants [EB/OL]. [ 2022-03-13 ]. https://arxiv.org/pdf/2009.08825v1.pdf https://arxiv.org/pdf/2009.08825v1.pdf
Sun S Q , Cheng Y , Gan Z and Liu J J . 2019 . Patient knowledge distillation for BERT model compression // Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing . Hong Kong, China : ACL: 4323 - 4332 [ DOI: 10.18653/v1/D19-1441 http://dx.doi.org/10.18653/v1/D19-1441 ]
Szegedy C , Liu W and Jia Y . 2015 . Going deeper with convolutions // Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Boston, USA : IEEE: 1 - 9 [ DOI: 10.1109/CVPR.2015.7298594 http://dx.doi.org/10.1109/CVPR.2015.7298594 ]
Tian Y L , Krishnan D and Isola P . 2022 . Contrastive representation distillation [EB/OL]. [ 2022-03-13 ]. https://arxiv.org/pdf/1910.10699.pdf https://arxiv.org/pdf/1910.10699.pdf
Tung F and Mori G . 2019 . Similarity-preserving knowledge distillation // Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Seoul, Korea (South) : IEEE: 1365 - 1374 [ DOI: 10.1109/ICCV.2019.00145 http://dx.doi.org/10.1109/ICCV.2019.00145 ]
Valverde F R , Hurtado J V and Valada A . 2021 . There is more than meets the eye: self-supervised multi-object detection and tracking with sound by distilling multimodal knowledge // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Nashville, USA : IEEE: 11607 - 11616 [ DOI: 10.1109/CVPR46437.2021.01144 http://dx.doi.org/10.1109/CVPR46437.2021.01144 ]
Wang K F , Gao X T , Zhao Y R , Li X J , Dou D J and Xu C Z . 2020a . Pay attention to features, transfer learn faster CNNs // Proceedings of the 8th International Conference on Learning Representations (ICLR) . Addis Ababa, Ethiopia : Open Review: 1 - 14
Wang X B , Fu T Y , Liao S C , Wang S , Lei Z and Mei T . 2020c . Exclusivity-consistency regularized knowledge distillation for face recognition // Proceedings of the 16th European Conference on Computer Vision . Glasgow, UK : Springer: 325 - 342 [ DOI: 10.1007/978-3-030-58586-0_20 http://dx.doi.org/10.1007/978-3-030-58586-0_20 ]
Wei Y , Pan X Y , Qin H W , Ouyang W L and Yan J J . 2018 . Quantization mimic: towards very tiny CNN for object detection // Proceedings of the 15th European Conference on Computer Vision . Munich, Germany : Springer: 274 - 290 [ DOI: DOI: 10.1007/978-3-030-01237-3_17 http://dx.doi.org/DOI:10.1007/978-3-030-01237-3_17 ]
Wen T C , Lai S Q and Qian X M . 2021 . Preparing lessons: improve knowledge distillation with better supervision . Neurocomputing , 454 : 25 - 33 [ DOI: 10.1016/J.NEUCOM.2021.04.102 http://dx.doi.org/10.1016/J.NEUCOM.2021.04.102 ]
Xu G , Liu Z and Loy C C . 2020 . Computation-efficient knowledge distillation via uncertainty-aware mixup [EB/OL]. [ 2022-03-13 ]. https://arxiv.org/pdf/2012.09413.pdf https://arxiv.org/pdf/2012.09413.pdf
Xu Z , Hsu Y C and Huang J W . 2018 . Training shallow and thin networks for acceleration via knowledge distillation with conditional adversarial networks [EB/OL]. [ 2022-03-13 ]. https://arxiv.org/pdf/1709.00513.pdf https://arxiv.org/pdf/1709.00513.pdf
Yang C L , Xie L X , Su C and Yuille A L . 2019 . Snapshot distillation: teacher-student optimization in one generation // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Long Beach, USA : IEEE: 2854 - 2863 [ DOI: 10.1109/CVPR.2019.00297 http://dx.doi.org/10.1109/CVPR.2019.00297 ]
Yao L W , Pi R J , Xu H , Zhang W , Li Z G and Zhang T . 2021 . G-DetKD: towards general distillation framework for object detectors via contrastive and semantic-guided feature imitation [EB/OL]. [ 2022-03-13 ]. https://arxiv.org/pdf/2108.07482.pdf https://arxiv.org/pdf/2108.07482.pdf
Yim J , Joo D , Bae J and Kim J . 2017 . A gift from knowledge distillation: fast optimization, network minimization and transfer learning // Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Honolulu, USA : IEEE: 7130 - 7138 [ DOI: 10.1109/CVPR.2017.754 http://dx.doi.org/10.1109/CVPR.2017.754 ]
You S , Xu C , Xu C and Tao D C . 2017 . Learning from multiple teacher networks // Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . Halifax, Canada : Association for Computing Machinery: 1285 - 1294 [ DOI: DOI: 10.1145/3097983.3098135 http://dx.doi.org/DOI:10.1145/3097983.3098135 ]
Yue K Y , Deng J F and Zhou F . 2020 . Matching guided distillation // Proceedings of the 16th European Conference on Computer Vision . Glasgow, UK : Springer: 312 - 328 [ DOI: 10.1007/978-3-030-58555-6_19 http://dx.doi.org/10.1007/978-3-030-58555-6_19 ]
Zagoruyko S and Komodakis N . 2017 . Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer [EB/OL]. [ 2022-03-13 ]. https://arxiv.org/pdf/1612.03928.pdf https://arxiv.org/pdf/1612.03928.pdf
Zhang L F , Song J B , Gao A N , Chen J W , Bao C L and Ma K S . 2019 . Be your own teacher: improve the performance of convolutional neural networks via self-distillation // Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Seoul, Korea (South) : IEEE: 3712 - 3721 [ DOI: 10.1109/ICCV.2019.00381 http://dx.doi.org/10.1109/ICCV.2019.00381 ]
Zhang L F , Shi Y K , Shi Z Q , Ma K S and Bao C L . 2020a . Task-oriented feature distillation // Proceedings of the 34th International Conference on Neural Information Processing Systems . Vancouver, Canada : Curran Associates Inc.: 14759 - 14771
Zhang Y , Xiang T , Hospedales T M and Lu H C . 2018 . Deep mutual learning // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City, USA : IEEE: 4320 - 4328 [ DOI: 10.1109/CVPR.2018.00454 http://dx.doi.org/10.1109/CVPR.2018.00454 ]
Zhang Y C , Lan Z H , Dai Y C , Zeng F G , Bai Y , Chang J and Wei Y C . 2020b . Prime-aware adaptive distillation // Proceedings of the 16th European Conference on Computer Vision . Glasgow, UK : Springer: 658 - 674 [ DOI: 10.1007/978-3-030-58529-7_39 http://dx.doi.org/10.1007/978-3-030-58529-7_39 ]
Zhao H R , Sun X , Dong J Y , Chen C R and Dong Z H . 2019 . Highlight every step: knowledge distillation via collaborative teaching [EB/OL]. [ 2022-03-13 ]. https://arxiv.org/pdf/1907.09643.pdf https://arxiv.org/pdf/1907.09643.pdf
Zhi Z , Ning G H and He Z H . 2017 . Knowledge projection for effective design of thinner and faster deep neural networks [EB/OL]. [ 2022-03-13 ]. https://arxiv.org/pdf/1710.09505.pdf https://arxiv.org/pdf/1710.09505.pdf
Zhou G R , Fan Y , Cui R P , Bian W J , Zhu X Q and Gai K . 2018 . Rocket launching: a universal and efficient framework for training well-performing light net . Proceedings of the AAAI Conference on Artificial Intelligence , 32 ( 1 ): 4580 - 4587 [ DOI: 10.1609/aaai.v32i1.11601 http://dx.doi.org/10.1609/aaai.v32i1.11601 ]
Zhu J G , Tang S X , Chen D P , Yu S J , Liu Y K , Rong M Z , Yang A J and Wang X H . 2021 . Complementary relation contrastive distillation // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Nashville, USA : IEEE: 9256 - 9265 [ DOI: 10.1109/CVPR46437.2021.00914 http://dx.doi.org/10.1109/CVPR46437.2021.00914 ]
相关文章
相关作者
相关机构
京公网安备11010802024621