Current Issue Cover
知识蒸馏方法研究与应用综述

司兆峰, 齐洪钢(中国科学院大学计算机科学与技术学院, 北京 100049)

摘 要
随着深度学习方法的不断发展,其存储代价和计算代价也不断增长,在资源受限的平台上,这种情况给其应用带来了挑战。为了应对这种挑战,研究者提出了一系列神经网络压缩方法,其中知识蒸馏是一种简单而有效的方法,成为研究热点之一。知识蒸馏的特点在于它采用了"教师-学生"架构,使用一个大型网络指导小型网络进行训练,以提升小型网络在应用场景下的性能,从而间接达到网络压缩的目的。同时,知识蒸馏具有不改变网络结构的特性,从而具有较好的可扩展性。本文首先介绍知识蒸馏的由来以及发展,随后根据方法优化的目标将知识蒸馏的改进方法分为两大类,即面向网络性能的知识蒸馏和面向网络压缩的知识蒸馏,并对经典方法和最新方法进行系统的分析和总结,最后列举知识蒸馏方法的几种典型应用场景,以便加深对各类知识蒸馏方法原理及其应用的理解。知识蒸馏方法发展至今虽然已经取得较好的效果,但是各类知识蒸馏方法仍然有不足之处,本文也对不同知识蒸馏方法的缺陷进行了总结,并根据网络性能和网络压缩两个方面的分析,给出对知识蒸馏研究的总结和展望。
关键词
Survey on knowledge distillation and its application

Si Zhaofeng, Qi Honggang(School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100049, China)

Abstract
Deep learning is an effective method in various tasks, including image classification, object detection, and semantic segmentation.Various architectures of deep neural networks(DNNs), such as Visual Geometry Group network (VGGNet), residual network(ResNet), and GoogLeNet, have been proposed recently.All of which have high computational costs and storage costs.The effectiveness of DNNs mainly comes from their high capacity and architectural complexity, which allow them to learn sufficient knowledge from datasets and generalize well to real-world scenes.However, high capacity and architectural complexity can also result in a drastic increase in storage and computational costs, thereby complicating the implementation of deep learning methods on devices with limited resources.Given the increasing demand for deep learning methods on portable devices, such as mobile phones, the cost of DNNs must be urgently reduced.Researchers have developed a series of methods called model compression to solve the aforementioned problem.These methods can be divided into four main categories:network pruning, weight quantization, weight decomposition, and knowledge distillation.Knowledge distillation is a comparably new method first introduced in 2014.It attempts to transfer the knowledge learned by a cumbersome network(teacher network)to a lightweight network(student network), thereby allowing the student network to perform similarly to the teacher network.Thus, compression can be achieved by using the student network for inference.Traditional knowledge distillation works by providing softened labels to the student network as the training target instead of allowing the student network to learn ground truth directly.The student network can learn about the correlation among classes in the classification problem by learning from softened labels.This approach can be taken as extra supervision while training.The student network trained by knowledge distillation should ideally approximate the performance of the teacher network.In this way, the computational and storage costs in the compressed network are reduced with minor degradation compared with those in the uncompressed network.However, this situation is almost unreachable when the compression rate is large enough to be comparable with the compression rates of other model compression methods.On the contrary, knowledge distillation can be taken as a measure of enhancing the performance of a deep learning model.Thus, this model can perform better than other models of similar size.Moreover, knowledge distillation is a method of model compression.In this study, we aim to review the knowledge distillation methods developed in recent years from a new perspective.We sort the existing methods according to their target by dividing them into performance-oriented methods and compression-oriented methods.Performance-oriented methods emphasize the improvement of the performance of the student network, whereas compression-oriented methods focus on the relationship between the size of the student network and its performance.We further divide these two categories into specific ideas.In performance-oriented methods, we describe state-of-the-art methods in two aspects:the representation of knowledge and ways of learning knowledge.The representation of knowledge has been widely studied in recent years.The researchers attempt to derive knowledge from the teacher network instead of outputting vectors to enrich the knowledge while training.Other forms of knowledge include a middle-layer feature map, representation extracted from the middle layer, and structural knowledge.The student network can learn about the teacher network's behavior while forward propagating by combining this extra knowledge with the soft target in traditional knowledge distillation.Thus, the student network acts similarly to the teacher network.Studies on the way of learning knowledge attempt to explore distillation architectures on the basis of the teacher-student architecture.Moreover, architectures includingonline distillation, self-distillation, multiteacher distillation, progressive knowledge distillation, and generative adversarial network(GAN)-based knowledge distillation are proposed.These architectures focus on the effectiveness of distillation and different use cases.For example, online distillation and self-distillation can be applied when the teacher network with high capacity is unavailable.In compression-oriented knowledge distillation, researchers try to combine neural architecture search(NAS)methods with knowledge distillation to balance the relationship between the performance and the size of the student network.Many studies on the impact of the size difference between the teacher network and the student network on distillation performance are also available.They concluded that a wide gap between the teacher and the student can cause performance degradation.Then, bridging the gap between the teacher and the student with several middle-sized networks was proposed in these studies.We also formalize different kinds of knowledge distillation methods.The corresponding figures are shown uniformly to help researchers understand the basic ideas comprehensively and learn about recent works on knowledge distillation.One of the most notable characteristics of knowledge distillation is that the architectures of the teacher network and the student network stay intact during training.Thus, other methods for different tasks can be incorporated easily.In this study, we introduce recent works on different knowledge distillation tasks, including object detection, face recognition, and natural language processing.Finally, we summarize the knowledge distillation methods mentioned before and propose several possible ideas.Recent research on knowledge distillation has mainly focused on enhancing the performance of the student network.The major problem of the student network lies in finding a feasible source of knowledge from the teacher network.Moreover, compression-based knowledge distillation suffers from the problem of searching space when NAS adjusts network architecture.On the basis of the analysis above, we propose three possible ideas for researchers to study:1)obtaining knowledge from various tasks and architectures in the form of knowledge distillation, 2)developing a searching space for NAS when combined with knowledge distillation and adjusting the teacher network while searching for the student network, and 3)developing a metric for knowledge distillation and other model compression methods to evaluate both task performance and compression performance.
Keywords

订阅号|日报