Current Issue Cover

司兆峰, 齐洪钢(中国科学院大学计算机科学与技术学院, 北京 100049)

摘 要
Survey on knowledge distillation and its application

Si Zhaofeng, Qi Honggang(School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100049, China)

Deep learning is an effective method in various tasks, including image classification, object detection, and semantic segmentation.Various architectures of deep neural networks(DNNs), such as Visual Geometry Group network (VGGNet), residual network(ResNet), and GoogLeNet, have been proposed recently.All of which have high computational costs and storage costs.The effectiveness of DNNs mainly comes from their high capacity and architectural complexity, which allow them to learn sufficient knowledge from datasets and generalize well to real-world scenes.However, high capacity and architectural complexity can also result in a drastic increase in storage and computational costs, thereby complicating the implementation of deep learning methods on devices with limited resources.Given the increasing demand for deep learning methods on portable devices, such as mobile phones, the cost of DNNs must be urgently reduced.Researchers have developed a series of methods called model compression to solve the aforementioned problem.These methods can be divided into four main categories:network pruning, weight quantization, weight decomposition, and knowledge distillation.Knowledge distillation is a comparably new method first introduced in 2014.It attempts to transfer the knowledge learned by a cumbersome network(teacher network)to a lightweight network(student network), thereby allowing the student network to perform similarly to the teacher network.Thus, compression can be achieved by using the student network for inference.Traditional knowledge distillation works by providing softened labels to the student network as the training target instead of allowing the student network to learn ground truth directly.The student network can learn about the correlation among classes in the classification problem by learning from softened labels.This approach can be taken as extra supervision while training.The student network trained by knowledge distillation should ideally approximate the performance of the teacher network.In this way, the computational and storage costs in the compressed network are reduced with minor degradation compared with those in the uncompressed network.However, this situation is almost unreachable when the compression rate is large enough to be comparable with the compression rates of other model compression methods.On the contrary, knowledge distillation can be taken as a measure of enhancing the performance of a deep learning model.Thus, this model can perform better than other models of similar size.Moreover, knowledge distillation is a method of model compression.In this study, we aim to review the knowledge distillation methods developed in recent years from a new perspective.We sort the existing methods according to their target by dividing them into performance-oriented methods and compression-oriented methods.Performance-oriented methods emphasize the improvement of the performance of the student network, whereas compression-oriented methods focus on the relationship between the size of the student network and its performance.We further divide these two categories into specific ideas.In performance-oriented methods, we describe state-of-the-art methods in two aspects:the representation of knowledge and ways of learning knowledge.The representation of knowledge has been widely studied in recent years.The researchers attempt to derive knowledge from the teacher network instead of outputting vectors to enrich the knowledge while training.Other forms of knowledge include a middle-layer feature map, representation extracted from the middle layer, and structural knowledge.The student network can learn about the teacher network's behavior while forward propagating by combining this extra knowledge with the soft target in traditional knowledge distillation.Thus, the student network acts similarly to the teacher network.Studies on the way of learning knowledge attempt to explore distillation architectures on the basis of the teacher-student architecture.Moreover, architectures includingonline distillation, self-distillation, multiteacher distillation, progressive knowledge distillation, and generative adversarial network(GAN)-based knowledge distillation are proposed.These architectures focus on the effectiveness of distillation and different use cases.For example, online distillation and self-distillation can be applied when the teacher network with high capacity is unavailable.In compression-oriented knowledge distillation, researchers try to combine neural architecture search(NAS)methods with knowledge distillation to balance the relationship between the performance and the size of the student network.Many studies on the impact of the size difference between the teacher network and the student network on distillation performance are also available.They concluded that a wide gap between the teacher and the student can cause performance degradation.Then, bridging the gap between the teacher and the student with several middle-sized networks was proposed in these studies.We also formalize different kinds of knowledge distillation methods.The corresponding figures are shown uniformly to help researchers understand the basic ideas comprehensively and learn about recent works on knowledge distillation.One of the most notable characteristics of knowledge distillation is that the architectures of the teacher network and the student network stay intact during training.Thus, other methods for different tasks can be incorporated easily.In this study, we introduce recent works on different knowledge distillation tasks, including object detection, face recognition, and natural language processing.Finally, we summarize the knowledge distillation methods mentioned before and propose several possible ideas.Recent research on knowledge distillation has mainly focused on enhancing the performance of the student network.The major problem of the student network lies in finding a feasible source of knowledge from the teacher network.Moreover, compression-based knowledge distillation suffers from the problem of searching space when NAS adjusts network architecture.On the basis of the analysis above, we propose three possible ideas for researchers to study:1)obtaining knowledge from various tasks and architectures in the form of knowledge distillation, 2)developing a searching space for NAS when combined with knowledge distillation and adjusting the teacher network while searching for the student network, and 3)developing a metric for knowledge distillation and other model compression methods to evaluate both task performance and compression performance.