Current Issue Cover
高层语义分析中的模型蒸馏方法综述

孙若禹, 熊红凯(上海交通大学电子工程系, 上海 200240)

摘 要
计算机视觉的任务目标是建立接近人类视觉系统的计算模型。随着深度神经网络(deep neural network,DNN)的发展,对计算机视觉中高层语义的分析与理解成为研究重点。计算机视觉的高层语义通常为人类可理解、可表述的用于表达图像、视频等媒体信号内容的描述子(descriptor),典型的高层语义分析任务包含图像分类、目标检测、实例分割、语义分割与视频场景识别、目标跟踪等。基于深度神经网络的算法使计算机视觉任务获得逐步提升的性能,但是网络模型的体量增大与计算效率的降低随之而来。模型蒸馏是一种基于迁移学习进行模型压缩的方案。此类方案通常利用一个预训练模型作为教师,提取其有效的表示,如模型输出、隐藏层特征或特征间相似度等,并将上述表示作为另一个规模较小、推断速度较快的学生模型的额外监督信号,对该学生模型进行训练,以达到提升小模型性能从而取代大模型的目的。模型蒸馏对模型性能与计算复杂度有着良好权衡,因此愈来愈多地用于基于深度学习的高层语义分析中。自2014年模型蒸馏概念提出以来,研究人员开发了大量应用于高层语义分析的模型蒸馏方法,在图像分类、目标检测与语义分割任务中的应用最为广泛。本文对上述典型任务中具有代表性的模型蒸馏方案进行调研和汇总,依照不同的视觉任务进行介绍。首先,从最成熟、应用最广泛的分类任务模型蒸馏方法开始,介绍其不同的设计思路与应用场景,展示部分实验性能的对比,指出在分类任务上与在检测、分割任务上应用模型蒸馏的条件差异性。接着,对几种经特殊设计而应用于目标检测、语义分割的典型模型蒸馏方法进行介绍,结合模型结构对设计目的与思路进行说明,提供部分实验结果的对比与分析。最后,对当前高层语义分析中模型蒸馏方法的现状进行了总结分析,并指出存在的困难及不足,设想未来可能的探索思路与发展方向。
关键词
Model distillation for high-level semantic understanding:a survey

Sun Ruoyu, Xiong Hongkai(Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai 200240, China)

Abstract
Computer vision tasks aim to construct computational models in relevant to functions-like of human visual systems. Current deep learning models are progressively improving upper bounds of performances in multiple computer vision tasks,especially for analysis and understanding of high-level semantics,i. e. ,multimedia-based descriptors for human recognition. Typical tasks to understand high-level semantics include image classification,object detection,instance segmentation,semantic segmentation,and video’s recognition and tracking. With the development of convolutional neural networks(CNNs),deep learning based high-level semantic understanding have all been benefiting from increasingly deeper and cumbersome models,which is also challenged for the problem of storages and computational costs. To obtain lighter structure and computation efficiency,many model compression strategies have been proposed,e. g. ,pruning, weight quantization,and low-rank factorization. But,such challenging issue is to be resolved for altered network structure or drop-severe of performance when deployed on computer vision tasks. Model distillation can be as one of the typical compression methods in terms of transfer learning to model compression. In general,model distillation utilizes a large and complicated pre-trained model as“teacher”and takes its effective representations,e. g. ,model outputs,features of hidden layers or feature maps-between similarities. These representations are treated as extra supervision signal together with the original ground truth for a lighter and faster model’s training,in which the lighter model is called“student”. As model distillation provides favorable balance between models’performances and efficiency,it is being rapidly explored on different computer vision tasks. This paper investigates the progress of model distillation methods since its introduction in 2014 and introduces their different strategies in various applications. We review some popular distillation strategies and current model distillation algorithms deployed on image classification,object detection and semantic segmentation in this paper. First,we introduce distillation methods for image classification tasks,where model distillation has already achieved mature development. Fundamentals of model distillation starts from using teacher classifiers’output logits as soft labels,bringing student with more inter-categories structural information,which is not available in conventional one-hot ground truths. Furthermore,hint learning can be used to utilize hierarchical structure of neural networks and take feature maps from hidden layers as another“teachers”-involved representations. Most of distillation strategies are designed and derived from similar approaches. In the aspects of frameworks’design and application scenes,the paper respectively introduced some typical distillation strategies on classification models. Some methods mainly considered novel approaches on supervision signal design,i. e. ,ensembles that differs from conventional classification soft labels or feature maps. Newly developed features for student models to mimic are usually computed from attention or similarity maps of different layers,data augmentations or sampled images. Other methods consider adding noise or perturbation to teacher classifiers’output or using probability inference to minimize the gap between teacher and student models. These specially designed features or logits are focused on a more appropriate representation of knowledge in teacher models than plain features from some layers’outputs. Moreover,in other methods,the procedure of model distillation is altered,and more complicated schemes are introduced to transfer teacher’s knowledge instead of simply training the student with generated labels or features. Also,as generative adversarial networks(GANs)achieve promising performance in image synthesis,some model distillation methods also introduce adversarial mechanisms in classifiers’distillation,where teacher models’features are regarded as“real ones” and the students are expected to“generate”similar features. In many practical scenes such as model compression,selftraining and parallel computing,classifiers’distillation is utilized in coordinate to specific process as well,e. g. ,fine tuning networks with full-precision teachers,distilling student model with its previous versions during training,and using models from different nodes as teachers. We summarize some popular strategies performances and illustrate the data in a table after approaches of model distillation in image classification tasks are introduced. Distillation methods’performances on improving classifiers’top-1 accuracies are compared on several typical classification datasets. The second part of the paper focuses on specially developed distillation methods for computer vision tasks more complicated than classification,e. g. , object detection,instance segmentation and semantic segmentation. Differentiated from classifiers,models of these tasks contain more redundant structures with heterogeneous outputs. Hence,recent works on detectors’and segmentation models’distillation is relatively less than those in classifiers’distillation. The paper describes current challenges in designing of distillation frameworks on detection and segmentation tasks. Some of typical distillation methods for detectors and segmentation models are then introduced based on different tasks and their multifaceted structures. Since there were few works specified for instance segmentation models’distillation,the papers simply introduce similar distillation methods for object detectors in the beginning of the second part. For detectors,requirements from localization demand special concentration on local information around foreground objects. Meanwhile,images from object detection datasets consists of more complicated scenes generally in which large amounts of different objects may occur. Hence,the solutions of distillation strategiesborrowing from for classifies may bring undesired performance decrease in object detection. Due to more complex structures in detectors,previous distillation methods may not be applicable. As“backbone with task heads”structure is widely used in modern computer vision models,researchers develop novel distillation methods mainly based on this typical framework. The introduced detectors’distillation strategies investigate issues above and mainly focus on specific output logits acquirement and specially designed loss functions for different parts in detectors. To highlight foreground regions before distillation,backbones-derived feature maps are often selected through regions of interest(RoIs)using masking operations. Various of output logits are selected in different methods from teacher models’task heads,affecting training of students’task heads in terms of specific matching and imitation schemes. Semantic segmentation requires more global information than object detection or instance segmentation tasks,focusing on pixel-wise classification inside the total image. One of the critical factors of pixels’correct classification is oriented to the analysis of inter-pixel relationships. Hence,model distillation methods for semantic segmentation also take advantages of pixels in both output masks and feature maps from hidden layers. Distillation strategies introduced in the paper are majorly on the application of hierarchical distillation on different part,e. g. ,the imitation of full output classification mask,imitation of full feature maps,computing of similarity matrices, and using conditional GANs(cGANs)for auxiliary imitation. The former two approaches are fundamental practices in model distillation. In contrast,to realize segmentation model’s pixel-wise knowledge to be more‘compact’after compression,some distillation methods utilize compressed features instead of original one to compute similarity with student. When cGANs is used to imitate student segmentation model to the teacher features,researchers introduce Wasserstein distance as a better metric for adversarial training. At the final part of this paper,previous works of model distillation for high-level semantic understanding are summarized. We review some obstacles and unsolved problems in current development of model distillation,and the future research direction is predicted as well.
Keywords

订阅号|日报