Current Issue Cover
融合注意力机制与知识蒸馏的孪生网络压缩

耿增民1, 余梦巧2, 刘峡壁2, 吕超1(1.北京服装学院基础教学部, 北京 100029;2.北京理工大学计算机学院, 北京 100081)

摘 要
目的 使用深度孪生网络解决图像协同分割问题,显著提高了图像分割精度。然而,深度孪生网络需要巨大的计算量,使其应用受到限制。为此,提出一种融合二值化注意力机制与知识蒸馏的孪生网络压缩方法,旨在获取计算量小且分割精度高的孪生网络。方法 首先提出一种二值化注意力机制,将其运用到孪生网络中,抽取大网络中的重要知识,再根据重要知识的维度重构原大网络,获取孪生小网络结构。然后基于一种改进的知识蒸馏方法将大网络中的知识迁移到小网络中,迁移过程中先后用大网络的中间层重要知识和真实标签分别指导小网络训练,以获取目标孪生小网络的权值。结果 实验结果表明,本文方法可将原孪生网络的规模压缩为原来的1/3.3,显著减小网络计算量,且分割结果接近于现有协同分割方法的最好结果。在MLMR-COS数据集上,压缩后的小网络分割精度略高于大网络,平均Jaccard系数提升了0.07%;在Internet数据集上,小网络分割结果的平均Jaccard系数比传统图像分割方法的最好结果高5%,且达到现有深度协同分割方法的最好效果;对于图像相对复杂的iCoseg数据集,压缩后的小网络分割精度相比于传统图像分割方法和深度协同分割方法的最好效果仅略有下降。结论 本文提出的孪生网络压缩方法显著减小了网络计算量和参数量,分割效果接近现有协同分割方法的最好结果。
关键词
Combining attention mechanism and knowledge distillation for Siamese network compression

Geng Zengmin1, Yu Mengqiao2, Liu Xiabi2, Lyu Chao1(1.Division of Basic Courses, Beijing Institute of Fashion Technology, Beijing 100029, China;2.School of Computer, Beijing Institute of Technology, Beijing 100081, China)

Abstract
Objective Image co-segmentation refers to segmenting common objects from image groups that contain the same or similar objects (foregrounds). Deep neural networks are widely used in this task given their excellent segmentation results. The end-to-end Siamese network is one of the most effective networks for image co-segmentation. However, this network has huge computational costs, which greatly limit its applications. Therefore, network compression is required. Although various network compression methods have been presented in the literature, they are mainly designed for single-branch networks and do not consider the characteristics of a Siamese network. To this end, we propose a novel network compression method specifically for Siamese networks. Method The proposed method transfers the important knowledge of a large network to a compressed small network. This method involves three steps. First, we acquire the important knowledge of the large network. To fulfill such task, we develop a binary attention mechanism that is applied to each stage of the encode module of the Siamese network. This mechanism maintains the features of common objects and eliminates the features of non-common objects in two images. As a result, the response of each stage of the Siamese network is represented as a matrix with sparse channels. We map this sparse response matrix to a dense matrix with smaller channel dimensions through a 1×1 kernel size convolution layer. This dense matrix represents the important knowledge of the large network. Second, we build a small network structure. As described in the first step, the number of channels used to represent the knowledge in each stage of a large network can be reduced. Accordingly, the number of channels in each convolution and normalization layers included in each stage can also be reduced. Therefore, we reconstruct each stage of the large network according to the channel dimensions of the dense matrix obtained in the first step to determine the final small network structure. Third, we transfer the knowledge from the large network to the compressed small network. We propose a two-step knowledge distillation method to implement this step. First, the output of each stage/deconvolutional layer of the large network is used as the supervision information. We calculate the Euclidean distance between the middle-layer outputs of the large and small networks as our loss function to guide the training of the small network. This loss function is designed to make sure that the middle-layer outputs of the small and large networks are as similar as possible at the end of the first training stage. Second, we compute the dice loss between the network output and the real label to guide the final refining of the small network and to further improve the segmentation accuracy. Result We perform two groups of experiments on three datasets, namely MLMR-COS, Internet, and iCoseg. MLMR-COS has a large scale of images with pixel-wise ground truth. An ablation study is performed on this dataset to verify the rationality of the proposed method. Meanwhile, although Internet and iCoseg are commonly used datasets for co-segmentation, they are too small to be used as training sets for methods based on deep learning. Therefore, we train our network on a training set generated by Pascal VOC 2012 and MSRC before testing it on the Internet and iCoseg to verify its effectiveness. Experimental results show that the proposed method can reduce the size of the original Siamese network by 3.3 times thereby significantly reducing the required amount of computation. Moreover, compared with the existing co-segmentation methods based on deep learning, the proposed method can significantly reduce the amount of computation required in a compressed network. The segmentation accuracy of this compressed network on three datasets is close to the stat of the art. On the MLMR-COS dataset, this compressed small network obtains an average Jaccard index that is 0.07% higher than that of the original large network. Meanwhile, on the Internet and iCoseg datasets, we compare the compressed network with 12 traditional supervised/unsupervised image co-segmentation methods and 3 co-segmentation methods based on deep learning. On the Internet dataset, the compressed network has a Jaccard index that is 5% than the those of traditional image segmentation methods and existing co-segmentation methods based on deep learning. On the iCoseg dataset with relatively complex images, the segmentation accuracy of the compressed small network is slightly lower than those of the other methods. Conclusion We propose a network compression method by combining binary attention mechanism and knowledge distillation and apply it to a Siamese network for image co-segmentation. This network significantly reduces the amount of calculation and parameters in Siamese networks and is similar to the state-of-the-art methods in terms of co-segmentation performance.
Keywords

订阅号|日报