Current Issue Cover
结合注意力机制的双路径语义分割

翟鹏博1,2, 杨浩1, 宋婷婷1, 余亢1, 马龙祥1,2, 黄向生3(1.中国科学院微电子研究所, 北京 100029;2.中国科学院大学, 北京 100049;3.中国科学院自动化研究所, 北京 100190)

摘 要
目的 针对现有语义分割算法存在的因池化操作造成分辨率降低导致的分割结果变差、忽视特征图不同通道和位置特征的区别以及特征图融合时方法简单,没有考虑到不同感受视野特征区别等问题,设计了一种基于膨胀卷积和注意力机制的语义分割算法。方法 主要包括两条路径:空间信息路径使用膨胀卷积,采用较小的下采样倍数以保持图像的分辨率,获得图像的细节信息;语义信息路径使用ResNet(residual network)采集特征以获得较大的感受视野,引入注意力机制模块为特征图的不同部分分配权重,使得精度损失降低。设计特征融合模块为两条路径获得的不同感受视野的特征图分配权重,并将其融合到一起,得到最后的分割结果。结果 为证实结果的有效性,在Camvid和Cityscapes数据集上进行验证,使用平均交并比(mean intersection over union,MIoU)和精确度(precision)作为度量标准。结果显示,在Camvid数据集上,MIoU和精确度分别为69.47%和92.32%,比性能第2的模型分别提高了1.3%和3.09%。在Cityscapes数据集上,MIoU和精确度分别为78.48%和93.83%,比性能第2的模型分别提高了1.16%和3.60%。结论 本文采用膨胀卷积和注意力机制模块,在保证感受视野并且提高分辨率的同时,弥补了下采样带来的精度损失,能够更好地指导模型学习,且提出的特征融合模块可以更好地融合不同感受视野的特征。
关键词
Two-path semantic segmentation algorithm combining attention mechanism

Zhai Pengbo1,2, Yang Hao1, Song Tingting1, Yu Kang1, Ma Longxiang1,2, Huang Xiangsheng3(1.Institute of Microelectronics, Chinese Academy of Sciences, Beijing 100029, China;2.University of Chinese Academy of Sciences, Beijing 100049, China;3.Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China)

Abstract
Objective Semantic segmentation is a fundamental problem in the field of computer vision where the category of each pixel in the image needs to be labeled. Traditional semantic segmentation uses a manual design to extract image features or edge information and uses the extracted features to obtain the final segmentation map through machine learning algorithms. With the rise of deep convolutional networks, some scholars have applied convolutional neural networks to semantic segmentation to improve its accuracy. However, the existing semantic segmentation algorithms face some problems. First, a contradiction is observed between the perceptual field of view and the resolution. When obtaining a large perceptual field of view, the resolution is often reduced, thereby generating poor segmentation results. Second, each channel of the feature map represents a feature. Objects at different positions should pay attention to different features, but the existing algorithms often ignore such difference. Third, the existing algorithms often use simple cascading or addition when fusing feature maps of different perception fields. For objects of different sizes, the various characteristics of the visual field are not given the same amount of importance. Method To address the aforementioned problems, this paper designs a semantic segmentation algorithm based on the dilated convolution and attention mechanism. First, two paths are used to collect the features. One of these paths uses dilated convolution to collect spatial information. This path initially utilizes a convolution kernel with a step size of 2 for fast downsampling and then experimentally selects a feature map with a downsampling factor of 4. Afterward, an expansion convolution with expansion ratios of 1, 2, and 4 is selected, and the obtained results are cascaded as the final feature map output while maintaining the resolution to obtain high-resolution feature maps. Meanwhile, the second path uses ResNet to collect features and obtain an expanded field of view. The ResNet network uses a residual structure that can well integrate shallow features with deep ones while simultaneously avoiding the difficulty of convergence during training. Feature maps are extracted with 16 and 32 times downsampling size from the ResNet network to expand the perception field. The attention mechanism module is then applied to assign weights to each part of feature maps to reflect their specificity. The attention mechanism is divided into the spatial attention, channel attention, and pyramid attention mechanisms. The spatial attention mechanism obtains the weight of each position on the feature map by extracting the information of different channels. Global maximization and average pooling are initially utilized in the spatial dimension. In this way, the characteristic information of different channels at each position is synthesized and used as the initial attention weight. Afterward, the initial spatial attention weights are processed by convolutional layers and batch normalization operations to further learn the feature information and are outputted as the final spatial attention weights. The channel attention mechanism obtains the weight of different channels by extracting the information of each channel. Maximum and average global pooling are initially applied in the channel dimension. Through these two pooling operations, the feature information on each channel is extracted, and the number of parameters is reduced. The output is treated as the initial attention weight, and a 3×3 convolution layer is used to further learn the attention weight, enhance the salient features, and suppress the irrelevant features. The pyramid attention mechanism uses a convolution of different perception fields and obtains the weights for each position in different channels. The initial feature map is operated by using convolution kernels with sizes 1×1, 3×3, and 5×5 to fully extract features at different levels of the feature map, and the output is treated as the initial attention weight. The feature maps are then multiplied to obtain the final state. Weights are assigned to the channel of each position on the feature map to effectively make the salient features get more attention and to suppress the irrelevant features. The three weights obtained are then merged to obtain the final attention weight. A feature fusion module is designed to fuse the features of different perception fields of the two paths in order to ascribe varying degrees of importance to the features of different perception fields. In the feature fusion module, the feature maps obtained from the two paths are upsampled to the same size by using a quadratic linear difference. Given that the features of different perception fields have varying degrees of importance, the obtained feature maps are initially cascaded, global pooling is applied to obtain the feature weights of each channel, and weights are adaptively assigned to each feature map. These maps are then upsampled to the original image size to obtain the final segmentation result. Result The results were validated on the Camvid and Cityscapes datasets with mean intersection over union(MIoU) and precision as metrics. The proposed model achieved 69.47% MIoU and 92.32% precision on Camvid, which were 1.3% and 3.09% higher than those of the model with the second-highest performance. Meanwhile, on Cityscapes, the proposed model achieved 78.48% MIoU and 93.83% precision, which were 1.16% and 3.60% higher than those of the model with the second-highest performance. The proposed algorithm also outperforms the other algorithms in terms of visual effects. Conclusion This paper designs a new semantic segmentation algorithm that uses dilated convolution and ResNet to collect image features and obtains different features of the image in large and small perception fields. Using dilated convolution improves the resolution while ensuring the perceived field of view. An attention mechanism module is also designed to assign weights to each feature map and to facilitate model learning. A feature fusion module is then designed to fuse the features of different perception fields and reflect their specificity. Experiments show that the proposed algorithm outperforms the other algorithms in terms of accuracy and shows a certain application value.
Keywords

订阅号|日报