结合注意力机制的双路径语义分割

翟鹏博; 杨浩; 宋婷婷; 余亢; 马龙祥; 黄向生

doi:10.11834/jig.190533

图像理解和计算机视觉 | 浏览量 : 0 下载量: 29 CSCD: 7

PDF
导出
分享
收藏
专辑

结合注意力机制的双路径语义分割
Two-path semantic segmentation algorithm combining attention mechanism
2020年25卷第8期页码：1627-1636
收稿：2019-10-17，

修回：2020-1-20，

录用：2020-1-27，

纸质出版：2020-08-16
DOI： 10.11834/jig.190533
稿件说明：

移动端阅览

翟鹏博, 杨浩, 宋婷婷, 余亢, 马龙祥, 黄向生. 结合注意力机制的双路径语义分割[J]. 中国图象图形学报, 2020,25(8):1627-1636. DOI： 10.11834/jig.190533.

Pengbo Zhai, Hao Yang, Tingting Song, Kang Yu, Longxiang Ma, Xiangsheng Huang. Two-path semantic segmentation algorithm combining attention mechanism[J]. Journal of Image and Graphics, 2020, 25(8): 1627-1636. DOI： 10.11834/jig.190533.

摘要

目的

针对现有语义分割算法存在的因池化操作造成分辨率降低导致的分割结果变差、忽视特征图不同通道和位置特征的区别以及特征图融合时方法简单，没有考虑到不同感受视野特征区别等问题，设计了一种基于膨胀卷积和注意力机制的语义分割算法。

方法

主要包括两条路径：空间信息路径使用膨胀卷积，采用较小的下采样倍数以保持图像的分辨率，获得图像的细节信息；语义信息路径使用ResNet（residual network）采集特征以获得较大的感受视野，引入注意力机制模块为特征图的不同部分分配权重，使得精度损失降低。设计特征融合模块为两条路径获得的不同感受视野的特征图分配权重，并将其融合到一起，得到最后的分割结果。

结果

为证实结果的有效性，在Camvid和Cityscapes数据集上进行验证，使用平均交并比（mean intersection over union，MIoU）和精确度（precision）作为度量标准。结果显示，在Camvid数据集上，MIoU和精确度分别为69.47%和92.32%，比性能第2的模型分别提高了1.3%和3.09%。在Cityscapes数据集上，MIoU和精确度分别为78.48%和93.83%，比性能第2的模型分别提高了1.16%和3.60%。

结论

本文采用膨胀卷积和注意力机制模块，在保证感受视野并且提高分辨率的同时，弥补了下采样带来的精度损失，能够更好地指导模型学习，且提出的特征融合模块可以更好地融合不同感受视野的特征。

Abstract

Objective

Semantic segmentation is a fundamental problem in the field of computer vision where the category of each pixel in the image needs to be labeled. Traditional semantic segmentation uses a manual design to extract image features or edge information and uses the extracted features to obtain the final segmentation map through machine learning algorithms. With the rise of deep convolutional networks

some scholars have applied convolutional neural networks to semantic segmentation to improve its accuracy. However

the existing semantic segmentation algorithms face some problems. First

a contradiction is observed between the perceptual field of view and the resolution. When obtaining a large perceptual field of view

the resolution is often reduced

thereby generating poor segmentation results. Second

each channel of the feature map represents a feature. Objects at different positions should pay attention to different features

but the existing algorithms often ignore such difference. Third

the existing algorithms often use simple cascading or addition when fusing feature maps of different perception fields. For objects of different sizes

the various characteristics of the visual field are not given the same amount of importance.

Method

To address the aforementioned problems

this paper designs a semantic segmentation algorithm based on the dilated convolution and attention mechanism. First

two paths are used to collect the features. One of these paths uses dilated convolution to collect spatial information. This path initially utilizes a convolution kernel with a step size of 2 for fast downsampling and then experimentally selects a feature map with a downsampling factor of 4. Afterward

an expansion convolution with expansion ratios of 1

and 4 is selected

and the obtained results are cascaded as the final feature map output while maintaining the resolution to obtain high-resolution feature maps. Meanwhile

the second path uses ResNet to collect features and obtain an expanded field of view. The ResNet network uses a residual structure that can well integrate shallow features with deep ones while simultaneously avoiding the difficulty of convergence during training. Feature maps are extracted with 16 and 32 times downsampling size from the ResNet network to expand the perception field. The attention mechanism module is then applied to assign weights to each part of feature maps to reflect their specificity. The attention mechanism is divided into the spatial attention

channel attention

and pyramid attention mechanisms. The spatial attention mechanism obtains the weight of each position on the feature map by extracting the information of different channels. Global maximization and average pooling are initially utilized in the spatial dimension. In this way

the characteristic information of different channels at each position is synthesized and used as the initial attention weight. Afterward

the initial spatial attention weights are processed by convolutional layers and batch normalization operations to further learn the feature information and are outputted as the final spatial attention weights. The channel attention mechanism obtains the weight of different channels by extracting the information of each channel. Maximum and average global pooling are initially applied in the channel dimension. Through these two pooling operations

the feature information on each channel is extracted

and the number of parameters is reduced. The output is treated as the initial attention weight

and a 3×3 convolution layer is used to further learn the attention weight

enhance the salient features

and suppress the irrelevant features. The pyramid attention mechanism uses a convolution of different perception fields and obtains the weights for each position in different channels. The initial feature map is operated by using convolution kernels with sizes 1×1

3×3

and 5×5 to fully extract features at different levels of the feature map

and the output is treated as the initial attention weight. The feature maps are then multiplied to obtain the final state. Weights are assigned to the channel of each position on the feature map to effectively make the salient features get more attention and to suppress the irrelevant features. The three weights obtained are then merged to obtain the final attention weight. A feature fusion module is designed to fuse the features of different perception fields of the two paths in order to ascribe varying degrees of importance to the features of different perception fields. In the feature fusion module

the feature maps obtained from the two paths are upsampled to the same size by using a quadratic linear difference. Given that the features of different perception fields have varying degrees of importance

the obtained feature maps are initially cascaded

global pooling is applied to obtain the feature weights of each channel

and weights are adaptively assigned to each feature map. These maps are then upsampled to the original image size to obtain the final segmentation result.

Result

The results were validated on the Camvid and Cityscapes datasets with mean intersection over union(MIoU) and precision as metrics. The proposed model achieved 69.47% MIoU and 92.32% precision on Camvid

which were 1.3% and 3.09% higher than those of the model with the second-highest performance. Meanwhile

on Cityscapes

the proposed model achieved 78.48% MIoU and 93.83% precision

which were 1.16% and 3.60% higher than those of the model with the second-highest performance. The proposed algorithm also outperforms the other algorithms in terms of visual effects.

Conclusion

This paper designs a new semantic segmentation algorithm that uses dilated convolution and ResNet to collect image features and obtains different features of the image in large and small perception fields. Using dilated convolution improves the resolution while ensuring the perceived field of view. An attention mechanism module is also designed to assign weights to each feature map and to facilitate model learning. A feature fusion module is then designed to fuse the features of different perception fields and reflect their specificity. Experiments show that the proposed algorithm outperforms the other algorithms in terms of accuracy and shows a certain application value.

关键词

Keywords

references

Badrinarayanan V, Handa A and Cipolla R. 2015. Segnet: a deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling[EB/OL].[2019-02-05] . https://arxiv.org/pdf/1505.07293.pdf https://arxiv.org/pdf/1505.07293.pdf

Brostow G J, Fauqueur J, and Cipolla R. 2009. Semantic object classes in video:a high-definition ground truth database. Pattern Recognition Letters, 30(2):88-97[DOI:10.1016/j.patrec.2008.04.005]

Chen L C, Papandreou G, Kokkinos I, Murphy K and Yuille A. 2014. Semantic image segmentation with deep convolutional nets and fully connected CRFS[EB/OL].[2019-02-05] . https://arxiv.org/pdf/1412.7062.pdf https://arxiv.org/pdf/1412.7062.pdf

Chen L C, Papandreou G, Schroff F and Adam H. 2017. Rethinking atrous convolution for semantic image segmentation[EB/OL].[2019-02-05] . https://arxiv.org/pdf/1706.05587.pdf https://arxiv.org/pdf/1706.05587.pdf

Cheng X Y, Zhao L Z, Hu Q and Shi J P. 2019. Fast semantic segmentation based on dense layer and attention mechanism[EB/OL].Computer Engineering: 1-7[2020-02-05].

程晓悦, 赵龙章, 胡穹, 史家鹏. 2019.基于密集层和注意力机制的快速语义分割[EB/OL].计算机工程: 1-7[2020-02-05] . https://doi-org-443.webvpn.las.ac.cn/10.19678/j.issn.1000-3428.0054245 https://doi-org-443.webvpn.las.ac.cn/10.19678/j.issn.1000-3428.0054245 .

Cordts M, Omran M, Ramos S and Ramos S. 2016. The cityscapes dataset for semantic urban scene understanding//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 3213-3223[ DOI: 10.1109/cvpr.2016.350 http://dx.doi.org/10.1109/cvpr.2016.350 ]

Gilra A and Gerstner W. 2017. Non-linear motor control by local learning in spiking neural networks[EB/OL].[2019-02-05] . https://arxiv.org/pdf/1712.10158.pdf https://arxiv.org/pdf/1712.10158.pdf

He K, Zhang X, Ren S and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, Nevada, USA: IEEE, 770-778[ DOI: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]

Jiang F, Gu Q, Hao H Z, Li N, Guo Y W and Chen D X. 2017. Summary of content-based image segmentation methods. Journal of Software, 28 (1):160-183

姜枫, 顾庆, 郝慧珍, 李娜, 郭延文, 陈道蓄. 2017基于内容的图像分割方法综述.软件学报, 28(1):160-183)[DOI:10.13328/j.cnki.jos.005136]

Krizhevsky A, Sutskever I and Hinton G E. 2012. Imagenet classification with deep convolutional neural networks//Advances in Neural Information Processing Systems. Nevada, USA: NIPS, 1097-1105[ DOI: 10.1145/3065386 http://dx.doi.org/10.1145/3065386 ]

Long J, Shelhamer E and Darrell T. 2015. Fully convolutional networks for semantic segmentation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 3431-3440[ DOI: 10.1109/CVPR.2015.7298965 http://dx.doi.org/10.1109/CVPR.2015.7298965 ]

Ma Z H, Gao H J and Lei T. 2019. Semantic segmentation algorithm based on enhanced feature fusion decoder[EB/OL].Computer Engineering: 1-6[2020-02-05]

马震环, 高洪举, 雷涛. 2019.基于增强特征融合解码器的语义分割算法[EB/OL].计算机工程: 1-6[2020-02-05] . https://doi-org-443.webvpn.las.ac.cn/10.19678/j.issn.1000-3428.0054964 https://doi-org-443.webvpn.las.ac.cn/10.19678/j.issn.1000-3428.0054964 .

Paszke A, Chaurasia A, Kim S, Kim S and Culurciello E. 2016. Enet: A deep neural network architecture for real-time semantic segmentation[EB/OL].[2019-02-05] . https://arxiv.org/pdf/1606.02147.pdf https://arxiv.org/pdf/1606.02147.pdf

Ronneberger O, Fischer P and Brox T. 2015. U-net: Convolutional networks for biomedical image segmentation//Proceedings of 2015 International Conference on Medical image computing and computer-assisted intervention. Munich, Germany: MICCAI, 234-241[ DOI: 10.1007/978-3-319-24574-4_28 http://dx.doi.org/10.1007/978-3-319-24574-4_28 ]

Simonyan K and Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition[EB/OL].[2019-02-05] . https://arxiv.org/pdf/1409.1556.pdf https://arxiv.org/pdf/1409.1556.pdf

Valada A, Vertens J, Dhall A, and Burgard W. 2017. Adapnet: Adaptive semantic segmentation in adverse environmental conditions//Proceedings of 2017 IEEE International Conference on Robotics and Automation. Marina Bay Sands Singapore: IEEE: 4644-4651[ DOI: 10.1109/icra.2017.7989540 http://dx.doi.org/10.1109/icra.2017.7989540 ]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J and Jones L. 2017. Attention is all you need//Advances in Neural Information Processing Systems. Nevada, USA: NIPS, 5998-6008

Woo S, Park J, Lee J Y and Kweon I S. 2018. Cbam: convolutional block attention module//Proceedings of the European Conference on Computer Vision (ECCV). Munich, Germany: ECCV, 3-19[ DOI: 10.1007/978-3-030-01234-2_1 http://dx.doi.org/10.1007/978-3-030-01234-2_1 ]

Yu C, Wang J, Peng C, Gao C, Yu G and Sang N. 2018. Bisenet: bilateral segmentation network for real-time semantic segmentation//Proceedings of the European Conference on Computer Vision (ECCV). Munich, Germany: ECCV: 325-341[ DOI: 10.1007/978-3-030-01261-8_20 http://dx.doi.org/10.1007/978-3-030-01261-8_20 ]

Yu F, and Koltun V. 2015. Multi-scale context aggregation by dilated convolutions[EB/OL].[2019-02-05] . https://arxiv.org/pdf/1511.07122.pdf https://arxiv.org/pdf/1511.07122.pdf

Zhang H, Goodfellow I, Metaxas D and Odena A. 2018. Self-attention generative adversarial networks[EB/OL].[2019-02-05] . https://arxiv.org/pdf/1805.08318.pdf https://arxiv.org/pdf/1805.08318.pdf

Zhao H, Shi J, Qi X, Wang X and Jia J. 2017. Pyramid scene parsing network//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Hawaii, USA: IEEE, 2881-2890[ DOI: 10.1109/CVPR.2017.660 http://dx.doi.org/10.1109/CVPR.2017.660 ]