结合注意力机制的双路径语义分割
Two-path semantic segmentation algorithm combining attention mechanism
- 2020年25卷第8期 页码:1627-1636
收稿:2019-10-17,
修回:2020-1-20,
录用:2020-1-27,
纸质出版:2020-08-16
DOI: 10.11834/jig.190533
移动端阅览

浏览全部资源
扫码关注微信
收稿:2019-10-17,
修回:2020-1-20,
录用:2020-1-27,
纸质出版:2020-08-16
移动端阅览
目的
2
针对现有语义分割算法存在的因池化操作造成分辨率降低导致的分割结果变差、忽视特征图不同通道和位置特征的区别以及特征图融合时方法简单,没有考虑到不同感受视野特征区别等问题,设计了一种基于膨胀卷积和注意力机制的语义分割算法。
方法
2
主要包括两条路径:空间信息路径使用膨胀卷积,采用较小的下采样倍数以保持图像的分辨率,获得图像的细节信息;语义信息路径使用ResNet(residual network)采集特征以获得较大的感受视野,引入注意力机制模块为特征图的不同部分分配权重,使得精度损失降低。设计特征融合模块为两条路径获得的不同感受视野的特征图分配权重,并将其融合到一起,得到最后的分割结果。
结果
2
为证实结果的有效性,在Camvid和Cityscapes数据集上进行验证,使用平均交并比(mean intersection over union,MIoU)和精确度(precision)作为度量标准。结果显示,在Camvid数据集上,MIoU和精确度分别为69.47%和92.32%,比性能第2的模型分别提高了1.3%和3.09%。在Cityscapes数据集上,MIoU和精确度分别为78.48%和93.83%,比性能第2的模型分别提高了1.16%和3.60%。
结论
2
本文采用膨胀卷积和注意力机制模块,在保证感受视野并且提高分辨率的同时,弥补了下采样带来的精度损失,能够更好地指导模型学习,且提出的特征融合模块可以更好地融合不同感受视野的特征。
Objective
2
Semantic segmentation is a fundamental problem in the field of computer vision where the category of each pixel in the image needs to be labeled. Traditional semantic segmentation uses a manual design to extract image features or edge information and uses the extracted features to obtain the final segmentation map through machine learning algorithms. With the rise of deep convolutional networks
some scholars have applied convolutional neural networks to semantic segmentation to improve its accuracy. However
the existing semantic segmentation algorithms face some problems. First
a contradiction is observed between the perceptual field of view and the resolution. When obtaining a large perceptual field of view
the resolution is often reduced
thereby generating poor segmentation results. Second
each channel of the feature map represents a feature. Objects at different positions should pay attention to different features
but the existing algorithms often ignore such difference. Third
the existing algorithms often use simple cascading or addition when fusing feature maps of different perception fields. For objects of different sizes
the various characteristics of the visual field are not given the same amount of importance.
Method
2
To address the aforementioned problems
this paper designs a semantic segmentation algorithm based on the dilated convolution and attention mechanism. First
two paths are used to collect the features. One of these paths uses dilated convolution to collect spatial information. This path initially utilizes a convolution kernel with a step size of 2 for fast downsampling and then experimentally selects a feature map with a downsampling factor of 4. Afterward
an expansion convolution with expansion ratios of 1
2
and 4 is selected
and the obtained results are cascaded as the final feature map output while maintaining the resolution to obtain high-resolution feature maps. Meanwhile
the second path uses ResNet to collect features and obtain an expanded field of view. The ResNet network uses a residual structure that can well integrate shallow features with deep ones while simultaneously avoiding the difficulty of convergence during training. Feature maps are extracted with 16 and 32 times downsampling size from the ResNet network to expand the perception field. The attention mechanism module is then applied to assign weights to each part of feature maps to reflect their specificity. The attention mechanism is divided into the spatial attention
channel attention
and pyramid attention mechanisms. The spatial attention mechanism obtains the weight of each position on the feature map by extracting the information of different channels. Global maximization and average pooling are initially utilized in the spatial dimension. In this way
the characteristic information of different channels at each position is synthesized and used as the initial attention weight. Afterward
the initial spatial attention weights are processed by convolutional layers and batch normalization operations to further learn the feature information and are outputted as the final spatial attention weights. The channel attention mechanism obtains the weight of different channels by extracting the information of each channel. Maximum and average global pooling are initially applied in the channel dimension. Through these two pooling operations
the feature information on each channel is extracted
and the number of parameters is reduced. The output is treated as the initial attention weight
and a 3×3 convolution layer is used to further learn the attention weight
enhance the salient features
and suppress the irrelevant features. The pyramid attention mechanism uses a convolution of different perception fields and obtains the weights for each position in different channels. The initial feature map is operated by using convolution kernels with sizes 1×1
3×3
and 5×5 to fully extract features at different levels of the feature map
and the output is treated as the initial attention weight. The feature maps are then multiplied to obtain the final state. Weights are assigned to the channel of each position on the feature map to effectively make the salient features get more attention and to suppress the irrelevant features. The three weights obtained are then merged to obtain the final attention weight. A feature fusion module is designed to fuse the features of different perception fields of the two paths in order to ascribe varying degrees of importance to the features of different perception fields. In the feature fusion module
the feature maps obtained from the two paths are upsampled to the same size by using a quadratic linear difference. Given that the features of different perception fields have varying degrees of importance
the obtained feature maps are initially cascaded
global pooling is applied to obtain the feature weights of each channel
and weights are adaptively assigned to each feature map. These maps are then upsampled to the original image size to obtain the final segmentation result.
Result
2
The results were validated on the Camvid and Cityscapes datasets with mean intersection over union(MIoU) and precision as metrics. The proposed model achieved 69.47% MIoU and 92.32% precision on Camvid
which were 1.3% and 3.09% higher than those of the model with the second-highest performance. Meanwhile
on Cityscapes
the proposed model achieved 78.48% MIoU and 93.83% precision
which were 1.16% and 3.60% higher than those of the model with the second-highest performance. The proposed algorithm also outperforms the other algorithms in terms of visual effects.
Conclusion
2
This paper designs a new semantic segmentation algorithm that uses dilated convolution and ResNet to collect image features and obtains different features of the image in large and small perception fields. Using dilated convolution improves the resolution while ensuring the perceived field of view. An attention mechanism module is also designed to assign weights to each feature map and to facilitate model learning. A feature fusion module is then designed to fuse the features of different perception fields and reflect their specificity. Experiments show that the proposed algorithm outperforms the other algorithms in terms of accuracy and shows a certain application value.
Badrinarayanan V, Handa A and Cipolla R. 2015. Segnet: a deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling[EB/OL].[2019-02-05] . https://arxiv.org/pdf/1505.07293.pdf https://arxiv.org/pdf/1505.07293.pdf
Brostow G J, Fauqueur J, and Cipolla R. 2009. Semantic object classes in video:a high-definition ground truth database. Pattern Recognition Letters, 30(2):88-97[DOI:10.1016/j.patrec.2008.04.005]
Chen L C, Papandreou G, Kokkinos I, Murphy K and Yuille A. 2014. Semantic image segmentation with deep convolutional nets and fully connected CRFS[EB/OL].[2019-02-05] . https://arxiv.org/pdf/1412.7062.pdf https://arxiv.org/pdf/1412.7062.pdf
Chen L C, Papandreou G, Schroff F and Adam H. 2017. Rethinking atrous convolution for semantic image segmentation[EB/OL].[2019-02-05] . https://arxiv.org/pdf/1706.05587.pdf https://arxiv.org/pdf/1706.05587.pdf
Cheng X Y, Zhao L Z, Hu Q and Shi J P. 2019. Fast semantic segmentation based on dense layer and attention mechanism[EB/OL].Computer Engineering: 1-7[2020-02-05].
程晓悦, 赵龙章, 胡穹, 史家鹏. 2019.基于密集层和注意力机制的快速语义分割[EB/OL].计算机工程: 1-7[2020-02-05] . https://doi-org-443.webvpn.las.ac.cn/10.19678/j.issn.1000-3428.0054245 https://doi-org-443.webvpn.las.ac.cn/10.19678/j.issn.1000-3428.0054245 .
Cordts M, Omran M, Ramos S and Ramos S. 2016. The cityscapes dataset for semantic urban scene understanding//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 3213-3223[ DOI: 10.1109/cvpr.2016.350 http://dx.doi.org/10.1109/cvpr.2016.350 ]
Gilra A and Gerstner W. 2017. Non-linear motor control by local learning in spiking neural networks[EB/OL].[2019-02-05] . https://arxiv.org/pdf/1712.10158.pdf https://arxiv.org/pdf/1712.10158.pdf
He K, Zhang X, Ren S and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, Nevada, USA: IEEE, 770-778[ DOI: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]
Jiang F, Gu Q, Hao H Z, Li N, Guo Y W and Chen D X. 2017. Summary of content-based image segmentation methods. Journal of Software, 28 (1):160-183
姜枫, 顾庆, 郝慧珍, 李娜, 郭延文, 陈道蓄. 2017基于内容的图像分割方法综述.软件学报, 28(1):160-183)[DOI:10.13328/j.cnki.jos.005136]
Krizhevsky A, Sutskever I and Hinton G E. 2012. Imagenet classification with deep convolutional neural networks//Advances in Neural Information Processing Systems. Nevada, USA: NIPS, 1097-1105[ DOI: 10.1145/3065386 http://dx.doi.org/10.1145/3065386 ]
Long J, Shelhamer E and Darrell T. 2015. Fully convolutional networks for semantic segmentation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 3431-3440[ DOI: 10.1109/CVPR.2015.7298965 http://dx.doi.org/10.1109/CVPR.2015.7298965 ]
Ma Z H, Gao H J and Lei T. 2019. Semantic segmentation algorithm based on enhanced feature fusion decoder[EB/OL].Computer Engineering: 1-6[2020-02-05]
马震环, 高洪举, 雷涛. 2019.基于增强特征融合解码器的语义分割算法[EB/OL].计算机工程: 1-6[2020-02-05] . https://doi-org-443.webvpn.las.ac.cn/10.19678/j.issn.1000-3428.0054964 https://doi-org-443.webvpn.las.ac.cn/10.19678/j.issn.1000-3428.0054964 .
Paszke A, Chaurasia A, Kim S, Kim S and Culurciello E. 2016. Enet: A deep neural network architecture for real-time semantic segmentation[EB/OL].[2019-02-05] . https://arxiv.org/pdf/1606.02147.pdf https://arxiv.org/pdf/1606.02147.pdf
Ronneberger O, Fischer P and Brox T. 2015. U-net: Convolutional networks for biomedical image segmentation//Proceedings of 2015 International Conference on Medical image computing and computer-assisted intervention. Munich, Germany: MICCAI, 234-241[ DOI: 10.1007/978-3-319-24574-4_28 http://dx.doi.org/10.1007/978-3-319-24574-4_28 ]
Simonyan K and Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition[EB/OL].[2019-02-05] . https://arxiv.org/pdf/1409.1556.pdf https://arxiv.org/pdf/1409.1556.pdf
Valada A, Vertens J, Dhall A, and Burgard W. 2017. Adapnet: Adaptive semantic segmentation in adverse environmental conditions//Proceedings of 2017 IEEE International Conference on Robotics and Automation. Marina Bay Sands Singapore: IEEE: 4644-4651[ DOI: 10.1109/icra.2017.7989540 http://dx.doi.org/10.1109/icra.2017.7989540 ]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J and Jones L. 2017. Attention is all you need//Advances in Neural Information Processing Systems. Nevada, USA: NIPS, 5998-6008
Woo S, Park J, Lee J Y and Kweon I S. 2018. Cbam: convolutional block attention module//Proceedings of the European Conference on Computer Vision (ECCV). Munich, Germany: ECCV, 3-19[ DOI: 10.1007/978-3-030-01234-2_1 http://dx.doi.org/10.1007/978-3-030-01234-2_1 ]
Yu C, Wang J, Peng C, Gao C, Yu G and Sang N. 2018. Bisenet: bilateral segmentation network for real-time semantic segmentation//Proceedings of the European Conference on Computer Vision (ECCV). Munich, Germany: ECCV: 325-341[ DOI: 10.1007/978-3-030-01261-8_20 http://dx.doi.org/10.1007/978-3-030-01261-8_20 ]
Yu F, and Koltun V. 2015. Multi-scale context aggregation by dilated convolutions[EB/OL].[2019-02-05] . https://arxiv.org/pdf/1511.07122.pdf https://arxiv.org/pdf/1511.07122.pdf
Zhang H, Goodfellow I, Metaxas D and Odena A. 2018. Self-attention generative adversarial networks[EB/OL].[2019-02-05] . https://arxiv.org/pdf/1805.08318.pdf https://arxiv.org/pdf/1805.08318.pdf
Zhao H, Shi J, Qi X, Wang X and Jia J. 2017. Pyramid scene parsing network//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Hawaii, USA: IEEE, 2881-2890[ DOI: 10.1109/CVPR.2017.660 http://dx.doi.org/10.1109/CVPR.2017.660 ]
相关作者
相关机构
京公网安备11010802024621