多尺度条形池化与通道注意力的图像语义分割

马吉权; 赵淑敏; 孔凡辉

doi:10.11834/jig.210359

图像理解和计算机视觉 | 浏览量 : 0 下载量: 0 CSCD: 2

PDF
导出
分享
收藏
专辑

多尺度条形池化与通道注意力的图像语义分割
Semantic image segmentation by using multi-scale strip pooling and channel attention
2022年27卷第12期页码：3530-3541
纸质出版日期： 2022-12-16 ，

录用日期： 2021-12-24
DOI： 10.11834/jig.210359
稿件说明：

移动端阅览

马吉权, 赵淑敏, 孔凡辉. 多尺度条形池化与通道注意力的图像语义分割[J]. 中国图象图形学报, 2022,27(12):3530-3541.

Jiquan Ma, Shumin Zhao, Fanhui Kong. Semantic image segmentation by using multi-scale strip pooling and channel attention[J]. Journal of Image and Graphics, 2022,27(12):3530-3541.
马吉权, 赵淑敏, 孔凡辉. 多尺度条形池化与通道注意力的图像语义分割[J]. 中国图象图形学报, 2022,27(12):3530-3541. DOI： 10.11834/jig.210359.

Jiquan Ma, Shumin Zhao, Fanhui Kong. Semantic image segmentation by using multi-scale strip pooling and channel attention[J]. Journal of Image and Graphics, 2022,27(12):3530-3541. DOI： 10.11834/jig.210359.

摘要

目的

针对自然场景下图像语义分割易受物体自身形状多样性、距离和光照等因素影响的问题，本文提出一种新的基于条形池化与通道注意力机制的双分支语义分割网络(strip pooling and channel attention net，SPCANet)。

方法

SPCANet从空间与内容两方面对图像特征进行抽取。首先，空间感知子网引入1维膨胀卷积与多尺度思想对条形池化技术进行优化改进，进一步在编码阶段增大水平与竖直方向上的感受野；其次，为了提升模型的内容感知能力，将在ImageNet数据集上预训练好的VGG16(Visual Geometry Group 16-layer network)作为内容感知子网，以辅助空间感知子网优化语义分割的嵌入特征，改善空间感知子网造成的图像细节信息缺失问题。此外，使用二阶通道注意力进一步优化网络中间层与高层的特征选择，并在一定程度上缓解光照产生的色差对分割结果的影响。

结果

使用Cityscapes作为实验数据，将本文方法与其他基于深度神经网络的分割方法进行对比，并从可视化效果和评测指标两方面进行分析。SPCANet在目标分割指标mIoU(mean intersection over union)上提升了1.2%。

结论

提出的双分支语义分割网络利用改进的条形池化技术、内容感知辅助网络和通道注意力机制对图像语义分割进行优化，对实验结果的提升起到了积极作用。

Abstract

Objective

Real-scenario image semantic segmentation is likely to be affected by multiple object-context shapes

ranges and illuminations. Current semantic segmentation methods have inaccurate classification results for pedestrians

buildings

road signs and other objects due to their small scales or wide ranges. At the same time

the existing methods are not distinguishable for objects with chromatic aberration

and it is easy to divide the same chromatic aberration-derived object into different objects

or segment different objects with similar colors into the same type of objects. In order to improve the performance of semantic image segmentation

we facilitate a new dual-branch semantic segmentation network in terms of strip pooling and attention mechanism (strip pooling and channel attention net (SPCANet)).

Method

the SPCANet can be used to extract the features of images via spatial and content perceptions. First

we employ the spatial perception Sub-net to augment the receptive field in the horizontal and vertical directions on the down-sampling stage by using dilated convolution and strip pooling with multi-scale. Our specific approach is focused on adding four parallel one-dimensional dilated convolutions with different rates to the horizontal and vertical branches on the basis of strip pooling model (based on the pooling operation which kernel size is

× 1 or 1 ×

)

which enhance the perception of large-scale objects in the image. Nextly

in order to improve the content perception ability of the model

we use the pre-trained VGG16 (Visual Geometry Group 16-layer network) based on ImageNet dataset as the content-perception sub-net to optimize the embedded features of semantic segmentation via spatial-perception assisted sub-net. The content sub-net can strengthen feature representation in combination with the spatial perception subnet. In addition

the second-order channel attention is used to optimize the feature assignment further between the middle and high-level layers of the network. In the network training period

the target information is focused and assigned a larger weight

and irrelevant information is suppressed and a smaller weight is assigned. By this way

the correlation is activated in the embedding features. To enhance the expression of image channel information

we use covariance and gating mechanism to achieve the second-order channel attention. Our model can be demonstrated sequentially 1) a three-channel color image is as input

2) the spatial-based and content-oriented sub-nets are transmitted for feature encoding in the embedded space

3) the two sets of features are fused (using the method of feature fusion for concatenate)

and 4) the fused features are sent to a prediction module (head) for classification and the segmentation task.

Result

We use the popular benchmarks (Cityscapes) as the testing data and our results are compared with other deep neural network-based methods (including the existing network published on the Cityscapes official website and the network based on local reproduction from GitHub). We evaluate the performance qualitatively and quantitatively. The qualitative analysis is carried out by means of visual analysis and the experiment is analyzed quantitatively by public popular metrics. 1) From the perspective of the visualization of the segmentation results

the method proposed in this paper has a strong perception of wide-range objects in the image

and the overall segmentation effect is improved obviously; 2) the metrics of segmentation can reflect the result of the experiment as well. Through the experimental data found that the commonly-used metrics such as accuracy (Acc) and the mean intersection over union (mIoU) are significantly improved. The mIoU is increased by 1.2%

and the Acc is increased by 0.7%. The Ablation studies validated the effectiveness of our modules. Among them

the improved strip pooling module has a more obvious improvement effect on the segmentation result. Under the same experimental circumstances based on batch-train dataset with an input size of 512×512×3

the mIoU can be improved by 4%

and then change the input size to 768 under the same experimental conditions

the mIoU is improved by 5%. The use of second-order channel attention makes the model more sensitive to the chromatic aberration part in the image during the training process. From the visualization results based on the Cityscapes batch-train dataset

the classification result such as pedestrians is improved obviously. The stability of other classification needs to be strengthened further. In the selection of content-perception subnet

we use three pre-trained networks on the ImageNet as candidates

including VGG16

ResNet101 and DenseNet101. The pre-trained VGG16 as the content-perception sub-net can achieve the best performance. The supplementary use of content-perception sub-net enhances the information representation ability of feature maps.

Conclusion

We develop the image semantic segmentation algorithm in the context of attention mechanism

multi-scale strip pooling and feature fusion. To optimize our image semantic segmentation

it is harnessed by an improved strip pooling technology (the receptive field augmentation with no more parameters)

second-order channel attention (channels-between information) and content perception auxiliary network. Our model can clarify the circumstances of inaccurate segmentation caused by multi-scale segmentation of objects. Our joint model with receptive fields and channel information is beneficial to the semantic image segmentation in the real scenario. To reduce the labor cost in data labeling

it can be extended to learn a more generalizing semantic image segmentation neural network through weakly supervised or unsupervised mode further.

关键词

图像分割注意力条形池化膨胀卷积感受野

Keywords

image segmentationattentionstrip poolingatrous convolutionreceptive field

references

Chen L C, Papandreou G, Kokkinos I, Murphy K and Yuille A L. 2018a. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4): 834-848[DOI: 10.1109/TPAMI.2017.2699184]

Chen L C, Papandreou G, Schroff F and Adam H. 2017. Rethinking atrous convolution for semantic image segmentation[EB/OL]. [2021-05-15].https://arxiv.org/pdf/1706.05587.pdfhttps://arxiv.org/pdf/1706.05587.pdf

Chen L C, Zhu Y K, Papandreou G, Schroff F and Adam H. 2018b. Encoder-decoder with atrous separable convolution for semantic image segmentation//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 831-851[DOI:10.1007/978-3-030-01234-2_49http://dx.doi.org/10.1007/978-3-030-01234-2_49]

Çiçek Ö, Abdulkadir A, Lienkamp S S, Brox T and Ronneberger O. 2016. 3D U-Net: learning dense volumetric segmentation from sparse annotation//Proceedings of the 19th International Conference on Medical Image Computing and Computer-Assisted Intervention. Athens, Greece: Springer: 424-432[DOI:10.1007/978-3-319-46723-8_49http://dx.doi.org/10.1007/978-3-319-46723-8_49]

Fu J, Liu J, Tian H J, Li Y, Bao Y J, Fang Z W and Lu H Q. 2019. Dual attention network for scene segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 3141-3149[DOI:10.1109/CVPR.2019.00326http://dx.doi.org/10.1109/CVPR.2019.00326]

Guo T Y, Wang B, Liu Y and Wei Y. 2019. Multi-channel fusion separable convolution neural networks for brain magnetic resonance image segmentation. Journal of Image and Graphics, 24(11): 2009-2020

郭彤宇, 王博, 刘悦, 魏颖. 2019. 多通道融合可分离卷积神经网络下的脑部磁共振图像分割. 中国图象图形学报, 24(11): 2009-2020[DOI: 10.11834/jig.190043]

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778[DOI:10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]

Hou Q B, Zhang L, Cheng M M and Feng J S. 2020. Strip pooling: rethinking spatial pooling for scene parsing//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 4002-4011[DOI:10.1109/CVPR42600.2020.00406http://dx.doi.org/10.1109/CVPR42600.2020.00406]

Howard A G, Zhu M L, Chen B, Kalenichenko D, Wang W J, Weyand T, Andreetto M and Adam H. 2017. Mobilenets: efficient convolutional neural networks for mobile vision applications[EB/OL]. [2021-05-15].https://arxiv.org/pdf/1704.04861.pdfhttps://arxiv.org/pdf/1704.04861.pdf

Hu J, Shen L and Sun G. 2018. Squeeze-and-excitation networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7132-7141[DOI:10.1109/CVPR.2018.00745http://dx.doi.org/10.1109/CVPR.2018.00745]

Huang G, Liu Z, Van Der Maaten L and Weinberger K Q. 2017. Densely connected convolutional networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 2261-2269[DOI:10.1109/CVPR.2017.243http://dx.doi.org/10.1109/CVPR.2017.243]

Huang Z L, Wang X G, Huang L C, Huang C, Wei Y C and Liu W Y. 2019. CCNet: Criss-cross attention for semantic segmentation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 603-612[DOI:10.1109/ICCV.2019.00069http://dx.doi.org/10.1109/ICCV.2019.00069]

LeCun Y, Boser B, Denker J S, Henderson D, Howard R E, Hubbard W and Jackel L D. 1989. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4): 541-551[DOI: 10.1162/neco.1989.1.4.541]

Li H C, Xiong P F, An J and Wang L X. 2018a. Pyramid attention network for semantic segmentation//Proceedings of British Machine Vision Conference 2018. Newcastle, UK: BMVA Press: #1120

Liu S T, Huang D and Wang Y H. 2018. Receptive field block net for accurate and fast object detection//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 404-419[DOI:10.1007/978-3-030-01252-6_24http://dx.doi.org/10.1007/978-3-030-01252-6_24]

Long J, Shelhamer E and Darrell T. 2015. Fully convolutional networks for semantic segmentation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3431-3440[DOI:10.1109/CVPR.2015.7298965http://dx.doi.org/10.1109/CVPR.2015.7298965]

Long X D, Zhang W W and Zhao B. 2020. PSPNet-SLAM: a semantic SLAM detect dynamic object by pyramid scene parsing network. IEEE Access, 8: 214685-214695[DOI: 10.1109/ACCESS.2020.3041038]

Ronneberger O, Fischer P and Brox T. 2015. U-Net: convolutional networks for biomedical image segmentation//Proceedings of International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich, Germany: Springer: 234-241[DOI:10.1007/978-3-319-24574-4_28http://dx.doi.org/10.1007/978-3-319-24574-4_28]

Simonyan K and ZissermanA. 2015. Very deep convolutional networks for large-scale image recognition//Proceedings of the 3rd International Conference on Learning Representations. San Diego, USA: ICLR: #1409

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc. : 6000-6010

Ye L, Zhu J Y and Duan T. 2020. Design of image segmentation model of driving vision based on deep learning. Research and Exploration in Laboratory, 39(10): 88-92

叶绿, 朱家懿, 段婷. 2020. 基于深度学习的行驶视觉图像分割模型设计. 实验室研究与探索, 39(10): 88-92[DOI: 10.3969/j.issn.1006-7167.2020.10.020]

Yuan Y H and Wang J D. 2021. OCNet: object context network for scene parsing[EB/OL]. [2021-05-15].https://arxiv.org/pdf/1809.00916v2.pdfhttps://arxiv.org/pdf/1809.00916v2.pdf

Zhao H S, Shi J P, Qi X J, Wang X G and Jia J Y. 2017. Pyramid scene parsing network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 6230-6239[DOI:10.1109/CVPR.2017.660http://dx.doi.org/10.1109/CVPR.2017.660]

文章被引用时，请邮件提醒。

提交

共性特征学习的高泛化伪造指纹检测

面向虚拟视点绘制空洞填充的渐进式迭代网络

分割一切模型SAM的潜力与展望：综述

结合潜在扩散模型和U型网络的HIFU治疗目标区域提取

深度学习实时语义分割研究进展和挑战