空洞可分离卷积和注意力机制的实时语义分割
Real-time semantic segmentation analysis based on cavity separable convolution and attention mechanism
- 2022年27卷第4期 页码:1216-1225
收稿:2020-12-14,
修回:2021-2-18,
录用:2021-2-25,
纸质出版:2022-04-16
DOI: 10.11834/jig.200729
移动端阅览

浏览全部资源
扫码关注微信
收稿:2020-12-14,
修回:2021-2-18,
录用:2021-2-25,
纸质出版:2022-04-16
移动端阅览
目的
2
为满足语义分割算法准确度和实时性的要求,提出了一种基于空洞可分离卷积模块和注意力机制的实时语义分割方法。
方法
2
将深度可分离卷积与不同空洞率的空洞卷积相结合,设计了一个空洞可分离卷积模块,在减少模型计算量的同时,能够更高效地提取特征;在网络输出端加入了通道注意力模块和空间注意力模块,增强对特征的通道信息和空间信息的表达并与原始特征融合,以进一步提高特征的表达能力;将融合的特征上采样到原图大小,预测像素类别,实现语义分割。
结果
2
在Cityscapes数据集和CamVid数据集上进行了实验验证,分别取得70.4%和67.8%的分割精度,速度达到71帧/s,而模型参数量仅为0.66 M。在不影响速度的情况下,分割精度比原始方法分别提高了1.2%和1.2%,验证了该方法的有效性。同时,与近年来的实时语义分割方法相比也表现出一定优势。
结论
2
本文方法采用空洞可分离卷积模块和注意力模块,在减少模型计算量的同时,能够更高效地提取特征,且在保证实时分割的情况下提升分割精度,在准确度和实时性之间达到了有效的平衡。
Objective
2
Image semantic segmentation is an essential part in computer vision analysis
which is related to autonomous driving
scenario recognitions
medical image analysis and unmanned aerial vehicle (UAV) application. To improve the global information acquisition efficiency
current semantic segmentation models can summarize the context information of different regions based on pyramid pooling module. Cavity-convolution-based multi-scale features extraction can increase the spatial resolution at different rates without changing the number of parameters. The feature pyramid network can be used to extract features and the multi-scale pyramid structure can be implemented to construct networks. The two methods mentioned above improve the accuracy of semantic segmentation. The practical applications are constrained of the size of the network and the speed of reasoning. Hence
a small capacity
fast and efficient real-time semantic segmentation network is a challenging issue to be designed. To require accuracy and real-time performance of semantic segmentation algorithm
a real-time semantic segmentation method is illustrated based on cavity separable convolution module and attention mechanism.
Method
2
First
the depth separable convolution is integrated to the cavity convolution with different rates to design a cavity separable convolution module. Next
the channel attention module and spatial attention module are melted into the performance of ending network-to enhance the representation of the channel information and spatial information of the feature
and integrate with the original features to obtain the final fused features to further improve the of the feature illustration capability. At the end
the fused features are up-sampled to the size of the original image to predict the category and achieve semantic segmentation. The targeted implementation can be segmented into feature extraction stage and feature enhancement stage. In the feature extraction stage
the input image adopts the cavity separable convolution module for intensive feature extraction. The module first uses a channel split operation to split the number of channels in half
splitting them into two branches. The following standard convolution is substituted to extract features more efficiently and shrink the number of model parameters based on deep separable convolution for each branch while. Meanwhile
the cavity convolution with different rates is used in the convolution layer of each branch to expand the receptive field and obtain multi-scale context information effectively. In the feature augmenting stage
the extracted features are re-integrated to enhance the demonstration of feature information. Our demonstration is illustrated as bellows: First
channel attention module and spatial attention module branch are melted into the model to enhance the expression of channel information and spatial information of features. Next
the global average pool branch is integrated to global context information to further improve the semantic segmentation performance. At the end
the branching features are all fused and the up-sampling process is used to match the resolution of the input image.
Result
2
Cityscapes dataset and the CamVid dataset are conducted on our method in order to verify the effectiveness of our illustrated method. The segmentation accuracy of Cityscapes dataset and CamVid dataset are 70.4% and 67.8% each. The running speed is 71 frame/s
while the model parameter amount was only 0.66 M. The demonstration illustrated that our method improves the segmentation accuracy to 1.2% and 1.2% each compared with the original method without low speed.
Conclusion
2
To customize the requirements of accuracy and real-time performance of semantic segmentation algorithm
a real-time semantic segmentation method is facilitated based on the cavity separable convolution module and the attention mechanism. This redesign depth method the can be combined with an efficient separation of convolution and cavity convolution in the depth of each separable branches with different cavity rate of convolution to obtain a different size of receptive field. The channel attention and spatial attention module are melted. Our method shrinks the number of model parameters and conducts feature information learning. Deeper network model and context aggregation module are conducted to achieve qualified real-time semantic segmentation simultaneously.
Chen L C, Papandreou G, Kokkinos I, Murphy K and Yuille A L. 2017. Deeplab: semantic image segmentation with deep convolutional nets, Atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4): 834-848 [DOI: 10.1109/TPAMI.2017.2699184]
Chen L C, Zhu Y K, Papandreou G, Schroff F and Adam H. 2018. Encoder-decoder with Atrous separable convolution for semantic image segmentation//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 833-851 [ DOI: 10.1007/978-3-030-01234-2_49 http://dx.doi.org/10.1007/978-3-030-01234-2_49 ]
Fu J, Liu J, Tian H J, Li Y, Bao Y J, Fang Z W and Lu H Q. 2019. Dual attention network for scene segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 3141-3149 [ DOI: 10.1109/CVPR.2019.00326 http://dx.doi.org/10.1109/CVPR.2019.00326 ]
Han S, Mao H Z and Dally W J. 2016. Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding [EB/OL]. [2020-11-14] . https://arxiv.org/pdf/1510.00149.pdf https://arxiv.org/pdf/1510.00149.pdf
Howard A G, Zhu M L, Chen B, Kalenichenko D, Wang W J, Weyand T, Andreetto M and Adam H. 2017. MobileNets: efficient convolutional neural networks for mobile vision applications [EB/OL]. [2020-11-14] . https://arxiv.org/pdf/1704.04861.pdf https://arxiv.org/pdf/1704.04861.pdf
Hu J, Shen L and Sun G. 2018. Squeeze-and-excitation networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7132-7141 [ DOI: 10.1109/CVPR.2018.00745 http://dx.doi.org/10.1109/CVPR.2018.00745 ]
Kirillov A, Girshick R, He K M and Dollár P. 2019. Panoptic feature pyramid networks//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 6392-6401 [ DOI: 10.1109/CVPR.2019.00656 http://dx.doi.org/10.1109/CVPR.2019.00656 ]
Lin T Y, Dollár P, Girshick R, He K M, Hariharan B and Belongie S. 2017. Feature pyramid networks for object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 936-944 [ DOI: 10.1109/CVPR.2017.106 http://dx.doi.org/10.1109/CVPR.2017.106 ]
Ma N N, Zhang X Y, Zheng H T and Sun J. 2018. ShuffleNet V2: practical guidelines for efficient CNN architecture design//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 122-138 [ DOI: 10.1007/978-3-030-01264-9_8 http://dx.doi.org/10.1007/978-3-030-01264-9_8 ]
Mehta S, Rastegari M, Caspi A, Shapiro L and Hajishirzi H. 2018. ESPNet: efficient spatial pyramid of dilated convolutions for semantic segmentation//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 561-580 [ DOI: 10.1007/978-3-030-01249-6_34 http://dx.doi.org/10.1007/978-3-030-01249-6_34 ]
Mehta S, Rastegari M, Shapiro L and Hajishirzi H. 2019. ESPNetv2: a light-weight, power efficient, and general purpose convolutional neural network//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 9182-9192 [ DOI: 10.1109/CVPR.2019.00941 http://dx.doi.org/10.1109/CVPR.2019.00941 ]
Nadkarni P M, Ohno-Machado L and Chapman W W. 2011. Natural language processing: an introduction. Journal of the American Medical Informatics Association, 18(5): 544-551 [DOI: 10.1136/amiajnl-2011-000464]
Paszke A, Chaurasia A, Kim S and Culurciello E. 2016. ENet: a deep neural network architecture for real-time semantic segmentation [EB/OL]. [2020-11-14] . https://arxiv.org/pdf/1606.02147.pdf https://arxiv.org/pdf/1606.02147.pdf
Qing C, Yu J, Xiao C B and Duan J. 2020. Deep convolutional neural network for semantic image segmentation. Journal of Image and Graphics, 25(6): 1069-1090
青晨, 禹晶, 肖创柏, 段娟. 2020. 深度卷积神经网络图像语义分割研究进展. 中国图象图形学报, 25(6): 1069-1090 [DOI: 10.11834/jig.190355]
Romera E, Álvarez J M, Bergasa L M and Arroyo R. 2017. ERFNet: efficient residual factorized ConvNet for real-time semantic segmentation. IEEE Transactions on Intelligent Transportation Systems, 19(1): 263-272 [DOI: 10.1109/TITS.2017.2750080]
Sandler M, Howard A, Zhu M L, Zhmoginov A and Chen L C. 2018. MobileNetV2: inverted residuals and linear bottlenecks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4510-4520 [ 10.1109/CVPR.2018.00474 http://dx.doi.org/10.1109/CVPR.2018.00474 ]
Wang Y, Zhou Q, Liu J, Xiong J, Gao G W, Wu X F and Latecki L J. 2019. Lednet: a lightweight encoder-decoder network for real-time semantic segmentation//Proceedings of 2019 IEEE International Conference on Image Processing (ICIP). Taipei, China: IEEE: 1860-1864 [ DOI: 10.1109/ICIP.2019.8803154 http://dx.doi.org/10.1109/ICIP.2019.8803154 ]
Wen W, Wu C P, Wang Y D, Chen Y R and Li H. 2016. Learning structured sparsity in deep neural networks//Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain: ACM: 2082-2090 [ DOI: 10.5555/3157096.3157329 http://dx.doi.org/10.5555/3157096.3157329 ]
Woo S, Park J, Lee J Y and Kweon I S. 2018. CBAM: convolutional block attention module//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 3-19 [ DOI: 10.1007/978-3-030-01234-2_1 http://dx.doi.org/10.1007/978-3-030-01234-2_1 ]
Wu T Y, Tang S, Zhang R, Cao J and Zhang Y. 2020. CGNet: a light-weight context guided network for semantic segmentation. IEEE Transactions on Image Processing, 30: 1169-1179 [DOI: 10.1109/TIP.2020.3042065]
Yu F and Koltun V. 2016. Multi-scale context aggregation by dilated convolutions. [EB/OL]. [2020-11-14] . https://arxiv.org/pdf/1511.07122.pdf https://arxiv.org/pdf/1511.07122.pdf
Yuan Y H, Huang L, Guo J Y, Zhang C, Chen X L and Wang J D. 2021. OCNet: object context network for scene parsing [EB/OL]. [2021-03-15] . https://arxiv.org/pdf/1809.00916.pdf https://arxiv.org/pdf/1809.00916.pdf
Zhao H S, Qi X J, Shen X Y, Shi J P and Jia J Y. 2018a. ICNet for real-time semantic segmentation on high-resolution images//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 418-434 [ DOI: 10.1007/978-3-030-01219-9_25 http://dx.doi.org/10.1007/978-3-030-01219-9_25 ]
Zhao H S, Shi J P, Qi X J, Wang X G and Jia J Y. 2017. Pyramid scene parsing network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 6230-6239 [ DOI: 10.1109/CVPR.2017.660 http://dx.doi.org/10.1109/CVPR.2017.660 ]
Zhao H S, Zhang Y, Liu S, Shi J P, Loy C C, Lin D H and Jia J Y. 2018b. PSANet: point-wise spatial attention network for sceneparsing//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 270-286 [ DOI: 10.1007/978-3-030-01240-3_17 http://dx.doi.org/10.1007/978-3-030-01240-3_17 ]
相关作者
相关机构
京公网安备11010802024621