编码—解码结构的语义分割

韩慧慧; 李帷韬; 王建平; 焦点; 孙百顺

doi:10.11834/jig.190212

图像处理和编码 | 浏览量 : 0 下载量: 1 CSCD: 4

PDF
导出
分享
收藏
专辑

编码—解码结构的语义分割
Semantic segmentation of encoder-decoder structure
2020年25卷第2期页码：255-266
纸质出版日期： 2020-02-16 ，

录用日期： 2019-07-29
DOI： 10.11834/jig.190212
稿件说明：

移动端阅览

韩慧慧, 李帷韬, 王建平, 焦点, 孙百顺. 编码—解码结构的语义分割[J]. 中国图象图形学报, 2020,25(2):255-266.

Huihui Han, Weitao Li, Jianping Wang, Dian Jiao, Baishun Sun. Semantic segmentation of encoder-decoder structure[J]. Journal of Image and Graphics, 2020,25(2):255-266.
韩慧慧, 李帷韬, 王建平, 焦点, 孙百顺. 编码—解码结构的语义分割[J]. 中国图象图形学报, 2020,25(2):255-266. DOI： 10.11834/jig.190212.

Huihui Han, Weitao Li, Jianping Wang, Dian Jiao, Baishun Sun. Semantic segmentation of encoder-decoder structure[J]. Journal of Image and Graphics, 2020,25(2):255-266. DOI： 10.11834/jig.190212.

摘要

目的

语义分割是计算机视觉中一项具有挑战性的任务，其核心是为图像中的每个像素分配相应的语义类别标签。然而，在语义分割任务中，缺乏丰富的多尺度信息和足够的空间信息会严重影响图像分割结果。为进一步提升图像分割效果，从提取丰富的多尺度信息和充分的空间信息出发，本文提出了一种基于编码-解码结构的语义分割模型。

方法

运用ResNet-101网络作为模型的骨架提取特征图，在骨架末端附加一个多尺度信息融合模块，用于在网络深层提取区分力强且多尺度信息丰富的特征图。并且，在网络浅层引入空间信息捕获模块来提取丰富的空间信息。由空间信息捕获模块捕获的带有丰富空间信息的特征图和由多尺度信息融合模块提取的区分力强且多尺度信息丰富的特征图将融合为一个新的信息丰富的特征图集合，经过多核卷积块细化之后，最终运用数据依赖的上采样（DUpsampling）操作得到图像分割结果。

结果

此模型在2个公开数据集（Cityscapes数据集和PASCAL VOC 2012数据集）上进行了大量实验，验证了所设计的每个模块及整个模型的有效性。新模型与最新的10种方法进行了比较，在Cityscapes数据集中，相比于RefineNet模型、DeepLabv2-CRF模型和LRR（Laplacian reconstruction and refinement）模型，平均交并比（mIoU）值分别提高了0.52%、3.72%和4.42%；在PASCAL VOC 2012数据集中，相比于Piecewise模型、DPN（deep parsing network）模型和GCRF（Gaussion conditional random field network）模型，mIoU值分别提高了6.23%、7.43%和8.33%。

结论

本文语义分割模型，提取了更加丰富的多尺度信息和空间信息，使得分割结果更加准确。此模型可应用于医学图像分析、自动驾驶、无人机等领域。

Abstract

Objective

Semantic segmentation

a challenging task in computer vision

aims to assign corresponding semantic class labels to every pixel in an image. This process is widely applied into many fields

such as autonomous driving

obstacle detection

medical image analysis

3D geometry

environment modeling

reconstruction of indoor environment

and 3D semantic segmentation. Despite the many achievements in semantic segmentation task

two challenges remain:1) the lack of rich multiscale information and 2) the loss of spatial information. Starting from capturing rich multiscale information and extracting affluent spatial information

a new semantic segmentation model is proposed

which can greatly improve the segmentation results.

Method

The new module is built on an encoder-decoder structure

which can effectively promote the fusion of high-level semantic information and low-level spatial information. The details of the entire architecture are elaborated as follows:First

in the encoder part

the ResNet-101 network is used as our backbone to capture feature maps. In ResNet-101 network

the last two blocks utilize atrous convolutions with

rate

=2 and

rate

which can reduce the spatial resolution loss. A multiscale information fusion module is designed in the encoder part to capture feature maps with rich multiscale and discriminative information in the deep stage of the network. In this module

by applying expansion and stacking principle

Kronecker convolutions are arranged in parallel structure to expand the receptive field for extracting multiscale information. A global attention module is applied to highlight discriminative information selectively in the feature maps captured by Kronecker convolutions. Subsequently

a spatial information capturing module is introduced as a decoder part in the shallow stage of the network. The spatial information capturing module aims to supplement the affluent spatial information

which can compensate for the spatial resolution loss caused by the repeated combination of max-pooling and striding at consecutive layers in ResNet-101. Moreover

the spatial information-capturing module plays an important role in effectively enhancing the relationships between the widely separated spatial regions. The feature maps with rich multiscale and discriminative information captured by the multiscale information fusion module in the deep stage and the feature maps with affluent spatial information captured by the spatial information-capturing module will be fused to obtain a new feature map set

which is full of effective information. Afterward

a multikernel convolution block is utilized to refine these feature maps. In the multikernel convolution block

two convolutions are in parallel. The sizes of the two convolution kernels are 3×3 and 5×5. The feature maps refined by the multikernel convolution block will be fed to a Data-dependent Upsampling (DUpsampling) operator to obtain the final prediction feature maps. The reason for replacing the upsample operators with bilinear interpolation with DUpsampling is that DUpsampling not only can utilize the redundancy in the segmentation label space but also can effectively recover the pixel-wise prediction. We can safely downsample arbitrary low-level feature maps to the resolution of the lowest resolution of feature maps and then fuse these features to produce the final prediction.

Result

To prove the effectiveness of the proposals

extensive experiments are conducted on two public datasets:PASCAL VOC 2012 and Cityscapes. We first conduct several ablation studies on the PASCAL VOC 2012 dataset to evaluate the effectiveness of each module and then perform several contrast experiments on the PASCAL VOC 2012 and Cityscapes datasets with existing approaches

such as FCN (fully convolutional network)

FRRN(full-resolution residual networks)

DeepLabv2

CRF-RNN(conditional random fields as recurrent neural networks)

DeconvNet

GCRF (Gaussion conditional random field network)

DeepLabv2-CRF

Piecewise

Dilation10

DPN (deep parsing network)

LRR (Laplacian reconstruction and refinement)

and RefineNet models

to verify the effectiveness of the entire architecture. On the Cityscapes dataset

our model achieves 0.52%

3.72%

and 4.42% mIoU improvement in performance compared with the RefineNet

DeepLabv2-CRF

and LRR models

respectively. On the PASCAL VOC 2012 dataset

our model achieves 6.23%

7.43%

and 8.33% mIoU improvement in performance compared with the Piecewise

DPN

and GCRF models

respectively. Several examples of visualization results from our model experimented on the Cityscapes and PASCAL VOC 2012 datasets demonstrate the superiority of the proposals.

Conclusion

Experimental results show that our model outperforms several state-of-the-art saliency approaches and can dramatically improve the results of semantic segmentation. This model has great application value in many fields

such as medical image analysis

automatic driving

and unmanned aerial vehicle.

关键词

语义分割克罗内克卷积多尺度信息空间信息注意力机制编码—解码结构Cityscapes数据集PASCAL VOC 2012数据集

Keywords

semantic segmentationKronecker convolutionmultiscale informationspatial informationattention mechanismencoder-decoder structureCityscapes datasetPASCAL VOC 2012 dataset

references

Badrinarayanan V, Kendall A and Cipolla R. 2017. SegNet:a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12):2481-2495[DOI:10.1109/TPAMI.2016.2644615]

Byeon W, Breuel T M, Raue F and Liwicki M. 2015. Scene labeling with LSTM recurrent neural networks//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE: 3547-3555[DOI:10.1109/CVPR.2015.7298977http://dx.doi.org/10.1109/CVPR.2015.7298977]

Chen L C, Papandreou G, Kokkinos I, Murphy K and Yuille A L. 2018a. Deeplab:semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834-848[DOI:10.1109/TPAMI.2017.2699184]

Chen L C, Papandreou G, Schroff F and Adam H. 2017. Rethinking atrous convolution for semantic image segmentation[EB/OL].[2017-12-05].https://arxiv.org/pdf/1706.05587.pdfhttps://arxiv.org/pdf/1706.05587.pdf

Chen L C, Yang Y, Wang J, Xu W and Yuille A L. 2016. Attention to scale: scale-aware semantic image segmentation//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 3640-3649[DOI:10.1109/CVPR.2016.396http://dx.doi.org/10.1109/CVPR.2016.396]

Chen L C, Zhu Y K, Papandreou G, Schroff F and Adam H. 2018b. Encoder-decoder with atrous separable convolution for semantic image segmentation//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 833-851[DOI:10.1007/978-3-030-01234-2_49http://dx.doi.org/10.1007/978-3-030-01234-2_49]

Fu J, Liu J, Tian H J, Li Y, Bao Y J, Fang Z W and Lu H Q. 2018. Dual attention network for scene segmentation[EB/OL].[2019-04-21].https://arxiv.org/pdf/1809.02983.pdfhttps://arxiv.org/pdf/1809.02983.pdf

Ghiasi G and Fowlkes C C. 2016. Laplacian pyramid reconstruction and refinement for semantic segmentation//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, Netherlands: Springer: 519-534[DOI:10.1007/978-3-319-46487-9_32http://dx.doi.org/10.1007/978-3-319-46487-9_32]

Glorot X, Bordes A and Bengio Y. 2017. Deep sparse rectifier neural networks[EB/OL].[2017-05-15].http://www.doc88.com/p-3187616467050.htmlhttp://www.doc88.com/p-3187616467050.html

Hu J, Shen L, Albanie S, Sun G and Wu E H. 2017. Squeeze-and-excitation networks[EB/OL].[2018-10-25].https://arxiv.org/pdf/1709.01507.pdfhttps://arxiv.org/pdf/1709.01507.pdf

Huang Z L, Wang X G, Huang L C, Huang C, Wei Y C and Liu W Y. 2018. CCNet: Criss-Cross attention for semantic segmentation[EB/OL].[2018-11-28].https://arxiv.org/pdf/1811.11721v1.pdfhttps://arxiv.org/pdf/1811.11721v1.pdf

Ioffe S and Szegedy C. 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift[EB/OL].[2015-03-02].https://arxiv.org/pdf/1502.03167.pdfhttps://arxiv.org/pdf/1502.03167.pdf

Li H C, Xiong P F, An J and Wang L X. 2018. Pyramid attention network for semantic segmentation[EB/OL].[2018-11-25].https://arxiv.org/pdf/1805.10180.pdfhttps://arxiv.org/pdf/1805.10180.pdf

Lin G S, Milan A, Shen C H and Reid I. 2017. Refinenet: multi-path refinement networks for high-resolution semantic segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE: 1-5[DOI:10.1109/CVPR.2017.549http://dx.doi.org/10.1109/CVPR.2017.549]

Lin G S, Shen C H, Van Den Hengel A and Reid I. 2016. Efficient piecewise training of deep structured models for semantic segmentation//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 3194-3203[DOI:10.1109/CVPR.2016.348http://dx.doi.org/10.1109/CVPR.2016.348]

Liu S F, De Mello S, Gu J W, Zhong G Y, Yang M H and Kautz J. 2017. Learning affinity via spatial propagation networks[EB/OL].[2017-10-03].https://arxiv.org/pdf/1710.01020v1.pdfhttps://arxiv.org/pdf/1710.01020v1.pdf

Liu Z W, Li X X, Luo P, Loy C C and Tang X O. 2015. Semantic image segmentation via deep parsing network//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 1377-1385[DOI:10.1109/ICCV.2015.162http://dx.doi.org/10.1109/ICCV.2015.162]

Noh H, Hong S and Han B. 2015. Learning deconvolution network for semantic segmentation//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 1520-1528[DOI:10.1109/ICCV.2015.178http://dx.doi.org/10.1109/ICCV.2015.178]

Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z M, Desmaison A, Antiga A and Lerer A. 2017. Automatic differentiation in PyTorch[EB/OL].[2017-10-29].https://openreview.net/pdf?id=BJJsrmfCZhttps://openreview.net/pdf?id=BJJsrmfCZ

Peng C, Zhang X Y, Yu G, Luo G M and Sun J. 2017. Large kernel matters-improve semantic segmentation by global convolutional network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE: 1743-1751[DOI:10.1109/CVPR.2017.189http://dx.doi.org/10.1109/CVPR.2017.189]

Ronneberger O, Fischer P and Brox T. 2015. U-net: convolutional networks for biomedical image segmentation//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich, Germany: Springer: 234-241[DOI:10.1007/978-3-319-24574-4_28http://dx.doi.org/10.1007/978-3-319-24574-4_28]

Shelhamer E, Long J and Darrell T. 2014. Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4):640-651[DOI:10.1109/TPAMI.2016.2552683]

Shuai B, Zuo Z, Wang B and Wang G. 2018. Scene segmentation with DAG-recurrent neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6):1480-1493[DOI:10.1109/TPAMI.2017.2712691]

Tian Z, He T, Shen C H and Yan Y L. 2019. Decoders matter for semantic segmentation: data-dependent decoding enables flexible feature aggregation[EB/OL].[2019-04-05].https://arxiv.org/pdf/1903.02120.pdfhttps://arxiv.org/pdf/1903.02120.pdf

Vemulapalli R, Tuzel O, Liu M Y and Chellappa R. 2016. Gaussian conditional random field network for semantic segmentation//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 3224-3233[DOI:10.1109/CVPR.2016.351http://dx.doi.org/10.1109/CVPR.2016.351]

Wang F, Jiang M Q, Qian C, Yang S, Li C, Zhang H G, Wang X G and Tang X O. 2017. Residual attention network for image classification//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE: 3156-3164[DOI:10.1109/CVPR.2017.683http://dx.doi.org/10.1109/CVPR.2017.683]

Wang P Q, Chen P F, Yuan Y, Liu D, Huang Z H, Hou X D and Cottrell G. 2018. Understanding Convolution for Semantic Segmentation//Proceedings of 2018 IEEE Winter Conference on Applications of Computer Vision. Lake Tahoe, NV, USA: IEEE: 1451-1460[DOI:10.1109/WACV.2018.00163http://dx.doi.org/10.1109/WACV.2018.00163]

Wu T Y, Tang S, Zhang R and Zhang Y D. 2018a. CGNet: a light-weight context guided network for semantic segmentation[EB/OL].[2019-04-12].https://arxiv.org/pdf/1811.08201.pdfhttps://arxiv.org/pdf/1811.08201.pdf

Wu T Y, Tang S, Zhang R, Cao J and Li J T. 2018b. Tree-structured kronecker convolutional network for semantic segmentation[EB/OL].[2018-12-15].https://arxiv.org/pdf/1812.04945.pdfhttps://arxiv.org/pdf/1812.04945.pdf

Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel R and Bengio Y. 2015. Show, attend and tell: neural image caption generation with visual attention[EB/OL].[2016-04-19].https://arxiv.org/pdf/1502.03044.pdfhttps://arxiv.org/pdf/1502.03044.pdf

Yang M K, Yu K, Zhang C, Li Z W and Yang K Y. 2018. DenseASPP for semantic segmentation in street scenes//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE: 3684-3692[DOI:10.1109/CVPR.2018.00388http://dx.doi.org/10.1109/CVPR.2018.00388]

Yu C Q, Wang J B, Peng C, Gao C X, Yu G and Sang N. 2018. Learning a discriminative feature network for semantic segmentation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE: 1857-1866[DOI:10.1109/CVPR.2018.00199http://dx.doi.org/10.1109/CVPR.2018.00199]

Yu F and Koltun V. 2015. Multi-scale context aggregation by dilated convolutions[EB/OL].[2016-04-30].https://arxiv.org/pdf/1511.07122.pdfhttps://arxiv.org/pdf/1511.07122.pdf

Yuan Y H and Wang J D. 2018. OCNet: object context network for scene parsing[EB/OL].[2019-01-22].https://arxiv.org/pdf/1809.00916.pdfhttps://arxiv.org/pdf/1809.00916.pdf

Zhao H S, Shi J P, Qi X J, Wang X G and Jia J Y. 2017. Pyramid scene parsing network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE: 6230-6239[DOI:10.1109/CVPR.2017.660http://dx.doi.org/10.1109/CVPR.2017.660]

Zhao H S, Zhang Y, Liu S, Shi J P, Loy C C, Lin D H and Jia J Y. 2018. PSANet: point-wise spatial attention network for scene parsing//Proceedings of 2018 European Conference on Computer Vision. Munich, Germany: Springer: 270-286

Zheng S, Jayasumana S, Romera-Paredes B, Vineet V, Su Z Z, Du D L, Huang C and Torr P H S. 2015. Conditional random fields as recurrent neural networks//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 1529-1537[DOI:10.1109/ICCV.2015.179http://dx.doi.org/10.1109/ICCV.2015.179]

文章被引用时，请邮件提醒。

提交