编码—解码结构的语义分割
Semantic segmentation of encoder-decoder structure
- 2020年25卷第2期 页码:255-266
纸质出版日期: 2020-02-16 ,
录用日期: 2019-07-29
DOI: 10.11834/jig.190212
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2020-02-16 ,
录用日期: 2019-07-29
移动端阅览
韩慧慧, 李帷韬, 王建平, 焦点, 孙百顺. 编码—解码结构的语义分割[J]. 中国图象图形学报, 2020,25(2):255-266.
Huihui Han, Weitao Li, Jianping Wang, Dian Jiao, Baishun Sun. Semantic segmentation of encoder-decoder structure[J]. Journal of Image and Graphics, 2020,25(2):255-266.
目的
2
语义分割是计算机视觉中一项具有挑战性的任务,其核心是为图像中的每个像素分配相应的语义类别标签。然而,在语义分割任务中,缺乏丰富的多尺度信息和足够的空间信息会严重影响图像分割结果。为进一步提升图像分割效果,从提取丰富的多尺度信息和充分的空间信息出发,本文提出了一种基于编码-解码结构的语义分割模型。
方法
2
运用ResNet-101网络作为模型的骨架提取特征图,在骨架末端附加一个多尺度信息融合模块,用于在网络深层提取区分力强且多尺度信息丰富的特征图。并且,在网络浅层引入空间信息捕获模块来提取丰富的空间信息。由空间信息捕获模块捕获的带有丰富空间信息的特征图和由多尺度信息融合模块提取的区分力强且多尺度信息丰富的特征图将融合为一个新的信息丰富的特征图集合,经过多核卷积块细化之后,最终运用数据依赖的上采样(DUpsampling)操作得到图像分割结果。
结果
2
此模型在2个公开数据集(Cityscapes数据集和PASCAL VOC 2012数据集)上进行了大量实验,验证了所设计的每个模块及整个模型的有效性。新模型与最新的10种方法进行了比较,在Cityscapes数据集中,相比于RefineNet模型、DeepLabv2-CRF模型和LRR(Laplacian reconstruction and refinement)模型,平均交并比(mIoU)值分别提高了0.52%、3.72%和4.42%;在PASCAL VOC 2012数据集中,相比于Piecewise模型、DPN(deep parsing network)模型和GCRF(Gaussion conditional random field network)模型,mIoU值分别提高了6.23%、7.43%和8.33%。
结论
2
本文语义分割模型,提取了更加丰富的多尺度信息和空间信息,使得分割结果更加准确。此模型可应用于医学图像分析、自动驾驶、无人机等领域。
Objective
2
Semantic segmentation
a challenging task in computer vision
aims to assign corresponding semantic class labels to every pixel in an image. This process is widely applied into many fields
such as autonomous driving
obstacle detection
medical image analysis
3D geometry
environment modeling
reconstruction of indoor environment
and 3D semantic segmentation. Despite the many achievements in semantic segmentation task
two challenges remain:1) the lack of rich multiscale information and 2) the loss of spatial information. Starting from capturing rich multiscale information and extracting affluent spatial information
a new semantic segmentation model is proposed
which can greatly improve the segmentation results.
Method
2
The new module is built on an encoder-decoder structure
which can effectively promote the fusion of high-level semantic information and low-level spatial information. The details of the entire architecture are elaborated as follows:First
in the encoder part
the ResNet-101 network is used as our backbone to capture feature maps. In ResNet-101 network
the last two blocks utilize atrous convolutions with
rate
=2 and
rate
=4
which can reduce the spatial resolution loss. A multiscale information fusion module is designed in the encoder part to capture feature maps with rich multiscale and discriminative information in the deep stage of the network. In this module
by applying expansion and stacking principle
Kronecker convolutions are arranged in parallel structure to expand the receptive field for extracting multiscale information. A global attention module is applied to highlight discriminative information selectively in the feature maps captured by Kronecker convolutions. Subsequently
a spatial information capturing module is introduced as a decoder part in the shallow stage of the network. The spatial information capturing module aims to supplement the affluent spatial information
which can compensate for the spatial resolution loss caused by the repeated combination of max-pooling and striding at consecutive layers in ResNet-101. Moreover
the spatial information-capturing module plays an important role in effectively enhancing the relationships between the widely separated spatial regions. The feature maps with rich multiscale and discriminative information captured by the multiscale information fusion module in the deep stage and the feature maps with affluent spatial information captured by the spatial information-capturing module will be fused to obtain a new feature map set
which is full of effective information. Afterward
a multikernel convolution block is utilized to refine these feature maps. In the multikernel convolution block
two convolutions are in parallel. The sizes of the two convolution kernels are 3×3 and 5×5. The feature maps refined by the multikernel convolution block will be fed to a Data-dependent Upsampling (DUpsampling) operator to obtain the final prediction feature maps. The reason for replacing the upsample operators with bilinear interpolation with DUpsampling is that DUpsampling not only can utilize the redundancy in the segmentation label space but also can effectively recover the pixel-wise prediction. We can safely downsample arbitrary low-level feature maps to the resolution of the lowest resolution of feature maps and then fuse these features to produce the final prediction.
Result
2
To prove the effectiveness of the proposals
extensive experiments are conducted on two public datasets:PASCAL VOC 2012 and Cityscapes. We first conduct several ablation studies on the PASCAL VOC 2012 dataset to evaluate the effectiveness of each module and then perform several contrast experiments on the PASCAL VOC 2012 and Cityscapes datasets with existing approaches
such as FCN (fully convolutional network)
FRRN(full-resolution residual networks)
DeepLabv2
CRF-RNN(conditional random fields as recurrent neural networks)
DeconvNet
GCRF (Gaussion conditional random field network)
DeepLabv2-CRF
Piecewise
Dilation10
DPN (deep parsing network)
LRR (Laplacian reconstruction and refinement)
and RefineNet models
to verify the effectiveness of the entire architecture. On the Cityscapes dataset
our model achieves 0.52%
3.72%
and 4.42% mIoU improvement in performance compared with the RefineNet
DeepLabv2-CRF
and LRR models
respectively. On the PASCAL VOC 2012 dataset
our model achieves 6.23%
7.43%
and 8.33% mIoU improvement in performance compared with the Piecewise
DPN
and GCRF models
respectively. Several examples of visualization results from our model experimented on the Cityscapes and PASCAL VOC 2012 datasets demonstrate the superiority of the proposals.
Conclusion
2
Experimental results show that our model outperforms several state-of-the-art saliency approaches and can dramatically improve the results of semantic segmentation. This model has great application value in many fields
such as medical image analysis
automatic driving
and unmanned aerial vehicle.
语义分割克罗内克卷积多尺度信息空间信息注意力机制编码—解码结构Cityscapes数据集PASCAL VOC 2012数据集
semantic segmentationKronecker convolutionmultiscale informationspatial informationattention mechanismencoder-decoder structureCityscapes datasetPASCAL VOC 2012 dataset
Badrinarayanan V, Kendall A and Cipolla R. 2017. SegNet:a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12):2481-2495[DOI:10.1109/TPAMI.2016.2644615]
Byeon W, Breuel T M, Raue F and Liwicki M. 2015. Scene labeling with LSTM recurrent neural networks//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE: 3547-3555[DOI:10.1109/CVPR.2015.7298977http://dx.doi.org/10.1109/CVPR.2015.7298977]
Chen L C, Papandreou G, Kokkinos I, Murphy K and Yuille A L. 2018a. Deeplab:semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834-848[DOI:10.1109/TPAMI.2017.2699184]
Chen L C, Papandreou G, Schroff F and Adam H. 2017. Rethinking atrous convolution for semantic image segmentation[EB/OL].[2017-12-05].https://arxiv.org/pdf/1706.05587.pdfhttps://arxiv.org/pdf/1706.05587.pdf
Chen L C, Yang Y, Wang J, Xu W and Yuille A L. 2016. Attention to scale: scale-aware semantic image segmentation//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 3640-3649[DOI:10.1109/CVPR.2016.396http://dx.doi.org/10.1109/CVPR.2016.396]
Chen L C, Zhu Y K, Papandreou G, Schroff F and Adam H. 2018b. Encoder-decoder with atrous separable convolution for semantic image segmentation//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 833-851[DOI:10.1007/978-3-030-01234-2_49http://dx.doi.org/10.1007/978-3-030-01234-2_49]
Fu J, Liu J, Tian H J, Li Y, Bao Y J, Fang Z W and Lu H Q. 2018. Dual attention network for scene segmentation[EB/OL].[2019-04-21].https://arxiv.org/pdf/1809.02983.pdfhttps://arxiv.org/pdf/1809.02983.pdf
Ghiasi G and Fowlkes C C. 2016. Laplacian pyramid reconstruction and refinement for semantic segmentation//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, Netherlands: Springer: 519-534[DOI:10.1007/978-3-319-46487-9_32http://dx.doi.org/10.1007/978-3-319-46487-9_32]
Glorot X, Bordes A and Bengio Y. 2017. Deep sparse rectifier neural networks[EB/OL].[2017-05-15].http://www.doc88.com/p-3187616467050.htmlhttp://www.doc88.com/p-3187616467050.html
Hu J, Shen L, Albanie S, Sun G and Wu E H. 2017. Squeeze-and-excitation networks[EB/OL].[2018-10-25].https://arxiv.org/pdf/1709.01507.pdfhttps://arxiv.org/pdf/1709.01507.pdf
Huang Z L, Wang X G, Huang L C, Huang C, Wei Y C and Liu W Y. 2018. CCNet: Criss-Cross attention for semantic segmentation[EB/OL].[2018-11-28].https://arxiv.org/pdf/1811.11721v1.pdfhttps://arxiv.org/pdf/1811.11721v1.pdf
Ioffe S and Szegedy C. 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift[EB/OL].[2015-03-02].https://arxiv.org/pdf/1502.03167.pdfhttps://arxiv.org/pdf/1502.03167.pdf
Li H C, Xiong P F, An J and Wang L X. 2018. Pyramid attention network for semantic segmentation[EB/OL].[2018-11-25].https://arxiv.org/pdf/1805.10180.pdfhttps://arxiv.org/pdf/1805.10180.pdf
Lin G S, Milan A, Shen C H and Reid I. 2017. Refinenet: multi-path refinement networks for high-resolution semantic segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE: 1-5[DOI:10.1109/CVPR.2017.549http://dx.doi.org/10.1109/CVPR.2017.549]
Lin G S, Shen C H, Van Den Hengel A and Reid I. 2016. Efficient piecewise training of deep structured models for semantic segmentation//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 3194-3203[DOI:10.1109/CVPR.2016.348http://dx.doi.org/10.1109/CVPR.2016.348]
Liu S F, De Mello S, Gu J W, Zhong G Y, Yang M H and Kautz J. 2017. Learning affinity via spatial propagation networks[EB/OL].[2017-10-03].https://arxiv.org/pdf/1710.01020v1.pdfhttps://arxiv.org/pdf/1710.01020v1.pdf
Liu Z W, Li X X, Luo P, Loy C C and Tang X O. 2015. Semantic image segmentation via deep parsing network//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 1377-1385[DOI:10.1109/ICCV.2015.162http://dx.doi.org/10.1109/ICCV.2015.162]
Noh H, Hong S and Han B. 2015. Learning deconvolution network for semantic segmentation//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 1520-1528[DOI:10.1109/ICCV.2015.178http://dx.doi.org/10.1109/ICCV.2015.178]
Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z M, Desmaison A, Antiga A and Lerer A. 2017. Automatic differentiation in PyTorch[EB/OL].[2017-10-29].https://openreview.net/pdf?id=BJJsrmfCZhttps://openreview.net/pdf?id=BJJsrmfCZ
Peng C, Zhang X Y, Yu G, Luo G M and Sun J. 2017. Large kernel matters-improve semantic segmentation by global convolutional network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE: 1743-1751[DOI:10.1109/CVPR.2017.189http://dx.doi.org/10.1109/CVPR.2017.189]
Ronneberger O, Fischer P and Brox T. 2015. U-net: convolutional networks for biomedical image segmentation//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich, Germany: Springer: 234-241[DOI:10.1007/978-3-319-24574-4_28http://dx.doi.org/10.1007/978-3-319-24574-4_28]
Shelhamer E, Long J and Darrell T. 2014. Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4):640-651[DOI:10.1109/TPAMI.2016.2552683]
Shuai B, Zuo Z, Wang B and Wang G. 2018. Scene segmentation with DAG-recurrent neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6):1480-1493[DOI:10.1109/TPAMI.2017.2712691]
Tian Z, He T, Shen C H and Yan Y L. 2019. Decoders matter for semantic segmentation: data-dependent decoding enables flexible feature aggregation[EB/OL].[2019-04-05].https://arxiv.org/pdf/1903.02120.pdfhttps://arxiv.org/pdf/1903.02120.pdf
Vemulapalli R, Tuzel O, Liu M Y and Chellappa R. 2016. Gaussian conditional random field network for semantic segmentation//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 3224-3233[DOI:10.1109/CVPR.2016.351http://dx.doi.org/10.1109/CVPR.2016.351]
Wang F, Jiang M Q, Qian C, Yang S, Li C, Zhang H G, Wang X G and Tang X O. 2017. Residual attention network for image classification//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE: 3156-3164[DOI:10.1109/CVPR.2017.683http://dx.doi.org/10.1109/CVPR.2017.683]
Wang P Q, Chen P F, Yuan Y, Liu D, Huang Z H, Hou X D and Cottrell G. 2018. Understanding Convolution for Semantic Segmentation//Proceedings of 2018 IEEE Winter Conference on Applications of Computer Vision. Lake Tahoe, NV, USA: IEEE: 1451-1460[DOI:10.1109/WACV.2018.00163http://dx.doi.org/10.1109/WACV.2018.00163]
Wu T Y, Tang S, Zhang R and Zhang Y D. 2018a. CGNet: a light-weight context guided network for semantic segmentation[EB/OL].[2019-04-12].https://arxiv.org/pdf/1811.08201.pdfhttps://arxiv.org/pdf/1811.08201.pdf
Wu T Y, Tang S, Zhang R, Cao J and Li J T. 2018b. Tree-structured kronecker convolutional network for semantic segmentation[EB/OL].[2018-12-15].https://arxiv.org/pdf/1812.04945.pdfhttps://arxiv.org/pdf/1812.04945.pdf
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel R and Bengio Y. 2015. Show, attend and tell: neural image caption generation with visual attention[EB/OL].[2016-04-19].https://arxiv.org/pdf/1502.03044.pdfhttps://arxiv.org/pdf/1502.03044.pdf
Yang M K, Yu K, Zhang C, Li Z W and Yang K Y. 2018. DenseASPP for semantic segmentation in street scenes//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE: 3684-3692[DOI:10.1109/CVPR.2018.00388http://dx.doi.org/10.1109/CVPR.2018.00388]
Yu C Q, Wang J B, Peng C, Gao C X, Yu G and Sang N. 2018. Learning a discriminative feature network for semantic segmentation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE: 1857-1866[DOI:10.1109/CVPR.2018.00199http://dx.doi.org/10.1109/CVPR.2018.00199]
Yu F and Koltun V. 2015. Multi-scale context aggregation by dilated convolutions[EB/OL].[2016-04-30].https://arxiv.org/pdf/1511.07122.pdfhttps://arxiv.org/pdf/1511.07122.pdf
Yuan Y H and Wang J D. 2018. OCNet: object context network for scene parsing[EB/OL].[2019-01-22].https://arxiv.org/pdf/1809.00916.pdfhttps://arxiv.org/pdf/1809.00916.pdf
Zhao H S, Shi J P, Qi X J, Wang X G and Jia J Y. 2017. Pyramid scene parsing network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE: 6230-6239[DOI:10.1109/CVPR.2017.660http://dx.doi.org/10.1109/CVPR.2017.660]
Zhao H S, Zhang Y, Liu S, Shi J P, Loy C C, Lin D H and Jia J Y. 2018. PSANet: point-wise spatial attention network for scene parsing//Proceedings of 2018 European Conference on Computer Vision. Munich, Germany: Springer: 270-286
Zheng S, Jayasumana S, Romera-Paredes B, Vineet V, Su Z Z, Du D L, Huang C and Torr P H S. 2015. Conditional random fields as recurrent neural networks//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 1529-1537[DOI:10.1109/ICCV.2015.179http://dx.doi.org/10.1109/ICCV.2015.179]
相关作者
相关机构