结合特征图切分的图像语义分割

曹峰梅; 田海杰; 付君; 刘静

doi:10.11834/jig.180402

ChinaMM 2018 | 浏览量 : 0 下载量: 5 CSCD: 4

PDF
导出
分享
收藏
专辑

结合特征图切分的图像语义分割
Feature map slice for semantic segmentation
2019年24卷第3期页码：464-473
收稿：2018-07-04，

修回：2018-9-4，

纸质出版：2019-03-16
DOI： 10.11834/jig.180402
稿件说明：

移动端阅览

曹峰梅, 田海杰, 付君, 刘静. 结合特征图切分的图像语义分割[J]. 中国图象图形学报, 2019,24(3):464-473. DOI： 10.11834/jig.180402.

Fengmei Cao, Haijie Tian, Jun Fu, Jing Liu. Feature map slice for semantic segmentation[J]. Journal of Image and Graphics, 2019, 24(3): 464-473. DOI： 10.11834/jig.180402.

摘要

目的

基于全卷积神经网络的图像语义分割研究已成为该领域的主流研究方向。然而，在该网络框架中由于特征图的多次下采样使得图像分辨率逐渐下降，致使小目标丢失，边缘粗糙，语义分割结果较差。为解决或缓解该问题，提出一种基于特征图切分的图像语义分割方法。

方法

本文方法主要包含中间层特征图切分与相对应的特征提取两部分操作。特征图切分模块主要针对中间层特征图，将其切分成若干等份，同时将每一份上采样至原特征图大小，使每个切分区域的分辨率增大；然后，各个切分特征图通过参数共享的特征提取模块，该模块中的多尺度卷积与注意力机制，有效利用各切块的上下文信息与判别信息，使其更关注局部区域的小目标物体，提高小目标物体的判别力。进一步，再将提取的特征与网络原输出相融合，从而能够更高效地进行中间层特征复用，对小目标识别定位、分割边缘精细化以及网络语义判别力有明显改善。

结果

在两个城市道路数据集CamVid以及GATECH上进行验证实验，论证本文方法的有效性。在CamVid数据集上平均交并比达到66.3%，在GATECH上平均交并比达到52.6%。

结论

基于特征图切分的图像分割方法，更好地利用了图像的空间区域分布信息，增强了网络对于不同空间位置的语义类别判定能力以及小目标物体的关注度，提供更有效的上下文信息和全局信息，提高了网络对于小目标物体的判别能力，改善了网络整体分割性能。

Abstract

Objective

Deep convolutional neural networks have recently shown outstanding performances in object recognition and have also been the first choice for dense classification problems

such as semantic segmentation. Fully convolutional network based methods have become the main research direction in the field of image semantic segmentation. However

repeated downsampling operations in these methods

such as pooling or convolution striding

lead to a significant decrease in the initial image resolution

which results in poor object delineation

small target losing

and weak segmentation output. Although some studies have solved this problem in recent years

determining how to effectively handle this problem remains an open question and deserves further attention. This study proposes a feature map slice module for semantic segmentation to solve this problem.

Method

The proposed method mainly includes two parts:middle layer feature map segmentation and corresponding feature extraction network. The feature map slice module mainly focuses on the middle layer feature map. The feature map is sliced into several small cubes

and then each cube is upsampled to the corresponding resolution of the original feature map

which enlarges the small target in the local area. Each cube is equivalent to a subregion of the original feature map by the proposed feature map slice module. After upsampling these cubes

the objects in these subregions are enlarged. Thus

the small objects in these regions can be regarded as relatively large objects

which are difficult to detect through the entire feature map. Therefore

in the process of feature extraction

attention must be focused on the small target objects in these subregions

which are difficult to detect if we handle the entire feature map. A weight-shared feature extraction network is thus designed for sliced feature maps. The feature extraction network adopts multiple convolution operations (different kernel sizes) to extract different scale feature information. For each input of the network

the dimension is reduced to half to save memory and dilation convolution is adopted to enlarge the network's receptive field. We then concatenate a difficult feature map (obtained by different convolution operations) and add a channel-attention operation. The feature extraction network combines multi-scale convolution and attention mechanism; when subregions are passing through the feature extraction network

it can extract different semantic category information from corresponding subregions

as well as provide contextual and global information and discriminant information of each slice effectively. Accordingly

we can focus on small objects in local areas and improve the discriminability of small target objects. Each cube passes through the feature extraction network. The extracted feature in the corresponding position is assembled and the entire mosaic feature map is acquired. The network original output is upsampled and fused with the mosaic feature map by element-wise max operation. In this way

the middle-layer feature can be reused efficiently. To utilize the middlelayer feature information

this module is introduced at multiple scales

which enhances the capability of extracting small target characteristics and spatial information in local areas. It also utilizes the semantic information in different scales and exhibits an obvious improvement for extracting small target features

refining segmentation edge

and enhancing network discrimination.

Result

The proposed method is verified on two urban scene-understanding datasets

namely

CamVid and GATECH. Both datasets contain many common urban scene objects

such as building

car

and cyclist. Several ablation experiments are conducted on the two datasets and excellent performances are achieved. In particular

intersection-over-union scores of 66.3 and 52.6 are acquired on CamVid and GATECH

respectively.

Conclusion

The proposed method utilizes the spatial distribution information of images

enhances the network capability to determine the semantic categories of different spatial locations

pays considerable attention to small target objects

and provides effective context and global information. The proposed method is expanded into different resolutions of the network considering that different resolutions can provide rich-scale information. Thus

we utilize middle layer feature information

improve the network capability to discriminate small target objects

and enhance the overall segmentation performance of the network.

关键词

Keywords

references

Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 3431-3440.[ DOI: 10.1109/CVPR.2015.7298965 http://dx.doi.org/10.1109/CVPR.2015.7298965 ]

Szegedy C, Liu W, Jia Y Q, et al. Going deeper with convolutions[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 1-9.[ DOI: 10.1109/CVPR.2015.7298594 http://dx.doi.org/10.1109/CVPR.2015.7298594 ]

Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[EB/OL]. 2015-04-10[2018-05-01] . https://arxiv.org/pdf/1409.1556.pdf https://arxiv.org/pdf/1409.1556.pdf .

Chen L C, Papandreou G, Kokkinos I, et al. Semantic image segmentation with deep convolutional nets and fully connected CRFs[EB/OL]. 2016-06-07[2018-05-01] . https://arxiv.org/pdf/1412.7062.pdf https://arxiv.org/pdf/1412.7062.pdf .

Yu F, Koltun V. Multi-scale context aggregation by dilated convolutions[EB/OL]. 2016-03-04[2018-05-01] . https://arxiv.org/pdf/1511.07122v2.pdf https://arxiv.org/pdf/1511.07122v2.pdf .

Noh H, Hong S, Han B. Learning deconvolution network for semantic segmentation[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 1520-1528.[ DOI: 10.1109/ICCV.2015.178 http://dx.doi.org/10.1109/ICCV.2015.178 ]

Badrinarayanan V, Kendall A, Cipolla R. SegNet:a deep convolutional encoder-decoder architecture for image segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12):2481-2495.[DOI:10.1109/TPAMI.2016.2644615]

Wang Y, Liu J, Yan J, et al. Objectness-aware semantic segmentation[C]//Proceedings of the 24nd ACM on Multimedia Conference. Amsterdam, Netherlands: ACM, 2016: 307-311.[ DOI: 10.1145/2964284.2967232 http://dx.doi.org/10.1145/2964284.2967232 ]

Lin G S, Milan A, Shen C H, et al. RefineNet: multi-path refinement networks for high-resolution semantic segmentation[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 1925-1934.[ DOI: 10.1109/CVPR.2017.549 http://dx.doi.org/10.1109/CVPR.2017.549 ]

Jégou S, Drozdzal M, Vazquez D, et al. The one hundred layers tiramisu: fully convolutional DenseNets for semantic segmentation[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Honolulu, HI, USA: IEEE, 2017: 1175-1183.[ DOI: 10.1109/CVPRW.2017.156 http://dx.doi.org/10.1109/CVPRW.2017.156 ]

Fu J, Liu J, Wang Y H, et al. Densely connected deconvolutional network for semantic segmentation[C]//Proceedings of 2017 IEEE International Conference on Image Processing. Beijing, China: IEEE, 2017: 3085-3089.[ DOI: 10.1109/ICIP.2017.8296850 http://dx.doi.org/10.1109/ICIP.2017.8296850 ]

Huang G, Liu Z, van der Maaten L, et al. Densely connected convolutional networks[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 2261-2269.[ DOI: 10.1109/CVPR.2017.243 http://dx.doi.org/10.1109/CVPR.2017.243 ]

Everingham M, Eslami S, van Gool L, et al. The PASCAL visual object classes challenge:a retrospective[J]. International Journal of Computer Vision, 2015, 111(1):98-136.[DOI:10.1007/s11263-014-0733-5]

Cordts M, Omran M, Ramos S, et al. The cityscapes dataset for semantic urban scene understanding[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 3213-3223.[ DOI: 10.1109/CVPR.2016.350 http://dx.doi.org/10.1109/CVPR.2016.350 ]

Brostow G J, Fauqueur J, Cipolla R. Semantic object classes in video:a high-definition ground truth database[J]. Pattern Recognition Letters, 2009, 30(2):88-97.[DOI:10.1016/j.patrec.2008.04.005]

Raza S H, Grundmann M, Essa I. Geometric context from videos[C]//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE, 2013: 3081-3088.[ DOI: 10.1109/CVPR.2013.396 http://dx.doi.org/10.1109/CVPR.2013.396 ]

Ronneberger O, Fischer P, Brox T. U-Net: convolutional networks for biomedical image segmentation[C]//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich, Germany: Springer, 2015: 234-241.[ DOI: 10.1007/978-3-319-24574-4_28 http://dx.doi.org/10.1007/978-3-319-24574-4_28 ]

Ghiasi G, Fowlkes C C. Laplacian pyramid reconstruction and refinement for semantic segmentation[C]//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer, 2016, 519-534.[ DOI: 10.1007/978-3-319-46487-9_32 http://dx.doi.org/10.1007/978-3-319-46487-9_32 ]

Peng C, Zhang X Y, Yu G, et al. Large kernel matters——improve semantic segmentation by global convolutional network[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 4353-4361.[ DOI: 10.1109/CVPR.2017.189 http://dx.doi.org/10.1109/CVPR.2017.189 ]

Chen L C, Zhu Y K, Papandreou G, et al. Encoder-Decoder with atrous separable convolution for semantic image segmentation[C]//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany. Springer, 2018, 833-851.[ DOI: 10.1007/978-3-030-01234-2_49 http://dx.doi.org/10.1007/978-3-030-01234-2_49 ]

He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 770-778.[ DOI: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]

Jia Y Q, Shelhamer E, Donahue J, et al. Caffe: convolutional architecture for fast feature embedding[C]//Proceedings of the 22nd ACM International Conference on Multimedia. Orlando, Florida, USA: ACM, 2014: 675-678.[ DOI: 10.1145/2647868.2654889 http://dx.doi.org/10.1145/2647868.2654889 ]

Kendall A, Badrinarayanan V, Cipolla R. Bayesian SegNet: model uncertainty in deep convolutional encoder-decoder architectures for scene understanding[C]//Proceedings of 2017 British Machine Vision Conference, London, UK, BMVA press, 2017

Kundu A, Vineet V, Koltun V. Feature space optimization for semantic video segmentation[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 3168-3175.[ DOI: 10.1109/CVPR.2016.345 http://dx.doi.org/10.1109/CVPR.2016.345 ]

Wang Y H, Liu J, Li Y, et al. Hierarchically supervised deconvolutional network for semantic video segmentation[J]. Pattern Recognition, 2017, 64:437-445.[DOI:10.1016/j.patcog.2016.09.046]

Tran D, Bourdev L, Fergus R, et al. Deep end2end voxel2voxel prediction[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Las Vegas, NV, USA: IEEE, 2016: 402-409.[ DOI: 10.1109/CVPRW.2016.57 http://dx.doi.org/10.1109/CVPRW.2016.57 ]