发布时间: 2019-03-16 摘要点击次数: 全文下载次数: DOI: 10.11834/jig.180402 2019 | Volume 24 | Number 3 ChinaMM 2018

1. 北京理工大学光电学院, 北京 100081;
2. 中国科学院自动化研究所模式识别国家重点实验室, 北京 100190
 收稿日期: 2018-07-04; 修回日期: 2018-09-04 基金项目: 国家自然科学基金项目（61472422） 第一作者简介: 曹峰梅, 1970年生, 女, 副教授, 博士, 主要研究方向为图像处理、图像显示技术。E-mail:a1831974703@163.com;田海杰, 男, 硕士研究生, 主要研究方向为图像语义分割。E-mail:15911022582@163.com;付君, 男, 博士研究生, 主要研究方向为图像语义分割、图像理解。E-mail:jun.fu@nlpr.ia.ac.cn;刘静, 女, 教授, 博士, 主要研究方向为多媒体分析、理解与检索、移动媒体计算与应用、深度学习算法。E-mail:jliu@nlpr.ia.ac.cn. 中图法分类号: TP309 文献标识码: A 文章编号: 1006-8961(2019)03-0464-10

关键词

Feature map slice for semantic segmentation
Cao Fengmei1, Tian Haijie1, Fu Jun2, Liu Jing2
1. School of Optics and Photonic, Beijing Institute of Technology, Beijing 100081, China;
2. National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
Supported by: National Natural Science Foundation of China (61472422)

Abstract

Objective Deep convolutional neural networks have recently shown outstanding performances in object recognition and have also been the first choice for dense classification problems, such as semantic segmentation. Fully convolutional network based methods have become the main research direction in the field of image semantic segmentation. However, repeated downsampling operations in these methods, such as pooling or convolution striding, lead to a significant decrease in the initial image resolution, which results in poor object delineation, small target losing, and weak segmentation output. Although some studies have solved this problem in recent years, determining how to effectively handle this problem remains an open question and deserves further attention. This study proposes a feature map slice module for semantic segmentation to solve this problem. Method The proposed method mainly includes two parts:middle layer feature map segmentation and corresponding feature extraction network. The feature map slice module mainly focuses on the middle layer feature map. The feature map is sliced into several small cubes, and then each cube is upsampled to the corresponding resolution of the original feature map, which enlarges the small target in the local area. Each cube is equivalent to a subregion of the original feature map by the proposed feature map slice module. After upsampling these cubes, the objects in these subregions are enlarged. Thus, the small objects in these regions can be regarded as relatively large objects, which are difficult to detect through the entire feature map. Therefore, in the process of feature extraction, attention must be focused on the small target objects in these subregions, which are difficult to detect if we handle the entire feature map. A weight-shared feature extraction network is thus designed for sliced feature maps. The feature extraction network adopts multiple convolution operations (different kernel sizes) to extract different scale feature information. For each input of the network, the dimension is reduced to half to save memory and dilation convolution is adopted to enlarge the network's receptive field. We then concatenate a difficult feature map (obtained by different convolution operations) and add a channel-attention operation. The feature extraction network combines multi-scale convolution and attention mechanism; when subregions are passing through the feature extraction network, it can extract different semantic category information from corresponding subregions, as well as provide contextual and global information and discriminant information of each slice effectively. Accordingly, we can focus on small objects in local areas and improve the discriminability of small target objects. Each cube passes through the feature extraction network. The extracted feature in the corresponding position is assembled and the entire mosaic feature map is acquired. The network original output is upsampled and fused with the mosaic feature map by element-wise max operation. In this way, the middle-layer feature can be reused efficiently. To utilize the middlelayer feature information, this module is introduced at multiple scales, which enhances the capability of extracting small target characteristics and spatial information in local areas. It also utilizes the semantic information in different scales and exhibits an obvious improvement for extracting small target features, refining segmentation edge, and enhancing network discrimination. Result The proposed method is verified on two urban scene-understanding datasets, namely, CamVid and GATECH. Both datasets contain many common urban scene objects, such as building, car, and cyclist. Several ablation experiments are conducted on the two datasets and excellent performances are achieved. In particular, intersection-over-union scores of 66.3 and 52.6 are acquired on CamVid and GATECH, respectively. Conclusion The proposed method utilizes the spatial distribution information of images, enhances the network capability to determine the semantic categories of different spatial locations, pays considerable attention to small target objects, and provides effective context and global information. The proposed method is expanded into different resolutions of the network considering that different resolutions can provide rich-scale information. Thus, we utilize middle layer feature information, improve the network capability to discriminate small target objects, and enhance the overall segmentation performance of the network.

Key words

deep learning; fully convolutional neural networks; semantic segmentation; scene parsing; feature slice; multiple scales; feature reuse

0 引言

1) 通过改进卷积核操作，减少下采样次数，保证图像空间分辨率，增大网络感受野。一种典型方式就是利用膨胀卷积，也被称为带“孔”的卷积方式[4-5]。这种方式去掉了网络中部分下采样操作，增大了输出特征图的分辨率，同时将普通卷积操作替换为膨胀卷积操作，以保证网络获得较大的感受野。

2) 通过中间层特征复用，由于中间层信息包含了许多图像细节，通过融合中间层特征，可以恢复图像边缘细节。文献[6-7]利用编码器—解码器结构来弥补空间信息的损失。这种编码器和解码器结构相互对称，采用反卷积的方式来逐层恢复图像分辨率。Wang等人[8]利用VGGNet作为预训练模型，使得编码器—解码器网络有了更好的权重参数，更加容易收敛，同时，性能也有了明显提升。Lin等人[9]提出一种多级特征整合的语义分割算法，强调网络中各种分辨率下的特征都能改善最后分割结果，设计了一种逐层特征融合的结构，从最小分辨率的特征图依次上采样，再与底层分辨率的特征图进行融合，最后得到一个原图大小的特征图。Je'gou等人[10]以及Fu等人[11]利用DenseNet[12]结构特征复用性强的特点，提出基于DenseNet的语义分割算法。当前特征复用方法虽然使用了中间层的特征信息且在性能上取得了一定的效果，但是其对中间层特征的处理过程，一般是直接在整个特征图上进行卷积操作，然而在图像中，一些小目标物体往往只分布于图像的某一小块区域内，这种特征提取过程对一些局部区域小目标就不会有很好的关注，网络对于小目标物体的判别能力还是很差，无法改善小目标物体的分割性能。

3.1 实现方法

 $lr = lr \times {\left( {1 - \frac{{{N_{\rm{c}}}}}{{\max N}}} \right)^p}$ (1)

3.2 CamVid实验结果

Table 1 Experimental results of different segmentation ratios of feature map slice modules

 $n$ baseline $n$=1 $n$=2 $n$=3 $n$=4 mean IoU/% 58.5 64.8 68.5 68.8 63.5

Table 2 Experimental results of feature extraction network

 方法 mean IoU/% baseline 58.5 baseline+1/16直接融合 62.5 baseline+1/16+特征提取网络 65.5 baseline+1/16+特征切分($n$=2, 未上采样)+特征提取网络 66.7 baseline+1/16 + 2倍上采样+特征提取网络 67.8 baseline+1/16+特征切分($n$=2)+特征提取网络 68.5

Table 3 Experimental results on CamVid test set

 方法 mean IoU/% global Avg/% ResNet50-baseline 56.7 89.7 ResNet50+DA 58.5 90.4 ResNet50+DA+FS-A 68.5 93.7 ResNet50+DA+FS-B 70.7 94.2 ResNet50+DA+FS-C 71.8 94.6 ResNet101+DA 60.0 90.6 ResNet101+DA+FS-C 72.9 95.3 ResNet101+DA+FS-C+MS 73.8 95.9 注：加粗数字表示最优结果。

Table 4 Results on CamVid test set

 方法 mean IoU/% global Avg/% SegNet[7] 55.6 88.5 BayesianSegNet[23] 63.1 86.9 DeconvNet[6] 48.9 85.9 DeepLab-LFOV[4] 61.6 - Dilation8[5] 65.3 79.0 Dilation8-FSO[24] 66.1 88.3 HDCNN-448+TL[25] 65.9 90.9 FC-DenseNet56[10] 65.8 90.8 FC-DenseNet103[10] 66.9 91.5 本文方法 66.3 91.0 注：加粗数字表示最优结果, “-”表示文献中未给出该指标。

3.3 GATECH实验结果

Table 5 Results on GATECH test set

 方法 时序信息使用 mean IoU/% global Avg/% 3D-V2V-scratch[26] 是 - 66.7 3D-V2V-finetune[26] 是 - 76.0 FC-DenseNet103[10] 否 - 79.4 HDCNN-448+TL[25] 是 48.2 82.1 本文方法 否 52.6 84.2 注：加粗数字表示最优结果, “-”表示文献中未给出该指标。

参考文献

• [1] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 3431-3440.[DOI: 10.1109/CVPR.2015.7298965]
• [2] Szegedy C, Liu W, Jia Y Q, et al. Going deeper with convolutions[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 1-9.[DOI: 10.1109/CVPR.2015.7298594]
• [3] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[EB/OL]. 2015-04-10[2018-05-01]. https://arxiv.org/pdf/1409.1556.pdf.
• [4] Chen L C, Papandreou G, Kokkinos I, et al. Semantic image segmentation with deep convolutional nets and fully connected CRFs[EB/OL]. 2016-06-07[2018-05-01]. https://arxiv.org/pdf/1412.7062.pdf.
• [5] Yu F, Koltun V. Multi-scale context aggregation by dilated convolutions[EB/OL]. 2016-03-04[2018-05-01]. https://arxiv.org/pdf/1511.07122v2.pdf.
• [6] Noh H, Hong S, Han B. Learning deconvolution network for semantic segmentation[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 1520-1528.[DOI: 10.1109/ICCV.2015.178]
• [7] Badrinarayanan V, Kendall A, Cipolla R. SegNet:a deep convolutional encoder-decoder architecture for image segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12): 2481–2495. [DOI:10.1109/TPAMI.2016.2644615]
• [8] Wang Y, Liu J, Yan J, et al. Objectness-aware semantic segmentation[C]//Proceedings of the 24nd ACM on Multimedia Conference. Amsterdam, Netherlands: ACM, 2016: 307-311.[DOI: 10.1145/2964284.2967232]
• [9] Lin G S, Milan A, Shen C H, et al. RefineNet: multi-path refinement networks for high-resolution semantic segmentation[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 1925-1934.[DOI: 10.1109/CVPR.2017.549]
• [10] Jégou S, Drozdzal M, Vazquez D, et al. The one hundred layers tiramisu: fully convolutional DenseNets for semantic segmentation[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Honolulu, HI, USA: IEEE, 2017: 1175-1183.[DOI: 10.1109/CVPRW.2017.156]
• [11] Fu J, Liu J, Wang Y H, et al. Densely connected deconvolutional network for semantic segmentation[C]//Proceedings of 2017 IEEE International Conference on Image Processing. Beijing, China: IEEE, 2017: 3085-3089.[DOI: 10.1109/ICIP.2017.8296850]
• [12] Huang G, Liu Z, van der Maaten L, et al. Densely connected convolutional networks[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 2261-2269.[DOI: 10.1109/CVPR.2017.243]
• [13] Everingham M, Eslami S, van Gool L, et al. The PASCAL visual object classes challenge:a retrospective[J]. International Journal of Computer Vision, 2015, 111(1): 98–136. [DOI:10.1007/s11263-014-0733-5]
• [14] Cordts M, Omran M, Ramos S, et al. The cityscapes dataset for semantic urban scene understanding[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 3213-3223.[DOI: 10.1109/CVPR.2016.350]
• [15] Brostow G J, Fauqueur J, Cipolla R. Semantic object classes in video:a high-definition ground truth database[J]. Pattern Recognition Letters, 2009, 30(2): 88–97. [DOI:10.1016/j.patrec.2008.04.005]
• [16] Raza S H, Grundmann M, Essa I. Geometric context from videos[C]//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE, 2013: 3081-3088.[DOI: 10.1109/CVPR.2013.396]
• [17] Ronneberger O, Fischer P, Brox T. U-Net: convolutional networks for biomedical image segmentation[C]//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich, Germany: Springer, 2015: 234-241.[DOI: 10.1007/978-3-319-24574-4_28]
• [18] Ghiasi G, Fowlkes C C. Laplacian pyramid reconstruction and refinement for semantic segmentation[C]//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer, 2016, 519-534.[DOI: 10.1007/978-3-319-46487-9_32]
• [19] Peng C, Zhang X Y, Yu G, et al. Large kernel matters——improve semantic segmentation by global convolutional network[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 4353-4361.[DOI: 10.1109/CVPR.2017.189]
• [20] Chen L C, Zhu Y K, Papandreou G, et al. Encoder-Decoder with atrous separable convolution for semantic image segmentation[C]//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany. Springer, 2018, 833-851.[DOI: 10.1007/978-3-030-01234-2_49]
• [21] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 770-778.[DOI: 10.1109/CVPR.2016.90]
• [22] Jia Y Q, Shelhamer E, Donahue J, et al. Caffe: convolutional architecture for fast feature embedding[C]//Proceedings of the 22nd ACM International Conference on Multimedia. Orlando, Florida, USA: ACM, 2014: 675-678.[DOI: 10.1145/2647868.2654889]
• [23] Kendall A, Badrinarayanan V, Cipolla R. Bayesian SegNet: model uncertainty in deep convolutional encoder-decoder architectures for scene understanding[C]//Proceedings of 2017 British Machine Vision Conference, London, UK, BMVA press, 2017
• [24] Kundu A, Vineet V, Koltun V. Feature space optimization for semantic video segmentation[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 3168-3175.[DOI: 10.1109/CVPR.2016.345]
• [25] Wang Y H, Liu J, Li Y, et al. Hierarchically supervised deconvolutional network for semantic video segmentation[J]. Pattern Recognition, 2017, 64: 437–445. [DOI:10.1016/j.patcog.2016.09.046]
• [26] Tran D, Bourdev L, Fergus R, et al. Deep end2end voxel2voxel prediction[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Las Vegas, NV, USA: IEEE, 2016: 402-409.[DOI: 10.1109/CVPRW.2016.57]