结合上下文特征与CNN多层特征融合的语义分割

罗会兰; 张云

doi:10.11834/jig.190087

图像分析和识别 | 浏览量 : 0 下载量: 4 CSCD: 2

PDF
导出
分享
收藏
专辑

结合上下文特征与CNN多层特征融合的语义分割
Semantic segmentation method with combined context features with CNN multi-layer features
2019年24卷第12期页码：2200-2209
收稿：2019-04-01，

修回：2019-6-22，

录用：2019-6-29，

纸质出版：2019-12-16
DOI： 10.11834/jig.190087
稿件说明：

移动端阅览

罗会兰, 张云. 结合上下文特征与CNN多层特征融合的语义分割[J]. 中国图象图形学报, 2019,24(12):2200-2209. DOI： 10.11834/jig.190087.

Huilan Luo, Yun Zhang. Semantic segmentation method with combined context features with CNN multi-layer features[J]. Journal of Image and Graphics, 2019, 24(12): 2200-2209. DOI： 10.11834/jig.190087.

摘要

目的

针对基于区域的语义分割方法在进行语义分割时容易缺失细节信息，造成图像语义分割结果粗糙、准确度低的问题，提出结合上下文特征与卷积神经网络（CNN）多层特征融合的语义分割方法。

方法

首先，采用选择搜索方法从图像中生成不同尺度的候选区域，得到区域特征掩膜；其次，采用卷积神经网络提取每个区域的特征，并行融合高层特征与低层特征。由于不同层提取的特征图大小不同，采用RefineNet模型将不同分辨率的特征图进行融合；最后将区域特征掩膜和融合后的特征图输入到自由形式感兴趣区域池化层，经过softmax分类层得到图像的像素级分类标签。

结果

采用上下文特征与CNN多层特征融合作为算法的基本框架，得到了较好的性能，实验内容主要包括CNN多层特征融合、结合背景信息和融合特征以及dropout值对实验结果的影响分析，在Siftflow数据集上进行测试，像素准确率达到82.3%，平均准确率达到63.1%。与当前基于区域的端到端语义分割模型相比，像素准确率提高了10.6%，平均准确率提高了0.6%。

结论

本文算法结合了区域的前景信息和上下文信息，充分利用了区域的语境信息，采用弃权原则降低网络的参数量，避免过拟合，同时利用RefineNet网络模型对CNN多层特征进行融合，有效地将图像的多层细节信息用于分割，增强了模型对于区域中小目标物体的判别能力，对于有遮挡和复杂背景的图像表现出较好的分割效果。

Abstract

Objective

Semantic segmentation plays an increasingly important role in visual analysis. It combines image classification

object detection

and image segmentation and classifies the pixels in an image through certain methods. Semantic segmentation divides an image into regions with certain semantic meanings and identifies the semantic categories of each region block. The semantic inference process from low to high levels is realized

and a segmented image with pixel-by-pixel semantic annotation is obtained. The semantic segmentation method based on candidate regions extracts free-form regions from the image

describes their features

classifies them based on regions

and converts the region-based prediction into pixel-level prediction. Although the candidate region-based model contributes to the development of semantic segmentation

it needs to generate many candidate regions. The process of generating candidate regions requires a huge amount of time and memory space. In addition

the quality of the candidate regions extracted by different algorithms and the lack of spatial information on the candidate areas

especially the loss of information on small objects

directly affect the final semantic segmentation. To solve the problem of rough semantic segmentation results and low accuracy ofregion-based semantic segmentation methods caused by the lack of detailed information

a semantic segmentation method that fuses the context and multiple layer features of convolutional neural networks is proposed in this study.

Method

First

candidate regions of different scales are generated from an image by using a selection method.The candidate area includes three parts

namely

square bounding box

foreground mask

and foreground size. The foreground mask is a binary mask that covers the foreground of the area over the candidate area. Multiplying the square region features on each channel with the corresponding foreground mask yields the foreground features of the region. Selective search uses graph-based image segmentation to generate several sub-regions

iteratively merges regions according to the similarity between sub-regions (i.e.

color

texture

size

and spatial overlap)

and outputs all possible regions of the target.Second

a convolutional neural network is used to extract the features of each region

and the high-and low-level features are fused in parallel. Parallel fusion combines the features of the same data set according to a certain rule

and the dimensions of the features must be the same before the combination.The features obtained by each convolutional layer are reduced using the linear discriminant analysis (LDA) method because of the different sizes of feature maps extracted from different layers. By selecting a projection hyperplane in the multi-dimensional space

the projection distance of the same category on the hyperplane is probably closer than the projection distance of different categories. The dimension reduction of LDA is only related to the number of categories because it is independent of the dimension of the data. The image dataset used in this work contains 33 categories. The LDA dimension reduction method is utilized to reduce the feature dimensions to 32

and this reduction decreases the size of the network's parameters. Moreover

LDA as a supervised algorithm can use prior knowledge on the class very well. Experimental results show that dimension reduction may lose some feature information but does not affect the segmentation result. After feature dimension reduction

the distance between different categories may increase

and the distance between the same categories may decrease

which can make the classification task easy. The RefineNet model is used to fuse feature maps with different resolutions. In this work

five feature map resolutions are used for fusion.The RefineNet network consists of three main components

namely

adaptive convolution

multi-resolution fusion

and chain residual pooling. The multi-resolution fusion part of the structure is utilized to adapt the input feature maps with a convolution layer

conduct upsampling

and perform pixel-level addition. The main task is to perform multi-resolution fusion to solve the problem of information loss caused by the downsampling operation and allow the image features extracted by each layer to be added to the final segmentation network. Finally

the regional feature mask and the fused feature map are inputted into the free-form pool of interest regions

and the pixel-level classification label of the image is obtained through the softmax classification layer.

Result

Context and convolutional neural network (CNN) multi-layer features are used for semantic segmentation

which exhibits good performance.The experimental content mainly includes CNN multi-layer feature fusion

combination of background information and fusion features

and the influence of dropout values on the experimental results.The training model is tested on the Siftflow dataset with a pixel accuracy of 82.3% and an average accuracy of 63.1%. Compared with the current region-based

end-to-end semantic segmentation model

the pixel accuracy is increased by 10.6% and the average accuracy is increased by 0.6%.

Conclusion

A semantic segmentation algorithm that combines context features with CNN multi-layer features is proposed in this study. The foreground and context information of the region are combined in the proposed method to utilize the context information of the region. The abstention principle is employed to reduce the parameter quantity of the network and avoid over-fitting

and the RefineNet network model is used to fuse the multi-layer features of CNN. By effectively using the multi-layer detail information of the image for segmentation

the model's capability to discriminate between small and medium-sized objects in the region is enhanced

and the segmentation effect is improved for images with occlusion and complex backgrounds. The experimental results show that the proposed method has a better segmentation effect

better segmentation performance

and higher robustness than several state-of-the-art methods.

关键词

Keywords

references

Badrinarayanan V, Kendall A and Cipolla R. 2015. SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12): 2481-2495 [DOI: 10.1109/TPAMI.2016.2644615]

Caesar H, Uijlings J and Ferrari V. 2015. Joint calibration for semantic segmentation//Proceedings of British Machine Vision Conference. Swansea, UK: BMVA Press [ DOI: 10.5244/C.29.29 http://dx.doi.org/10.5244/C.29.29 ]

Cao F M, Tian H J, Fu J and Liu J. 2019. Feature map slice for semantic segmentation. Journal of Image and Graphics, 24(3): 464-473

曹峰梅, 田海杰, 付君, 刘静. 2019.结合特征图切分的图像语义分割.中国图象图形学报, 24(3): 464-473)[DOI: 10.11834/jig.180402]

Caesar H, Uijlings J and Ferrari V. 2016. Region-based semantic segmentation with end-to-end training//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer, 381-397 [ DOI: 10.1007/978-3-319-46448-0_23 http://dx.doi.org/10.1007/978-3-319-46448-0_23 ]

Carreira J and Sminchisescu C. 2012. CPMC: automatic object segmentation using constrained parametric min-cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(7): 1312-1328 [DOI: 10.1109/TPAMI.2011.231]

Carreira J, Rui C, Batista J and Sminchisescu C. 2012. Semantic segmentation with second-order pooling//Proceedings of the 12th European Conference on Computer Vision. Florence, Italy: Springer, 430-443 [ DOI: 10.1007/978-3-642-33786-4_32 http://dx.doi.org/10.1007/978-3-642-33786-4_32 ]

Eigen D and Fergus R. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2380-7504 [ DOI: 10.1109/ICCV.2015.304 http://dx.doi.org/10.1109/ICCV.2015.304 ]

Farabet C, Couprie C, Najman L and Lecun Y. 2013. Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8): 1915-1929 [DOI: 10.1109/TPAMI.2012.231]

Felzenszwalb P F and Huttenlocher D P. 2004. Efficient graph-based image segmentation. International Journal of Computer Vision, 59(2): 167-181 [DOI: 10.1023/b:visi.0000022288.19776.77]

George M. 2015. Image parsing with a wide range of classes and scene-level context//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 3622-3630 [ DOI: 10.1109/CVPR.2015.7298985 http://dx.doi.org/10.1109/CVPR.2015.7298985 ]

Girshick R. 2015. Fast R-CNN//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 1440-1448 [ DOI: 10.1109/ICCV.2015.169 http://dx.doi.org/10.1109/ICCV.2015.169 ]

Girshick R, Donahue J, Darrell T and Malik J. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE, 580-587 [ DOI: 10.1109/CVPR.2014.81 http://dx.doi.org/10.1109/CVPR.2014.81 ]

Hu H X, Deng Z W, Zhou G T, Sha F and Mori G. 2017. LabelBank: revisiting global perspectives for semantic segmentation[EB/OL]. [2019-03-16] . https://arxiv.org/pdf/1703.09891.pdf https://arxiv.org/pdf/1703.09891.pdf

He K M, Zhang X Y, Ren S Q and Sun J. 2015. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 770-778 [ DOI: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]

Jiang F, Gu Q, Hao H Z, Li N, Guo Y W and Chen D X. 2017. Survey on content-based image segmentation methods. Journal of Software, 28(1): 160-183

姜枫, 顾庆, 郝慧珍, 李娜, 郭延文, 陈道蓄. 2017.基于内容的图像分割方法综述.软件学报, 28(1): 160-183)[DOI: 10.13328/j.cnki.j0s.005136]

Jiang Z Y, Yuan Y and Wang Q. 2018. Contour-aware network for semantic segmentation via adaptive depth. Neurocomputing, 284: 27-35 [DOI: 10.1016/j.neucom.2018.01.022]

Krizhevsky A, Sutskever I and Hinton G E. 2012. ImageNet classification with deep convolutional neural networks//Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, Nevada: Curran Associates Inc., 1097-1105

Li H C, Xiong P F, An J and Wang L. 2018. Pyramid attention network for semantic segmentation[EB/OL]. [2019-03-16] . https://arxiv.org/pdf/1805.10180.pdf https://arxiv.org/pdf/1805.10180.pdf

Lin G S, Milan A, Shen C H and Reid I. 2016. Pyramid attention network for sem. RefineNet: multi-path refinement networks for high-resolution semantic segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 5168-5177 [ DOI: 10.1109/CVPR.2017.549 http://dx.doi.org/10.1109/CVPR.2017.549 ]

Ning Q Q, Zhu J K and Chen C. 2017. Very fast semantic image segmentation using hierarchical dilation and feature refining. Cognitive Computation, 10(1): 62-72 [DOI: 10.1007/s12559-017-9530-0]

Ren S, He K, Girshick R and Sun J. 2015. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149 [DOI: 10.1109/TPAMI.2016.2577031]

Ronneberger O, Fischer P and Brox T. 2015. U-Net: convolutional networks for biomedical image segmentation//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich, Germany: Springer, 234-241 [ DOI: 10.1007/978-3-319-24574-4_28 http://dx.doi.org/10.1007/978-3-319-24574-4_28 ]

Sharma A, Tuzel O and Jacobs D W. 2015. Deep hierarchical parsing for semantic segmentation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 530-538 [ DOI: 10.1109/CVPR.2015.7298651 http://dx.doi.org/10.1109/CVPR.2015.7298651 ]

Sharma A, Tuzel O and Liu M Y. 2014. Recursive context propagation network for semantic scene labeling. Advances in Neural Information Processing Systems, 3: 2447-2455.

Shelhamer E, Long J and Darrell T. 2014. Fully Convolutional Networks for Semantic Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4): 640-651 [DOI: 10.1109/TPAMI.2016.2572683]

Simonyan K and Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition[EB/OL]. [2019-03-16] . https://arxiv.org/pdf/1409.1556.pdf https://arxiv.org/pdf/1409.1556.pdf

Srivastava N, Hinton G, Krizhevsky A, Sutskever I and Salakhutdinov R. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1): 1929-1958

Szegedy C, Liu W, Jia Y Q, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V and Rabinovich A. 2014. Going deeper with convolutions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 1-9 [ DOI: 10.1109/CVPR.2015.7298594 http://dx.doi.org/10.1109/CVPR.2015.7298594 ]

Uijlings J R, van de Sande K E A, Gevers T, Gevers T and Smeulders A W. 2013. Selective search for object recognition. International Journal of Computer Vision, 104(2): 154-171 [DOI: 10.1007/s11263-013-0620-5]

Xiao F, Rui T, Ren T W and Wang D. 2019. Full convolutional network for semantic segmentation and object detection. Journal of Image and Graphics, 24(3): 474-482

肖锋, 芮挺, 任桐炜, 王东. 2019.全卷积语义分割与物体检测网络.中国图象图形学报, 24(3): 474-482)[DOI: 10.11834/jig.180406]

Yang J M, Price B, Cohen S and Yang M H. 2014. Full convolutional network for semantics. Context driven scene parsing with attention to rare classes//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE, 3294-3301 [ DOI: 10.1109/CVPR.2014.415 http://dx.doi.org/10.1109/CVPR.2014.415 ]