结合上下文特征与CNN多层特征融合的语义分割
Semantic segmentation method with combined context features with CNN multi-layer features
- 2019年24卷第12期 页码:2200-2209
收稿:2019-04-01,
修回:2019-6-22,
录用:2019-6-29,
纸质出版:2019-12-16
DOI: 10.11834/jig.190087
移动端阅览

浏览全部资源
扫码关注微信
收稿:2019-04-01,
修回:2019-6-22,
录用:2019-6-29,
纸质出版:2019-12-16
移动端阅览
目的
2
针对基于区域的语义分割方法在进行语义分割时容易缺失细节信息,造成图像语义分割结果粗糙、准确度低的问题,提出结合上下文特征与卷积神经网络(CNN)多层特征融合的语义分割方法。
方法
2
首先,采用选择搜索方法从图像中生成不同尺度的候选区域,得到区域特征掩膜;其次,采用卷积神经网络提取每个区域的特征,并行融合高层特征与低层特征。由于不同层提取的特征图大小不同,采用RefineNet模型将不同分辨率的特征图进行融合;最后将区域特征掩膜和融合后的特征图输入到自由形式感兴趣区域池化层,经过softmax分类层得到图像的像素级分类标签。
结果
2
采用上下文特征与CNN多层特征融合作为算法的基本框架,得到了较好的性能,实验内容主要包括CNN多层特征融合、结合背景信息和融合特征以及dropout值对实验结果的影响分析,在Siftflow数据集上进行测试,像素准确率达到82.3%,平均准确率达到63.1%。与当前基于区域的端到端语义分割模型相比,像素准确率提高了10.6%,平均准确率提高了0.6%。
结论
2
本文算法结合了区域的前景信息和上下文信息,充分利用了区域的语境信息,采用弃权原则降低网络的参数量,避免过拟合,同时利用RefineNet网络模型对CNN多层特征进行融合,有效地将图像的多层细节信息用于分割,增强了模型对于区域中小目标物体的判别能力,对于有遮挡和复杂背景的图像表现出较好的分割效果。
Objective
2
Semantic segmentation plays an increasingly important role in visual analysis. It combines image classification
object detection
and image segmentation and classifies the pixels in an image through certain methods. Semantic segmentation divides an image into regions with certain semantic meanings and identifies the semantic categories of each region block. The semantic inference process from low to high levels is realized
and a segmented image with pixel-by-pixel semantic annotation is obtained. The semantic segmentation method based on candidate regions extracts free-form regions from the image
describes their features
classifies them based on regions
and converts the region-based prediction into pixel-level prediction. Although the candidate region-based model contributes to the development of semantic segmentation
it needs to generate many candidate regions. The process of generating candidate regions requires a huge amount of time and memory space. In addition
the quality of the candidate regions extracted by different algorithms and the lack of spatial information on the candidate areas
especially the loss of information on small objects
directly affect the final semantic segmentation. To solve the problem of rough semantic segmentation results and low accuracy ofregion-based semantic segmentation methods caused by the lack of detailed information
a semantic segmentation method that fuses the context and multiple layer features of convolutional neural networks is proposed in this study.
Method
2
First
candidate regions of different scales are generated from an image by using a selection method.The candidate area includes three parts
namely
square bounding box
foreground mask
and foreground size. The foreground mask is a binary mask that covers the foreground of the area over the candidate area. Multiplying the square region features on each channel with the corresponding foreground mask yields the foreground features of the region. Selective search uses graph-based image segmentation to generate several sub-regions
iteratively merges regions according to the similarity between sub-regions (i.e.
color
texture
size
and spatial overlap)
and outputs all possible regions of the target.Second
a convolutional neural network is used to extract the features of each region
and the high-and low-level features are fused in parallel. Parallel fusion combines the features of the same data set according to a certain rule
and the dimensions of the features must be the same before the combination.The features obtained by each convolutional layer are reduced using the linear discriminant analysis (LDA) method because of the different sizes of feature maps extracted from different layers. By selecting a projection hyperplane in the multi-dimensional space
the projection distance of the same category on the hyperplane is probably closer than the projection distance of different categories. The dimension reduction of LDA is only related to the number of categories because it is independent of the dimension of the data. The image dataset used in this work contains 33 categories. The LDA dimension reduction method is utilized to reduce the feature dimensions to 32
and this reduction decreases the size of the network's parameters. Moreover
LDA as a supervised algorithm can use prior knowledge on the class very well. Experimental results show that dimension reduction may lose some feature information but does not affect the segmentation result. After feature dimension reduction
the distance between different categories may increase
and the distance between the same categories may decrease
which can make the classification task easy. The RefineNet model is used to fuse feature maps with different resolutions. In this work
five feature map resolutions are used for fusion.The RefineNet network consists of three main components
namely
adaptive convolution
multi-resolution fusion
and chain residual pooling. The multi-resolution fusion part of the structure is utilized to adapt the input feature maps with a convolution layer
conduct upsampling
and perform pixel-level addition. The main task is to perform multi-resolution fusion to solve the problem of information loss caused by the downsampling operation and allow the image features extracted by each layer to be added to the final segmentation network. Finally
the regional feature mask and the fused feature map are inputted into the free-form pool of interest regions
and the pixel-level classification label of the image is obtained through the softmax classification layer.
Result
2
Context and convolutional neural network (CNN) multi-layer features are used for semantic segmentation
which exhibits good performance.The experimental content mainly includes CNN multi-layer feature fusion
combination of background information and fusion features
and the influence of dropout values on the experimental results.The training model is tested on the Siftflow dataset with a pixel accuracy of 82.3% and an average accuracy of 63.1%. Compared with the current region-based
end-to-end semantic segmentation model
the pixel accuracy is increased by 10.6% and the average accuracy is increased by 0.6%.
Conclusion
2
A semantic segmentation algorithm that combines context features with CNN multi-layer features is proposed in this study. The foreground and context information of the region are combined in the proposed method to utilize the context information of the region. The abstention principle is employed to reduce the parameter quantity of the network and avoid over-fitting
and the RefineNet network model is used to fuse the multi-layer features of CNN. By effectively using the multi-layer detail information of the image for segmentation
the model's capability to discriminate between small and medium-sized objects in the region is enhanced
and the segmentation effect is improved for images with occlusion and complex backgrounds. The experimental results show that the proposed method has a better segmentation effect
better segmentation performance
and higher robustness than several state-of-the-art methods.
Badrinarayanan V, Kendall A and Cipolla R. 2015. SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12): 2481-2495 [DOI: 10.1109/TPAMI.2016.2644615]
Caesar H, Uijlings J and Ferrari V. 2015. Joint calibration for semantic segmentation//Proceedings of British Machine Vision Conference. Swansea, UK: BMVA Press [ DOI: 10.5244/C.29.29 http://dx.doi.org/10.5244/C.29.29 ]
Cao F M, Tian H J, Fu J and Liu J. 2019. Feature map slice for semantic segmentation. Journal of Image and Graphics, 24(3): 464-473
曹峰梅, 田海杰, 付君, 刘静. 2019.结合特征图切分的图像语义分割.中国图象图形学报, 24(3): 464-473)[DOI: 10.11834/jig.180402]
Caesar H, Uijlings J and Ferrari V. 2016. Region-based semantic segmentation with end-to-end training//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer, 381-397 [ DOI: 10.1007/978-3-319-46448-0_23 http://dx.doi.org/10.1007/978-3-319-46448-0_23 ]
Carreira J and Sminchisescu C. 2012. CPMC: automatic object segmentation using constrained parametric min-cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(7): 1312-1328 [DOI: 10.1109/TPAMI.2011.231]
Carreira J, Rui C, Batista J and Sminchisescu C. 2012. Semantic segmentation with second-order pooling//Proceedings of the 12th European Conference on Computer Vision. Florence, Italy: Springer, 430-443 [ DOI: 10.1007/978-3-642-33786-4_32 http://dx.doi.org/10.1007/978-3-642-33786-4_32 ]
Eigen D and Fergus R. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2380-7504 [ DOI: 10.1109/ICCV.2015.304 http://dx.doi.org/10.1109/ICCV.2015.304 ]
Farabet C, Couprie C, Najman L and Lecun Y. 2013. Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8): 1915-1929 [DOI: 10.1109/TPAMI.2012.231]
Felzenszwalb P F and Huttenlocher D P. 2004. Efficient graph-based image segmentation. International Journal of Computer Vision, 59(2): 167-181 [DOI: 10.1023/b:visi.0000022288.19776.77]
George M. 2015. Image parsing with a wide range of classes and scene-level context//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 3622-3630 [ DOI: 10.1109/CVPR.2015.7298985 http://dx.doi.org/10.1109/CVPR.2015.7298985 ]
Girshick R. 2015. Fast R-CNN//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 1440-1448 [ DOI: 10.1109/ICCV.2015.169 http://dx.doi.org/10.1109/ICCV.2015.169 ]
Girshick R, Donahue J, Darrell T and Malik J. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE, 580-587 [ DOI: 10.1109/CVPR.2014.81 http://dx.doi.org/10.1109/CVPR.2014.81 ]
Hu H X, Deng Z W, Zhou G T, Sha F and Mori G. 2017. LabelBank: revisiting global perspectives for semantic segmentation[EB/OL]. [2019-03-16] . https://arxiv.org/pdf/1703.09891.pdf https://arxiv.org/pdf/1703.09891.pdf
He K M, Zhang X Y, Ren S Q and Sun J. 2015. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 770-778 [ DOI: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]
Jiang F, Gu Q, Hao H Z, Li N, Guo Y W and Chen D X. 2017. Survey on content-based image segmentation methods. Journal of Software, 28(1): 160-183
姜枫, 顾庆, 郝慧珍, 李娜, 郭延文, 陈道蓄. 2017.基于内容的图像分割方法综述.软件学报, 28(1): 160-183)[DOI: 10.13328/j.cnki.j0s.005136]
Jiang Z Y, Yuan Y and Wang Q. 2018. Contour-aware network for semantic segmentation via adaptive depth. Neurocomputing, 284: 27-35 [DOI: 10.1016/j.neucom.2018.01.022]
Krizhevsky A, Sutskever I and Hinton G E. 2012. ImageNet classification with deep convolutional neural networks//Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, Nevada: Curran Associates Inc., 1097-1105
Li H C, Xiong P F, An J and Wang L. 2018. Pyramid attention network for semantic segmentation[EB/OL]. [2019-03-16] . https://arxiv.org/pdf/1805.10180.pdf https://arxiv.org/pdf/1805.10180.pdf
Lin G S, Milan A, Shen C H and Reid I. 2016. Pyramid attention network for sem. RefineNet: multi-path refinement networks for high-resolution semantic segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 5168-5177 [ DOI: 10.1109/CVPR.2017.549 http://dx.doi.org/10.1109/CVPR.2017.549 ]
Ning Q Q, Zhu J K and Chen C. 2017. Very fast semantic image segmentation using hierarchical dilation and feature refining. Cognitive Computation, 10(1): 62-72 [DOI: 10.1007/s12559-017-9530-0]
Ren S, He K, Girshick R and Sun J. 2015. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149 [DOI: 10.1109/TPAMI.2016.2577031]
Ronneberger O, Fischer P and Brox T. 2015. U-Net: convolutional networks for biomedical image segmentation//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich, Germany: Springer, 234-241 [ DOI: 10.1007/978-3-319-24574-4_28 http://dx.doi.org/10.1007/978-3-319-24574-4_28 ]
Sharma A, Tuzel O and Jacobs D W. 2015. Deep hierarchical parsing for semantic segmentation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 530-538 [ DOI: 10.1109/CVPR.2015.7298651 http://dx.doi.org/10.1109/CVPR.2015.7298651 ]
Sharma A, Tuzel O and Liu M Y. 2014. Recursive context propagation network for semantic scene labeling. Advances in Neural Information Processing Systems, 3: 2447-2455.
Shelhamer E, Long J and Darrell T. 2014. Fully Convolutional Networks for Semantic Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4): 640-651 [DOI: 10.1109/TPAMI.2016.2572683]
Simonyan K and Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition[EB/OL]. [2019-03-16] . https://arxiv.org/pdf/1409.1556.pdf https://arxiv.org/pdf/1409.1556.pdf
Srivastava N, Hinton G, Krizhevsky A, Sutskever I and Salakhutdinov R. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1): 1929-1958
Szegedy C, Liu W, Jia Y Q, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V and Rabinovich A. 2014. Going deeper with convolutions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 1-9 [ DOI: 10.1109/CVPR.2015.7298594 http://dx.doi.org/10.1109/CVPR.2015.7298594 ]
Uijlings J R, van de Sande K E A, Gevers T, Gevers T and Smeulders A W. 2013. Selective search for object recognition. International Journal of Computer Vision, 104(2): 154-171 [DOI: 10.1007/s11263-013-0620-5]
Xiao F, Rui T, Ren T W and Wang D. 2019. Full convolutional network for semantic segmentation and object detection. Journal of Image and Graphics, 24(3): 474-482
肖锋, 芮挺, 任桐炜, 王东. 2019.全卷积语义分割与物体检测网络.中国图象图形学报, 24(3): 474-482)[DOI: 10.11834/jig.180406]
Yang J M, Price B, Cohen S and Yang M H. 2014. Full convolutional network for semantics. Context driven scene parsing with attention to rare classes//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE, 3294-3301 [ DOI: 10.1109/CVPR.2014.415 http://dx.doi.org/10.1109/CVPR.2014.415 ]
相关作者
相关机构
京公网安备11010802024621