融合分割先验的多图像目标语义分割

廖旋; 缪君; 储珺; 张桂梅

doi:10.11834/jig.180568

图像分析和识别 | 浏览量 : 0 下载量: 437 CSCD: 6

PDF
导出
分享
收藏
专辑

融合分割先验的多图像目标语义分割
Multi-image object semantic segmentation by fusing segmentation priors
2019年24卷第6期页码：890-901
收稿：2018-09-26，

修回：2018-11-3，

纸质出版：2019-06-16
DOI： 10.11834/jig.180568
稿件说明：

移动端阅览

廖旋, 缪君, 储珺, 张桂梅. 融合分割先验的多图像目标语义分割[J]. 中国图象图形学报, 2019,24(6):890-901. DOI： 10.11834/jig.180568.

Xuan Liao, Jun Miao, Jun Chu, Guimei Zhang. Multi-image object semantic segmentation by fusing segmentation priors[J]. Journal of Image and Graphics, 2019, 24(6): 890-901. DOI： 10.11834/jig.180568.

摘要

目的

在序列图像或多视角图像的目标分割中，传统的协同分割算法对复杂的多图像分割鲁棒性不强，而现有的深度学习算法在前景和背景存在较大歧义时容易导致目标分割错误和分割不一致。为此，提出一种基于深度特征的融合分割先验的多图像分割算法。

方法

首先，为了使模型更好地学习复杂场景下多视角图像的细节特征，通过融合浅层网络高分辨率的细节特征来改进PSPNet-50网络模型，减小随着网络的加深导致空间信息的丢失对分割边缘细节的影响。然后通过交互分割算法获取一至两幅图像的分割先验，将少量分割先验融合到新的模型中，通过网络的再学习来解决前景/背景的分割歧义以及多图像的分割一致性。最后通过构建全连接条件随机场模型，将深度卷积神经网络的识别能力和全连接条件随机场优化的定位精度耦合在一起，更好地处理边界定位问题。

结果

本文采用公共数据集的多图像集进行了分割测试。实验结果表明本文算法不但可以更好地分割出经过大量数据预训练过的目标类，而且对于没有预训练过的目标类，也能有效避免歧义的区域分割。本文算法不论是对前景与背景区别明显的较简单图像集，还是对前景与背景颜色相似的较复杂图像集，平均像素准确度（PA）和交并比（IOU）均大于95%。

结论

本文算法对各种场景的多图像分割都具有较强的鲁棒性，同时通过融入少量先验，使模型更有效地区分目标与背景，获得了分割目标的一致性。

Abstract

Objective

Object segmentation from multiple images involves locating the positions and ranges of common target objects in a scene

as presented in a sequence image set or multi-view images. This process is applied to various computer vision tasks and beyond

such as object detection and tracking

scene understanding

and 3D reconstruction. Early approaches consider object segmentation as a histogram matching of color values

and they are only applied to pair-wise images with the same or similar objects. Later on

object co-segmentation methods are introduced. Most of these methods take the MRF model as the basic framework and establish the cost function that consists of the energy within the image itself and the energy between images by using the feature calculation based on the gray or color values of pixels. The cost function is minimized to obtain consistent segmentation. However

when the foreground and background colors in these images are similar

co-segmentation cannot easily realize object segmentation with consistent regions. In recent years

with the development of deep learning

methods based on various deep learning models have been proposed. Some methods

such as the full convolutional network

adopt convolutional neural networks to extract the high-level semantic features of images to attain end-to-end image classification with pixel level. These algorithms can obtain better precision than traditional methods could. Compared with these traditional methods

deep learning methods can learn appropriate features automatically for individual classes without manual selection and adjustment of features. Exactly segmenting a single image must combine multi-level spatial domain information. Hence

multi-image segmentation not only demands fine grit accuracy in local regions and single image segmentation but also requires the balance of local and global information among multiple images. When ambiguous regions around the foreground and background are involved or when sufficient priori information is not given about objects

most deep learning methods tend to generate errors and achieve inconsistent segmentation from sequential image sets or multi-view images.

Method

In this study

we propose a multi-image segmentation method on the basis of depth feature exaction. The method is similar to the neural network model of PSPNet-50

in which a residual network is used to exact the features of the first 50 layers of the network. These extracted features are integrated into the pyramid pooling module by using pooling layers with differently sized pooling filters. Then

the features of different levels are fused. After applying a convolutional layer and up-convolutional layer

the initial end-to-end outputs are attained. To make the model learn the detail features from the multi-view images of complex scenes comprehensively

we join the first and fifth parts of the output network features. Thus

the PSPNet-50 network model is improved by integrating the high-resolution details of the shallow layer network

which also is used to reduce the effects of spatial information loss on the segmentation edge details as the network deepens. In the training phase

the improved network model is first pre-trained using the ADE20k dataset. Thus

the model

after considerable data training

achieves strong robustness and generalization. Afterward

one or two prior segmentations of the object are gained by using the interactive segmentation approach. These small priori segmentation integrations are fused into the new model. The network is then re-trained to solve the ambiguity segmentation problem between the foreground and the background and the inconsistent segmentation problem among multi-image. We analyze the relationship between the number of re-training iterations and the segmentation accuracy by employing a large number of experimental results to determine the optimal number of iterations. Finally

by constructing a fully connected conditional random field

the recognition ability of the deep convolutional neural network and the accurate locating ability of the fully connected condition random field are coupled together. The object region is effectively located

and the object edge is clearly detected.

Result

We evaluate our method on multi-image from various public data sets showing outdoor buildings and indoor objects. We also compare our results with those of other deep learning methods

such as fully convolutional networks (FCNs) and pyramid scene parsing network (PSPNet). Experiments in the multi-view dataset of "Valbonne" and "Box" show that our algorithm can exactly segment the region of the object in re-training classes while effectively avoiding the ambiguous region segmentation for those untraining object classes. To evaluate our algorithm quantitatively

we compute the commonly used accuracy evaluation

average values of pixel accuracy (PA)

and intersection over union (IOU) and then evaluate the segmentation accuracy of the object.

Results

show that our algorithm attains satisfactory scores not only in complex scene image sets with similar foreground and background contexts but also in simple image sets with obvious differences between the foreground and background contexts. For example

in the "Valbonne" set

the PA and IOU values of our result are 0.968 3 and 0.946 9

respectively; whereas the values of FCN are 0.702 7 and 0.694 2

respectively. The values of PSPNet are 0.850 9 and 0.824 0. Our method achieves 10% higher scores than FCN does and 20% higher scores than PSPNet does. In the "Box" set

our method achieves the PA values of 0.994 6 and IOU values of 0.957 7. However

FCN and PSPNet cannot find the real region of the object because the "Box" class is not contained in their re-training classes. The same improvements are found in other data sets. The average scores of PA and IOU of our method are more than 0.95.

Conclusion

Experimental results demonstrate that our algorithm has strong robustness in various scenes and can achieve consistent segmentation in multi-view images. A small amount of priori integration can help to accurately predict object pixel-level region and make the model effectively distinguish object regions from the background. The proposed approach consistently outperforms competing methods for contained and un-contained object classes.

关键词

Keywords

references

Li F X, Kim T, Humayun A, et al. Video segmentation by tracking many figure-ground segments[C]//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, NSW, Australia: IEEE, 2013: 2192-2199.[ Doi: 10.1109/ICCV.2013.273 http://dx.doi.org/10.1109/ICCV.2013.273 ]

Zhang G M, Chen B B, Xu K, et al. New CV model combining fractional differential and image local information[J]. Journal of Image and Graphics, 2018, 23(8):1131-1143.

张桂梅, 陈兵兵, 徐可, 等.结合分数阶微分和图像局部信息的CV模型[J].中国图象图形学报, 2018, 23(8):1131-1143.[DOI:10.11834/jig.170580]

MartinovićA, Knopp J, Riemenschneider H, et al. 3D all the way: semantic segmentation of urban scenes from start to end in 3D[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 4456-4465.[ Doi: 10.1109/CVPR.2015.7299075 http://dx.doi.org/10.1109/CVPR.2015.7299075 ]

Rother C, Minka T, Blake A, et al. Cosegmentation of image pairs by histogram matching-incorporating a global constraint into MRFs[C]//Proceedings of 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. New York, NY, USA: IEEE, 2006: 993-1000.[ Doi: 10.1109/CVPR.2006.91 http://dx.doi.org/10.1109/CVPR.2006.91 ]

Mukherjee L, Singh V, Dyer C R. Halfintegrality based algorithms for cosegmentation of images[C]//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL, USA: IEEE, 2009: 2028-2035.[ Doi: 10.1109/CVPR.2009.5206652 http://dx.doi.org/10.1109/CVPR.2009.5206652 ]

Hochbaum D S, Singh V. An efficient algorithm for co-segmentation[C]//Proceedings of 2009 IEEE International Conference on Computer Vision. Kyoto, Japan: IEEE, 2009: 269-276.[ Doi: 10.1109/ICCV.2009.5459261 http://dx.doi.org/10.1109/ICCV.2009.5459261 ]

Vicente S, Kolmogorov V, Rother C. Cosegmentation revisited: models and optimization[C]//Proceedings of the 11th European Conference on Computer Vision. Heraklion, Crete, Greece: Springer, 2010: 465-479.[ Doi: 10.1007/978-3-642-15552-9_34 http://dx.doi.org/10.1007/978-3-642-15552-9_34 ]

Joulin A, Bach F, Ponce J. Discriminative clustering for image co-segmentation[C]//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, CA, USA: IEEE, 2010: 1943-1950.[ Doi: 10.1109/CVPR.2010.5539868 http://dx.doi.org/10.1109/CVPR.2010.5539868 ]

Vicente S, Rother C, Kolmogorov V. Object cosegmentation[C]//Proceedings of 2011 IEEE Conference on Computer Vision and Pattern Recognition. Colorado Springs, CO, USA: IEEE, 2011: 2217-2224.[ Doi: 10.1109/CVPR.2011.5995530 http://dx.doi.org/10.1109/CVPR.2011.5995530 ]

Rubio J C, Serrat J, López A, et al. Unsupervised co-segmentation through region matching[C]//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, RI, USA: IEEE, 2012: 749-756.[ Doi: 10.1109/CVPR.2012.6247745 http://dx.doi.org/10.1109/CVPR.2012.6247745 ]

CollinsM D, Xu J, Grady L, et al. Random walks based multi-image segmentation: quasiconvexity results and GPU-based solutions[C]//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, RI, USA IEEE, 2012: 1656-1663.[ Doi: 10.1109/CVPR.2012.6247859 http://dx.doi.org/10.1109/CVPR.2012.6247859 ]

Dong X P, Shen J B, Shao L, et al. Interactive cosegmentation using global and local energy optimization[J]. IEEE Transactions on Image Processing, 2015, 24(11):3966-3977.[DOI:10.1109/TIP.2015.2456636]

Zhu Y F, Zhang Y J. Transductive co-segmentation of multi-view images[J]. Journal of Electronics&Information Technology, 2011, 33(4):763-768.

朱云峰, 章毓晋.直推式多视图协同分割[J].电子与信息学报, 2011, 33(4):763-768.[DOI:10.3724/SP.J.1146.2010.00839]

Djelouah A, Franco J S, Boyer E, et al. Multi-view object segmentation in space and time[C]//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, NSW, Australia: IEEE, 2013: 2640-2647.[ Doi: 10.1109/ICCV.2013.328 http://dx.doi.org/10.1109/ICCV.2013.328 ]

Nguyen T N A, Cai J F, Zheng J M, et al. Interactive object segmentation from multi-view images[J]. Journal of Visual Communication and Image Representation, 2013, 24(4):477-485.[DOI:10.1016/j.jvcir.2013.02.012]

Shotton J, Johnson M, Cipolla R. Semantic texton forests for image categorization and segmentation[C]//Proceedings of 2008 IEEE Conference on Computer Vision and Pattern Recognition. Anchorage, AK, USA: IEEE, 2008: 1-8.[ Doi: 10.1109/CVPR.2008.4587503 http://dx.doi.org/10.1109/CVPR.2008.4587503 ]

Shotton J, Fitzgibbon A, Cook M, et al. Real-time human pose recognition in parts from single depth images[C]//Proceedi ngs of 2011 IEEE Conference on Computer Vision and Pattern Recognition. Colorado Springs, CO, USA: IEEE, 2011: 1297-1304.[ Doi: 10.1109/CVPR.2011.5995316 http://dx.doi.org/10.1109/CVPR.2011.5995316 ]

Shelhamer E, Long J, Darrell T. Fully convolutional networks for semantic segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4):640-651.[10.1109/TPAMI.2016.2572683]

Badrinarayanan V, Handa A, Cipolla R. SegNet: a deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling[J]. eprint arXiv: 1505.07293, 2015.

Zhao H S, Shi J P, Qi X J, et al. Pyramid scene parsing network[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, Hawaii, USA: IEEE, 2017: 6230-6239.[ Doi: 10.1109/CVPR.2017.660 http://dx.doi.org/10.1109/CVPR.2017.660 ]

Chen L C, Papandreou G, Kokkinos I, et al. Semantic image segmentation with deep convolutional nets and fully connected CRFs[J]. eprint arXiv: 1412.7062, 2014.

Chen L C, Papandreou G, Kokkinos I, et al. DeepLab:semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4):834-848.[DOI:10.1109/TPAMI.2017.2699184]

He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 770-778.[ Doi: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]

Zhou B L, Zhao H, Puig X, et al. Semantic understanding of scenes through the ADE20K dataset[J]. arXiv: 1608.05442, 2016.

Gulshan V, Rother C, Criminisi A, et al. Geodesic star convexity for interactive image segmentation[C]//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, CA, USA: IEEE, 2010: 3129-3136.[ Doi: 10.1109/CVPR.2010.5540073 http://dx.doi.org/10.1109/CVPR.2010.5540073 ]

Lecun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11):2278-2324.[DOI:10.1109/5.726791]

Krähenbühl P, Koltun V. Efficient inference in fully connected CRFs with Gaussian edge potentials[C]//Proceedings of the International Conference on Neural Information Processing Systems. Granada, Spain: Curran Associates Inc, 2011: 109-117.

Visual Geometry Group. Multi-view and Oxford Colleges building reconstruction[EB/OL].[2018-07-05] . http://www.robots.ox.ac.uk/vgg/data/data-mview.html http://www.robots.ox.ac.uk/vgg/data/data-mview.html

Kim H, Xiao H, Max N. Piecewise planar scene reconstruction and optimization for multi-view stereo[C]//Proceedings of the 11th Asian Conference on Computer Vision. Daejeon, Korea: Springer, 2012: 191-204.[ Doi: 10.1007/978-3-642-37447-0_15 http://dx.doi.org/10.1007/978-3-642-37447-0_15 ]

Kowdle A, Sinha S N, Szeliski R. Multiple view object cosegmentation using appearance and stereo cues[C]//Proceedings of the 12th European Conference on Computer Vision. Florence, Italy: Springer, 2012: 789-803.[ Doi: 10.1007/978-3-642-33715-4_57 http://dx.doi.org/10.1007/978-3-642-33715-4_57 ]