融合语义—表观特征的无监督前景分割

李熹; 马惠敏; 马洪兵; 王弈冬

doi:10.11834/jig.200442

NCIG 2020 | 浏览量 : 0 下载量: 123 CSCD: 0

PDF
导出
分享
收藏
专辑

融合语义—表观特征的无监督前景分割
Semantic-apparent feature-fusion-based unsupervised foreground segmentation method
2021年26卷第10期页码：2503-2513
收稿：2020-08-21，

修回：2020-11-19，

录用：2020-11-26，

纸质出版：2021-09-16
DOI： 10.11834/jig.200442
稿件说明：

移动端阅览

李熹, 马惠敏, 马洪兵, 王弈冬. 融合语义—表观特征的无监督前景分割[J]. 中国图象图形学报, 2021,26(10):2503-2513. DOI： 10.11834/jig.200442.

Xi Li, Huimin Ma, Hongbing Ma, Yidong Wang. Semantic-apparent feature-fusion-based unsupervised foreground segmentation method[J]. Journal of Image and Graphics, 2021, 26(10): 2503-2513. DOI： 10.11834/jig.200442.

摘要

目的

前景分割是图像理解领域中的重要任务，在无监督条件下，由于不同图像、不同实例往往具有多变的表达形式，这使得基于固定规则、单一类型特征的方法很难保证稳定的分割性能。针对这一问题，本文提出了一种基于语义-表观特征融合的无监督前景分割方法（semantic apparent feature fusion，SAFF）。

方法

基于语义特征能够对前景物体关键区域产生精准的响应，但往往产生的前景分割结果只关注于关键区域，缺乏物体的完整表达；而以显著性、边缘为代表的表观特征则提供了更丰富的细节表达信息，但基于表观规则无法应对不同的实例和图像成像模式。为了融合表观特征和语义特征优势，研究建立了融合语义、表观信息的一元区域特征和二元上下文特征编码的方法，实现了对两种特征表达的全面描述。接着，设计了一种图内自适应参数学习的方法，用于计算最适合的特征权重，并生成前景置信分数图。进一步地，使用分割网络来学习不同实例间前景的共性特征。

结果

通过融合语义和表观特征并采用图像间共性语义学习的方法，本文方法在PASCAL VOC（pattern analysis，statistical modelling and computational learning visual object classes）2012训练集和验证集上取得了显著超过类别激活映射（class activation mapping，CAM）和判别性区域特征融合方法（discriminative regional feature integration，DRFI）的前景分割性能，在F测度指标上分别提升了3.5%和3.4%。

结论

本文方法可以将任意一种语义特征和表观特征前景计算模块作为基础单元，实现对两种策略的融合优化，取得了更优的前景分割性能。

Abstract

Objective

Foreground segmentation is an essential research in the field of image understanding

which is a pre-processing step for saliency object detection

semantic segmentation

and various pixel-level learning tasks. Given an image

this task aims to provide each pixel a foreground or background annotation. For fully supervision-based methods

satisfactory results can be achieved via multi-instance-based learning. However

when facing the problem under unsupervised conditions

achieving a stable segmentation performance based on fixed rules or a single type of feature is difficult because different images and instances always have variable expressions. Moreover

we find that different types of method have different advantages and disadvantages on different aspects. On the one hand

semantic feature-based learning methods could provide accurate key region extraction of foregrounds but could not generate complete object region and edges in detail. On the other hand

richer detailed expression can be obtained based on an apparent feature-based framework

but it cannot be suitable for variable kinds of cases.

Method

Based on the observations

we propose an unsupervised foreground segmentation method based on semantic-apparent feature fusion. First

given a sample

we encode it as semantic and apparent feature map. We use a class activation mapping model pretrained on ImageNet for semantic heat map generation and select saliency and edge maps to express the apparent feature. Each kind of semantic and apparent feature can be used

and the established framework is widely adaptive for each case. Second

to combine the advantages of the two type of features

we split the image as super pixels

and set the expression of four elements as unary and binary semantic and apparent feature

which realizes a comprehensive description of the two types of expressions. Specifically

we build two binary relation matrices to measure the similarity of each pair of super pixels

which are based on apparent and semantic feature. For generating the binary semantic feature

we use the apparent feature-based similarity measure as a weight to provide the element for each super pixel

in which semantic-feature-based similarity measure is utilized for binary apparent feature calculation. Based on the different view for feature encoding

the two types of information could be fused for the first time. Then

we propose a method for adaptive parameter learning to calculate the most suitable feature weights and generate the foreground confidence score map. Based on the four elements

we could establish an equation to express each super pixel's foreground confidence score using the least squares method. For an image

we first select super pixels with higher confident scores of unary semantic and apparent feature on foreground or background. Then

we can learn weights of the four elements and bias' linear combination by least squares estimation. Based on the adaptive parameters

we can achieve a better confidence score inference for each super pixel individually. Third

we use segmentation network to learn foreground common features from different instances. In a weakly supervised semantic segmentation task

the fully supervision-based framework is used for improving pseudo annotations for training data and providing inference results. Inspired by the idea

we use the convolution network to mine foreground common feature from different instances. The trained model could be utilized to optimize the quality of foreground segmentation for both images used for network training and new data directly. A better performance can be achieved by fusing semantic and apparent features as well as cascading the modules of intra image adaptive feature weight learning and inter-image common feature learning.

Result

We test our methods on the pattern analysis

statistical modelling and computational learning visual object classes(PASCAL VOC)2012 training and evaluation set

which include 10 582 and 1 449 samples

respectively. Precision-recall curve as well as F-measure are used as indicators to evaluate the experimental results. Compared with typical semantic and apparent feature-based foreground segmentation methods

the proposed framework achieves superior improvement of baselines. For PASCAL VOC 2012 training set

the F-measure has a 3.5% improvement

while a 3.4% increase is obtained on the validation set. We also focus on the performance on visualized results for analysis the advantages of fusion framework. Based on comparison

we can find that results with accurate

detailed expression can be achieved based on the adaptive feature fusion operation

while incorrect cases can further be modified via multi-instance-based learning framework.

Conclusion

In this study

we propose a semantic-apparent feature fusion method for unsupervised foreground segmentation. Given an image as input

we first calculate the semantic and apparent feature of the unary region of each super pixel in image. Then

we integrate two types of features through the cross-use of similarity measure of apparent and semantic feature. Next

we establish a context relationship for each pair of super pixels to calculate the binary feature of each region. Further

we establish an adaptive weight learning strategy. We obtain the weighting parameters for optimal foreground segmentation and achieve the confidence in the image foreground by automatically adjusting the influence of each dimensional feature on the foreground estimation in each specific image instance. Finally

we build a foreground segmentation network model to learn the common features of foreground between different instances and samples. Using the trained network model

the image can be re-inferred to obtain more accurate foreground segmentation results. The experiments on the PASCAL VOC 2012 training set and validation set prove the effectiveness and generalization ability of the algorithm. Moreover

the method proposed can use other foreground segmentation methods as a baseline and is widely used to improve the performance of tasks such as foreground segmentation and weakly supervised semantic segmentation. We also believe that to consider the introduction of various types of semantic and apparent feature fusion as well as adopt alternate iterations to mine the internal spatial context information of image and the common expression features between different instance is a feasible way to improve the performance of foreground segmentation further and an important idea for semantic segmentation tasks.

关键词

Keywords

references

Achanta R, Shaji A, Smith K, Lucchi A, Fua P and Süsstrunk S. 2012. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11): 2274-2282[DOI:10.1109/TPAMI.2012.120]

Ahn J and Kwak S. 2018. Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA:IEEE: 4981-4990[ DOI: 10.1109/CVPR.2018.00523 http://dx.doi.org/10.1109/CVPR.2018.00523 ]

Chen L C, Papandreou G, Kokkinos I, Murphy K and Yuille A L. 2018. Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4): 834-848[DOI:10.1109/TPAMI.2017.2699184]

Chen L C, Papandreou G, Schroff F and Adam H. 2017. Rethinking atrous convolution for semantic image segmentation[EB/OL]. [2020-08-07] https://arxiv.org/pdf/1706.05587.pdf https://arxiv.org/pdf/1706.05587.pdf

Chen X Z, Kundu K, Zhang Z Y, Ma H M, Fidler S and Urtasun R. 2016. Monocular 3D object detection for autonomous driving//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 2147-2156[ DOI: 10.1109/CVPR.2016.236 http://dx.doi.org/10.1109/CVPR.2016.236 ]

Deng J, Dong W, Socher R, Li L J, Li K and Li F F. 2009. ImageNet: a large-scale hierarchical image database//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, USA: IEEE: 248-255[ DOI: 10.1109/CVPR.2009.5206848 http://dx.doi.org/10.1109/CVPR.2009.5206848 ]

Everingham M, Van Gool L, Williams C K I, Winn J and Zisserman A. 2010. The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2): 303-338[DOI:10.1007/s11263-009-0275-4]

Hariharan B, Arbeláez P, Bourdev L, Maji S and Malik J. 2011. Semantic contours from inverse detectors//Proceedings of 2011 International Conference on Computer Vision. Barcelona, Spain: IEEE: 991-998[ DOI: 10.1109/ICCV.2011.6126343 http://dx.doi.org/10.1109/ICCV.2011.6126343 ]

Huang Z L, Wang X G, Wang J S, Liu W Y and Wang J D. 2018. Weakly-supervised semantic segmentation network with deep seeded region growing//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7014-7023[ DOI: 10.1109/CVPR.2018.00733 http://dx.doi.org/10.1109/CVPR.2018.00733 ]

Jiang H Z, Wang J D, Yuan Z J, Wu Y, Zheng N N and Li S P. 2013. Salient object detection: a discriminative regional feature integration approach//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, USA: IEEE: 2083-2090[ DOI: 10.1109/CVPR.2013.271 http://dx.doi.org/10.1109/CVPR.2013.271 ]

Li X, Ma H M and Luo X. 2020. Weaklier supervised semantic segmentation with only one image level annotation per category. IEEE Transactions on Image Processing, 29: 128-141[DOI:10.1109/TIP.2019.2930874]

Li X, Ma H M and Wang X. 2018a. Feature proposal model on multidimensional data clustering and its application. Pattern Recognition Letters, 112: 41-48[DOI:10.1016/j.patrec.2018.05.025]

Li X, Ma H M and Wang X. 2018b. Region proposal ranking via fusion feature for object detection//Proceedings of the 25th IEEE International Conference on Image Processing (ICIP). Athens, Greece: IEEE: 1298-1302[ DOI: 10.1109/ICIP.2018.8451326 http://dx.doi.org/10.1109/ICIP.2018.8451326 ]

Li X, Ma H M, Wang X H and Zhang K. 2018c. Saliency detection via alternative optimization adaptive influence matrix model. Pattern Recognition Letters, 101: 29-36[DOI:10.1016/j.patrec.2017.11.006]

Long J, Shelhamer E and Darrell T. 2015. Fully convolutional networks for semantic segmentation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3431-3440[ DOI: 10.1109/CVPR.2015.7298965 http://dx.doi.org/10.1109/CVPR.2015.7298965 ]

Simonyan K and Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition[EB/OL]. [2020-08-07] . https://arxiv.org/pdf/1409.1556.pdf https://arxiv.org/pdf/1409.1556.pdf

Szegedy C, Liu W, Jia Y Q, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V and Rabinovich A. 2015. Going deeper with convolutions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 1-9[ DOI: 10.1109/CVPR.2015.7298594 http://dx.doi.org/10.1109/CVPR.2015.7298594 ]

Wang P Q, Chen P F, Yuan Y, Liu D, Huang Z H, Hou X D and Cottrell G. 2018a. Understanding convolution for semantic segmentation//Proceedings of 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). Lake Tahoe, USA: IEEE: 1451-1460[ DOI: 10.1109/WACV.2018.00163 http://dx.doi.org/10.1109/WACV.2018.00163 ]

Wang X, Ma H M, Chen X Z and You S D. 2018b. Edge preserving and multi-scale contextual neural network for salient object detection. IEEE Transactions on Image Processing, 27(1): 121-134[DOI:10.1109/TIP.2017.2756825]

Wang X, You S D, Li X and Ma H M. 2018c. Weakly-supervised semantic segmentation by iteratively mining common object features//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 1354-1362[ DOI: 10.1109/CVPR.2018.00147 http://dx.doi.org/10.1109/CVPR.2018.00147 ]

Zheng S, Jayasumana S, Romera-Paredes B, Vineet V, Su Z Z, Du D L, Huang C and Torr P HS. 2015. Conditional random fields as recurrent neural networks//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 1529-1537[ DOI: 10.1109/ICCV.2015.179 http://dx.doi.org/10.1109/ICCV.2015.179 ]

Zhou B L, Khosla A, Lapedriza A, Oliva A and Torralba A. 2016. Learning deep features for discriminative localization//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 2921-2929[ DOI: 10.1109/CVPR.2016.319 http://dx.doi.org/10.1109/CVPR.2016.319 ]

Zitnick C L and Dollár P. 2014. Edge boxes: locating object proposals from edges//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 391-405[ DOI: 10.1007/978-3-319-10602-1_26 http://dx.doi.org/10.1007/978-3-319-10602-1_26 ]