超像素条件随机场下的RGB-D视频显著性检测
RGB-D video saliency detection via superpixel-level conditional random field
- 2021年26卷第4期 页码:872-882
纸质出版日期: 2021-04-16 ,
录用日期: 2020-10-10
DOI: 10.11834/jig.200122
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2021-04-16 ,
录用日期: 2020-10-10
移动端阅览
李贝, 杨铀, 刘琼. 超像素条件随机场下的RGB-D视频显著性检测[J]. 中国图象图形学报, 2021,26(4):872-882.
Bei Li, You Yang, Qiong Liu. RGB-D video saliency detection via superpixel-level conditional random field[J]. Journal of Image and Graphics, 2021,26(4):872-882.
目的
2
视觉显著性在众多视觉驱动的应用中具有重要作用,这些应用领域出现了从2维视觉到3维视觉的转换,从而基于RGB-D数据的显著性模型引起了广泛关注。与2维图像的显著性不同,RGB-D显著性包含了许多不同模态的线索。多模态线索之间存在互补和竞争关系,如何有效地利用和融合这些线索仍是一个挑战。传统的融合模型很难充分利用多模态线索之间的优势,因此研究了RGB-D显著性形成过程中多模态线索融合的问题。
方法
2
提出了一种基于超像素下条件随机场的RGB-D显著性检测模型。提取不同模态的显著性线索,包括平面线索、深度线索和运动线索等。以超像素为单位建立条件随机场模型,联合多模态线索的影响和图像邻域显著值平滑约束,设计了一个全局能量函数作为模型的优化目标,刻画了多模态线索之间的相互作用机制。其中,多模态线索在能量函数中的权重因子由卷积神经网络学习得到。
结果
2
实验在两个公开的RGB-D视频显著性数据集上与6种显著性检测方法进行了比较,所提模型在所有相关数据集和评价指标上都优于当前最先进的模型。相比于第2高的指标,所提模型的AUC(area under curve),sAUC(shuffled AUC),SIM(similarity),PCC(Pearson correlation coefficient)和NSS(normalized scanpath saliency)指标在IRCCyN数据集上分别提升了2.3%,2.3%,18.9%,21.6%和56.2%;在DML-iTrack-3D数据集上分别提升了2.0%,1.4%,29.1%,10.6%,23.3%。此外还进行了模型内部的比较,验证了所提融合方法优于其他传统融合方法。
结论
2
本文提出的RGB-D显著性检测模型中的条件随机场和卷积神经网络充分利用了不同模态线索的优势,将它们有效融合,提升了显著性检测模型的性能,能在视觉驱动的应用领域发挥一定作用。
Objective
2
Visual saliency detection aims to identify the most attractive objects or regions in an image and acts a fundamental role in many vision-based applications
such as target detection and tracking
visual content analysis
scene classification
image/video compression
image quality evaluation
and pedestrian detection. In recent years
the new paradigm shifts from 2D to 3D vision have triggered many interesting functionalities for those vision applications
but the traditional RGB saliency detection models cannot produce satisfactory results in these applications. Thus
visual saliency detection models based on RGB-D data
which involve different modality visual cues
have attracted a large amount of research interest. Existing RGB-D saliency detection models usually consist of two stages. In the first stage
multimodality visual cues
including spatial
depth
and motion cues
are extracted from the color map and the depth map. In the second stage
these cues are fused to obtain the final saliency map via various fusion methods
such as linear weighted summation
Bayesian framework
and conditional random field (CRF). In recent years
learning-based fusion methods
such as support vector machines
AdaBoost
random forest
and deep neural networks
have been widely studied. Several of the above fusion methods have achieved good results in the RGB saliency model. However
different from the traditional RGB saliency detection models
in most of the cases
the involved multimodality visual cues
especially the saliency results
are mutually substantially different from one another. The difference reveals the rivalry in multimodality saliency cues and brings difficulties to the fusion stage in RGB-D saliency models. Therefore
under the two-stage framework
a new challenge arises from design suitable features that can be designed for saliency maps of corresponding multimodality visual cues to increase the probability of mutual fusion in the first stage and how these saliency maps can be fused to obtain the final RGB-D visual saliency map in the second stage.
Method
2
An RGB-D saliency detection model is proposed based on superpixel-level CRF
and 3D scenes are represented by the video format of RGB maps and corresponding depth maps. The predicted saliency map is obtained in two stages for multimodality saliency cues and final fusion. Multimodality saliency cues
including spatial
depth
and motion cues
are considered
and three independent saliency maps for these cues are computed. A saliency fusion algorithm is proposed based on the superpixel-level CRF model. The graph structure of the CRF model is constructed by taking the superpixels as graph nodes
and each superpixel is connected to its adjacent superpixels. Based on the graph
a global energy function is designed to consider the influence of the involved multimodality saliency cues and the smoothing constraint between neighboring superpixels jointly. The global energy function consists of a data term and a smooth term. The data term describes the effects of the multimodality saliency maps on the final fused saliency maps. Three weighting maps of the multimodality saliency cues are learned via a convolutional neural network (CNN)and added to the data term because multimodality saliency maps play different roles in various scenarios. The smooth term adds constraints to the difference of the saliency values of adjacent superpixels
and the constraint intensity is controlled by the RGB and depth differences between them. When the difference values of RGB and depth vectors between two adjacent superpixels are smaller
these two adjacent pixels are more likely to have similar saliency values. The final predicted saliency map is obtained by optimizing the global energy function.
Result
2
In experiments
the proposed model is compared with six state-of-the-art saliency detection models on two public RGB-D video saliency datasets
namely
IRCCyN and DML-iTrack-3D. Five popular quantitative metrics are used to evaluate the proposed model
including area under curve (AUC)
shuffled AUC (sAUC)
similarity (SIM)
Pearson correlation coefficient (PCC)
and normalized scanpath saliency (NSS). Experimental results show that the proposed model outperforms state-of-the-art models on all involved datasets and evaluation metrics. Compared with the second highest scores
the AUC
sAUC
SIM
PCC
and NSS of our model increase by 2.3%
2.3%
18.9%
21.6%
and 56.2%
respectively
on IRCCyN datasets
and increase by 2.0%
1.4%
29.1%
10.6%
and 23.3%
respectively
on DML-iTrack-3D datasets. Moreover
the saliency maps of different visual cues and traditional fusion methods show that the proposed model achieves the best performance
and the proposed fusion method effectively takes advantage of different visual cues. To verify the benefit of the proposed CNN-based weight-learning network
the weights of multimodality saliency maps are set to same value. The experimental results show that performance decreases after removing the weight-learning network.
Conclusion
2
In this study
an RGB-D saliency detection model based on superpixel-level CRF is proposed. The multimodality visual cues are first extracted and then fused by utilizing the CRF model with a global energy function. The fusion stage jointly considers the effects of the multimodality visual cues and the smoothing constraint of the saliency values of adjacent superpixels. Therefore
the proposed model makes full use of the advantages of multimodality visual cues and avoids the conflict caused by the competition among them
thus achieving better fusion results. The experimental results show that the five evaluation metrics of the proposed model are better than those of other start-of-the-art models in two RGB-D video saliency datasets. Thus
the proposed model can use the correlation among multimodality visual cues to detect the saliency objects or regions in 3D dynamic scenes effectively
which is believed helpful for 3D vision-based applications. In addition
the proposed model is a simple
intuitive combination of the traditional method and the deep learning method
and the combination of these two methods can still be improved greatly. The future study will further focus on how to combine traditional methods and deep learning methods more effectively.
RGB-D显著性显著性融合条件随机场(CRF)全局能量函数卷积神经网络(CNN)
RGB-D saliencysaliency fusionconditional random field(CRF)global energy functionconvolutional neural network(CNN)
Achanta R and Süsstrunk S. 2017. Superpixels and polygons using simple non-iterative clustering//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 4895-4904[DOI: 10.1109/CVPR.2017.520http://dx.doi.org/10.1109/CVPR.2017.520]
Banitalebi-Dehkordi A, Nasiopoulos E, Pourazad M T and Nasiopoulos P. 2016. Benchmark three-dimensional eye-tracking dataset for visual saliency prediction on stereoscopic three-dimensional video. Journal of Electronic Imaging, 25(1): #013008[DOI:10.1117/1.JEI.25.1.013008]
Blake A, Rother C, Brown M, Perez P and Torr P. 2004. Interactive image segmentation using an adaptive GMMRF model//Proceedings of the 8th European Conference on Computer Vision. Prague, Czech Republic: Springer: 428-441[DOI: 10.1007/978-3-540-24670-1_33http://dx.doi.org/10.1007/978-3-540-24670-1_33]
Brox T and Malik J. 2011. Large displacement optical flow: descriptor matching in variational motion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(3): 500-513[DOI:10.1109/TPAMI.2010.143]
Coria L, Xu D and Nasiopoulos P. 2012. Automatic stereoscopic 3D video reframing//Proceedings of 2012 3DTV-Conference: The True Vision-Capture, Transmission and Display of 3D Video. Zurich, Switzerland: IEEE: 1-4[DOI: 10.1109/3DTV.2012.6365428http://dx.doi.org/10.1109/3DTV.2012.6365428]
Desingh K, Krishna K M, Rajan D and Jawahar CV. 2013. Depth really matters: improving visual salient region detection with depth//Proceedings of British Machine Vision Conference 2013. Bristol, UK: BMVA Press: 1-98[DOI: 10.5244/C.27.98http://dx.doi.org/10.5244/C.27.98]
Fang Y M, Wang J L, Narwaria M, Le Callet P and Lin W S. 2014. Saliency detection for stereoscopic images. IEEE Transactions on Image Processing, 23(6): 2625-2636[DOI:10.1109/TIP.2014.2305100]
Fang Y M, Zhang C, Li J, Lei J J, Da Silva M P and Le Callet P. 2017. Visual attention modeling for stereoscopic video: a benchmark and computational model. IEEE Transactions on Image Processing, 26(10): 4684-4696[DOI:10.1109/TIP.2017.2721112]
Ficocelli M, Terao J and Nejat G. 2016. Promoting interactions between humans and robots using robotic emotional behavior. IEEE Transactions on Cybernetics, 46(12): 2911-2923[DOI:10.1109/TCYB.2015.2492999]
Guo J, Song B and Du X J. 2016. Significance evaluation of video data over media cloud based on compressed sensing. IEEE Transactions on Multimedia, 18(7): 1297-1304[DOI:10.1109/TMM.2016.2564100]
Harel J, Koch C and Perona P. 2006. Graph-based visual saliency//Proceedings of the 19th International Conference on Neural Information Processing Systems. Vancouver, Canada: MIT Press: 545-552
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778[DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]
Huang Z Y, Yu Y L, Gu J and Liu H P. 2017. An efficient method for traffic sign recognition based on extreme learning machine. IEEE Transactions on Cybernetics, 47(4): 920-933[DOI:10.1109/TCYB.2016.2533424]
Itti L, Koch C and Niebur E. 1998. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11): 1254-1259[DOI:10.1109/34.730558]
Ju R, Ge L, Geng W J, Ren T W and Wu G S. 2014. Depth saliency based on anisotropic center-surround difference//Proceedings of 2014 IEEE International Conference on Image Processing. Paris, France: IEEE: 1115-1119[DOI: 10.1109/ICIP.2014.7025222http://dx.doi.org/10.1109/ICIP.2014.7025222]
Li L Y, Liu Y H, Jiang T J, Wang K and Fang M. 2018. Adaptive trajectory tracking of nonholonomic mobile robots using vision-based position and velocity estimation. IEEE Transactionson Cybernetics, 48(2): 571-582[DOI:10.1109/TCYB.2016.2646719]
Li W Y, Wang P and Qiao H. 2014. A survey of visual attention based methods for object tracking. Acta Automatica Sinica, 40(4): 561-576
黎万义, 王鹏, 乔红. 2014. 引入视觉注意机制的目标跟踪方法综述. 自动化学报, 40(4): 561-576[DOI:10.3724/SP.J.1004.2014.00561]
Liang M and Hu X L. 2015. Feature selection in supervised saliency prediction. IEEE Transactions on Cybernetics, 45(5): 914-926[DOI:10.1109/TCYB.2014.2338893]
LiuT, Yuan Z J, Sun J, Wang J D, Zheng N N, Tang X O and Shum H Y. 2011. Learning to detect a salient object. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(2): 353-367[DOI:10.1109/TPAMI.2010.70]
Ma C Y and Hang H M. 2015. Learning-based saliency model with depth information. Journal of Vision, 15(6): 19[DOI:10.1167/15.6.19]
Niu Y Z, Geng Y J, Li X Q and Liu F. 2012. Leveraging stereopsis for saliency analysis//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 454-461[DOI: 10.1109/CVPR.2012.6247708http://dx.doi.org/10.1109/CVPR.2012.6247708]
Ouerhani N and Hugli H. 2000. Computing visual attention from scene depth//Proceedings of the 15th International Conference on Pattern Recognition. Barcelona, Spain: IEEE: 1375-1378[DOI: 10.1109/ICPR.2000.905356http://dx.doi.org/10.1109/ICPR.2000.905356]
Park Y, Lee B, Cheong W S and Hur N. 2012. Stereoscopic 3D visual attention model considering comfortable viewing//Proceedings of 2012 IET Conference on Image Processing. London, UK: IET: #6290640[DOI: 10.1049/cp.2012.0445http://dx.doi.org/10.1049/cp.2012.0445]
Qu L Q, He S F, Zhang J W, Tian J D, Tang Y D and Yang Q X. 2017. RGBD salient object detection via deep fusion. IEEE Transactions on Image Processing, 26(5): 2274-2285[DOI:10.1109/TIP.2017.2682981]
Shi J, Zhu H, Wang D and Du S. 2017. Scene classification algorithm of fusing visual perception. Journal of Image and Graphics, 22(12): 1750-1757
史静, 朱虹, 王栋, 杜森. 2017. 融合视觉感知特性的场景分类算法. 中国图象图形学报, 22(12): 1750-1757[DOI:10.11834/jig.170232]
Stuit S M, Verstraten F A J, and Paffen C L E. 2010. Saliency in a suppressed image affects the spatial origin of perceptual alternations during binocular rivalry. Vision Research, 50(19): 1913-1921[DOI:10.1016/j.visres.2010.06.014]
Viola P and Jones M J. 2004. Robust real-time face detection. International Journal of Computer Vision, 57(2): 137-154[DOI:10.1023/b:visi.0000013087.49260.fb]
Wang R, Yu Z X, Du L F and Wan W G. 2013. Saliency-based adaptive block compressive sampling for image signals. Journal of Image and Graphics, 18(10): 1255-1260
王瑞, 余宗鑫, 杜林峰, 万旺根. 2013. 结合图像信号显著性的自适应分块压缩采样. 中国图象图形学报, 18(10): 1255-1260[DOI:10.11834/jig.20131005]
Wu J J. 2014. Image Information Perception and Quality Assessment Based on the Human Visual System. Xi'an: Xidian University
吴金建. 2014. 基于人类视觉系统的图像信息感知和图像质量评价. 西安: 西安电子科技大学)[DOI: 10.7666/d.D551836http://dx.doi.org/10.7666/d.D551836]
Xu M, Yu X S, Chen D Y, Wu C D, Jia T and Ru J Y. 2018. Pedestrian detection in complex thermal infrared surveillance scene. Journal of Image and Graphics, 23(12): 1829-1837
许茗, 于晓升, 陈东岳, 吴成东, 贾同, 茹敬雨. 2018. 复杂热红外监控场景下行人检测. 中国图象图形学报, 23(12): 1829-1837[DOI:10.11834/jig.180299]
Yang Y, Yan J H and Jing Q F. 2018. Deformation object tracking based on the fusion of invariant scalable key point matching and image saliency. Journal of Image and Graphics, 23(3): 384-398
杨勇, 闫钧华, 井庆丰. 2018. 融合图像显著性与特征点匹配的形变目标跟踪. 中国图象图形学报, 23(3): 384-398[DOI:10.11834/jig.170339]
Yuan Q, Zhang J F and Wu L Z. 2018. RGB-D saliency detection based on improved local background enclosure feature. Journal of Computer Applications, 38(5): 1432-1435
袁泉, 张建峰, 伍立志. 2018. 基于改进LBE特征的RGB-D显著性检测. 计算机应用, 38(5): 1432-1435[DOI:10.11772/j.issn.1001-9081.2017102587]
Zhang J M and Sclaroff S. 2013. Saliency detection: a Boolean map approach//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE: 153-160[DOI: 10.1109/ICCV.2013.26http://dx.doi.org/10.1109/ICCV.2013.26]
Zhang L Y, Tong M H, Marks T K, Shan H H and Cottrell G W. 2008. SUN: a Bayesian framework for saliency using natural statistics. Journal of Vision, 8(7): #32[DOI:10.1167/8.7.32]
Zhang Y, Jiang G Y, Yu M and Chen K. 2010. Stereoscopic visual attention model for 3D video//Proceedings of the 16th International Multimedia Modeling Conference on Advances in Multimedia Modeling. Chongqing, China: Springer: 314-324[DOI: 10.1007/978-3-642-11301-7_33http://dx.doi.org/10.1007/978-3-642-11301-7_33]
Zhao H W and He J S. 2018. Saliency detection method fused depth information based on Bayesian framework. Opto-Electronic Engineering, 45(2): #170341
赵宏伟, 何劲松. 2018. 基于贝叶斯框架融合深度信息的显著性检测. 光电工程, 45(2): #170341[DOI:10.12086/oee.2018.170341]
相关作者
相关机构