Current Issue Cover
深度学习背景下视觉显著性物体检测综述

王自全1, 张永生1, 于英1, 闵杰1, 田浩2(1.信息工程大学地理空间信息学院, 郑州 450001;2.31434部队, 沈阳 110000)

摘 要
视觉显著性物体检测是对人类视觉和认知系统的模拟,而深度学习则是对人类大脑计算方式的模拟,将两者有机结合可以有效推动计算机视觉的发展。视觉显著性物体检测的任务是从图像中定位并提取具有明确轮廓的显著性物体实例。随着深度学习的发展,视觉显著性物体检测的精度和效率都得到巨大提升,但仍然面临改进主流算法性能、减少对像素级标注样本的依赖等主要挑战。针对上述挑战,本文从视觉显著性物体检测思想与深度学习方法融合策略的角度对相关论述进行分类总结。1)分析传统显著性物体检测方法带来的启示及其缺点,指出视觉显著性物体检测的核心思路为多层次特征的提取、融合与修整;2)从改进特征编码方式与信息传递结构、提升边缘定位精度、改善注意力机制、提升训练稳定性和控制噪声的角度对循环卷积神经网络、全卷积神经网络和生成对抗网络3种主流算法的性能提升进行分析,从优化弱监督样本处理模块的角度分析了减少对像素级标注样本依赖的方法;3)对协同显著性物体检测、多类别图像显著性物体检测以及未来的研究问题和方向进行介绍,并给出了可能的解决思路。
关键词
Review of deep learning based salient object detection

Wang Ziquan1, Zhang Yongsheng1, Yu Ying1, Min Jie1, Tian Hao2(1.College of Geospatial Information, Information Engineering University, Zhengzhou 450001, China;2.31434 Troops, Shenyang 110000, China)

Abstract
Salient object detection (SOD) visual technology is a simulation of human vision and cognitive system nowadays. Current deep learning method is a computational simulation for human brain. Traditional SOD methods are required to design complicated hand-craft features to extract multi-level features, and then use machine learning or other methods for fusion and refinement. Each step of SOD can be internalized into the deep learning based neural network model related to a variety of algorithms. To provide reference for our intelligent SOD methods, recent SOD are sorted out from the perspective of principles, basic ideas and algorithms in detail. First, we briefly review the classic framework of traditional SOD methods to extract SOD technology like multi-level features fusion. Current challenges are related to time-consuming preprocessing, complex feature designing and lack of robust. Traditional methods are based on contrasting features, which tend to identify the boundary of the object and the corresponding internal noise. However, the SOD task is more concerned with the scope of the object and constrained of the internal homogeneous area to be suppressed. In addition, the spatial domain features extractions are disturbed of the effects of light and complex background, it is difficult to be integrated with other features effectively. These unstable issues are resulted in. Next, we analyzes a sort of fully supervised implementation architectures for significant deep learning based object detection in the context of the early fusion model, series of recurrent convolution neural network (CNN) architecture, series of full convolution network architecture as well as the feature extraction and fusion enhanced attention mechanism. The internal mechanism and connection of these methods are discussed. "The early fusion model" refers to the fusion strategy of traditional features and deep learning features in terms of artificial rules (such as vector stitching) in 2015. This strategic artificial feature fusion rules are lack of theories and mechanisms. CNN method is only used for high-level features extraction. Each super pixel needs to be traversed and input into the neural network, which is time-consuming. Recurrent CNN constant updates the recognition results to identify the target through introducing the forgetting mechanism. Thanks to this, multi-level features can naturally aggregate with each other using less parameters, which can achieve a better result than the single feed forward network. Full convolution network (FCN) is qualified for end-to-end multi-classification tasks at pixel level in complex background greatly enhanced the detection capability. To aggregate multi-level features, current researches are conducted based on FCN and illustrated a large number of customized models based on multiple strategies, including improving the fusion method in the network, compensating for the network's extraction accuracy of boundary information. Recent the attention mechanism module have become a useful supplement for neural network model. Not only does attention mechanism improve the precision of salient object detection, but also makes the thought of "salient" deliberated that it can only be applied in the classification task at pixel level. So the "salient concepts" can be used in object detection and lightweight model. Thirdly, The weak supervision and multi-task issues have their potentials because the training of full supervision salient object detection method requires expensive pixel-level annotation. Since the task-oriented CNN classification can focus and locate objects with image based tag semantics in the context of detailed salient maps optimization. Meanwhile, the sample updating method needs to be designed because the weakly supervised samples are insufficient to the refinement after generating initial salient map. At last, we introduce the application and development of generative adversarial network (GAN) and graph neural network (GNN) in SOD. Thanks to neural network theory, GNN and GAN have also been applied to SOD task. Based on of summarizing the existing methods, future SOD are predicted like feature fusion mode improvement, collaborative significant object detection, weak supervision and multi-task strategy, and multi-categories image significance detection. In these scenarios, the data has a more complex or fuzzy distribution (e.g., no longer subject to Euclidean spatial distribution). The solution should be more capable to describe the features further.
Keywords

订阅号|日报