Current Issue Cover
使用密集弱注意力机制的图像显著性检测

项圣凯, 曹铁勇, 方正, 洪施展(陆军工程大学指挥控制工程学院, 南京 210001)

摘 要
目的 基于全卷积网络(FCN)模型的显著性检测(SOD)的研究认为,更大的解码网络能实现比小网络更好的检测效果,导致解码阶段参数量庞大。视觉注意力机制一定程度上缓解了模型过大的问题。本文将注意力机制分为强、弱注意力两种:强注意力能为解码提供更强的先验,但风险很大;相反,弱注意力机制风险更小,但提供的先验较弱;基于此提出并验证了采用弱注意力的小型网络架构也能达到大网络的检测精度这一观点。方法 本文设计了全局显著性预测和基于弱注意力机制的边缘优化两个阶段,其核心是提出的密集弱注意力模块。它弥补了弱注意力的缺点,仅需少量额外参数,就能提供不弱于强注意力的先验信息。结果 相同的实验环境下,提出的模型在5个数据集上取得了总体上更好的检测效果。同时,提出的方法将参数量控制在69.5 MB,检测速度达到了实时32帧/s。实验结果表明,与使用强注意力的检测方法相比,提出的密集弱注意力模块使得检测模型的泛化能力更好。结论 本文目标是使用弱注意力机制来提高检测效能,为此设计了兼顾效率和风险的弱注意力模块。弱注意力机制可以提高解码特征的效率,从而压缩模型大小和加快检测速度,并在现有测试集上体现出更好的泛化能力。
关键词
Dense weak attention model for salient object detection

Xiang Shengkai, Cao Tieyong, Fang Zheng, Hong Shizhan(Institute of Command and Control Engineering, Army Engineering University, Nanjing 210001, China)

Abstract
Objective Salient object detection, also called saliency detection, aims to localize and segment the most conspicuous and eye-attracting objects or regions in an image. Several applications have benefited from saliency detection, such as image and video compression, context-aware image retargeting, scene parsing, image resizing, object detection, and segmentation. The detection process includes feature extraction and mapping to the saliency value. Most of the state-of-art salient object detection models use extracted features from pre-trained classification convolution network. Related works have shown that models based on fully convolutional networks (FCNs) can encode semantic-rich features, thereby improving the robustness and accuracy of saliency detection. An intuitive opinion states that a large complex network performs better than a small and simple one. Many of the current methods lack efficiency and require numerous storage resources. In the past few years, attention mechanism has been employed to boost and aid many visual tasks in reducing the decoding difficulty and producing lightweight networks. To be more specific, attention mechanism utilizes pre-estimated attention mask and provides useful prior knowledge to the decoding progress. This mechanism eases the mapping from features to the saliency value to eliminate the need to design a large and complex decoding network. However, the wildly used strong attention applies a multiplicative operation between attention mask and features. When the attention mask is normalized, scilicet values range from 0 to 1, where a value of 0 irreversibly wipes out the distribution of certain features. Thus, using strong attention may cause overfitting risks. On the contrary, weak attention applies an additive operation and is less risky and less efficient. Weak attention shifts the features in the feature space and does not destroy the distribution. However, the previously added information can be smoothed by the convolutional operations. The longer the sequence of convolutional layers are, the less effect the attention mask will exert on the decoding features. This work contributes in three aspects:1) We infer about the visual attention mechanism by dividing it into strong and weak attentions before qualitatively explaining how the attention mechanism improves the decoding efficiency. 2) We discuss the principles of the two types of attention mechanism. Finally, 3) we propose a dense weak attention module that can improve the efficiency of utilizing the features compared with the existing methods. Method Instead of applying the weak attention only at the beginning of the first convolutional layer, we performed the application tautologically and consequently (i.e., applying weak attention before all decoding convolutional layers). The proposed method is called dense weak attention module (DWAM), which introduces an ideal end-to-end detection model called dense weak attention network. The proposed method inherits an FCN-like architecture, which consists of a sequence of convolutional, pooling, and different activation layers. To fine-tune the VGG-16 network, we divide the decoding network into two parts:global saliency detection and edge optimization using DWAM. A rough saliency map is predicted in the deepest branch of the network. Then, the saliency map is treated as an attention mask and concatenated to shallow features to predict a saliency map with increased resolution. To output side saliency maps, we add cross entropy layers after each side output, a process known as deep supervision, to optimize the network. We discover that weak attention plays an important role in the optimization of the detection result by providing effective prior information. With few additional parameters, we have achieved an improved detection result and detection speed. To achieve a more robust prediction than before, the atrous spatial pyramid pooling is used to enhance the ability of detecting multiscale targets. Result We compared the proposed method with seven FCN-based state-of-the-art techniques on five widely accepted benchmarks, and set three indicators as evaluation criteria:mean absolute error (MAE), F measure, and precision-recall curve. Under the same condition, the proposed model demonstrated more competitive results compared with the other state-of-art methods. The MAE of the proposed method is generally better than that of other methods, which means that DWAM produces more pixel-level accuracy results than the other techniques. DWAM's F measure is higher by approximately 2%6% than most of the state-of-art methods. In addition, the precision-recall curve shows that DWAM has a slight advantage and better balance between precision and recall metrics than the other techniques. Meanwhile, the model size of the proposed method is only 69.5 MB and the real-time detection speed reaches 32 frame per second. Conclusion In this study, we proposed an efficient and fully convolutional salient object detection model to improve the efficiency of feature decoding and enhance the generalization ability through weak attention mechanism and deep supervision training than other state-of-the-art methods. Compared with the existing methods, the results of the proposed method is more competitive and the detection speed is faster even if the model remained small.
Keywords

订阅号|日报