Current Issue Cover
注意力引导网络的显著性目标检测

何伟, 潘晨(中国计量大学信息工程学院, 杭州 310018)

摘 要
目的 全卷积模型的显著性目标检测大多通过不同层次特征的聚合实现检测,如何更好地提取和聚合特征是一个研究难点。常用的多层次特征融合策略有加法和级联法,但是这些方法忽略了不同卷积层的感受野大小以及产生的特征图对最后显著图的贡献差异等问题。为此,本文结合通道注意力机制和空间注意力机制有选择地逐步聚合深层和浅层的特征信息,更好地处理不同层次特征的传递和聚合,提出了新的显著性检测模型AGNet(attention-guided network),综合利用几种注意力机制对不同特征信息加权解决上述问题。方法 该网络主要由特征提取模块(feature extraction module, FEM)、通道—空间注意力融合模块(channel-spatial attention aggregation module, C-SAAM)和注意力残差细化模块(attention residual refinement module,ARRM)组成,并且通过最小化像素位置感知(pixel position aware, PPA)损失训练网络。其中,C-SAAM旨在有选择地聚合浅层的边缘信息以及深层抽象的语义特征,利用通道注意力和空间注意力避免融合冗余的背景信息对显著性映射造成影响;ARRM进一步细化融合后的输出,并增强下一个阶段的输入。结果 在5个公开数据集上的实验表明,AGNet在多个评价指标上达到最优性能。尤其在DUT-OMRON(Dalian University of Technology-OMRON)数据集上,F-measure指标相比于排名第2的显著性检测模型提高了1.9%,MAE(mean absolute error)指标降低了1.9%。同时,网络具有不错的速度表现,达到实时效果。结论 本文提出的显著性检测模型能够准确地分割出显著目标区域,并提供清晰的局部细节。
关键词
The salient object detection based on attention-guided network

He Wei, Pan Chen(Department of Information Engineering, China Jiliang University, Hangzhou 310018, China)

Abstract
Objective The salient object detection is to detect the targeted part of the image, and to segment the shape of salient objects. The distractibility allows humans to allocate limited resources of brain to the most important information in the visual scene. It achieves the high efficiency and precision of visual system. The salient object detection is used to simulate the attention mechanism of the human brain. This image processing issue is usually applied in image editing, visual tracking and robot navigation. The existing visual feature information method is widespread used to detect salient objects in accordance with, brightness, color, and movement. The lack of high-level semantic information constraints their capability to detect salient objects in complex scenes. The pyramid structure of deep convolutional neural networks (DCNNs) realizes the extraction of low-level information and semantically high-level information through multiple convolution operations and pooling operations. The feature extraction capabilities of convolutional neural networks have applied in the context of computer vision. The full convolutional neural network (FCN) is proposed to harness salient object detection. Multi-level feature fusion strategies are commonly used like addition and cascade. But these adopted strategies often ignore the difference in the contribution of different features to salient objects and lead to sub-optimal solutions. The low-level and fuzzy boundaries at the high-level reduce salient detection accuracy. Hence, we design a new model for salient object detection. Our model yields different weights to attention features and a variety of attention mechanisms are used to guide the fusion of feature information block by block. Method A feature aggregation network based on attention mechanisms is conducted for saliency object detection. Our new network proposed uses a variety of attention mechanisms to melt different weights into the information of different feature maps. It clarifies the effective aggregation of deep features and shallow features. The network is mainly composed of feature extraction module (FEM), channel-spatial attention aggregation module (C-SAAM) and attention residual refinement module (ARRM). Our trained network is minimized the pixel position aware loss (PPA). FEM obtains rich context information based on multi-scale feature extraction. C-SAAM aims to option aggregate edge information of shallow feature and extract semantic high-level features. Unlike addition and concatenation, C-SAAM uses channel attention and spatial attention to aggregate multi-layer features and release redundant information fusing problems. We also design a residual refinement module based on ARRM to further refine the fused output and improve the input function. We use ResNet-50 as the backbone network of our encoder part, and use transfer learning to load the parameters of the trained model on ImageNet to initialize the network. The DUTS-TR dataset is used to train our network as well. In the training stage, the input images and ground truth masks are resized to 288×288 pixels, and NVIDIA GTX 2080Ti GPU device are used for training. Small batch random gradient descent (SGD) is utilized to optimize our network. The learning rate is set to 0.05, the momentum is set to 0.9, the weight decay is set to 5E-4, and the batch size is set to 24. With no validation set, our model was trained 30 epochs, and the whole training process took 3 hours. In the test process, the inference time for 320×320 pixels images reaches 0.02 s (50 frame/s), which achieves the real-time requirements. Result we compared our model with the 13 models on five public datasets. In order to comprehensively evaluate the effectiveness of our proposed model, we used the precision-recall (PR) curve, the F-measure score and curve, the mean absolute error (MAE) and E-measure were adopt to evaluate our model. In terms of complex DUT-OMRON dataset analysis, the F-measure is increased by 1.9% and MAE is reduced by 1.9% compared with the second performance model. In addition, we also design PR curve and F-measure curve of the five datasets in order to evaluate the segmented salient objects. Compared with other methods, the F-measure curve is the core under different thresholds, which proves the effectiveness of the demonstrated model. It is shown in the visualize example that our model can predict qualified saliency map and filter the non-salient areas out. Conclusion Our aggregation network based on channel-spatial attention guidance has its priority to extract high-level and low-level features from the input image effectively.
Keywords

订阅号|日报