发布时间: 2020-01-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.190187
2020 | Volume 25 | Number 1

图像理解和计算机视觉

使用密集弱注意力机制的图像显著性检测

项圣凯, 曹铁勇, 方正, 洪施展

陆军工程大学指挥控制工程学院, 南京 210001

收稿日期: 2019-05-15; 修回日期: 2019-07-01

基金项目: 国家自然科学基金项目（61471394）；江苏省优秀青年基金项目（BK20180080）

第一作者简介: 项圣凯, 1994年生, 男, 硕士研究生, 主要研究方向为深度学习与显著性检测。E-mail:kelvinqaq@outlook.com;
正, 男, 博士研究生, 主要研究方向为人工智能和图像处理。E-mail:542050417@qq.com;
洪施展, 男, 硕士研究生, 主要研究方向为图像处理。E-mail:674081036@qq.com.

中图法分类号: TP391

文献标识码: A

文章编号: 1006-8961(2020)01-0136-12

摘要

目的基于全卷积网络（FCN）模型的显著性检测（SOD）的研究认为，更大的解码网络能实现比小网络更好的检测效果，导致解码阶段参数量庞大。视觉注意力机制一定程度上缓解了模型过大的问题。本文将注意力机制分为强、弱注意力两种：强注意力能为解码提供更强的先验，但风险很大；相反，弱注意力机制风险更小，但提供的先验较弱；基于此提出并验证了采用弱注意力的小型网络架构也能达到大网络的检测精度这一观点。方法本文设计了全局显著性预测和基于弱注意力机制的边缘优化两个阶段，其核心是提出的密集弱注意力模块。它弥补了弱注意力的缺点，仅需少量额外参数，就能提供不弱于强注意力的先验信息。结果相同的实验环境下，提出的模型在5个数据集上取得了总体上更好的检测效果。同时，提出的方法将参数量控制在69.5 MB，检测速度达到了实时32帧/s。实验结果表明，与使用强注意力的检测方法相比，提出的密集弱注意力模块使得检测模型的泛化能力更好。结论本文目标是使用弱注意力机制来提高检测效能，为此设计了兼顾效率和风险的弱注意力模块。弱注意力机制可以提高解码特征的效率，从而压缩模型大小和加快检测速度，并在现有测试集上体现出更好的泛化能力。

关键词

显著性检测; 视觉注意力机制; 编码—解码; 全卷积网络; 实时检测

Dense weak attention model for salient object detection

Xiang Shengkai, Cao Tieyong, Fang Zheng, Hong Shizhan

Institute of Command and Control Engineering, Army Engineering University, Nanjing 210001, China

Supported by: National Natural Science Foundation of China (61471394);Natural Science Foundation of Jiangsu Province for Excellent Young Scholars (BK20180080)

Abstract

Objective Salient object detection, also called saliency detection, aims to localize and segment the most conspicuous and eye-attracting objects or regions in an image. Several applications have benefited from saliency detection, such as image and video compression, context-aware image retargeting, scene parsing, image resizing, object detection, and segmentation. The detection process includes feature extraction and mapping to the saliency value. Most of the state-of-art salient object detection models use extracted features from pre-trained classification convolution network. Related works have shown that models based on fully convolutional networks (FCNs) can encode semantic-rich features, thereby improving the robustness and accuracy of saliency detection. An intuitive opinion states that a large complex network performs better than a small and simple one. Many of the current methods lack efficiency and require numerous storage resources. In the past few years, attention mechanism has been employed to boost and aid many visual tasks in reducing the decoding difficulty and producing lightweight networks. To be more specific, attention mechanism utilizes pre-estimated attention mask and provides useful prior knowledge to the decoding progress. This mechanism eases the mapping from features to the saliency value to eliminate the need to design a large and complex decoding network. However, the wildly used strong attention applies a multiplicative operation between attention mask and features. When the attention mask is normalized, scilicet values range from 0 to 1, where a value of 0 irreversibly wipes out the distribution of certain features. Thus, using strong attention may cause overfitting risks. On the contrary, weak attention applies an additive operation and is less risky and less efficient. Weak attention shifts the features in the feature space and does not destroy the distribution. However, the previously added information can be smoothed by the convolutional operations. The longer the sequence of convolutional layers are, the less effect the attention mask will exert on the decoding features. This work contributes in three aspects:1) We infer about the visual attention mechanism by dividing it into strong and weak attentions before qualitatively explaining how the attention mechanism improves the decoding efficiency. 2) We discuss the principles of the two types of attention mechanism. Finally, 3) we propose a dense weak attention module that can improve the efficiency of utilizing the features compared with the existing methods. Method Instead of applying the weak attention only at the beginning of the first convolutional layer, we performed the application tautologically and consequently (i.e., applying weak attention before all decoding convolutional layers). The proposed method is called dense weak attention module (DWAM), which introduces an ideal end-to-end detection model called dense weak attention network. The proposed method inherits an FCN-like architecture, which consists of a sequence of convolutional, pooling, and different activation layers. To fine-tune the VGG-16 network, we divide the decoding network into two parts:global saliency detection and edge optimization using DWAM. A rough saliency map is predicted in the deepest branch of the network. Then, the saliency map is treated as an attention mask and concatenated to shallow features to predict a saliency map with increased resolution. To output side saliency maps, we add cross entropy layers after each side output, a process known as deep supervision, to optimize the network. We discover that weak attention plays an important role in the optimization of the detection result by providing effective prior information. With few additional parameters, we have achieved an improved detection result and detection speed. To achieve a more robust prediction than before, the atrous spatial pyramid pooling is used to enhance the ability of detecting multiscale targets. Result We compared the proposed method with seven FCN-based state-of-the-art techniques on five widely accepted benchmarks, and set three indicators as evaluation criteria:mean absolute error (MAE), F measure, and precision-recall curve. Under the same condition, the proposed model demonstrated more competitive results compared with the other state-of-art methods. The MAE of the proposed method is generally better than that of other methods, which means that DWAM produces more pixel-level accuracy results than the other techniques. DWAM's F measure is higher by approximately 2%6% than most of the state-of-art methods. In addition, the precision-recall curve shows that DWAM has a slight advantage and better balance between precision and recall metrics than the other techniques. Meanwhile, the model size of the proposed method is only 69.5 MB and the real-time detection speed reaches 32 frame per second. Conclusion In this study, we proposed an efficient and fully convolutional salient object detection model to improve the efficiency of feature decoding and enhance the generalization ability through weak attention mechanism and deep supervision training than other state-of-the-art methods. Compared with the existing methods, the results of the proposed method is more competitive and the detection speed is faster even if the model remained small.

Key words

salient object detection (SOD); visual attention mechanism; encoder-decoder; fully convolutional networks (FCNs); real-time detection

0 引言

显著性检测(SOD)的目的是模仿人类视觉系统，自然地将场景的主要对象与图像的其余部分分开。这能为图像的后续处理剔除大量的无意义信息，是一种有效的图像预处理手段。显著性检测广泛应用于多个领域，如物体重定位(Sun和Ling，2011；Vinyals等，2015)，图像检索(Cheng等，2017；Gao等，2012)，图像或视频压缩(Itti，2004)，物体跟踪(Borji等，2012；杨勇等，2018)等。

随着深度学习的兴起和卷积神经网络(CNNs)的出现，现有主流的显著性检测算法大多基于全卷积网络(FCNs)(Shelhamer等，2017)。与传统检测算法相比，FCN模型将特征提取、计算显著性两阶段统一起来，并通过有监督学习进行优化。近年来，基于FCN的网络模型在检测效果上大大超越了大多数传统算法。事实证明FCN网络提取的特征相比人工设计特征，具有表达能力和鲁棒性更强的优势。

大多数基于FCN的模型遵循编码器—解码器(encoder-decoder)型或UNet(Ronneberger等，2015)型设计，如图 1。这两类设计都将网络划分为两个阶段。编码阶段，以深度卷积网络为主体，渐渐降低特征的空间分辨率并升高维度。经过编码的特征高度抽象，包含丰富的语义信息。解码阶段以卷积层和反卷积层为结构主体，逐步解读编码的特征，最后映射到任务的解空间。

图 1 不同类型的网络结构示意图

Fig. 1 Different network architecture((a) normal CNN; (b) encoder-decoder architecture; (c) UNet-like architecture)

主流模型都在现有的图像分类网络基础之上设计解码网络，最典型的有AlexNet(Krizhevsky等，2017)，VGG Net(Simonyan和Zisserman，2014)和ResNet(He等，2016)。

不同检测模型之间的最主要差异在于解码网络的设计，这决定了检测效果和检测效率。图 2展示了使用VGG-16骨架的检测模型的参数量、检测速度与检测效果的气泡图。横纵轴坐标分别表示参数量和检测速度，气泡大小及颜色表示F值指标，代表检测效果，其中红字标出的是本文模型。其他主流模型还存在以下两方面的缺点：一是解码阶段的参数量庞大，这对于部署到实际应用是不可接受的。二是检测速度较慢，无法达到30帧/s的实时检测要求。本文目的在于保证检测效果的前提下大幅度压缩模型参数量，加快检测的速度。

图 2 采用相同VGG-16骨架的检测模型参数量、检测速度与检测效果对比气泡图

Fig. 2 Bubble chart of comparision of parameter amount, detection speed and performance for detection models with VGG-16 backbone

在寻找更合理设计结构的过程中，受视觉注意力机制启发，提出将注意力机制进一步划分为强弱两种，并采用弱注意力机制进行多尺度信息的融合，实现了小型网络架构下的高效检测。

1 注意力机制

注意力机制近年来在计算机视觉、自然语言处理和语音处理领域都被高频率使用。不同领域中注意力机制有着不同的定义。在计算机视觉领域中，它指的是在处理图像时着重于其中的某一部分，这一部分可以是图像的局部2维空间，也可以是整个维度。在图像、视频分类(He等，2019；Peng等，2019)、检索(李军等，2017)和语义分割(Chen等，2016b)等任务上，注意力机制都发挥了重要的作用。

视觉注意力机制的核心是注意力图，是一个与数据同维度和(或)同分辨率的掩模$\mathit{\boldsymbol{M}} \buildrel \Delta \over = \left| {{m_i}} \right|, i \in \mathit{\boldsymbol{ \boldsymbol{\varOmega} }}$，其中Ω为掩模的空间坐标集合，某个位置的值${m_i} \in [0, 1]$是其注意力强度，值越大表示该点的重要性越强。以卷积神经网络中的多通道特征图为例，$\mathit{\boldsymbol{D}} \in {\mathit{\boldsymbol{R}}^{H \times W \times C}}$有3个维度，分别为高、宽和通道数，而其空间注意力掩模是2维的，即$\mathit{\boldsymbol{M}} \in {\mathit{\boldsymbol{I}}^{H \times w}}, \mathit{\boldsymbol{I}} \buildrel \Delta \over = \left[ {0, 1} \right]$。根据注意力掩模在计算中的作用不同，本文将注意力机制分为2类：

1) 强注意力。M作为权重与数据D相乘，即强注意力作用下的输出${\mathit{\boldsymbol{\bar D}}}$为

$ {\mathit{\boldsymbol{\overline D}} _{(h, w, c)}} = {\mathit{\boldsymbol{M}}_{(h, w)}} \cdot {\mathit{\boldsymbol{D}}_{(h, w, c)}} $

(1)

式中，下标$h, w, c$分别代表对应长、宽和通道维度的索引。

2) 弱注意力。设弱注意力作用下的卷积层输出通道数为C′，$\mathit{\boldsymbol{w}}$和$\mathit{\boldsymbol{v}}$为分别作用在特征和掩模上的卷积核，核尺寸均为$S \times S$，即$\mathit{\boldsymbol{w}} \in {\mathit{\boldsymbol{R}}^{S \times S \times C \times C'}}$和$\mathit{\boldsymbol{v}} \in {\mathit{\boldsymbol{R}}^{S \times S \times 1 \times {C^\prime }}}$。则输出的第c′通道${{\mathit{\boldsymbol{\bar D}}}_{\left({{c^\prime }} \right)}}$为

$ {{\mathit{\boldsymbol{\bar D}}}_{\left({{c^\prime }} \right)}} = {\mathit{\boldsymbol{v}}_{\left({{c^\prime }} \right)}}*\mathit{\boldsymbol{M}} + \sum\limits_{i = 1}^c {{\mathit{\boldsymbol{w}}_{\left({i, {c^\prime }} \right)}}} *{\mathit{\boldsymbol{D}}_{(i)}} + {b_{\left({{c^\prime }} \right)}} $

(2)

式中，*为卷积二元运算，左右分别为卷积模板(卷积核)及特征图，$\mathit{i}$和c′分别为输入通道和输出通道的索引下标，${b_{\left({{c^\prime }} \right)}}$为偏置项。

从数据空间的角度看，强注意力运算相当于在数据空间中以原点为中心做了线性缩放变换，缩放系数为[0, 1]，这对于数据分布的影响相当大；而弱注意力运算相当于做了局部平移，对数据分布影响较小。Chen等人(2018)提出的反向注意力就使用了取反的注意力掩模，将显著性足够大的特征置零。这样操作的好处是显著部分的特征分布集中在特征空间原点附近，方差非常小，而非显著部分方差则较大，使网络自然地倾向关注显著性不明显的区域。

但是，强注意力的相乘操作也存在很大风险，尤其在掩模值为0的部分，相乘操作使其对应的特征全部置零，相当于抹除了对应特征的分布规律。因此，采用强注意力机制极度依赖掩模估计的效果，一旦掩模M估计不准确，且没有额外措施，这些特征包含的信息将不可恢复。体现在实际的检测效果上是容易过拟合。

弱注意力是加性的，因此不会完全抹除特征的分布，避免了强注意力机制的风险。弱注意力的缺点在于卷积计算使得其衰减得很厉害，因为弱注意力掩模是单通道的，对多通道的特征来说信息的占比很小，并且卷积计算具有平滑特征的可能性，且卷积层数越多，弱注意力提供的信息被平滑的程度越高，甚至可能被当做噪声完全滤除。为解决这个问题，本文在第3节提出加入多层弱注意力掩模来补偿卷积计算可能造成的信号衰减。

2 多尺度融合

多尺度融合是提高模型检测性能的有效手段，主流的深度模型都利用了骨干网络中的不同层级的特征，引出多个子网络分支，并融合多个子网络的信息进行预测。根据融合的阶段不同，图 3展示了3种融合范式。

图 3 3种常见融合范式示意图

Fig. 3 Illustration of three different forms to merge multi-level information((a) merge features; (b) merge predictions; (c) merge deeper prediction and shallow features)

同样以使用VGG-16网络作为编码器的模型为例。特征融合在预测前的特征提取阶段进行融合，如Zhang等人(2017a)提出的Amulet模型提出了基于分辨率的特征融合方法，使得每个网络在不同的分辨率上都能获得所有层级的信息。

预测结果融合则在最后的预测阶段结合多个分支的预测结果，如Li和Yu(2016a)提出的深度对比学习模型(DCL)将编码网络各层特征图降采样为统一的41×41像素分辨率，每个检测分支子网络各自独立地预测，然后融合所有结果；为了优化各分支的预测以及降低模型的训练难度，Hou等人(2019)提出的深度监督显著性模型(DSS)在不同层次都输出显著性检测结果。观察到深层特征定位目标较准确但边缘非常模糊，浅层输出包含更丰富的细节。模型提出采用密集的短路连接，将深层检测结果级联到浅层的检测结果中，融合所有层次的输出作为最后检测结果。

Liu和Han(2016)认为简单级联多层的预测结果容易混入浅层错误结果的干扰，提出将深层结果级联至浅层的解码过程中，作为解码的先验信息；Luo等人(2017)在特征解码时做局部中心化，并将上采样结果与相邻的较浅层传递，最后融合全局结果和局部结果；Chen等人(2018)提出的反向注意力显著性模型(RAS)将粗略的检测结果与更浅层的特征相乘，为浅层分支的特征解码提供高效的先验信息。同时，注意力机制相比于级联特征提供的先验信息更强。

最近也有学者综合使用了多种融合的方式，在检测效果上有了进一步的提升。如Zhang等人(2018)在特征融合的基础之上加入了类似图 3(c)的融合方式；Hu等人(2018)在网络分支的每个卷积层后都添加了多尺度融合特征(MLIF)。

第1种融合方式(特征融合)操作的对象是高维的特征，这将增加大量的存储和计算开销。为了降低模型的存储和计算开销，本文模型结合弱注意力机制，在得到了浅层检测结果后，将其作为注意力掩模，通过第3种(深层预测与浅层特征融合)融合方式，简化了浅层分支的预测难度。

同时深度监督学习的加入使得网络在训练时每个分支都直接受训练目标的监督，明显提升了多分支结构模型的检测效果，并大大降低了训练难度。

3 模型介绍

所提模型命名为深度弱注意力网络(DWAN)，其总体结构如图 4所示，采用VGG-16作为编码网络，以从浅到深的顺序，选择M层对应的激活值作为输出，记为${\mathit{\boldsymbol{I}}^{(m)}}, m = 1, 2, \cdots, M$。解码网络包含5个分支，接受编码网络的5个输出，输出对应分支的显著性预测结果，记为${{\mathit{\boldsymbol{\bar y}}}^{(m)}}, m = 1, 2, \cdots, M$。

图 4 本文模型的整体结构

Fig. 4 The overall architecture of the proposed network

令$\mathit{\boldsymbol{T}} = \left\{ {\left({{\mathit{\boldsymbol{x}}_n}, {\mathit{\boldsymbol{y}}_n}} \right)|n = 1, 2, \cdots, N} \right\}$为训练集，其中，$\left\{ {{\mathit{\boldsymbol{x}}_1}, {\mathit{\boldsymbol{x}}_2}, \cdots, {\mathit{\boldsymbol{x}}_N}} \right\}$为输入图像集合，记为X,$\left\{ {{\mathit{\boldsymbol{y}}_1}, {\mathit{\boldsymbol{y}}_2}, \cdots, {\mathit{\boldsymbol{y}}_N}} \right\}$为输入图像对应的显著性物体的标注图集合，记为Y。

3.1 全局显著性预测

由于${\mathit{\boldsymbol{I}}^{\left(M \right)}}$是编码网络中最深层的输出，所包含语义信息最丰富，更容易预测图像显著性强的区域。解码网络首先利用${\mathit{\boldsymbol{I}}^{\left(M \right)}}$预测全局显著性图。当输入图像为${\mathit{\boldsymbol{x}}_n}$时，全局显著性检测分支首先使用连续L层卷积进一步提取有针对性的特征，并压缩特征通道数，得到压缩解码后的特征，记为

$ \begin{array}{l} \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;A_n^{(M)} = A_{n, L}^{(M)}\\ {\boldsymbol{A}}_{n, l}^{(M)} = \left\{ {\begin{array}{*{20}{l}} {\sigma \left({\mathit{\boldsymbol{w}}_l^{(M)}*\mathit{\boldsymbol{A}}_{n, l - 1}^{(M)} + \mathit{\boldsymbol{b}}_l^{(M)}} \right)}&{\quad l > 1}\\ {\sigma \left({\mathit{\boldsymbol{w}}_l^{(M)}*} \right.\mathit{\boldsymbol{I}}_n^{(M)} + \mathit{\boldsymbol{b}}_l^{(M)}}&{\quad l = 1} \end{array}} \right. \end{array} $

(3)

式中，$\sigma $为线性整流单元(ReLU)，$\sigma (x) = \max \left\{ {0, x} \right\}$，$\mathit{\boldsymbol{w}}$为卷积核，b为卷积偏置项，下标$1 \le l \le L$为连续卷积层的序数。

为了预测每个空间点的显著性值，一般的方法采用卷积核大小为1×1的逐点卷积。但固定的感受野使得网络检测不同大小的物体时遇到困难，同时本文实验中还发现这使得网络更难训练。Google公司的Deeplab(Chen等，2016a)提出了Atrous空间金字塔池(ASPP)模块，采用K个不同空洞系数${{r_j}}$的空洞卷积核

$ \left\{ {\left({\mathit{\boldsymbol{w}}_j^{(m)};{r_j}} \right)} \right\}, j = 1, 2, \cdots, K $

(4)

式中，参数${r_j} \in {\mathit{\boldsymbol{Z}}^ + }$为空洞系数，表示矩形卷积核相邻两点的距离，当$r = 1$时，退化为一般卷积。更大的空洞系数使得卷积核能采样更大范围。相比于相同核尺寸的卷积，空洞卷积能成倍增加卷积核的感受野，却不增加参数量和计算量。感受野V、卷积核尺寸S、空洞系数$r$之间的关系为

$ V = r \times s - 1, r > 1 $

(5)

最后对K个预测结果加权求和，最终的预测值为

$ \bar y_n^{(M)} = \rho \left({\sum\limits_{j = 1}^K {{\lambda _j}} A_n^{(M)} + {b_j}} \right) $

(6)

式中，${\lambda _j}\left({j = 1, 2, \cdots, K} \right)$为权重，可通过输出通道为1的1×1逐点卷积实现，${{b_j}}$为卷积偏置项的第$j$分量。$\rho $为Sigmoid函数，作用是归一化，即

$ \rho (x) = \frac{1}{{1 + {{\rm{e}}^{ - x}}}}, x \in (- \infty, + \infty) $

(7)

为了计算损失，预测图尺寸必须与标注图尺寸相同。模型采用了反卷积上采样层，将预测图放大到原始尺寸。全局显著性图的损失函数为

$ \begin{array}{l} {L_G} = - \frac{1}{N}\sum\limits_{n = 1}^N {\sum\limits_{i \in \mathit{\Omega } } {{{\bar y}_{n, i}}} } \cdot \ln \left({\bar y_{n, i}^{(M)}} \right) + \\ \;\;\;\;\;\;\;\;\;\left({1 - {{\bar y}_{n, i}}} \right) \cdot \ln \left({1 - \bar y_{n, i}^{(M)}} \right) \end{array} $

(8)

式中，$\mathit{\Omega }$为空间坐标集合。

3.2 弱注意力指导的边缘优化

为了恢复全局显著性图中丢失的物体边缘细节，提出的密集弱注意力模块利用全局显著性图为浅层分支提供先验信息，帮助浅层准确定位物体位置；同时高分辨率的浅层特征包含的细节信息有助于从中恢复全局显著性图丢失的边缘信息，以获得更高质量的检测结果。如图 5所示，提出的密集弱注意力模块，在解码浅层特征中也采用了L层卷积，提取针对性特征。深层预测结果的加入降低了预测显著区域的难度。

图 5 密集弱注意力模块(DWAM)示意图

Fig. 5 Dense weak attention module for feature decoding

当输入图像为${\mathit{\boldsymbol{x}}_n}$时，第$m$分支的输入分别为编码器的输入$\mathit{\boldsymbol{I}}_n^{(m)}$和第$m$-1分支的显著性预测图$\mathit{\boldsymbol{\bar y}}_n^{(m - 1)}$，为简洁起见，此处预测图默认已上采样为2倍，以满足级联的要求。则经过L层弱注意力补偿卷积后的激活值为

$ \begin{array}{l} \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\mathit{\boldsymbol{A}}_n^{(m)} = \mathit{\boldsymbol{A}}_{n, L}^{(m)}\\ \mathit{\boldsymbol{A}}_{n, l}^{(m)} = \left\{ {\begin{array}{*{20}{l}} {\sigma \left({\mathit{\boldsymbol{w}}_l^{(m)}*\mathit{cat}\left({\mathit{\boldsymbol{A}}_{n, l - 1}^{(m)}, {{\bar y}^{(m - 1)}}} \right) + \mathit{\boldsymbol{b}}_l^{(m)}} \right)}&{l > 1}\\ {\sigma \left({\mathit{\boldsymbol{w}}_l^{(m)}*\mathit{cat}\left({\mathit{\boldsymbol{I}}_n^{(m)}, {{\bar y}^{(m - 1)}}} \right) + \mathit{\boldsymbol{b}}_l^{(m)}} \right)}&{l = 1} \end{array}} \right. \end{array} $

(9)

式中，$\mathit{cat}\left(\cdot \right)$为级联操作，即按通道维度拼接。弱注意力掩模是1维的，与特征的维度相比，提供的信息只占很小一部分，所以在解码阶段的L层卷积之前都级联了检测图，以补偿卷积后先验信息的衰减。

预测阶段与全局显著性预测同样使用ASPP多尺度预测，同时第$m$分支得到的预测结果作为第$m$+1分支的弱注意力掩模输入。边缘优化阶段包含M-1个分支输出

$ \bar y_n^{(m)} = \rho \left({\sum\limits_{j = 1}^K {{\lambda _j}} \mathit{\boldsymbol{A}}_n^{(m)} + {b_j}} \right), m = 1, 2, \cdots, M - 1 $

(10)

对应的损失函数为

$ \begin{array}{l} L_R^{(m)} = - \frac{1}{N}\sum\limits_{n = 1}^N {\sum\limits_{i \in \mathit{\boldsymbol{ \boldsymbol{\varOmega} }}} {{{\bar y}_{n, i}}} } \cdot \ln \left({\bar y_{n, i}^{(m)}} \right) + n\\ \;\;\;\;\;\;\left({1 - {{\bar y}_{n, i}}} \right) \cdot \ln \left({1 - \bar y_{n, i}^{(m)}} \right) \end{array} $

(11)

模型的整体损失函数为

$ L = L_G^{(M)} + \sum\limits_{m = 1}^{M - 1} {{\lambda _m}} L_R^{(m)} $

(12)

式中，${{\lambda _m}}$为边缘优化分支的损失函数权重，在本实验中${\lambda _m}{\rm{ = }}1$。模型的最终输出为$m$=1时的分支输出${\bar y_n^{(1)}}$。

4 实验及分析

实验使用多个不同难度的数据集，并使用了3个被广泛认可的指标评价检测效果。实验使用Caffe深度学习编程框架，硬件资源包括Intel Xeon E5 2680v3 CPU、128 GB内存和单张NVIDIA TITAN X图像加速卡。为了确保公平，所有测试均在上述环境下进行。

4.1 数据集和训练参数

实验在MSRA-B(Liu等，2007)数据集上训练，它包含5 000幅自然图像，绝大多数图像只包含1个显著物体，分辨率不超过400×400像素。实验在以下5个数据集上进行。根据其他研究者的经验，训练集图像水平翻转，实际训练图像增广为原来的2倍，以减轻过拟合。为公平比较不同算法的检测效果，所有的模型均在该训练集上重新训练直至收敛。DUT-OMRON(Yang等，2013)包含5 172幅更具挑战性的图像，单幅图像包含多个显著物体。ECSSD(Shi等，2016)包含1 000幅人工从互联网上挑选并标注的图像。THUR15K(Cheng等，2014)包含5类共6 232幅带标注的图像。MSRA10K(Cheng等，2015)与MSRA-B数据集相似度最高，是对其的扩充，共包含10 000幅图像。HKU-IS(Li和Yu，2016b)包含4 447幅挑选的图像，物体存在遮挡、与图像边缘接触，整体对比度也更低，比较有挑战性。

网络参数的更新使用带动量的随机梯度下降(SGD)和向后传播算法(BP)。模型训练总共20 000步，初始学习率设为10^-8，每训练7 500步学习率乘以0.1，动量参数momentum为0.9，iter_size参数设置为10，weight_decay参数为0.000 5。

4.2 评价指标

本文用P-R曲线(precious-recall)，${F_\beta }$值和平均绝对误差(MAE) 3个指标客观评价检测效果。

P-R曲线是计算与标注之间的准确率和召回率的关系曲线，曲线越靠近右上角说明检测效果越好。${F_\beta }$值是准确率、召回率的调和均值，计算为

$ {F_\beta } = \frac{{\left({1 + {\beta ^2}} \right)P \times R}}{{{\beta ^2}P + R}} $

(13)

式中，${{\beta ^2}}$为调和因子，为了方便比较，与之前研究者的工作一样设置为0.3，表示更侧重于准确率。P和R分别代表准确率和召回率，可计算为

$ \begin{array}{l} P = \frac{1}{N}\sum\limits_{n = 1}^N {\frac{{\sum\limits_{i \in \mathit{\Omega }} \tau \left({\bar y_{n, i}^{(m)};g} \right) \cdot {y_{n, i}}}}{{\sum\limits_{i \in \mathit{\Omega }} \tau \left({\bar y_{n, i}^{(m)};g} \right)}}} \\ R = \frac{1}{N}\sum\limits_{n = 1}^N {\frac{{\sum\limits_{i \in \mathit{\Omega }} \tau \left({\bar y_{i, n}^{(m)};g} \right) \cdot {y_{i, n}}}}{{\sum\limits_{i \in \mathit{\Omega }} {{y_{i, n}}} }}} \\ \tau (x;g) = \left\{ {\begin{array}{*{20}{l}} 1&{x > g}\\ 0&{x \le g} \end{array}, g \in (0, 1)} \right. \end{array} $

(14)

式中，函数$\tau (x; g)$是二值化函数，作用是将模型的输出$\bar y_{i, n}^{(m)} \in [0, 1]$，转换为$\left\{ {0, 1} \right\}$，以满足准确率、召回率的计算条件。

平均绝对误差计算检测结果和真实标注之间逐像素的平均偏差，代表了检测结果的整体效果，计算为

$ MAE = \frac{1}{N}\sum\limits_{n = 1}^N {\sum\limits_{i \in \mathit{\Omega }} {{{\left| {\bar y_{n, i}^{(m)} - {y_{n, i}}} \right|}_1}} } $

(15)

4.3 实验结果及对比

本文挑选了几个典型的基于FCN的显著性检测模型，分别是反向注意力显著性模型(RAS)(Chen等，2018)，非局部深度特征模型(NLDF)(Luo等，2017)，深度监督显著性模型(DSS)(Hou等，2019)，深度金字塔显著性模型(DHS)(Liu和Han，2016)，不定卷积特征学习显著性模型(UCF)(Zhang等，2017b)，多级卷积特征融合模型(Amulet)(Zhang等，2017a)。为了寻找模型的最优结构，达到最佳的性能，实验设置了多组超参数。

模型基准设置为使用ASPP模块，解码卷积层数为1且通道数为64。如表 1所示，随着卷积层数增加，检测性能逐渐提高，但当层数大于3时，检测性能反而逐渐下降。一方面，因为随着层数的增加，复杂度逐渐提高，映射能力更强。但层数过多时网络的复杂度过高，致使输入的微小扰动就会使得输出完全不同的结果，鲁棒性降低。另一方面，深度增加导致最后一层卷积核的感受野(RF)增加。感受野决定了卷积计算的结果与多大范围空间的输入相干，感受野越大，接收到的信息越多，一定程度能提高预测准确率。但过大的感受野会接受太多无用信息，不利于局部细节的保持。这说明解码网络深度并非越深越好。

随后，固定使用3层卷积，微调输出的通道数。观察表 2的结果可以发现通道数和检测效果之间的关系也不是线性的。因此，合适的输出维度对模型的设计也十分重要。基于上述实验，模型最终确定的结构为3层卷积，输出通道64。

表 1 HKU-IS数据集上不同解码卷积层数对检测结果的影响
Table 1 Different number of convolutional layers and corresponding results when tested on HKU-IS dataset

下载CSV

卷积层数	RF	${F_\beta }$值	MAE
1	3	0.863	0.048
2	5	0.867	0.046
3	7	0.875	0.043
4	9	0.863	0.048
5	11	0.855	0.045
注：加粗字体表示各项指标最优值。

表 2 HKU-IS数据集上不同通道数对检测结果的影响
Table 2 Different number of convolutional channels and corresponding results when tested on HKU-IS dataset

下载CSV

卷积通道数	${F_\beta }$值	MAE
16	0.870	0.046
32	0.868	0.046
64	0.875	0.043
128	0.870	0.046
256	0.866	0.047
注：加粗字体表示各项指标最优值。

从表 3的${F_\beta }$值和MAE以及图 6的PR曲线指标看，本文模型在多个数据集上取得了总体上最好的检测成绩。在物体与背景对比度较大的ECSSD和HKU-IS数据集上，提出的DWAN模型取得了最优的检测效果。在较难的DUT-OMRON和THUR15K数据集上，MAE指标取得了最优但${F_\beta }$指标略低于DSS+。这是由于${F_\beta }$指标本身具有准确率偏好，而DSS+在DSS基础上采用了条件随机场(CRF)(Hou等，2019)平滑检测结果，这会滤去一些灰色区域，提高了准确率但牺牲了召回率。从DSS和DSS+的指标对比可知使用CRF处理后能显著提高检测的效果，但单张图像的检测时间增加了0.2 s，大大降低了检测的速度，见表 4。

表 3 模型在5个数据集上的检测效果定量指标对比
Table 3 Quantitative comparison with state-of-the-art methods on five benchmark datasets

下载CSV

编号	方法	数据集
		DUT-OMRON		ECSSD		HKU-IS		MSRA10K		THUR15K
		${F_\beta }$值	MAE	${F_\beta }$值	MAE	${F_\beta }$值	MAE	${F_\beta }$值	MAE	${F_\beta }$值	MAE
1	DHS(Liu和Han，2016)	0.677	0.076	0.855	0.078	0.837	0.064	0.926	0.037	0.687	0.082
2	NLDF(Luo等，2017)	0.684	0.080	0.878	0.063	0.874	0.048	0.906	0.049	0.697	0.081
3	Amulet(Zhang等，2017a)	0.647	0.089	0.841	0.078	0.821	0.062	0.911	0.043	0.649	0.094
4	UCF(Zhang等，2017b)	0.610	0.123	0.821	0.076	0.811	0.069	0.898	0.057	0.670	0.104
5	RAS(Chen等，2019)	0.691	0.070	0.850	0.075	0.849	0.057	0.919	0.042	0.689	0.080
6	DSS(Hou等，2019)	0.671	0.075	0.839	0.080	0.829	0.064	0.901	0.048	0.677	0.084
7	DSS+(Hou等，2019)	0.709	0.070	0.861	0.074	0.859	0.057	0.912	0.043	0.710	0.079
8	DWAN	0.706	0.066	0.884	0.055	0.875	0.043	0.918	0.038	0.706	0.075
9	DWAN w/o ASPP	0.696	0.068	0.872	0.056	0.869	0.046	0.914	0.040	0.691	0.078
*注：w/o ASPP表示除去ASPP多尺度预测模块；加粗字体表示各项指标最优值。

图 6 5个数据集上准确率—召回率曲线指标的横向比较

((a) DUT-OMRON; (b) ECSSD; (c) HKU-IS; (d) MSRA10K; (e) THUR15K)

Fig. 6 Comparison of precision-recall curves on five difierent datasets with state-of-art

表 4 模型检测速度及模型大小对比
Table 4 Comparison with state-of-art methods on detection efficiency and model size

下载CSV

模型	输入尺寸/像素	平台	检测速度/(帧/s)	模型大小/MB
RAS	原始大小	Caffe	26.3	77.2
NLDF	固定352	tensorflow	12.3	425.9
DSS	原始大小	Caffe	9.9	249
DSS+	原始大小	Caffe, C++	3.2	249
DHS	固定224	pytorch	38.5	375.1
UCF	原始大小	Caffe	4.1	112.5
Amulet	原始大小	Caffe	8.9	126.5
DWAN	原始大小	Caffe	32.2	69.5
注：加粗字体表示各项指标最优值。

输入分辨率为400×400像素左右时，本文模型能以32帧/s的速度实时检测。检测最快的是DHS，为38帧/s。除了平台差异，它需要固定输入图像尺寸为224×224像素，所需的计算量大大小于检测原图，所以在检测速度上具有很大优势。

RAS和提出的DWAN都使用了注意力机制。两者与其他方法相比，模型大小都只有不到80 MB，且检测效果与大网络相当。验证了注意力机制在解码阶段具有提高效率的作用。两者的不同在于，前者采用了强注意力，后者采用了密集的弱注意力。如表 3所示，即使排除ASPP的增益，两者在与训练集最相似的MSRA10K数据集上取得了非常相近的结果，但在其余4个数据集上提出的DWAN模型均明显优于RAS。这说明采用弱注意力机制使模型的泛化能力更强，不容易过拟合。此外DHS也使用了弱注意力机制，不同之处在于，DHS提出的递归卷积层仅在递归初始加入了弱注意力机制。而DWAN密集使用了弱注意力，避免因网络变深导致弱注意力作用减弱。结果显示，DHS在MSRA10K上取得了最好的检测结果，但在其他数据集上并无特别大的优势，出现了较为严重的过拟合，而提出的DWAN泛化能力明显更强。

图 7直观展示了几个典型的检测样例对比，从对比中可以看出提出的DWAN的优势。第1、4行的物体由多个较单一的色块组成，其他大多数方法漏检物体的一部分，而本文方法能较完整地检出。第2行的结果展示了模型较好的抗干扰能力。第3、5行的检测结果对比，展示了本文模型对接近图像边缘的物体仍然具有较好的检测能力，并且细节也保留得更好。第6、7行展示了在高、低对比度下检测结果的边缘都更锐利。

图 7 与其他方法的检测示例对比图

Fig. 7 Visual comparison with state-of-art methods

5 结论

本文针对主流模型解码网络参数冗余量大、检测速度慢的缺点，提出了一种新的基于视觉注意力机制的解码网络设计方法。结合视觉注意力机制的应用，深入分析了强、弱注意力机制在解码阶段的作用。分析得出了强注意力机制的乘性操作具有较大的过拟合风险的结论。本文设计了密集弱注意力模块，挖掘了弱注意机制的潜力。该模块保留了注意力机制提高解码特征效率的优势，大大削减了模型大小，并在一定程度上减轻了强注意力模型的过拟合风险。

提出的端到端的显著性检测模型DWAN基于FCN遵循编码—解码的结构，先后经过全局显著性预测和弱注意力指导的边缘优化两个过程，获得最终的显著性检测图。多个公开数据集上的实验结果表明，与主流的以VGG-16作为网络骨架的检测模型相比，本文模型在MAE、${F_\beta }$值及PR曲线3个指标下取得了综合最好的成绩。同时模型参数量在对比方法中最少，且在高分辨率输入下达到了32.2帧/s的实时检测速度。

本文模型也存在不足，其训练的收敛速度明显比使用强注意力的模型(如RAS)慢，同时必须使用深度监督学习指导每个分支的注意力掩模生成，否则很难收敛。推测这是由于弱注意力提供的先验信息不足造成的，这使得浅层的解码难度比使用强注意力的更难。其真正的原因有待下一步的工作进行验证。另外，深度监督学习如何帮助这样难收敛的模型更快地收敛也有待进一步的研究。

参考文献

Borji A, Frintrop S and Sihite D N. 2012. Adaptive object tracking by learning background context//Proceedings of 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. Providence, RI, USA: IEEE: 23-30[DOI:10.1109/CVPRW.2012.6239191]

Chen L C, Papandreou G, Kokkinos I, Murphy K, Yuille A L. 2016a. DeepLab:semantic image segmentation with deep convolutional nets, Atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4): 834-848 [DOI:10.1109/TPAMI.2017.2699184]

Chen L C, Yang Y, Wang J, Wei X and Alan L. 2016b. Attention to scale: scale-aware semantic image segmentation//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 3640-3649[DOI:10.1109/CVPR.2016.396]

Chen S H, Tan X L and Wang B. 2018. Reverse attention for salient object detection//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 236-252[DOI:10.1007/978-3-030-01240-3_15]

Cheng M M, Mitra N J, Huang X L, Torr P, Hu S M. 2015. Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3): 569-582 [DOI:10.1109/TPAMI.2014.2345401]

Cheng M M, Mitra N J, Huang X L, Hu S M. 2014. SalientShape:group saliency in image collections. The Visual Computer, 30(4): 443-453 [DOI:10.1007/s00371-013-0867-4]

Cheng M M, Hou Q B, Zhang S H, Rosin P L. 2017. Intelligent visual media processing:when graphics meets vision. Journal of Computer Science and Technology, 32(1): 110-121 [DOI:10.1007/s11390-017-1681-7]

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 770-778[DOI:10.1109/cvpr.2016.90]

He X T, Peng Y X, Zhao J J. 2019. Which and how many regions to gaze:focus discriminative regions for fine-grained visual categorization. International Journal of Computer Vision, 127(9): 1235-1255 [DOI:10.1007/s11263-019-01176-2]

Hou Q B, Cheng M M, Hu X W, Borji A, Tu Z, Torr P. 2019. Deeply supervised salient object detection with short connections. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(4): 815-828 [DOI:10.1109/TPAMI.2018.2815688]

Hu X W, Zhu L, Qin J, Fu C W and Heng P A. 2018. Recurrently aggregating deep features for salient object detection//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Louisiana: 6943-6950

Krizhevsky A, Sutskever I, Hinton G E. 2017. ImageNet classification with deep convolutional neural networks. Communications of the ACM Advances in Neural Information Processing Systems, 60(6): 84-90 [DOI:10.1145/3065386]

Itti L. 2004. Automatic foveation for video compression using a neurobiological model of visual attention. IEEE Transactions on Image Processing, 13(10): 1304-1318 [DOI:10.1109/tip.2004.834657]

Li G B and Yu Y Z. 2016a. Deep contrast learning for salient object detection//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Conference on. Las Vegas, NV, USA. IEEE: 478-487[DOI:10.1109/cvpr.2016.58]

Li G B, Yu Y Z. 2016b. Visual saliency detection based on multiscale deep CNN features. IEEE Transactions on Image Processing, 25(11): 5012-5024 [DOI:10.1109/tip.2016.2602079]

Li J, Lyu S H, Chen F, Yang G G, Dou Y. 2017. Image retrieval by combining recurrent neural network and visual attention mechanism. Journal of Image and Graphics, 22(2): 241-248 (李军, 吕绍和, 陈飞, 阳国贵, 窦勇. 2017. 结合视觉注意机制与递归神经网络的图像检索. 中国图象图形学报, 22(2): 241-248) [DOI:10.11834/jig.20170212]

Liu N and Han J W. 2016. DHSNet: deep hierarchical saliency network for salient object detection//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 678-686[DOI:10.1109/cvpr.2016.80]

Liu T, Yuan Z, Sun J, Wang J, Zheng N N, Tang X and Shum H Y. 2007. Learning to detect a salient object//Proceedings of 2007 IEEE Conference on Computer Vision and Pattern Recognition. Minneapolis, MN, USA: IEEE: 353-367[DOI:10.1109/cvpr.2007.383047]

Shelhamer E, Long J, Shelhamer E, Darrell T. 2017. Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4): 640-651 [DOI:10.1109/TPAMI.2016.2572683]

Luo Z M, Mishra A, Achkar A, Eichel J, Li S and Josoin P M. 2017. Non-local deep features for salient object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA. IEEE: 6609-6617[DOI:10.1109/cvpr.2017.698]

Peng Y X, Zhao Y Z, Zhang J C. 2019. Two-stream collaborative learning with spatial-temporal attention for video classification. IEEE Transactions on Circuits and Systems for Video Technology, 29(3): 773-786 [DOI:10.1109/tcsvt.2018.2808685]

Ronneberger O, Fischer P and Brox T. 2015. U-Net: convolutional networks for biomedical image segmentation//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich, Germany: Springer: 234-241[DOI:10.1007/978-3-319-24574-4_28]

Shi J P, Yan Q, Xu L, Jia J. 2016. Hierarchical image saliency detection on extended CSSD. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(4): 717-729 [DOI:10.1109/tpami.2015.2465960]

Simonyan K and Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition[EB/OL].[2015-04-10] https://arxiv.org/pdf/1409.1556.pdf

Sun J and Ling H B. 2011. Scale and object aware image retargeting for thumbnail browsing//Proceedings of 2011 International Conference on Computer Vision. Barcelona, Spain: IEEE: 1511-1518[DOI:10.1109/iccv.2011.6126409]

Vinyals O, Toshev A, Bengio S and Erhan D. 2015. Show and tell: a neural image caption generator//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE: 3156-3164[DOI:10.1109/cvpr.2015.7298935]

Yang C, Zhang L H, Lu H C, Ruan X and Yang M H. 2013. Saliency detection via graph-based manifold ranking//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE: 3166-3173[DOI:10.1109/cvpr.2013.407]

Yang Y, Yan J H, Jing Q F. 2018. Deformation object tracking based on the fusion of invariant scalable key point matching and image saliency. Journal of Image and Graphics, 23(3): 384-398 (杨勇, 闫钧华, 井庆丰. 2018. 融合图像显著性与特征点匹配的形变目标跟踪. 中国图象图形学报, 23(3): 384-398) [DOI:10.11834/jig.170339]

Gao Y, Wang M, Tao D C, Ji R R, Dai Q H. 2012. 3-D object retrieval and recognition with hypergraph analysis. IEEE Transactions on Image Processing, 21(9): 4290-4303 [DOI:10.1109/tip.2012.2199502]

Zhang P P, Wang D, Lu H C, Wang H and Ruan X. 2017a. Amulet: aggregating multi-level convolutional features for salient object detection//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 202-211[DOI:10.1109/iccv.2017.31]

Zhang P P, Wang D, Lu H C, Wang H and Yin B. 2017b. Learning uncertain convolutional features for accurate saliency detection//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 212-221[DOI:10.1109/iccv.2017.32]

Zhang P P, Wang L Y, Wang D, Lu H and Shen C. 2018. Agile amulet: real-time salient object detection with contextual attention[EB/OL].[2019-05-01].https://arxiv.org/pdf/1802.06960.pdf