发布时间: 2018-12-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.180073
2018 | Volume 23 | Number 12

图像理解和计算机视觉

融合深度模型和传统模型的显著性检测

方正, 曹铁勇, 洪施展, 项圣凯

陆军工程大学指挥控制工程学院, 南京 210001

收稿日期: 2018-03-16; 修回日期: 2018-06-14

基金项目: 国家自然科学基金项目（61471394，61402519），江苏省自然科学基金项目（BK20140071，BK20140074）

第一作者简介: 方正, 1994年生, 男, 2016年于电子科技大学获得电子信息工程学士学位, 主要研究方向为图像信号处理、计算机视觉。E-mail:542050417@qq.com;
洪施展, 男, 硕士研究生, 主要研究方向为图像与视频信号处理。E-mail:674081036@qq.com;
项圣凯, 男, 硕士研究生, 主要研究方向为计算机视觉。E-mail:xsk123830@outlook.com.

中图法分类号: TP391

文献标识码: A

文章编号: 1006-8961(2018)12-1864-10

摘要

目的显著性检测是图像和视觉领域一个基础问题，传统模型对于显著性物体的边界保留较好，但是对显著性目标的自信度不够高，召回率低，而深度学习模型对于显著性物体的自信度高，但是其结果边界粗糙，准确率较低。针对这两种模型各自的优缺点，提出一种显著性模型以综合利用两种方法的优点并抑制各自的不足。方法首先改进最新的密集卷积网络，训练了一个基于该网络的全卷积网络（FCN）显著性模型，同时选取一个现有的基于超像素的显著性回归模型，在得到两种模型的显著性结果图后，提出一种融合算法，融合两种方法的结果以得到最终优化结果，该算法通过显著性结果Hadamard积和像素间显著性值的一对一非线性映射，将FCN结果与传统模型的结果相融合。结果实验在4个数据集上与最新的10种方法进行了比较，在HKU-IS数据集中，相比于性能第2的模型，$\text{F}$值提高了2.6%；在MSRA数据集中，相比于性能第2的模型，$\text{F}$值提高了2.2%，MAE降低了5.6%；在DUT-OMRON数据集中，相比于性能第2的模型，$\text{F}$值提高了5.6%，MAE降低了17.4%。同时也在MSRA数据集中进行了对比实验以验证融合算法的有效性，对比实验结果表明提出的融合算法改善了显著性检测的效果。结论本文所提出的显著性模型，综合了传统模型和深度学习模型的优点，使显著性检测结果更加准确。

关键词

显著性检测; 密集卷积网络; 全卷积网络; 融合算法; Hadamard积

Saliency detection via fusion of deep model and traditional model

Fang Zheng, Cao Tieyong, Hong Shizhan, Xiang Shengkai

Institute of Command and Control Engineering, Army Engineering University, Nanjing 210001, China

Supported by: National Natural Science Foundation of China(61471394, 61402519)

Abstract

Objective Saliency detection is a fundamental problem in computer vision and image processing, which aims to identify the most conspicuous objects or regions in an image. Saliency detection has been widely used in several visual applications, including object retargeting, scene classification, visual tracking, image retrieval, and semantic segmentation. In most traditional approaches, salient objects are derived based on the extracted features from pixels or regions. Final saliency maps consist of these regions with their saliency scores. The performance of these models rely on the segmentation methods and the selection of features. These approaches cannot produce satisfactory results when images with multiple salient objects or low-contrast contents are encountered. Traditional approaches preserve the boundaries well but with insufficient confidence of salient objects, which yield low recall rates. Convolution neural networks (CNNs) have been introduced in pixel-wise prediction problems, such as saliency detection, due to their outstanding performance in image classification tasks. CNNs redefine the saliency problem as a labeling problem where the feature selection between salient and non-salient objects is automatically performed through gradient descent. A CNN cannot be directly used to train a saliency model, and a CNN can be utilized in saliency detection by extracting a square patch around each pixel and by using the patch to predict the center pixel's class. Patches are frequently obtained from different resolutions of the input image to capture global information. Another method is the addition of up-sampled layers in the CNN. A modified CNN is called fully connected network (FCN), which is first proposed for semantic segmentation. Most saliency detection CNN models use FCN to capture considerable global and local information. FCN is a popular model that modifies the CNN to fit dense prediction problem, which replaces the SoftMax and fully connected layers in the CNN into convolution and deconvolution layers. Compared with traditional methods, FCNs can accurately locate salient objects and yield their high confidence. However, the boundaries of salient objects are coarse and their precision rates are lower than the traditional approaches due to the down-sampling structure in FCNs. To deal with the limitations of the 2 kinds of saliency models, we proposed a novel composite saliency model that combines the advantages and restrains the drawbacks of two saliency models. Method In this study, a new FCN based on dense convolutional network (DenseNet) is built. For saliency detection, we replace the fully connected layer and final pooling layer into a 1×1 kernel size convolution layer and a deconvolution layer. A sigmoid layer is applied to obtain the saliency maps. In the training process, the saliency network end with a squared Euclidean loss layer for saliency regression. We fine-tune the pre-trained DenseNet-161 to train our saliency model. Our training set consists of 3 900 images that are randomly selected from 5 saliency public dataset, namely, ECSSD, SOD, HKU-IS, MSRA, and ICOSEG. Our saliency network is implemented in Caffe toolbox. The input images and ground-truth maps are resized to 500×500 for training, the momentum parameter is set to 0.99, the learning rate is set to 10^-10, and the weight decay is 0.000 5. The SGD learning procedure is accelerated using a NVIDIA GTX TITAN X GPU device, which takes approximately one day in 200 000 iterations. Then, we use a traditional saliency model. The selected model adopts multi-level segmentation to produce several segmentations of an image, where each superpixel is represented by a feature vector that contains different kinds of image features. A random forest is trained by those feature vectors to derive saliency maps. On the basis of the 2 models, we propose a fusion algorithm that combines the advantages of traditional approaches and deep learning methods. For an image, 15 segmentations of the image are produced, and the saliency maps of all segmentations are derived by the random forest. Then, we use FCN to produce another type of saliency map of the image. The fusion algorithm applies the Hadamard product on the 2 types of saliency maps, and the initial fusion result is obtained by averaging the Hadamard product results. Then, an adaptive threshold is used to fuse the initial fusion and FCN results by using a pixel-to-pixel map to obtain the final fusion result. Result We compared our model with 10 state-of-the-art saliency models, including the traditional approaches and deep learning methods on 4 public datasets, namely, DUT-OMRON, ECSSD, HKU-IS, and MSRA. The quantitative evaluation metrics contained F-measure, mean square error (MAE), and PR curves, and we provided several saliency maps of each method for comparison. The experiment results show that our model outperforms all other methods in HKU-IS, MSRA, and DUT-OMRON datasets. The saliency maps showed that our model can produce refined results. We compared the performance of random forest, FCN, and final fusion results in verify the effectiveness of our fusion algorithm. Comparative experiments demonstrated that the fusion algorithm improves saliency detection. Compared with the random forest results in ECSSD, HKU-IS, MSRA, and DUT-OMRON, the F-measure (higher is better) increased by 6.2%, 15.6%, 5.7%, and 16.6% and MAE (i.e., less is better) decreased by 17.4%, 43.9%, 33.3%, and 24.5% respectively. Compared with the FCN results in ECSSD, HKU-IS, MSRA, and DUT-OMRON, the F-measure increased by 2.2%, 4.1%, 5.7%, and 11.3%, respectively, and MAE decreased by 0.6%, 10.7%, and 18.4% in ECSSD, MSRA, and DUT-OMRON, respectively. In addition, we conducted a series of comparative experiments in MSRA to clearly show the effectiveness of different steps of the fusion algorithm. Conclusion In this study, we proposed a composite saliency model that contains an FCN and a traditional model and a fusion algorithm to fuse 2 kinds of saliency maps. The experiment results show that our model outperforms several state-of-the-art saliency approaches and the fusion algorithm improves the performance.

Key words

saliency detection; dense convolutional network; fully convolutional network; fusion algorithm; Hadamardproduct

0 引言

显著性检测是图像处理和计算机视觉中的一个基础研究领域，其目的是找出一幅图像中最显著的区域。显著性检测广泛应用在各种图像处理和识别任务中，包括目标重定位^[1]、景物分类^[2]、视觉跟踪^[3]、图像恢复^[4]以及语义分割^[5]等领域。

传统的显著性检测模型^[6-9]一般先对图像进行超像素分割，然后提取分割后每个超像素的各种特征，例如颜色、纹理、直方图等，通过这些特征来判断图像中的显著性区域。传统方法的性能高度依赖于特征选取以及超像素分割方法。在传统方法中使用的手工特征，在面对简单的图像时，可以取得较好的效果，但是面对一些背景复杂，前背景对比度低的情况时，这些模型不能很好地检测出显著性目标。

随着卷积神经网络(CNN)^[10-11]在图像识别领域获得的巨大成功，CNN也被引入到密集标注任务领域中，将CNN用于像素级别的标注任务有两种方法，一种是以提取图像中的方形像素块^[12]，并用这个方形像素块来预测中心像素的类别。有部分显著性检测模型使用了这种方法，Zhao等人^[13]在一个多元网络中融合了一幅图像全局和局部信息，全局信息用作确定整幅图像的显著性区域，而局部信息用于估计显著性物体边缘。Li等人^[14]利用3个CNN中提取出的特征来确定图像块的显著性值。

另一种在密集标注任务中使用CNN的方法是全卷积网络(FCN)，FCN是最早由Long等人^[15]提出的将CNN改造用作语义分割任务的模型, 在显著性检测中，大部分模型也是基于FCN来构建的^[16-18]。Li等人^[16]提出了一个包含像素级别以及区域级别训练任务的卷积神经网络，并在最后使用一个条件随机场对结果进行优化。Luo等人^[17]提出了一个端到端的FCN模型用于显著性检测，并且使用了一个边缘损失函数来提高显著性结果的边缘准确度。Zhang等人^[18]在提取CNN每层特征后，提出了一个融合模块来融合所有特征以得到显著性结果图。

相比传统模型，基于深度学习的显著性模型鲁棒性更好，由于利用了CNN中丰富的语义信息，对于显著性物体的定位十分准确，且自信度高，面对复杂环境和低对比度图像也能很好地找出显著性目标。但是，由于CNN本身是设计来解决图像分类问题的，其中包含多个降采样层，所以在降采样的过程中，难免会有信息的丢失，所以深度学习模型的结果图一般较为粗糙，主要表现为对显著性物体的边缘检测不够准确。而传统模型虽然缺少语义信息，对显著性目标自信度不够高，但是传统模型利用超像素对原始图像进行分割，很好地保留了图像中的边缘信息。图 1展示了同一张图像经过FCN和传统模型的结果图。从图中可以看出，FCN结果很好地定位了显著性物体，但其边缘很粗糙，无法判断出原始物体的形状，而使用超像素分割的传统方法，很好地保留了显著性物体的边缘信息，但是对于显著性物体的自信度不够高，定位不够准确，显著性分数低。

图 1 不同方法结果示例

Fig. 1 Results of different methods((a) original image; (b) ground-truth map; (c) result of FCN; (d) result of traditional method)

针对两类模型各自的优缺点，本文提出了一个包含FCN以及随机森林的复合显著性模型。首先改进了最新的密集卷积网络^[19]训练了一个针对显著性检测的密集FCN；并选取一个现有的已训练好的传统模型(随机森林)用以预测超像素的显著性值。最后提出一种融合算法，将两种模型的结果图进行融合，得到最终融合结果。

实验在4个数据集上与包括传统方法和深度学习方法在内的10种最新的方法进行了比较，并设置对比实验以验证融合算法的有效性，此外，还分析了融合算法中的参数选择以及其中间步骤对模型性能的影响。结果显示，在其中3个数据集上本文的模型具有较好的性能。对比实验结果也证明了提出的融合算法改善了显著性检测的效果。该复合模型的结构简图如图 2所示。

图 2 本文提出的复合显著性检测模型

Fig. 2 The composite saliency model proposed in this paper

1 传统模型及深度模型构建

首先，本文改进了最新的密集卷积神经网络，训练了密集FCN，作为深度模型。密集卷积神经网络(DenseNet)^[19]是最近提出的新型卷积神经网络，该网络相比以往的VGG-Net^[10]、ResNet^[11]更加高效，通过将每一层的输出连接到其后面所有层的输入上，实现了特征重利用，提高了特征使用效率，其每层输出特征图很少，所以极大地减少了网络参数量。并且在图像分类任务中，也优于之前的网络。改进的密集全卷积网络结构如图 3所示，其中密集卷积网络的结构参照DenseNet-161^[19], 该卷积神经网络相比于以往神经网络的最大不同在于其中的密集卷积块，在密集卷积块中，包含若干个卷积层，每个卷积层的输出都直接连接到其后所有卷积层的输入上(图 3中密集卷积块是示例图，其中的圆圈代表卷积层，用以说明其中的密集连接结构，实际每个密集卷积块包含不止4个卷积层)。具体密集卷积网络的参数设置可参照文献[19]。这里选取密集卷积网络的主要目的有两点：一是为了减少模型参数量，二是为了利用其中更有辨别力的特征来判定显著性目标。如图 3所示，将密集卷积网络改成FCN，需要先将原网络中的分类层去除，然后添加一个卷积层和反卷积层。由于显著性结果中，像素显著性由[0, 1]区间中的值来表示，所以在反卷积层后跟一个Sigmoid激活层，以得到显著性结果图。

图 3 密集全卷积网络

Fig. 3 Dense fully convolutional network

在网络训练时，该网络输入图像大小为500×500像素，其中反卷积层的参数stride=31, pad=14, kernelsize=63，用以将16×16像素的结果图重置到500×500像素，反卷积层激活函数使用Sigmoid函数，用来将结果映射到[0, 1]区间。训练损失函数为Euclidean loss，用 $J $(·)表示

$ J\left( \mathit{\boldsymbol{Z}} \right) = \frac{1}{N}\sum\limits_{i = 1}^N {\left\| {{\mathit{\boldsymbol{I}}_i} - f\left( {{\mathit{\boldsymbol{Z}}_i}} \right)} \right\|} _{\rm{F}}^2 $

(1)

式中，$\mathit{\boldsymbol{Z}} = \{ {\mathit{\boldsymbol{Z}}_i}\} \left({i = 1, \cdots, N} \right)$是训练图像样本，${\mathit{\boldsymbol{I}}_i} = \left({i = 1, \cdots, N} \right)$是人工标注图。在测试时，先将图像通过双线性插值重置到500×500像素，然后经过网络得出结果，最后再利用双线性插值将结果图重置为原图大小。

传统模型选择了Jiang等人^[7]提出的随机森林，该模型由大量的超像素对应的图像特征向量训练而来，输入是一个超像素对应的图像颜色、纹理、直方图和位置等特征的特征向量，输出为该超像素的显著性分数。将图像先进行超像素分割后，便可以输入该模型得到显著性结果图。

2 显著性结果融合

正如之前所述及图 1中结果实例，FCN虽然利用了CNN中鲁棒的语义信息，可以准确地定位到显著性物体的具体位置，并且对显著性物体自信度高(即可以正确判断显著性物体并赋予其高显著性值，而赋予背景低显著性，几乎为零，表现在实验中就是召回率很高), 但是由于FCN结构的特殊性，最终的显著性结果是通过一个尺寸很小的特征图反卷积而得到的，无法保留精确的边缘信息(表现为准确率低)；而传统方法使用超像素先对图像进行分割，然后再根据其各种低级特征判断超像素显著性，这种方法由于其特征的限制，对显著性物体的自信度不高(图 1中显著性结果有大量灰色部分，在实验结果中表现为召回率低)，但是其超像素分割基于原始图像大小，没有对图像的降采样和上采样过程，边缘信息没有丢失(表现为准确率高)，所以图 1中传统模型的结果边缘更精确平滑。

为了综合利用两者的优点并抑制各自缺点，本文提出了一种融合算法，算法如图 2所示，具体流程如下(其中多水平超像素分割方法来自文献[7])：

1) 用多水平超像素分割算法对图像$\mathit{\boldsymbol{I}}$进行分割，得到15个超像素分割图${\mathit{\boldsymbol{S}}_1}, {\mathit{\boldsymbol{S}}_2}, \cdots, {\mathit{\boldsymbol{S}}_{15}}$。

2) 将所有超像素分割图通过随机森林，得到每个超像素分割图对应的显著性结果$\mathit{\boldsymbol{Sa}}{\mathit{\boldsymbol{l}}_1}, \mathit{\boldsymbol{Sa}}{\mathit{\boldsymbol{l}}_2}, \cdots, \mathit{\boldsymbol{Sa}}{\mathit{\boldsymbol{l}}_{15}}$。

3) 将原图像通过密集全卷积网络得到结果图${\mathit{\boldsymbol{F}}_{{\rm{map}}}}$。

4) 将$\mathit{\boldsymbol{Sa}}{\mathit{\boldsymbol{l}}_1}, \mathit{\boldsymbol{Sa}}{\mathit{\boldsymbol{l}}_2}, \cdots, \mathit{\boldsymbol{Sa}}{\mathit{\boldsymbol{l}}_{15}}$分别与全卷积结果图${\mathit{\boldsymbol{F}}_{{\rm{map}}}}$进行矩阵Hadamard积，结果记做$\mathit{\boldsymbol{Sa}}{\mathit{\boldsymbol{l}}_{\text{m}1}}, \mathit{\boldsymbol{Sa}}{\mathit{\boldsymbol{l}}_{\text{m}2}}, \cdots, \mathit{\boldsymbol{Sa}}{\mathit{\boldsymbol{l}}_{\text{m}{15}}}$。

5) 将$\mathit{\boldsymbol{Sa}}{\mathit{\boldsymbol{l}}_{\text{m}1}}, \mathit{\boldsymbol{Sa}}{\mathit{\boldsymbol{l}}_{\text{m}2}}, \cdots, \mathit{\boldsymbol{Sa}}{\mathit{\boldsymbol{l}}_{\text{m}15}}$求平均得到初级融合结果${\mathit{\boldsymbol{A}}_{{\rm{sal}}}}$。

6)找出${\mathit{\boldsymbol{A}}_{{\rm{sal}}}}$中的显著性值最大的像素点，将其显著性结果记为$max$，并找出${\mathit{\boldsymbol{A}}_{{\rm{sal}}}}$中显著性值大于$max$/5的所有像素位置。将这些位置的显著性值替换为${\mathit{\boldsymbol{F}}_{{\rm{map}}}}$中对应像素的显著性值。得到最终融合结果${\mathit{\boldsymbol{S}}_{{\rm{map}}}}$。

图 4是一个融合效果的直观展示。

图 4 融合过程示例

Fig. 4 An example of fusion progress((a) original image; (b) ground-truth map; (c) the saliency map of superpixel segmentation; (d) result of FCN; (e) the initial fusion result; (f) the final fusion result)

Hadamard积是两个矩阵元素相同位置元素对应相乘，两种显著性结果进行Hadamard积，会偏向于保留低显著性值的像素。从图 4可以看出，将FCN结果(图 4(d))与超像素显著性结果(图 4(c))进行Hadamard积求平均后，去除了图 4(c)结果中非显著性物体部分，并完整地保留了显著性目标的边缘部分，但是目标部分的显著性值变化不大，依旧与图 4(c)结果相似。

初级融合图很大地提高了结果的准确率，但并没有结合FCN结果中对于显著性物体的高自信度。所以在得到初始融合结果后，设计了一个自适应的门限来进行显著性值的映射以提高结果的召回率，其目的是将初始融合结果中低显著性像素点的值替换为FCN结果中相对应高显著性像素点的值。从两种结果的分析来说，初级融合图中目标显著性数值是由图像的低级特征判定而来，其显著性自信度不够高，而FCN中对应部分显著性值更高，与人工标注图像更加接近，所以将对应像素点的显著性结果值替换为FCN结果中的对应值，可以很好地提高召回率。从图像观感来说，就是用FCN结果(图 4(d))较亮的部分去填充初级融合结果较暗的部分(图 4(e)中灰色部分)。算法中门限选择为初级融合图中最大显著性数值的五分之一，门限的选择也会影响性能，具体分析见对比实验部分。最终结果如图 4(f)所示，既保留了显著性目标的边缘信息，又保留了对目标显著性的高自信度。

3 实验结果及分析

本文的密集全卷积网络训练集包含3 900张图像，图像来自5个显著性检测公共数据集：ECSSD^[20]、SOD^[21]、HKU-IS^[14]、MSRA^[22]和ICOSEG^[23], 网络在caffe平台上进行训练，学习率设置为10^-10，最终模型迭代次数为20万次。

实验在4个数据集：ECSSD、HKU-IS、MSRA、DUT-OMRON^[24]上选取了10种最新的显著性检测模型进行比较。包括6个传统显著性模型：DRFI^[7]、IDRFI^[25]、DW^[26]、HDCT^[9]、RRWR^[27]、CGVS^[28]，4个深度学习模型：MC^[13]、MDF^[14]、AMU^[18]、UCF^[29]。DRFI是本文计算超像素分割图显著性结果的模型，将DRFI与密集FCN(记为denseFCN)的结果同最终结果相比较，作为对比实验以证明融合算法的有效性。由于UCF和AMU训练集使用了MSRA中所有图像，所以没有包括在MSRA数据集测试结果中。实验最后部分通过MSRA数据集上的一系列对比实验分析了融合算法每一过程具体作用以及自适应阈值选择标准对结果的影响。评价指标包括$\text{F}$值、均方误差(MAE)以及PR曲线。其中表 1是所有比较方法的$\text{F}$值以及MAE值。图 5是各模型的PR曲线图。图 6是一些显著性结果的对比图。

表 1 各算法$\text{F}$值及MAE
Table 1 F-measure and MAE of different methods

下载CSV

数据集	评价标准	CGVS	HDCT	RRWR	AMU	UCF	MC	MDF	DW	IDRFI	DRFI	denseFCN	本文算法
ECSSD	$\text{F}$值	0.644	0.688	0.716	0.838	0.823	0.739	0.775	0.647	0.755	0.75	0.78	0.797
ECSSD	MAE	0.228	0.206	0.197	0.128	0.132	0.167	0.169	0.221	0.18	0.19	0.158	0.157
HKU-IS	$\text{F}$值	0.582	0.681	0.683	0.851	0.827	0.721	0.831	0.654	0.747	0.755	0.839	0.873
HKU-IS	MAE	0.175	0.115	0.126	0.044	0.053	0.092	0.058	0.132	0.093	0.098	0.047	0.055
MSRA	$\text{F}$值	0.734	0.784	0.829	/	/	0.838	0.877	0.798	0.847	0.848	0.848	0.896
MSRA	MAE	0.117	0.088	0.087	/	/	0.061	0.053	0.1	0.071	0.075	0.056	0.05
DUT-OMRON	$\text{F}$值	0.479	0.587	0.593	0.706	0.681	0.63	0.697	0.608	0.622	0.64	0.67	0.746
DUT-OMRON	MAE	0.213	0.109	0.138	0.088	0.103	0.097	0.086	0.114	0.105	0.094	0.087	0.071
注：加粗数字代表最优性能值，AMU和UCF的测试结果中不包括MSRA数据集，表中以“/”表示。

图 5 不同模型PR曲线

Fig. 5 PR-curves of different models ((a)ECSSD; (b)HKU-IS; (c)DUT-OMRON; (d)MSRA)

可以看出，本文提出的模型在HKU-IS、MSRA、DUT-OMRON上效果优于相比较的其他模型。在HKU-IS数据集中，相比于性能第2的模型AMU，$\text{F}$值提高了2.6%；在MSRA数据集中，相比于性能第2的模型MDF，$\text{F}$值提高了2.2%，MAE降低了5.6%；在DUT-OMRON数据集中，相比于性能第2的模型AMU，$\text{F}$值提高了5.6%，MAE降低了17.4%。

本文算法结果相比于DRFI算法，$\text{F}$值在4个测试数据集上分别提高了6.2%、15.6%、5.7%、16.6%，MAE降低了17.4%、43.9%、33.3%、24.5%。相比于denseFCN算法，经过融合后其$\text{F}$值在4个数据集上分别提高了2.2%、4.1%、5.7%、11.3%，其MAE在ECSSD、MSRA、DUT-OMRON数据集中分别降低了0.6%、10.7%、18.4%。证明本文算法很好地结合了两种模型的优点，提高了显著性检测的性能。从图 6中显著性结果也可以看出本文算法很好地保留了显著性物体的边界，同时也保留了对显著性目标的高自信度。

图 6 不同模型显著性结果图对比

Fig. 6 Saliency maps of different models((a) original images; (b)ground-truth maps; (c)ours; (d) denseFCN; (e)DRFI; (f)AMU; (g)UCF; (h)CGVS; (i)DW; (j)HDCT; (k)IDRFI; (l)MC; (m)MDF; (n)RRWR)

为了说明本文算法每一过程的作用，进一步在MSRA数据集794张图像上做了对比实验，记录了融合算法过程中显著性结果准确率、召回率和$\text{F}$值的变化，具体结果如表 2所示。从表 2可以很清楚地看出，传统方法具有较高的准确率，而denseFCN结果有较高的召回率，在经过Hadamard积后，极大地提高了结果的准确率，但同时也降低了其召回率；最后经过自适应阈值的像素间非线性映射后，召回率也有了很大提高。

表 2 融合算法中结果相关性能变化
Table 2 Changes of performance in fusion algorithm

下载CSV

算法步骤	准确率	召回率	$\text{F}$值
单独DRFI	0.856 3	0.824 0	0.848 6
单独denseFCN	0.833 8	0.903 2	0.848 8
Hadamard积求平均	0.952	0.731	0.890
最终融合结果	0.922	0.822	0.895

在本文算法中，自适应阈值的设置同样会影响最终融合结果的性能。阈值太小，则最终融合图与denseFCN结果区别太小，没有利用到传统模型的边界信息；而阈值太大则无法利用denseFCN结果中高显著性自信度的优点。为此，同样在MSRA数据集794张图像中进行了对比实验，实验结果如表 3所示，其中$M$为初级融合结果中最大显著性值。

表 3 不同阈值设置对结果影响
Table 3 Influence of different thresholds on results

下载CSV

映射阈值	准确率	召回率	$\text{F}$值
$M$/2	0.953 0	0.669 3	0.868 1
$M$/3	0.942 6	0.741 5	0.887 1
$M$/4	0.922 0	0.822 0	0.895 0
$M$/5	0.921 5	0.822 1	0.896 5
$M$/6	0.914 0	0.838 9	0.895 5
$M$/7	0.906 3	0.853 4	0.893 5
$M$/8	0.901 3	0.861 3	0.891 8
$M$/9	0.895 4	0.868 9	0.889 1
注：加粗数字表示最优性能值。

由表 3可以看出，阈值设置越大，结果准确率越高，召回率越低，反之亦然。当映射阈值设置为$M$/5时，兼顾了准确率和召回率，达到了最高的$\text{F}$值，所以本文算法中映射阈值最终选取了$M$/5。

4 结论

本文提出了一种复合显著性检测模型，通过训练一个密集全卷积网络同时选择一个现有传统模型来得到各自显著性结果图，然后提出了一个融合算法融合两种结果得到最终融合显著图。本文算法在3个数据集中性能好于当前最新的显著性检测算法，对比实验也证明了提出的融合算法能有效利用两种模型的优点。本文算法可用于改善大部分传统显著性模型或者深度显著模型的结果，提高这些模型的性能。同时本文的模型也可作为显著性模型用于图像检索、图像识别或分割等任务中。

本文算法在显著性检测中有很好的性能，但是本文算法分割了检测步骤，利用两个模型得出最终结果，这会增加计算时间，同时模型参数多，不利于模型的推广和移植。将来的研究方向可以围绕简化模型来展开，比如设计端到端的检测模型、减少模型参数以及提高计算速度等。

参考文献

[1] Ding Y Y, Xiao J, Yu J Y. Importance filtering for image retargeting[C]//Proceedings of CVPR 2011. Colorado Springs: IEEE, 2011: 89-96.[DOI: 10.1109/CVPR.2011.5995445]

[2] Siagian C, Itti L. Rapid biologically-inspired scene classification using features shared with visual attention[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(2): 300–312. [DOI:10.1109/TPAMI.2007.40]

[3] Borji A, Frintrop S, Sihite D N, et al. Adaptive object tracking by learning background context[C]//Proceedings of 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. Providence: IEEE, 2012: 23-30.[DOI: 10.1109/CVPRW.2012.6239191]

[4] He J F, Feng J Y, Liu X L, et al. Mobile product search with bag of hash bits and boundary reranking[C]//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence: IEEE, 2012: 3005-3012.[DOI: 10.1109/CVPR.2012.6248030]

[5] Donoser M, Urschler M, Hirzer M, et al. Saliency driven total variation segmentation[C]//Proceedings of the 2009 IEEE 12th International Conference on Computer Vision. Kyoto: IEEE, 2009: 817-824.[DOI: 10.1109/ICCV.2009.5459296]

[6] Perazzi F, Krähenbühl P, Pritch Y, et al. Saliency filters: contrast based filtering for salient region detection[C]//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence: IEEE, 2012: 733-740.[DOI: 10.1109/CVPR.2012.6247743]

[7] Jiang H Z, Wang J D, Yuan Z J, et al. Salient object detection: a discriminative regional feature integration approach[C]//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland: IEEE, 2013: 2083-2090.[DOI: 10.1109/CVPR.2013.271]

[8] Tong N, Lu H C, Zhang L H, et al. Saliency detection with multi-scale superpixels[J]. IEEE Signal Processing Letters, 2014, 21(9): 1035–1039. [DOI:10.1109/LSP.2014.2323407]

[9] Kim J, Han D, Tai Y W, et al. Salient region detection via high-dimensional color transform[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE, 2014: 883-890.[DOI: 10.1109/CVPR.2014.118]

[10] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv: 1409.1556, 2014.

[11] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.[DOI: 10.1109/CVPR.2016.90]

[12] Farabet C, Couprie C, Najman L, et al. Learning hierarchical features for scene labeling[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(8): 1915–1929. [DOI:10.1109/TPAMI.2012.231]

[13] Zhao R, Ouyang W L, Li H S, et al. Saliency detection by multi-context deep learning[C]//Proceedings of 2015IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 1265-1274.[DOI: 10.1109/CVPR.2015.7298731]

[14] Li G, Yu Y. Visual saliency detection based on multiscale deep CNN features[J]. IEEE Transactions on Image Processing, 2016, 25(11): 5012–5024. [DOI:10.1109/TIP.2016.2602079]

[15] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 3431-3440.[DOI: 10.1109/CVPR.2015.7298965]

[16] Li G B, Yu Y Z. Deep contrast learning for salient object detection[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 478-487.[DOI: 10.1109/CVPR.2016.58]

[17] Luo Z M, Mishra A, Achkar A, et al. Non-local deep features for salient object detection[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.: IEEE, 2017: 6593-6601.[DOI: 10.1109/CVPR.2017.698]

[18] Zhang P P, Wang D, Lu H C, et al. Amulet: aggregating multi-level convolutional features for salient object detection[C]//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 202-211.[DOI: 10.1109/ICCV.2017.31]

[19] Huang G, Liu Z, van der Maaten L, et al. Densely connected convolutional networks[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 2261-2269.[DOI: 10.1109/CVPR.2017.243]

[20] Shi J., Yan Q., Xu L., Jia J.. Hierarchical image saliency detection on extended cssd[J]. IEEE transactions on pattern analysis and machine intelligence, 2016, vol. 38(no. 4): pp. 717–729. [DOI:10.1109/TPAMI.2015.2465960]

[21] Movahedi V, Elder J H. Design and perceptual validation of performance measures for salient object segmentation[C]//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops. San Francisco: IEEE, 2010: 49-56.[DOI: 10.1109/CVPRW.2010.5543739]

[22] Liu T, Yuan Z J, Sun J, et al. Learning to detect a salient object[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(2): 353–367. [DOI:10.1109/TPAMI.2010.70]

[23] Batra D, Kowdle A, Parikh D, et al. iCoseg: interactive co-segmentation with intelligent scribble guidance[C]//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco: IEEE, 2010: 3169-3176.[DOI: 10.1109/CVPR.2010.5540080]

[24] Yan Q, Xu L, Shi J P, et al. Hierarchical saliency detection[C]//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland: IEEE, 2013: 1155-1162.[DOI: 10.1109/CVPR.2013.153]

[25] Zhou X F, Liu Z, Sun G L, et al. Improving saliency detection via multiple kernel boosting and adaptive fusion[J]. IEEE Signal Processing Letters, 2016, 23(4): 517–521. [DOI:10.1109/LSP.2016.2536743]

[26] Li H Y, Lu H C, Lin Z, et al. Inner and inter label propagation:salient object detection in the wild[J]. IEEE Transactions on Image Processing, 2015, 24(10): 3176–3186. [DOI:10.1109/TIP.2015.2440174]

[27] Li C Y, Yuan Y C, Cai W D, et al. Robust saliency detection via regularized random walks ranking[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 2710-2717.[DOI: 10.1109/CVPR.2015.7298887]

[28] Yang K F, Li H, Li C Y, et al. A unified framework for salient structure detection by contour-guided visual search[J]. IEEE Transactions on Image Processing, 2016, 25(8): 3475–3488. [DOI:10.1109/TIP.2016.2572600]

[29] Zhang P P, Wang D, Lu H C, et al. Learning uncertain convolutional features for accurate saliency detection[C]//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 212-221.[DOI: 10.1109/ICCV.2017.32]