发布时间: 2019-07-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.180224
2019 | Volume 24 | Number 7

图像分析和识别

融合深度特征和多核增强学习的显著目标检测

张晴¹, 李云¹, 李文举¹, 林家骏², 肖莽¹, 陈飞云¹

1. 上海应用技术大学计算机科学与信息工程学院, 上海 201418;

2. 华东理工大学信息科学与工程学院, 上海 200237

收稿日期: 2018-04-09; 修回日期: 2019-01-07

基金项目: 国家自然科学基金项目（61401281，61806126，41671402）；上海应用技术大学中青年科技人才发展基金项目（ZQ2018-23）

第一作者简介: 张晴, 1983年生, 女, 副教授, 主要研究方向为视觉感知计算模型。E-mail:zhangqing@sit.edu.cn;
李云, 男, 硕士研究生, 主要研究方向为图像处理、计算机视觉。E-mail:370793890@qq.com;
李文举, 男, 教授, 主要研究方向为图像处理、计算机视觉。E-mail:wjli@sit.edu.cn;
林家骏, 男, 教授, 主要研究方向为图像/视频的分析与理解、信息安全。E-mail:jjlin@ecust.edu.cn;
肖莽, 男, 讲师, 主要研究方向为图像处理、机器学习。E-mail:mangxiao@sit.edu.cn;
陈飞云, 女, 副教授, 主要研究方向为计算机应用。E-mail:cfy@sit.edu.cn.

中图法分类号: TP391

文献标识码: A

文章编号: 1006-8961(2019)07-1096-10

摘要

目的针对现有基于手工特征的显著目标检测算法对于显著性物体尺寸较大、背景杂乱以及多显著目标的复杂图像尚不能有效抑制无关背景区域且完整均匀高亮显著目标的问题，提出了一种利用深度语义信息和多核增强学习的显著目标检测算法。方法首先对输入图像进行多尺度超像素分割计算，利用基于流形排序的算法构建弱显著性图。其次，利用已训练的经典卷积神经网络对多尺度序列图像提取蕴含语义信息的深度特征，结合弱显著性图从多尺度序列图像内获得可靠的训练样本集合，采用多核增强学习方法得到强显著性检测模型。然后，将该强显著性检测模型应用于多尺度序列图像的所有测试样本中，线性加权融合多尺度的检测结果得到区域级的强显著性图。最后，根据像素间的位置和颜色信息对强显著性图进行像素级的更新，以进一步提高显著图的准确性。结果在常用的MSRA5K、ECSSD和SOD数据集上与9种主流且相关的算法就准确率、查全率、F-measure值、准确率—召回率（PR）曲线、加权F-measure值和覆盖率（OR）值等指标和直观的视觉检测效果进行了比较。相较于性能第2的非端到端深度神经网络模型，本文算法在3个数据集上的平均F-measure值、加权F-measure值、OR值和平均误差（MAE）值，分别提高了1.6%，22.1%，5.6%和22.9%。结论相较于基于手工特征的显著性检测算法，本文算法利用图像蕴含的语义信息并结合多个单核支持向量机（SVM）分类器组成强分类器，在复杂图像上取得了较好的检测效果。

关键词

显著目标检测; 显著性检测; 深度特征; 多核增强学习; 多尺度检测

Salient object detection via deep features and multiple kernel boosting learning

Zhang Qing¹, Li Yun¹, Li Wenju¹, Lin Jiajun², Xiao Mang¹, Chen Feiyun¹

1. School of Computer Science and Information Engineering, Shanghai Institute of Technology, Shanghai 201418, China;

2. School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China

Supported by: National Natural Science Foundation of China (61401281, 61806126, 41671402)

Abstract

Objective Salient object detection identifies the most conspicuous and eye-attracting objects or regions in images. Results are often expressed by saliency maps, in which the intensity of each pixel presents the strength of the probability that the pixel belongs to a salient region. Visual saliency detection has been used as a pre-processing step for facilitating a wide range of vision applications, including image and video compression, image retargeting, visual tracking, and robot navigation. Although the performance of salient object detection approaches has dramatically improved in the last few years, it remains challenging in computer vision tasks. Most existing methods focus on handcrafted features and use distinct prior knowledge, such as contrast, center, background, and objectness priors, to enhance performance. Recently, convolutional neural network (CNN)-based approaches have shown to be remarkably effective and successfully broken the limits of traditional handcrafted feature-based methods. The recent CNN-based salient object detection approaches have been successful in overcoming the disadvantages of handcrafted feature-based approaches and have greatly enhanced the performance of saliency detection. These CNN-based models, especially the end-to-end ones, have shown their superiority on feature extraction and efficiently captured high-level information about the objects and their cluttered surroundings. The existing handcrafted feature-based salient object detection algorithms are insufficient in effectively suppressing irrelevant backgrounds and uniformly highlighting the entire salient object and on complicated images with large salient object, cluttered backgrounds, and multiple salient objects. We propose a salient object detection scheme based on multiple kernel boosting learning and deep semantic information to overcome this drawback. Method First, we segment the input image into multiscale superpixels and obtain weak saliency maps through graph-based manifold ranking. Second, we extract the deep features involving semantic information by using classic CNN. We obtain reliable training sets through the multiscale weak saliency maps to develop a strong salient object detection model by using multiple kernel boosting learning. Then, saliency maps are directly produced by samples from the multiscale superpixel images, which are infused to generate a strong saliency map. Finally, a pixel-level saliency map is refined in accordance with the color and position to improve the detection performance. Result The proposed moodel is compared with 11 state-of-the-art methods to evaluate its performance in terms of precision, recall, F-measure, PR (precision-recall) curve, weighted F-measure, OR (overlapping ratio) and MAE (mean absolute error) scores, and visual effect on three popular and public datasets, namely, MSRA5K, ECSSD, and SOD. Experimental results show the improvements over the state-of-the-art methods. The F-measure score of our algorithm increased by 0.7%, 2.0%, and 2.1%; the weighted F-measure increased by 18.9%, 27.6%, and 19.8%; the OR scores increased by 2.9%, 6.8%, and 7.2%; and the MAE scores increased by 34.5%, 26.9%, and 7.5% compared with the saliency results produced by the non-end-to-end deep learning model whose performance ranks second on MSRA5K, ECSSD, and SOD, respectively. The experiments on visual effect show that our method performs well in various complex images, such as saliency objects and backgrounds that share similar appearance, multiple salient objects, salient objects with complex texture and structure, and clutter backgrounds. The proposed approach not only uniformly highlights the entire salient objects but also efficiently preserves the contour of salient objects under various scenarios. Moreover, we conduct experiments on three datasets in terms of PR curves to evaluate the performance of each component of the proposed algorithm. Moreover, the average running time of our algorithm and the methods based on non-end-to-end CNNs is presented. The implementation is performed on ECSSD dataset by using MATLAB or C, and most of the test images have a resolution of 300×400 pixels. An efficient C/C++ implementation based on parallelized components would decrease our model's computation time and render it feasible for real-world application. Conclusion The proposed salient object detection model demonstrates good performance on complicated images compared with the salient object detection method based on handcrafted features, which learns a strong classifier with four single kernel SVM(support vector machine) and uses classic CNN. Further improvements of salient object detection algorithm on dataset with complex and confusing background images are worth expecting. In further research, we plan to utilize additional features from a CNN and construct an end-to-end model, which would improve performance and save computation cost. Moreover, our further work will pay attention to small and salient object detections in video.

Key words

salient object detection; saliency detection; deep feature; multiple kernel boosting learning; multiscale detection

0 引言

显著目标检测的目的是模仿人类视觉系统的感知能力，自动从图像中检测出最吸引人注意的区域/物体，从而迅速获取场景的主要内容。其研究成果广泛应用在多种图像分析和理解任务中，包括行人再识别^[1]、场景分类^[2]、基于内容的图像检索^[3]和图像/视频的自适应压缩^[4]等。

传统的显著性检测模型主要采用手工选择的特征进行显著性计算，一般首先对图像进行区域分割，然后提取分割区域的低层特征，例如边缘、纹理、颜色和形状等，直接进行显著性计算^[5-6]，或采用浅层学习算法对多个低层特征或特征图进行融合^[7-8], 得到最终的显著性图。在显著性计算中，研究人员引入各种先验知识以提高检测性能，其中最常用的是对比度先验^[5]，并取得了一定的成功，但基于对比度先验的方法普遍存在一些问题，譬如目标区域不能被均匀突出、背景中的高频信息被错误检测等。

为解决这些问题，中心先验^[9-10]、背景边界先验^{[6, 11-14]}、似物性先验^{[13, 15]}等方法相继引入到显著性目标检测模型中。Yang等人^[16]将位于图像边界的超像素视为背景信息，将超像素视为最小计算单元对图像构建稀疏连接图，通过节点之间不同的连接方式学习出一个完全关联矩阵，最后根据背景标签进行显著性扩散得到显著性图。文献[17]利用似物性采样模型粗略估计显著物体存在的位置，结合马尔可夫吸收链进行显著性检测。Qin等人^[18]提出了基于元胞自动机的显著性传播机制，通过相邻区域间的相互交流来探索相似区域之间的关联性。

基于手工特征的显著性计算方法在简单图像上取得了令人满意的结果，但在处理复杂场景时可能会失败，特别是前景与背景具有相似属性时，手工特征往往无法区分前景目标和背景噪声。主要原因在于基于手工特征的模型无法获取图像隐含的语义信息，即深度特征。

卷积神经网络(CNN)是一种带有卷积结构的神经网络，“深度”意味着和传统的人工神经网络相比，它有更深层次的网络结构和更多的网络节点。卷积神经网络避免了人为的手工特征选择，直接从大量数据中学习对任务有用的特征，其各个卷积层得到的特征不仅含有颜色、边缘和形状等中低层特征，更含有丰富的语义信息。近3年来，基于卷积神经网络的显著性检测方法引起了较多关注，相较于基于手工特征的方法，基于CNN的显著性检测模型在检测精度方面取得了较大的提升^[19-20]。

根据是否采用端到端的卷积神经网络进行显著性的像素级预测，可将现有方法分为两类:1)以分割区域为最小单元，利用卷积神经网络提取分割区域的深度特征进行显著性检测^[21-23]。例如，Wang等人^[21]运用一个深度神经网络DNN-L学习局部性特征，使用另外一个神经网络DNN-G学习图像的全局特征，最终通过一种区域加权求和的方式生成最终的显著性图。2)直接构建端到端的卷积神经网络进行显著性计算^[24-26]。Liu等人^[24]提出了一个逐层循环卷积网络，利用VGG16网络和多个循环卷积层以融入全局信息。Li等人^[25]提出了一个基于多任务学习的全卷积网络框架用于显著性物体检测。

本文提出一种新的结合深度特征和多核增强学习的显著性目标检测算法，直接从输入图像中选取训练样本集合进行多核增强学习，省去了繁琐的离线训练或者人工标注样本过程，并利用训练好的卷积神经网络提取多尺度图像的深度特征。本文算法在复杂图像上能有效抑制杂乱背景并获得具有清晰轮廓的显著性物体。

1 基于深度特征和多核学习的算法

本文算法的总体流程如图 1所示：1)对输入图像进行多尺度超像素分割；2)根据流形排序算法进行显著性扩散，构建弱显著性图；3)通过卷积神经网络进行多尺度图像的深度特征提取；4)根据弱显著性图得到正负训练样本集合，采用多核增强学习得到一个强分类器；5)将此强分类器作用在每个尺度图像的所有测试样本集合上；6)将得到的多尺度显著性图线性加权融合得到强显著性图；7)对超像素级的强显著性图进行像素级的显著性更新，得到最终的显著性图。

图 1 基于深度特征和多核增强学习的算法整体流程图

Fig. 1 Overall pipeline of our proposed algorithm based on deep features and multiple kernel boosting learning

1.1 多尺度超像素分割

对图像进行超像素分割计算时，不同的超像素尺度会对检测结果产生不同的影响。为了解决尺度大小问题，本文采用了一种比较简单的多尺度方法，运用简单线性迭代聚类算法(SLIC)^[27]，对输入图像进行5层的多尺度超像素分割，设定超像素个数分别为100、150、200、250和300。

1.2 弱显著性检测模型

在每个超像素尺度上，本文采用简单又有效的基于图的流形排序算法^[16]得到最初的显著性图$\mathit{\boldsymbol{S}}_{0}$，然后采用图割方法对其进行优化。

对于输入图像$\mathit{\boldsymbol{I}}$，构建其无向连接图$\mathit{\boldsymbol{G}}=(\mathit{\boldsymbol{V}}, \mathit{\boldsymbol{E}}, \mathit{\boldsymbol{T}})$，其中$\mathit{\boldsymbol{E}}$表示连接各个像素节点$\mathit{\boldsymbol{V}}$的无向边的集合，$\mathit{\boldsymbol{T}}$是由前景节点权重$\left\{ {{\mathit{\boldsymbol{T}}^{\rm{f}}}(p)} \right\}$和虚拟背景节点权重$\left\{ {{\mathit{\boldsymbol{T}}^{\rm{b}}}(p)} \right\}$组成的集合

$ {\mathit{\boldsymbol{T}}^{\rm{f}}}(p) = {\mathit{\boldsymbol{S}}_0}(p) $

(1)

$ \begin{array}{l} \\ {\mathit{\boldsymbol{T}}^{\rm{b}}}(p) = 1 - {\mathit{\boldsymbol{S}}_0}(p) \end{array} $

(2)

使用最大流方法^[28]得到显著性图${\mathit{\boldsymbol{S}}_1}$，并将其与${\mathit{\boldsymbol{S}}_0}$相结合，得到最终的弱显著性图${\mathit{\boldsymbol{S}}_{\rm{w}}}$

$ {\mathit{\boldsymbol{S}}_{\rm{w}}} = \frac{{{\mathit{\boldsymbol{S}}_0} + {\mathit{\boldsymbol{S}}_1}}}{2} $

(3)

1.3 强显著性检测模型

1.3.1 深度特征提取

所提算法对输入图像进行多尺度超像素分割后，将每个超像素区域送入已训练好的卷积神经网络提取蕴含语义信息的深度特征。采用AlexNet网络进行深度特征的提取，AlexNet网络由5个卷积层和3个全连接层组成，结构较为简单。研究表明，尽管AlexNet网络最初被训练应用于视觉识别任务，然而从中获取的深度特征在其他计算机视觉任务上具有通用性^[26]。

由于超像素区域具有不规则形状，但是CNN要求输入必须是一个矩形区域，因此本文借鉴文献[29]的处理方法，将包围超像素的最小矩形框作为AlexNet网络的输入，将位于最小矩形框内但不属于超像素区域的像素的值用ImageNet训练数据集上同样位置的平均值代替。这个平均值取代操作对后续的计算不产生任何影响，因为AlexNet网络在进行计算前会对输入图像减去ImageNet训练数据集同等位置的平均值。

1.3.2 训练数据集获取

本文根据弱显著性图为强显著性模型的学习提供可靠的训练样本，具体的样本选择方法为：设定两个高低阈值，超像素的平均显著值高于高阈值则被认为是正样本，标记为+1，低于低阈值则被认为是负样本，标记为-1。在本文实验中，将高阈值设置为弱显著性图${\mathit{\boldsymbol{S}}_{\rm{w}}}$的平均显著值的2倍，将低阈值设置为0.05。

1.3.3 多核SVM增强学习优化

支持向量机(SVM)分类器存在的最主要问题是对不同的数据很难选定一个统一的、合适的核函数。本文采用多核增强学习方法^[7]，将多个单核的SVM分类器作为弱分类器，并采用Adaboost增强学习方法通过多次迭代学习得到一个强分类器。

多核学习是一种基于核的学习模型，多核增强学习是基于多核学习算法的改进算法。对于单个特征，利用单个核函数得到单核SVM，对应的分类判别函数为

$ \mathit{\boldsymbol{Y}}(\mathit{\boldsymbol{r}}) = \sum\limits_{i = 1}^N {{\alpha _i}} {\mathit{\boldsymbol{l}}_i}{\mathit{\boldsymbol{k}}_m}\left( {\mathit{\boldsymbol{r}}, {\mathit{\boldsymbol{r}}_i}} \right) + b $

(4)

式中，$\left\{ {{\mathit{\boldsymbol{r}}_i}, {\mathit{\boldsymbol{l}}_i}} \right\}_{i = 1}^N$为训练样本，$α_{i}$为拉格朗日乘子，${{\mathit{\boldsymbol{r}}_i}}$为第$i$个样本，$l_{i}$表示样本的二值标签，$N$表示样本数量，${\mathit{\boldsymbol{k}}_m}\left({\mathit{\boldsymbol{r}}, {\mathit{\boldsymbol{r}}_i}} \right) = {\mathit{\boldsymbol{\varphi }}_m}\left({{\mathit{\boldsymbol{r}}_i}} \right) \cdot {\mathit{\boldsymbol{\varphi }}_m}(\mathit{\boldsymbol{r}})$为样本在再生核Hilbert空间中的内积，$b$为偏置常数。

通过多核学习，联合多个基本分类器，可以获取比单核模型更优的性能，式(4)转化为

$ \mathit{\boldsymbol{Y}}(\mathit{\boldsymbol{r}}) = \sum\limits_{m = 1}^M {{\beta _m}} \sum\limits_{i = 1}^N {{\alpha _i}} {\mathit{\boldsymbol{l}}_i}{\mathit{\boldsymbol{k}}_m}\left( {\mathit{\boldsymbol{r}}, {\mathit{\boldsymbol{r}}_i}} \right) + b $

(5)

式中，$β_{i}$表示对应核函数的权重，$\sum\limits_{i = 1}^M {{\beta _i}} = 1$，$M$为核函数的种类数。

多核增强学习算法采用Adaboost增强算法代替经典多核学习算法中单核SVM的简单组合。式(5)转化为

$ \mathit{\boldsymbol{Y}}(\mathit{\boldsymbol{r}}) = \sum\limits_{j = 1}^J {{\beta _j}} {\mathit{\boldsymbol{z}}_j}(\mathit{\boldsymbol{r}}) $

(6)

式中，${\mathit{\boldsymbol{z}}_j}(\mathit{\boldsymbol{r}}) = {\mathit{\boldsymbol{\alpha }}^{\rm{T}}}{\mathit{\boldsymbol{k}}_j}(\mathit{\boldsymbol{r}}) + {b_j}$为单核SVM的目标函数，参数$J$表示增强过程的迭代次数。可以把每一个SVM视为一个弱分类器，加权结合所有的弱分类器则得到最终的强分类器。

经过$J$次迭代，得到$J$组$β_{j}$和${\mathit{\boldsymbol{z}}_j}(\mathit{\boldsymbol{r}})$值，最终可得到一个强分类器即强显著性模型，该分类器可以直接应用于当前图像的所有超像素，得到超像素级的显著性图。

所提算法利用训练得到的强显著性模型对输入图像的每个尺度进行分类计算，得到一系列的超像素级的显著性图，再将其线性加权融合得到最终的强显著性图

$ {\mathit{\boldsymbol{S}}_{\rm{S}}} = \mathit{norm}\left( {\sum\limits_{i = 1}^N {{\mathit{\boldsymbol{S}}_i}} /N} \right) $

(7)

式中，$norm(·)$表示归一化操作，$N$为多尺度超像素分割的图像层数，本文实验中，$N=5$。${\mathit{\boldsymbol{S}}_i}$为第$i$层尺度图像通过强显著性检测得到的显著性图。

1.4 强显著性图的像素级更新

强显著性图${\mathit{\boldsymbol{S}}_{\rm{S}}}$是超像素级的，即同一超像素内的所有像素具有相同的显著度值。为了提高像素间的空间联系，本文算法使用全连接的条件随机场(CRF)^[30]模型执行像素级的显著性更新。根据像素间的颜色差异和位置距离，确定每一个像素的最终显著值。全连接的CRF模型将显著性图的像素级更新问题转化为二值标签问题

$ E(L) = - \sum\limits_i {\lg } P\left( {{l_i}} \right) + \sum\limits_{i, j} {{\theta _{ij}}} \left( {{l_i}, {l_j}} \right) $

(8)

式中，$L$表示每个像素的二值标签，$P(l_{i})$表示像素$x_{i}$被赋予标签$l_{i}$的概率，即像素$x_{i}$的显著性概率。初始化计算时，$P(1)=Sal_{i}$，$P(0)=1-Sal_{i}$，$Sal_{i}$是像素$x_{i}$在强显著性图${\mathit{\boldsymbol{S}}_{\rm{S}}}$中的显著值。$θ_{ij}(l_{i}, l_{j})$定义为

$ \begin{array}{l} {\theta _{ij}} = \mu \left( {{l_i}, {l_j}} \right)\left[ {{\omega _1}\exp \left( { - \frac{{{{\left\| {{p_i} - {p_j}} \right\|}^2}}}{{2\sigma _\alpha ^2}} - \frac{{{{\left\| {{I_i} - {I_j}} \right\|}^2}}}{{2\sigma _\beta ^2}}} \right) + } \right.\\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\left. {{\omega _2}\exp \left( { - \frac{{{{\left\| {{p_i} - {p_j}} \right\|}^2}}}{{2\sigma _\gamma ^2}}} \right)} \right] \end{array} $

(9)

式中，$l_{i}≠l_{j}$时，$μ(l_{i}, l_{j})=1$，$p_{i}$表示像素$x_{i}$的位置，$I_{i}$表示像素$x_{i}$的颜色。参数$σ_{α}$和$σ_{β}$控制算法对空间距离和颜色相似度的敏感度。式(8)的第1个指数项使得颜色差和空间距离较为接近的像素具有相似的显著度值，第2个指数项用于去除孤立的小区域，参数$σ_{γ}$用于控制高斯核的尺度。本文实验中，各个参数取值为：$w$1=3，$w$2=5，$σ_{α}$=3，$σ_{β}$=50，$σ_{γ}$=3。

图 2是本文算法各个阶段生成的显著性图的比较，弱显著性图(图 2(c))粗糙地确定了图像的显著目标与背景区域的大概位置；强显著性图(图 2(d))则利用弱显著性图得到的显著目标和背景信息进行学习，进一步凸显前景目标并抑制背景信息；强显著性图以超像素为基本单元进行显著性分配，在均匀高亮目标方面存在不足，因此采用基于像素级的更新方法得到的最终显著性图(图 2(e))性能更好。

图 2 显著性检测过程示意图

Fig. 2 Saliency detection process diagram

((a) input image; (b) ground truth; (c) weak saliency map; (d) strong saliency map; (e) refined pixel-level saliency map)

2 实验结果与分析

本文算法运行环境是Win7，PC机具体配置为Intel Core i7-4790 CPU(3.6 GHz)，16 GB内存，所提算法代码基于MATLAB R2016b实现。在公开的基准测试集MSRA5K、ECSSD和SOD上进行算法性能比较与分析，数据集均由人工标注像素级的显著目标。其中，MSRA5K含有5 000幅图像，规模较大。ECSSD包含1 000幅结构和纹理较为复杂的自然图像。SOD包含300幅图像，部分图像包含多个显著性物体。

采用5种度量方法评判算法的性能：准确率—召回率(PR)曲线、平均绝对误差(MAE)、F-measure值、加权F(wF)-measure值和覆盖率(OR)。PR曲线表明不同阈值下的显著性图的准确率—召回率性能。F-measure是个综合指标，评价算法在正确率(P)和召回率(R)方面的综合性能。各项指标计算为

$ {f_{\rm{P}}} = \frac{{\sum {{\mathit{\boldsymbol{S}}_{\rm{g}}}} \times {\mathit{\boldsymbol{S}}_{\rm{d}}}}}{{\sum {{\mathit{\boldsymbol{S}}_{\rm{d}}}} }} $

(10)

$ {f_{\rm{R}}} = \frac{{\sum {{\mathit{\boldsymbol{S}}_{\rm{g}}}} \times {\mathit{\boldsymbol{S}}_{\rm{d}}}}}{{\sum {{\mathit{\boldsymbol{S}}_{\rm{g}}}} }} $

(11)

$ {f_{\rm{F}}} = \frac{{(1 + \alpha ) \times {f_{\rm{P}}} \times {f_{\rm{R}}}}}{{\alpha \times {f_{\rm{P}}} + {f_{\rm{R}}}}} $

(12)

式中，${\mathit{\boldsymbol{S}}_{\rm{d}}}$是算法检测到的显著性图，${\mathit{\boldsymbol{S}}_{\rm{g}}}$是基准图，为了强调准确率，通常将${\beta ^2}$设置为0.3。

wF-measure^[31]是传统F-measure度量方法的改进，与传统方法的计算类似，wF-measure为$f_{\rm{P}}^\omega $和$f_{\rm{R}}^\omega $的加权调和均值

$ f_{\rm{F}}^\omega = \frac{{(1 + \alpha ) \times f_{\rm{P}}^\omega \times f_{\rm{R}}^\omega }}{{\alpha \times f_{\rm{P}}^\omega + f_{\rm{R}}^\omega }} $

(13)

平均绝对误差MAE定义为显著性图与基准图的像素值差

$ {f_{{\rm{MAE}}}} = \frac{1}{{H \times W}}\sum\limits_i^H {\sum\limits_j^W {\left| {{\mathit{\boldsymbol{S}}_{\rm{d}}} - {\mathit{\boldsymbol{S}}_{\rm{g}}}} \right|} } $

(14)

覆盖率OR定义为

$ {f_{{\rm{OR}}}} = \frac{{\left| {\mathit{\boldsymbol{M}} \cap {\mathit{\boldsymbol{S}}_{\rm{g}}}} \right|}}{{\left| {\mathit{\boldsymbol{M}} \cup {\mathit{\boldsymbol{S}}_{\rm{g}}}} \right|}} $

(15)

式中，$\mathit{\boldsymbol{M}}$是${\mathit{\boldsymbol{S}}_{\rm{d}}}$的二值图，选取两倍的显著性图均值作为阈值计算而得。

对比算法选用近年来具有代表性以及与本文算法相关的11种，包括：MR(manifold ranking)^[16]、PCA(principal component analysis)^[28]、MSS(multi-scale superpixels)^[10]、RBD(robust background detection)^[12]、BL(boostrap learning)^[7]、LPS(label propagation saliency)^[13]、RR(regularized random walks ranking)^[6]、BSCA(background-based map optimized via single-layer cellular automata)^[18]和SMD(structured matrix decomposition)^[14]，ELD(encoded low level distance)^[23]和NLDF(non-local deep features)^[20]。均由作者提供公开代码或显著性图。其中，ELD和NLDF为近3年性能较好的基于深度神经网络的显著目标检测算法。

2.1 算法检测性能的客观定量评估

所提算法与近年来具有代表性的11种算法分别在3个数据集上进行测试评估。图 3是12种算法分别在3个数据集上的F-measure值、wF-measure值和覆盖率OR值的比较结果。图 4是12种算法分别在3个数据集上的MAE值比较图。图 5是12种算法分别在3个数据集上的PR曲线比较图。

图 3 不同算法在3个数据集上的F-measure、加权F-measure和OR值比较

Fig. 3 Quantitative comparison of saliency maps on three datasets in terms of F-measure, wF-measure and OR

((a) ECSSD; (b) MSRA5K; (c) SOD)

图 4 不同算法在MSRA5K、ECSSD和SOD数据集上的MAE值比较

Fig. 4 Quantitative comparison of saliency maps on MSRA5K, ECSSD and SOD in terms of MAE scores

图 5 不同算法在不同数据集上的PR曲线比较

Fig. 5 Quantitative comparison of saliency maps on three datasets in terms of PR curves ((a) ECSSD; (b) MSRA5K; (c) SOD)

由图 3—图 5可知，本文算法在PR曲线、F-measure值，wF-measure值、OR值和MAE值上较传统的基于手工特征的显著性检测算法有较大的性能提升，这主要是由于本文在强显著性图的计算中，利用现有的CNN网络引入了深度特征，更好地描述了图像的语义信息。而本文算法较近两年具有代表性的基于CNN训练的检测算法存在不足，主要原因在于：1)考虑到计算时间，使用了结构简单的AlexNet提取特征。可使用更深更宽的CNN替代AlexNet提取特征以进一步提升性能，如VGGNet和ResNet等；2)所提模型未针对显著性检测任务训练新的CNN，仅利用AlexNet进行深度特征提取，AlexNet是用于物体识别的CNN，而ELD和NLDF均在现有的CNN网络基础上，针对显著性检测任务，设计了新的网络结构并进行训练学习。在今后的工作中，将研究设计全新网络以提升检测性能。

2.2 算法的主观视觉比较

将本文算法与MR、PCA、MSS、RBD、BL、LPS、RR、BSCA和SMD算法进行直观的视觉检测效果比较，结果如图 6所示。相对于其他算法而言，对于纹理背景(图 6第4行)或背景较为杂乱(图 6第2和第3行)的图像，本文算法能有效抑制背景信息且较完整突出显著目标；对于显著目标尺寸较大(图 6第1行)或者多显著目标(图 6第5行)图像，本文算法也具有较好的检测能力；而在处理前景背景对比度较低的图像(图 6第6行)时，本文算法亦能保持健壮性。

图 6 不同算法检测效果的直观视觉比较

Fig. 6 Visual comparisons of saliency maps produced by different approaches ((a) input images; (b) ground truth; (c) MR; (d) PCA; (e) MSS; (f) RBD; (g) BL; (h) LPS; (i) RR; (j) BSCA; (k) SMD; (l) ours)

2.3 算法各组成模块分析

图 7是所提算法生成的弱显著性图、超像素级的强显著性图和更新后的像素级的显著性图在3个数据集上的PR曲线。从图 7可以看出，弱显著性图PR性能较差，更新后的显著性图PR性能最优，表明本文利用已训练的深度神经网络提取深度特征、结合多个核函数训练SVM以及基于像素级的显著性更新步骤的有效性。

图 7 本文算法各个步骤在各数据集上的PR曲线比较

Fig. 7 Quantitative comparison of different component of proposed algorithm on different datasets in terms of PR curves

((a) ECSSD; (b) MSRA5K; (c) SOD)

2.4 运行时间分析

表 1是10种算法的平均检测时间比较。本文算法直接从图像中获取测试样本，且利用已训练的CNN获取深度特征，因而省去了离线训练和繁琐的数据集人工标注过程。在一幅300×400像素的图像上，大约需要5.6 s处理时间。由于在强显著性图生成步骤中采用了多尺度计算，因而采用更有效的编程语言进行并行计算将降低计算时间，使之更具实用性。综合上述的量化评价指标和主观视觉效果，本文算法具有优势。

表 1 算法的运行时间比较
Table 1 Comparisons of running time

下载CSV

算法	时间/s	计算机语言
MR	0.94	MATLAB
PCA	3.03	C
MSS	1.01	MATLAB
RBD	0.23	MATLAB
BL	16.94	MATLAB
LPS	1.47	MATLAB
RR	2.73	MATLAB
BSCA	1.02	MATLAB
SMD	0.74	MATLAB+C
本文	5.6	MATLAB+C

3 结论

本文提出了基于深度特征和多核增强学习的显著目标检测算法，无需离线训练模型和人工采集标签数据，从测试图像本身获取训练数据进行分类器学习。该方法利用SLIC超像素分割方法对输入图像进行多尺度分割，利用经典的卷积神经网络AlexNet提取图像的深度特征，根据弱显著性图获得可靠样本集，采用多核增强学习方法得到超像素级的强显著性图，并根据颜色和位置信息进行像素级的显著性更新。在常用的3个公开数据集MSRA5K、ECSSD和SOD上进行评估比较实验，实验数据表明，本文算法对于复杂图像具有优势，能得到接近人眼视觉感知结果的显著性图。

不足之处在于本文采用了现有的卷积神经网络对图像的分割块提取深度特征，算法的时间复杂度与分割的区域数目成正比，造成算法效率不高。在今后的工作中，将开展利用端到端的CNN进行显著性检测的研究，同时提高算法效率和性能，以及开展实例级的显著性目标分割方法的研究。

参考文献

[1] Zhao R, Ouyang W, Wang X G. Person re-identification by saliency learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(2): 356–370. [DOI:10.1109/TPAMI.2016.2544310]

[2] Zhang F, Du B, Zhang L P. Saliency-guided unsupervised feature learning for scene classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2015, 53(4): 2175–2184. [DOI:10.1109/TGRS.2014.2357078]

[3] Yang X Y, Qian X M, Xue Y. Scalable mobile image retrieval by exploring contextual saliency[J]. IEEE Transactions on Image Processing, 2015, 24(6): 1709–1721. [DOI:10.1109/TIP.2015.2411433]

[4] Sun M, Farhadi A, Taskar B, et al. Salient montages from unconstrained videos[C]//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer, 2014: 472-488.[DOI:10.1007/978-3-319-10584-0_31]

[5] Cheng M M, Mitra N J, Huang X L, et al. Global contrast based salient region detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(3): 569–582. [DOI:10.1109/TPAMI.2014.2345401]

[6] Li C Y, Yuan Y C, Cai W D, et al. Robust saliency detection via regularized random walks ranking[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 2710-2717.[DOI:10.1109/CVPR.2015.7298887]

[7] Tong N, Lu H C, Ruan X, et al. Salient object detection via bootstrap learning[C]//Proceedings of 2005 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 1884-1892.[DOI:10.1109/CVPR.2015.7298798]

[8] Zhou X F, Liu Z, Sun G L, et al. Improving saliency detection via multiple kernel boosting and adaptive fusion[J]. IEEE Signal Processing Letters, 2016, 23(4): 517–521. [DOI:10.1109/LSP.2016.2536743]

[9] Yan Q, Xu L, Shi J P, et al. Hierarchical saliency detection[C]//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE, 2013: 1155-1162.[DOI:10.1109/CVPR.2013.153]

[10] Tong N, Lu H C, Zhang L H, et al. Saliency detection with multi-scale superpixels[J]. IEEE Signal Processing Letters, 2014, 21(9): 1035–1039. [DOI:10.1109/LSP.2014.2323407]

[11] Zhang Q, Lin J J, Tao Y Y, et al. Salient object detection via color and texture cues[J]. Neurocomputing, 2017, 243: 35–48. [DOI:10.1016/j.neucom.2017.02.064]

[12] Zhu W J, Liang S, Wei Y C, et al. Saliency optimization from robust background detection[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE, 2014: 2814-2821.[DOI:10.1109/CVPR.2014.360]

[13] Li H Y, Lu H C, Lin Z, et al. Inner and inter label propagation:salient object detection in the wild[J]. IEEE Transactions on Image Processing, 2015, 24(10): 3176–3186. [DOI:10.1109/TIP.2015.2440174]

[14] Peng H W, Li B, Ling H B, et al. Salient object detection via structured matrix decomposition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 818–832. [DOI:10.1109/TPAMI.2016.2562626]

[15] Yan X Y, Wang Y H, Song Q, et al. Salient object detection via boosting object-level distinctiveness and saliency refinement[J]. Journal of Visual Communication and Image Representation, 2017, 48: 224–237. [DOI:10.1016/j.jvcir.2017.06.013]

[16] Yang C, Zhang L H, Lu H C, et al. Saliency detection via graph-based manifold ranking[C]//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE, 2013: 3166-3173.[DOI:10.1109/CVPR.2013.407]

[17] Zhang Q, Luo D S, Li W J, et al. Two-stage absorbing Markov chain for salient object detection[C]//Proceedings of 2017 IEEE International Conference on Image Processing. Beijing: IEEE, 2017: 895-899.[DOI:10.1109/ICIP.2017.8296410]

[18] Qin Y, Lu H C, Xu Y Q, et al. Saliency detection via cellular automata[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 110-119.[DOI:10.1109/CVPR.2015.7298606]

[19] Kuen J, Wang Z H, Wang G. Recurrent attentional network for saliency detection[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 3668-3677.[DOI:10.1109/CVPR.2016.78]

[20] Luo Z M, Mishra A, Achkar A, et al. Non-local deep features for salient object detection[C]//Proceedings of 2017 IEEE International Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 6593-6601.[DOI:10.1109/CVPR.2017.698]

[21] Wang L J, Lu H C, Ruan X, et al. Deep networks for saliency detection via local estimation and global search[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 3183-3192.[DOI:10.1109/CVPR.2015.7298938]

[22] Li G B, Yu Y Z. Visual saliency based on multiscale deep features[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 5455-5463.[DOI:10.1109/CVPR.2015.7299184]

[23] Lee G, Tai Y W, Kim J. Deep saliency with encoded low level distance map and high level features[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 660-668.[DOI:10.1109/CVPR.2016.78]

[24] Liu N, Han J W. DHSNet: deep hierarchical saliency network for salient object detection[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 678-686.[DOI:10.1109/CVPR.2016.80]

[25] Li X, Zhao L M, Wei L N, et al. DeepSaliency:multi-task deep neural network model for salient object detection[J]. IEEE Transactions on Image Processing, 2016, 25(8): 3919–3930. [DOI:10.1109/TIP.2016.2579306]

[26] Wang L Z, Wang L J, Lu H C, et al. Saliency detection with recurrent fully convolutional networks[C]//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer, 2016: 825-841.[DOI:10.1007/978-3-319-46493-0_50]

[27] Achanta R, Shaji A, Smith K, et al. SLIC superpixels compared to state-of-the-art superpixel methods[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(11): 2274–2282. [DOI:10.1109/TPAMI.2012.120]

[28] Margolin R, Tal A, Zelnik-Manor L, et al. What makes a patch distinct?[C]//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE, 2013: 1139-1146.[DOI:10.1109/CVPR.2013.151]

[29] Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of 2014 IEEE International Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE, 2014: 580-587.[DOI:10.1109/CVPR.2014.81]

[30] Krähenbuhl P, Koltun V. Efficient inference in fully connected CRFs with Gaussian edge potentials[C]//Proceedings of the 24th International Conference on Neural Information Processing Systems. Granada, Spain: Curran Associates Inc., 2011: 109-117.

[31] Margolin R, Zelnik-Manor L, Tal A. How to evaluate foreground maps[C]//Proceedings of 2014 IEEE International Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE, 2014: 248-255.[DOI:10.1109/CVPR.2014.39]