发布时间: 2021-10-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.200317
2021 | Volume 26 | Number 10

图像分析和识别

多支路协同的RGB-T图像显著性目标检测

蒋亭亭¹, 刘昱¹, 马欣², 孙景林¹

1. 天津大学微电子学院, 天津 300072;

2. 天津大学电气自动化与信息工程学院, 天津 300072

收稿日期: 2020-06-28; 修回日期: 2020-08-26; 预印本日期: 2020-09-02

基金项目: 云南省重大科技专项：云南特色产业数字化研究与应用示范项目（202002AD080001）；天津市重大科技专项（18ZXRHSY00190）

作者简介: 蒋亭亭, 1995年生, 男, 硕士研究生, 主要研究方向为基于RGB-T图像的显著性目标检测。E-mail: jtt18822197331@163.com
刘昱, 男, 教授, 主要研究方向为多媒体信号处理、机器学习。E-mail: liuyu@tju.edu.cn
马欣, 通信作者, 女, 讲师, 主要研究方向为数字信号处理。E-mail: maxin2789@126.com
孙景林, 女, 博士研究生, 主要研究方向为人工智能与机器视觉。E-mail: sunjinglin@tju.edu.cn
*通信作者: 马欣 maxin2789@126.com

中图法分类号: TP391

文献标识码: A

文章编号: 1006-8961(2021)10-2388-12

摘要

目的显著性目标检测是机器视觉应用的基础，然而目前很多方法在显著性物体与背景相似、低光照等一些复杂场景得到的效果并不理想。为了提升显著性检测的性能，提出一种多支路协同的RGB-T（thermal）图像显著性目标检测方法。方法将模型主体设计为两条主干网络和三条解码支路。主干网络用于提取RGB图像和Thermal图像的特征表示，解码支路则分别对RGB特征、Thermal特征以及两者的融合特征以协同互补的方式预测图像中的显著性物体。在特征提取的主干网络中，通过特征增强模块实现多模图像的融合互补，同时采用适当修正的金字塔池化模块，从深层次特征中获取全局语义信息。在解码过程中，利用通道注意力机制进一步区分卷积神经网络（convolutional neural networks，CNN）生成的特征在不同通道之间对应的语义信息差异。结果在VT821和VT1000两个数据集上进行测试，本文方法的最大F-measure值分别为0.843 7和0.880 5，平均绝对误差（mean absolute error，MAE）值分别为0.039 4和0.032 2，相较于对比方法，提升了整体检测性能。结论通过对比实验表明，本文提出的方法提高了显著性检测的稳定性，在一些低光照场景取得了更好效果。

关键词

RGB-T显著性目标检测; 多模图像融合; 多支路协同预测; 通道注意力机制; 金字塔池化模块(PPM)

Multi-path collaborative salient object detection based on RGB-T images

Jiang Tingting¹, Liu Yu¹, Ma Xin², Sun Jinglin¹

1. School of Microelectronics, Tianjin University, Tianjin 300072, China;

2. School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China

Supported by: Key Science and Technology Specific Projects of Yunnan Province: Digital Research and Application Demonstration of Yunnan Characteristic Industries (202002AD080001); Key Science and Technology Specific Projects of Tianjin (18ZXRHSY00190)

Abstract

Objective Saliency detection is a fundamental technology in computer vision and image processing,which aims to identify the most visually distinctive objects or regions in an image. As a preprocessing step,salient object detection plays a critical role in many computer vision applications,including visual tracking,scene classification,image retrieval,and content-based image compression. While numerous salient object detection methods have been presented,most of them are designed for RGB images only or depth RGB (RGB-D) images. However,these methods remain challenging in some complex scenarios. RGB methods may fail to distinguish salient objects from backgrounds when exposed to similar foreground and background or low-contrast conditions. RGB-D methods also suffer from challenging scenarios characterized by low-light conditions and variations in illumination. Considering that thermal infrared images are invariant to illumination conditions,we propose a multi-path collaborative salient object detection method in this study,which is designed to improve the performance of saliency detection by using the multi-mode feature information of RGB and thermal images. Method In this study,we design a novel end-to-end deep neural network for thermal RGB (RGB-T) salient object detection,which consists of an encoder network and a decoder network,including the feature enhance module,the pyramid pooling module,the channel attention module,and the l₁-norm fusion strategy. First,the main body of the model contains two backbone networks for extracting the feature representations of RGB and thermal images,respectively. Then,three decoding branches are used to predict the saliency maps in a coordinated and complementary manner for extracted RGB feature,thermal feature,and fusion feature of both,respectively. The two backbone network streams have the same structure,which is based on Visual Geometry Group 19-layer (VGG-19) net. In order to make a better fit with saliency detection task,we only maintain five convolutional blocks of VGG-19 net and discard the last pooling and fully connected layers to preserve more spatial information from the input image. Second,the feature enhance module is used to fully extract and fuse multi-modal complementary cues from RGB and thermal streams. The modified pyramid pooling module is employed to capture global semantic information from deep-level features,which is used to locate salient objects. Finally,in the decoding process,the channel attention mechanism is designed to distinguish the semantic differences between the different channels,thereby improving the decoder's ability to separate salient objects from backgrounds. The entire model is trained in an end-to-end manner. Our training set consists of 900 aligned RGB-T image pairs that are randomly selected from each subset of the VT1000 dataset. To prevent overfitting,we augment the training set by flipping and rotating operations. Our method is implemented with PyTorch toolbox and trained on a PC with GTX 1080Ti GPU and 11 GB of memory. The input images are uniformly resized to 256×256 pixels. The momentum,weight decay,and learning rate are set as 0.9,0.000 5,and 1E-9,respectively. During training,the softmax entropy loss is used to converge the entire network. Result We compare our model with four state-of-the-art saliency models,including two RGB-based methods and two RGB-D-based methods,on two public datasets,namely,VT821 and VT1000. The quantitative evaluation metrics contain F-measure,mean absolute error (MAE),and precision-recall(PR) curves,and we also provide several saliency maps of each method for visual comparison. The experimental results demonstrate that our model outperforms other methods,and the saliency maps have more refined shapes under challenging conditions,such as poor illumination and low contrast. Compared with the other four methods in VT821,our method obtains the best results on maximum F-measure and MAE. The maximum F-measure (higher is better) increases by 0.26%,and the MAE (less is better) decreases by 0.17% than the second-ranked method. Compared with the other four methods in VT1000,our model also achieves the best result on maximum F-measure,which reaches 88.05% and increases by 0.46% compared with the second-ranked method. However,the MAE is 3.22%,which increases by 0.09% and is slightly poorer than the first-ranked method. Conclusion We propose a CNN-based method for RGB-T salient object detection. To the best of our knowledge,existing saliency detection methods are mostly based on RGB or RGB-D images,so it is very meaningful to explore the application of CNN for RGB-T salient object detection. The experimental results on two public RGB-T datasets demonstrate that the method proposed in this study performs better than the state-of-the-art methods,especially for challenging scenes with poor illumination,complex background,or low contrast,which proves that it is effective to improve the performance by fusing multi-modal information from RGB and thermal images. However,public datasets for RGB-T salient detection are lacking,which is very important for the performance of deep learning network. At the same time,detection speed is a key measurement in the preprocessing step of other computer vision tasks. Thus,in the future work,we will collect more high-quality datasets for RGB-T salient detection and design more light-weight models to increase the speed of detection.

Key words

RGB-T salient object detection; multi-modal images fusion; multi-path collaborative prediction; channel attention mechanism; pyramid pooling module(PPM)

0 引言

显著性目标检测的目的是从一幅图像中将人眼视觉最感兴趣部分分离出来，在图像检索(Fan等，2015)、图像压缩(Guo和Zhang，2010)、语义分割(Wei等，2017)以及视觉跟踪(Wang等，2015)等诸多机器视觉应用的预处理中扮演着重要角色，对减少算法计算量和提高算法性能具有极大的辅助作用。

深度学习快速发展并展现出优秀的特征提取能力，基于深度学习的显著性目标检测算法取得了丰富成果，但大多数方法都是以RGB或RGB-D(depth)图像为研究对象且存在一定的局限性，前者很难从低对比度、背景复杂等场景中准确地检测出显著性目标，后者则很难获取透明或半透明物体、深色物体以及超出量程物体的深度信息，更重要的是光照因素会严重影响这两类方法的检测性能。基于此，考虑到热红外图像不受光照因素影响且具有良好的穿透能力，提出了以RGB-T(thermal)图像为基础的显著性检测方法。Tang等人(2019)和Tu等人(2019)分别构建了两个RGB-T显著性检测数据集，并且提出了协同排序算法和图协同学习算法完成RGB-T显著性检测任务，在一定程度上融合了RGB和Thermal图像的特征，但仍然受到显著性物体处于边界位置、显著性物体与背景相似等一些特殊情况限制。Zhang等人(2019)在公共的RGB-T数据集(Tang等，2019)中利用卷积神经网络(convolutional neural networks，CNN)深度学习RGB-T图像的特征，并且通过RGB-T多级特征融合，实现多模图像的信息互补，提高了RGB-T显著性检测的效果。

为了进一步解决以上方法存在的局限性，探究RGB-T图像融合在显著性目标检测中的应用，本文提出一种基于RGB和Thermal图像特征增强、多支路协同预测的方法完成显著性目标检测任务。模型总体分为编码器和解码器两部分，编码器的两条主干网络基于VGG19(Visual Geometry Group 19-layer net)(Simonyan和Zisserman，2015)分别提取RGB和Thermal图像特征，每一级特征输出后，通过特征增强模块(feature enhance module，FEM)将对应的RGB和Thermal特征进行相互增强，以融合更多互补信息，最终得到两个分别侧重于RGB和Thermal图像的特征表示，同时利用适当修正的金字塔池化模块(pyramid pooling module，PPM)(Wang等，2017)从深层次特征中获取全局语义信息，以进一步定位显著性物体。在特征解码阶段，3条支路以协同互补方式进行显著性预测。其中，两条支路分别以主干网络得到的RGB和Thermal特征作为输入，利用通道注意力模块(channel attention module，CAM)(Zhang等，2018)，将主干网络中的低层次特征和全局语义信息融入到待解码特征中，使检测结果更加完整准确；第3条支路首先通过$l_{ 1}$-范数融合决策($l_1$-norm strategy)(Li和Wu，2019)将主干网络得到的两个特征进行融合，再经过解码得到相应的预测结果；最后3个预测结果通过拼接和卷积操作，自适应输出最优结果。

1 本文方法

模型总体框架包括特征提取和特征解码两部分，如图 1所示。特征提取部分用于提取RGB和Thermal图像特征，包括两条结构相同的主干网络，由预训练好的VGG19主体组成，为了更好地适用于显著性目标检测任务，仅保留了前5个卷积块，舍弃了第5级的池化层和最后的全连接层。特征解码部分包含3条支路，图 1中用不同颜色表示，各支路之间的特征信息通过协同互补方式促使模型得到最优的检测效果。整个网络模型还包括FEM、PPM、CAM以及$l_{ 1}$-norm融合模块，在GT(ground truth)监督下，以端到端方式进行训练。

图 1 本文提出的多支路协同RGB-T显著性目标检测模型整体框架

Fig. 1 The overall architecture of the proposed RGB-T multi-path collaborative salient object detection

1.1 特征增强模块(FEM)

为了更好地实现RGB和Thermal图像的信息互补，受Zhao等人(2019)的启发，提出在两条主干网络的每个卷积块之后，通过一个FEM模块，利用对方同级的特征进行调制，使原来的单模态特征融合成多模态特征，从而实现特征增强的目的。

如图 2所示，以RGB主干网络为例，${\mathit{\boldsymbol{F}}}^{\rm RGB}_{\rm l}$和${\mathit{\boldsymbol{F}}}^{\rm T}_{\rm l}$分别表示RGB和Thermal主干网络中5个卷积块的输出特征(第5级未经池化处理)，并且同一级的两个特征具有相同分辨率。在RGB主干网络中，${\mathit{\boldsymbol{F}}}^{\rm T}_{\rm l}$作为条件对同级的${\mathit{\boldsymbol{F}}}^{\rm RGB}_{\rm l}$进行调制和增强。由于Thermal图像具有分辨率低、噪声大等缺陷，直接将提取的Thermal特征用于增强RGB特征会影响融合效果。对此，Li等人(2020)提出GFT(gated feature-wise transform)模块理念。与其相似，本文先对${\mathit{\boldsymbol{F}}}^{\rm T}_{\rm l}$进行sigmoid处理，再与原来的${\mathit{\boldsymbol{F}}}^{\rm T}_{\rm l}$相乘得到${\mathit{\boldsymbol{\tilde F}}}^{\rm T}_{\rm l}$，尽可能抑制噪声的影响；然后对${\mathit{\boldsymbol{F}}}^{\rm RGB}_{\rm l}$进行放缩和平移操作，并通过残差连接(Zhang等，2017b)将最初的${\mathit{\boldsymbol{F}}}^{\rm RGB}_{\rm l}$与调制后的特征相加，得到${\mathit{\boldsymbol{\tilde F}}}^{\rm RGB}_{\rm l}$；最后，利用传统的残差卷积单元(He等，2016)，输出增强后的特征${\mathit{\boldsymbol{F}}}^{\rm RGB}_{e\_l}$。以上操作可归纳为

图 2 特征增强模块示意图

Fig. 2 Diagram of feature enhance module

$ \widetilde{\boldsymbol{F}}_{l}^{\mathrm{T}}=\sigma\left(\boldsymbol{F}_{l}^{\mathrm{T}}\right) \times \boldsymbol{F}_{l}^{\mathrm{T}} $

(1)

$ \widetilde{\boldsymbol{F}}_{l}^{\mathrm{RGB}}=\boldsymbol{F}_{l}^{\mathrm{RGB}} \times \widetilde{\boldsymbol{F}}_{l}^{\mathrm{T}}+\widetilde{\boldsymbol{F}}_{l}^{\mathrm{T}}+\boldsymbol{F}_{l}^{\mathrm{RGB}} $

(2)

$ \boldsymbol{F}_{e_{-} l}^{\mathrm{RGB}}=\boldsymbol{C}\left(\widetilde{\boldsymbol{F}}_{l}^{\mathrm{RGB}}, {W}\right)+\widetilde{\boldsymbol{F}}_{l}^{\mathrm{RGB}} $

(3)

式中，$σ$表示sigmoid操作，${\mathit{\boldsymbol{C}}}$表示3×3的卷积处理，$W$ 代表卷积过程的参数，$l∈\{1, 2, 3, 4, 5\}$，表示主干网络的各级网络。类似的，在Thermal主干网络中，每一个卷积块的输出特征${\mathit{\boldsymbol{F}}}^{\rm T}_{\rm l}$作为被调制的对象，经过FEM模块后得到增强后相对应的特征${\mathit{\boldsymbol{F}}}^{\rm T}_{e\_l}$。

1.2 $l_{ 1}$-norm融合策略

现有的多模特征融合算法，大多都是简单通过相加或相乘等操作完成，虽然取得了一定成效，但是相对于显著性特征选择而言比较粗糙。Li和Wu(2019)提出一种基于$l_{ 1}$-norm的多模图像融合方法，更好地从不同模态的图像特征中选择更为重要的特征作为输出，如图 3所示。然而，此方法中的基于块平均操作大幅减缓了训练速度，本文采用$l_{ 1}$-norm融合策略时舍弃了这一个过程。

图 3 $l_{ 1}$-norm融合策略示意图

Fig. 3 Diagram of $l_{ 1}$-norm strategy

由图 1可知，两条主干网络的第5级输出经过特征增强模块后得到${\mathit{\boldsymbol{F}}}^{\rm RGB}_{ e\_5}$和${\mathit{\boldsymbol{F}}}^{\rm T}_{ e\_5}$，在图 3中统一表示为${\mathit{\pmb{φ}}}^{i}_{ 1:C}$，其中$i =$ 1表示RGB，$i = $2表示Thermal，${\mathit{\pmb{φ}}}^{i}_{ m}$为第$m$个通道的特征，$C$为总通道数。该模块通过$l_{ 1}$-norm操作和权重计算，得到最终的融合特征${\mathit{\boldsymbol{F}}}^{fusion}$，具体为

$ {{C^i}(x, y) = {{\left\| {\varphi _{1:C}^i(x, y)} \right\|}_1}} $

(4)

$ {{w^i}(x, y) = {C^i}(x, y)/\sum\limits_{i = 1}^2 {{C^i}} (x, y)} $

(5)

$ {{F^{{\rm{fusion }}}}(x, y) = \sum\limits_{i = 1}^2 {{w^i}} (x, y) \times \varphi _{1:C}^i(x, y)} $

(6)

式中，($x, y$) 表示像素的坐标，$C^{i}(x, y)$表示经过$l_{ 1}$-范数处理后各坐标位置的值，$w^{i}(x, y)$ 表示融合特征在各坐标位置上，RGB通道信息或Thermal通道信息所占的权重。

1.3 通道注意力模块和金字塔池化模块

深度学习网络模型输出的各级特征具有不同的特性。更深层次的特征具有更宽的视野、更多的全局语义信息，能够更好地定位显著性物体，但是由于进行了多次池化操作，很多细节信息也随之丢失；而较浅层次的特征具有更多的空间信息和更好的边缘结构，能够进一步细化最终的检测结果。基于以上特性，受Ha等人(2017)的启发，在特征解码过程中，将主干网络的低层次特征加以融合，从而更好地提高检测结果的完整性。现有方法大多利用拼接或相加的方式进行融合，且对特征的每个通道无差别对待。然而Zhang等人(2018)提出，由CNN网络产生的特征在不同通道对应不同的语义信息，这一特性对区分显著性物体具有重要作用。对此，本文方法采用CAM模块，从特征的不同通道提取更加具有区分度的语义信息，以提高模型的检测性能。另外，为了更加准确地定位显著性物体，减少低层次特征带来的背景冗余信息，利用PPM模块(Wang等，2017)生成全局语义信息，进一步区分显著性物体和图像的背景。

PPM模块示意图如图 4所示。将两条主干网络中Conv5-4的输出特征(未经池化处理(no pooling，np))表示为${\mathit{\boldsymbol{F}}}^{\rm RGB/T}_{\rm np\_5}$，并作为PPM的输入。该模块首先通过一个1×1的卷积层，将特征通道数减少为原来的一半，用${\mathit{\boldsymbol{{\tilde F}}}}^{\rm RGB/T}_{\rm np\_5}$表示。对类似于显著性目标检测的任务而言，需要处理图像中多种尺度大小的对象，如果仅通过全局平均池化处理会丢失较多的空间信息，因此，${\mathit{\boldsymbol{{\tilde F}}}}^{\rm RGB/T}_{\rm np\_5}$经过不同尺度的平均池化操作，得到4种大小分别为1×1、2×2、4×4、8×8的多尺度语义特征块。为了保证不同尺度特征块的权重，模块中增加了1×1的卷积层，并将通道数降为${\mathit{\boldsymbol{{\tilde F}}}}^{\rm RGB/T}_{\rm np\_5}$的1/4。接着利用上采样操作使所有特征块的大小与${\mathit{\boldsymbol{{\tilde F}}}}^{\rm RGB/T}_{\rm np\_5}$相同并进行拼接。最后，通过3×3卷积层进一步提取特征，得到PPM模块的输出${\mathit{\boldsymbol{F}}}^{\rm PPM}_{\rm out}$。

图 4 金字塔池化模块示意图

Fig. 4 Diagram of pyramid pooling module

在图 1中，主干网络的低层次特征与PPM模块输出的高级语义特征融合后表示为${\mathit{\boldsymbol{F}}}^{\rm RGB/T}_{f\_l}$($l∈\{1, 2, 3\})$，并作为CAM模块的一个输入。CAM模块的详细结构如图 5所示，首先${\mathit{\boldsymbol{F}}}^{\rm RGB/T}_{f\_l}$经过一个自适应层，由两个3×3的卷积操作组成，以扩大特征的感受野；然后与解码器$D_{n}$的输出特征${\mathit{\boldsymbol{F}}}^{\rm D}_{n}$进行拼接，得到初步融合的特征${\mathit{\boldsymbol{F}}}^{\rm C}_{n}$；接着利用全局平均池化操作生成一个通道特征向量；再经过全连接层处理，得到各通道之间的相互依赖特性；然后利用sigmoid函数权衡每个通道的重要程度$W$，使其与之前的${\mathit{\boldsymbol{F}}}^{\rm C}_{n}$相乘，得到权重之后的特征${\mathit{\boldsymbol{{\tilde F}}}}^{\rm C}_{n}$；最后通过1×1卷积层将${\mathit{\boldsymbol{{\tilde F}}}}^{\rm C}_{n}$的特征通道数还原成输入特征的大小，作为CAM模块的输出${\mathit{\boldsymbol{F}}}^{\rm CAM}_{\rm out}$。

图 5 通道注意力模块示意图

Fig. 5 Diagram of channel attention module

1.4 损失函数

如图 1所示，解码器包含3条解码支路，并以协同互补的方式对不同特征进行显著性预测，每条支路都会得到一个显著性映射结果，分别表示为${\mathit{\boldsymbol{S}}}_{ 1}$(RGB特征支路)、${\mathit{\boldsymbol{S}}}_{\rm 2}$(Thermal特征支路)和${\mathit{\boldsymbol{S}}}_{\rm 3}$(融合特征支路)。最后，3个预测结果通过拼接和卷积操作，得到模型最终的映射${\mathit{\boldsymbol{S}}}_{\rm 0}$。由于每条支路得到的显著性映射结果准确与否都会影响到最终的输出${\mathit{\boldsymbol{S}}}_{\rm 0}$，因此，通过计算每个映射${\mathit{\boldsymbol{S}}}_{i}(i=0, 1, 2, 3)$与ground truth(GT)之间交叉熵损失，使每个映射结果都能受到监督，以保证能够尽可能向ground truth逼近。这种监督方式有助于调整每条支路的训练参数并提高模型最终输出$\mathit{\boldsymbol{S}}_0$的准确率。参考Wang和Gong(2019)以及Chen和Li(2019)提出的损失函数形式，每个映射结果相比于ground truth的损失量的加权系数均设为1。因此，完整的损失函数为

$ L = - \sum\limits_{i = 0}^3 {\left({G\log \left({{\mathit{\boldsymbol{S}}_i}} \right) + (1 - G)\log \left({1 - {\mathit{\boldsymbol{S}}_i}} \right)} \right)} $

(7)

式中，$G$为映射结果的真实值。

2 实验结果与分析

2.1 实验配置

2.1.1 数据集

目前，以RGB-T图像为研究对象的深度学习方法很少，主要原因是用于显著性检测任务的RGB-T数据不足，仅Tang等人(2019)和Tu等人(2019)提供的两个用于RGB-T显著性检测的公开数据集VT821和VT1000。VT821数据集包含821对RGB和Thermal图像以及相应的GT，但是其中有很多Thermal图像没能与RGB图像校准对齐。VT1000数据集较好地解决了图像校准问题，从低光照环境、多个显著性物体和显著性物体较小等多种场景中收集了1 000对RGB-T图像。本文从VT1000数据集的各种场景中随机挑选900对图像作为训练集，剩余的100对图像作为VT1000的测试集。另外，从VT821数据集中人工挑出了223对校准相对满意的图像作为VT821的测试集。为了更好地解决由于数据集较少产生的过拟合现象，采用数据增强策略，将900对训练数据集分别进行镜像翻转和180度旋转，最终得到2 700对RGB-T图像作为模型的训练数据。

2.1.2 参数设定

模型利用pytorch框架，在NVIDIA 1080Ti GPU (11 GB内存)上进行训练，采用了随机梯度下降法(stochastic gradient descent，SGD)，动量参数设为0.9，权值衰减参数设为0.000 5，学习率设为1E-9，训练图像和测试图像的尺寸统一设定为256×256像素。

2.2 评价指标

实验选取PR(precision-recall)曲线、F-measure和平均绝对误差(mean absolute error，MAE)等常用评价指标比较不同显著性目标检测算法的性能。

PR曲线中，$P$表示查准率(precision)，$R$表示查全率(recall)。所有的预测结果按照不同的阈值进行二值化处理，再与二值化后的GT比较，即可计算相应阈值对应的PR值，阈值从[0, 255]中以1为间隔逐一选取，最终得到256组PR值绘制成PR曲线。

F-measure指标计算为

$ F_{\beta}=\frac{\left(1+\beta^{2}\right) \times P \times R}{\beta^{2} \times P+R} $

(8)

式中，$β_2$设为0.3(Zhang等，2017a)，用于提高$P$的权重。通过改变阈值[0, 255]大小，可得该指标的最大值maxF和平均值aveF。

MAE用于计算预测图像与GT之间平均像素的绝对误差(Perazzi等，2012)，计算为

$ M A E=\frac{1}{W \times H} \sum\limits_{x=1}^{W} \sum\limits_{y=1}^{H}|S(x, y)-G(x, y)| $

(9)

式中，$W$和$H$分别表示图像的宽和高，($x, y$)表示像素的坐标，${\mathit{\boldsymbol{S}}}$表示模型生成的显著图，${\mathit{\boldsymbol{G}}}$表示人工标定的真值图。

2.3 与当前流行方法比较

为了验证本文方法的有效性，与4种现有显著性检测方法进行比较，包括基于RGB图像的显著性检测方法PoolNet(Liu等，2019)和CPDNet(Wu等，2019)，以及基于RGB-D图像的显著性检测方法DMRA(Piao等，2019)和A2dele(Piao等，2020)。由于目前基于RGB-T图像的相关文献未提供公开代码，本文没有给出相应比较。为了更公平地对比，适当修正并重新训练了上述RGB和RGB-D显著性检测方法。由于基于RGB图像的显著性检测方法只利用RGB图像作为训练和测试的输入，所以需要在原来的网络模型前加一级融合层将原始的RGB和Thermal图像进行拼接，然后通过一个3×3的卷积层将拼接后的特征转变为3通道特征，之后送入原始网络模型进行训练，训练后的模型用PoolNet+和CPDNet+表示。而基于RGB-D图像的显著性检测方法的训练对象是RGB-D图像对，所以将其中的D通道换为T通道，重新训练后再用于测试。

2.3.1 定性分析

图 6展示的是基于RGB图像的方法及对应的加入T通道后重新训练的部分结果。可以发现，以RGB-T图像为输入，重新训练后得到的模型对一些光照不足、显著性物体与背景相似等情况表现得更优秀，检测结果有了明显改善，说明T通道信息在一定程度上可以与RGB通道信息进行融合互补，从而提高显著性检测性能。

图 6 基于RGB图像的显著性检测方法及对应融合RGB-T图像后修正版本的检测结果视觉比较

Fig. 6 Visual comparisons of the saliency detection results by RGB methons and their modified multi-modal versions

((a) RGB images; (b) thermal infrared images; (c) PoolNet; (d) PoolNet+; (e) CPDNet; (f) CPDNet+; (g) GT)

图 7展示的是不同方法在多种场景下的预测结果。第1—3行是一般场景下的图像，光线较为充足且对比明显，可以发现所有方法基本都能较为准确地对显著性物体进行预测；第4、5行图像中包含多个显著性物体，第6、7行图像中显著性目标为透明物体。这两种场景中，其他大部分方法都只能得到显著性物体的一部分，而本文方法能够较为准确地定位显著性物体的位置，而且预测结果具有更好的边缘结构；第8—10行图像所处环境光照不充足，可以看出，基于RGB图像的方法PoolNet和CPDNet很难从低光照环境中检测出显著性物体，而且光线越暗，结果越差；对应的PoolNet+和CPDNet+方法虽然较为粗糙地融合了Thermal图像，但是一定程度上提高了检测效果；基于RGB-D的两种方法得到的预测结果也并不完整；相比较而言，本文方法得到的检测结果最接近真实物体。

图 7 不同方法检测结果的视觉比较

Fig. 7 Visual comparisons of saliency detection results among different methods

((a) RGB images; (b) thermal infrared images; (c) PoolNet; (d) PoolNet+; (e) CPDNet; (f) CPDNet+; (g) DMRA; (h) A2dele; (i) ours; (j) GT)

以上分析表明，本文提出的多支路协同的RGB-T图像显著性目标检测方法在一些挑战性环境下，依然具有较好的可行性和稳定性。

2.3.2 定量分析

表 1展示了不同方法在VT821和VT1000数据集上的F-measure和MAE指标结果。可以发现，本文方法在两个数据集上的整体性能均优于其他对比方法，尤其在非训练数据集VT821上，本文方法取得的效果更为明显，表现出了较好的泛化性。而PoolNet+和CPDNet+的性能相较于最初的PoolNet和CPDNet在VT1000数据集上有了很大提升，但是在VT821数据集上改善很少，甚至有的指标还略有变差。可能的原因是VT821数据集本身存在一定校准误差；另外，重新训练只是简单地将RGB和Thermal通道进行拼接，没有更好地进行优化处理。

表 1 不同方法在VT821和VT1000数据集上的F-measure和MAE指标对比
Table 1 Comparison of F-measure and MAE in VT821 and VT1000 datasets among different methods

下载CSV

方法	VT821			VT1000
方法	maxF	aveF	MAE	maxF	aveF	MAE
PoolNet	0.819 1	0.804 7	0.048 5	0.831 5	0.816 4	0.057 9
PoolNet+	0.817 7	0.808 6	0.041 1	0.875 9	0.865 6	0.031 3
CPDNet	0.797 1	0.788 2	0.042 5	0.839 6	0.831 5	0.036 1
CPDNet+	0.791 5	0.779 6	0.047 1	0.873 8	0.857 0	0.033 3
DMRA	0.841 1	0.827 0	0.041 7	0.861 3	0.844 4	0.038 1
A2dele	0.751 3	0.748 2	0.062 1	0.860 9	0.857 5	0.040 1
本文	0.843 7	0.823 2	0.039 4	0.880 5	0.862 6	0.032 2
注：加粗和下划线字体分别表示各列最优和次优结果。

图 8是各种方法在VT821和VT1000上的PR曲线图。由图可知，本文方法的PR曲线优于其他方法，而且利用不同阈值确定PR曲线时，本文方法的PR值都处在一个较好的位置，并且跨度区间较窄。由此表明，通过融合RGB和Thermal图像对提高显著性检测性能是有效的。

图 8 本文方法与其他方法在2个公开数据集上的PR曲线

Fig. 8 The PR curves of the proposed method and other state-of-the-art methods on 2 public datasets

((a) VT821; (b) VT1000)

表 2展示了各种方法在VT821和VT1000上测试每幅图像的平均运行时间。所有测试均在NVIDIA 1080Ti GPU (11 GB内存)上进行，测试图像尺寸统一设定为256×256像素。从表 2可以看出，本文模型的每幅图像的平均运行时间为0.11 s左右，虽然相较于其他方法并不算快，但是整体速度可以接受。主要原因是本文模型的输入为RGB-T图像对，两条主干网络需要分别利用VGG19模型进行多模特征提取并进行相互增强；另外，每条支路之间的关联性较强，需要进行协同互补和信息交换，在一定程度上减缓了模型的检测速度。在未来工作中，将进一步探索更加轻量化的模型结构，加快显著性检测速度。

表 2 各方法的平均运行时间分析
Table 2 Average runtime analysis of the compared methods

下载CSV

方法	平均运行时间/s
PoolNet	0.13
PoolNet+	0.1
CPDNet	0.05
CPDNet+	0.06
DMRA	0.09
A2dele	0.03
本文	0.11

2.4 消融实验分析

为进一步分析网络模型中CAM、PPM、FEM和$l_{ 1}$-norm融合策略等模块的相互作用，对各模块提高网络模型性能的效果进行消融实验，视觉效果对比和定量分析结果分别如图 9和表 3所示。其中，baseline表示仅保留模型的基本框架，即2条提取特征的主干网络加3条基本的特征解码支路，本文方法在baseline中保留$l_{ 1}$-norm融合模块以获得融合特征解码支路的输入，+CAM、+PPM、+FEM和-L1分别表示在前一模型的基础上增加或舍弃当前的模块，从而分析每个模块对最终测试结果的影响。

图 9 消融分析的视觉结果

Fig. 9 The visual results of ablation analysis

((a) RGB images; (b) thermal infrared images; (c) baseline; (d) +CAM; (e) +PPM; (f) +FEM/ours; (g) -L1; (h) GT)

表 3 两个公开数据集上的消融分析定量比较结果
Table 3 Quantitative comparison results of ablation analysis on 2 public datasets

下载CSV

方法	VT821			VT1000
方法	maxF	aveF	MAE	maxF	aveF	MAE
baseline	0.832 3	0.800 4	0.044 8	0.862 5	0.839 8	0.035 5
+CAM	0.848 8	0.817 6	0.042 9	0.874 6	0.857 5	0.035 4
+PPM	0.850 6	0.819 3	0.041 1	0.876 3	0.856 0	0.033 7
+FEM/本文	0.843 7	0.823 2	0.039 4	0.880 5	0.862 6	0.032 2
－L1	0.848 9	0.823 6	0.041 7	0.866 0	0.848 6	0.036 9
注：加粗字体为每列最优值。

1) 通道注意力模块(CAM)。将主干网络中低层次特征包含的空间信息融入解码特征时，不可避免地会引入一些冗余噪声。因此，在baseline的基础上，采用通道注意力机制，增加CAM，利用显著性物体和图像背景在不同特征通道表现的语义差异来更好地进行区分。从图 9和表 3可以发现，增加CAM后，检测结果的外形更加完整，F-measure和MAE指标在baseline的基础上有了明显提升。

2) 金字塔池化模块(PPM)。在增加CAM模块的基础上继续增加PPM模块，主要作用是从高层次特征中提取高级语义信息，从而更加准确地获取显著性物体的位置，排除背景噪声的干扰。本文模型首先将主干网络中的低层次特征与PPM生成的语义特征融合，再输入到CAM模块与解码特征结合。图 9和表 3的结果证明PPM进一步提升了模型的整体性能。

3) 特征增强模块(FEM)和$l_{ 1}$-norm融合策略。图 9和表 3数据表明，在保留$l_{ 1}$-norm融合模块的基础上增加FEM模块，本文模型最终得到的检测结果更加准确完整，各项评价指标在VT1000数据集上均达到最优，在VT821数据集上，虽然maxF值有所下降，但是aveF值和MAE值均有所改善。用简单相加操作替代$l_{ 1}$-norm融合模块后，各评价指标在VT1000数据集上有明显变差趋势，在VT821数据集上，MAE值同样增加了0.23 % (值越大性能越差)。由此得知，FEM模块和$l_{ 1}$-norm融合策略两者是一种相互促进的关系，FEM模块在主干网络不断提取特征的过程中，以特征增强的方式将当前模态的信息融入到另一模态的信息中，使得单模态特征逐步融合成信息丰富的双模态特征，同时又有所侧重，一个偏向于RGB模态、另一个偏向于Thermal模态，然后利用$l_{ 1}$-norm融合策略从增强后的两个特征中选取各自最为重要的信息进行组合，得到融合特征解码支路的输入。因此，FEM模块和$l_{ 1}$-norm融合模块对充分融合RGB和Thermal多模态特征均起到了不可替代的作用，所以本文模型同时保留了这两个模块。

消融实验结果表明，本文设计的CAM、PPM、FEM和$l_{ 1}$-norm融合模块对提高模型的显著性检测性能均是可行和有效的。

3 结论

针对现有显著性检测方法在低光照、低对比度等特殊环境下生成的显著图效果较差等问题，本文提出一种多支路协同的RGB-T图像显著性目标检测方法，实现端到端的训练方式。本文模型设计了多个子模块。在特征提取主干网络中，通过特征增强模块有效实现了RGB和Thermal图像的信息融合和互补，通过金字塔池化模块生成全局语义信息，用于定位显著性物体。在特征解码的过程中采用通道注意力机制，从不同的特征通道中抽取更加具有区分度的特征信息，以提高模型的显著性检测能力。整个模型通过多条预测支路之间的信息互补、协同处理，自适应输出最优的检测结果。

在VT821和VT1000数据集上与其他现有方法进行测试和比较。定性对比实验结果表明，本文模型最终生成的检测图像具有更好的边缘结构，与真实物体更加接近，同时在一些低光照、低对比度等场景下均取得了相对更好的检测结果。定量对比实验结果表明，本文模型对于最大F-measure均取得了最优结果，同时平均F-measure和MAE也非常接近于最优值。由上可知，本文模型一定程度上提升了显著性检测的整体性能。

由于本文模型包含了多条预测支路，同时在预测过程中需要进行信息交互，因此模型的参数量相对较大，检测的时效性并没有明显优势。另外，因为以RGB-T图像为研究对象的显著性检测数据集较少，所以在训练过程中，一定程度上会影响模型最终的性能。因此，在后续工作中将进一步探索更加轻量级的网络模型，收集更多场景下应用于RGB-T显著性目标检测的数据集，同时考虑对每个场景下的光照强度进行实时记录，作为图像数据的另一维信息，希望有助于更好地调整模型的相关参数。

参考文献

Chen H, Li Y F. 2019. Three-stream attention-aware network for RGB-D salient object detection. IEEE Transactions on Image Processing, 28(6): 2825-2835 [DOI:10.1109/TIP.2019.2891104]

Fan D P, Wang J, Liang X M. 2015. Improving image retrieval using the context-aware saliency areas. Applied Mechanics and Materials, 734: 596-599 [DOI:10.4028/www.scientific.net/AMM.734.596]

Guo C L, Zhang L M. 2010. A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression. IEEE Transactions on Image Processing, 19(1): 185-198 [DOI:10.1109/TIP.2009.2030969]

Ha Q S, Watanabe K, Karasawa T, Ushiku Y and Harada T. 2017. MFNet: towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes//Proceedings of 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems. Vancouver, Canada: IEEE: 5108-5115[DOI: 10.1109/IROS.2017.8206396]

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778[DOI: 10.1109/CVPR.2016.90]

Li C L, Xia W, Yan Y, Luo B, Tang J. 2020. Segmenting objects in day and night: edge-conditioned CNN for thermal image semantic segmentation. IEEE Transactions on Neural Networks and Learning Systems, 32(7): 3069-3082 [DOI:10.1109/TNNLS.2020.3009373]

Li H, Wu X J. 2019. DenseFuse: a fusion approach to infrared and visible images. IEEE Transactions on Image Processing, 28(5): 2614-2623 [DOI:10.1109/TIP.2018.2887342]

Liu J J, Hou Q B, Cheng M M, Feng J S and Jiang J M. 2019. A simple pooling-based design for real-time salient object detection//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 3912-3921[DOI: 10.1109/CVPR.2019.00404]

Perazzi F, Krähenbühl P, Pritch Y and Hornung A. 2012. Saliency filters: contrast based filtering for salient region detection//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 733-740[DOI: 10.1109/CVPR.2012.6247743]

Piao Y R, Ji W, Li J J, Zhang M and Lu H C. 2019. Depth-induced multi-scale recurrent attention network for saliency detection//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 7253-7262[DOI: 10.1109/ICCV.2019.00735]

Piao Y R, Rong Z K, Zhang M, Ren W S and Lu H C. 2020. A2dele: adaptive and attentive depth distiller for efficient RGB-D salient object detection//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE[DOI: 10.1109/CVPR42600.2020.00908]

Simonyan K and Zisserman A. 2015. Very deep convolutional networks for large-scale image recognition[EB/OL]. https://arxiv.org/pdf/1409.1556.pdf

Tang J, Fan D Z, Wang X X, Tu Z Z, Li C L. 2019. RGBT salient object detection: benchmark and a novel cooperative ranking approach. IEEE Transactions on Circuits and Systems for Video Technology, 30(12): 4421-4433 [DOI:10.1109/TCSVT.2019.2951621]

Tu Z Z, Xia T, Li C L, Wang X X, Ma Y, Tang J. 2019. RGB-T image saliency detection via collaborative graph learning. IEEE Transactions on Multimedia, 22(1): 160-173 [DOI:10.1109/TMM.2019.2924578]

Wang F L, Zhen Y, Zhong B N, Ji R R. 2015. Robust infrared target tracking based on particle filter with embedded saliency detection. Information Sciences, 301: 215-226 [DOI:10.1016/j.ins.2014.12.022]

Wang N N, Gong X J. 2019. Adaptive fusion for RGB-D salient object detection. IEEE Access, 7: 55277-55284 [DOI:10.1109/ACCESS.2019.2913107]

Wang T T, Borji A, Zhang L H, Zhang P P and Lu H C. 2017. A stagewise refinement model for detecting salient objects in images//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 4039-4048[DOI: 10.1109/ICCV.2017.433]

Wei Y C, Liang X D, Chen Y P, Shen X H, Cheng M M, Feng J S, Zhao Y, Yan S C. 2017. STC: a simple to complex framework for weakly-supervised semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11): 2314-2320 [DOI:10.1109/TPAMI.2016.2636150]

Wu Z, Su L and Huang Q M. 2019. Cascaded partial decoder for fast and accurate salient object detection//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 3902-3911[DOI: 10.1109/CVPR.2019.00403]

Zhang L H, Yang C, Lu H C, Ruan X, Yang M H. 2017a. Ranking saliency. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(9): 1892-1904 [DOI:10.1109/TPAMI.2016.2609426]

Zhang P P, Wang D, Lu H C, Wang H Y and Ruan X. 2017b. Amulet: aggregating multi-level convolutional features for salient object detection//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 202-211[DOI: 10.1109/ICCV.2017.31]

Zhang Q, Huang N C, Yao L, Zhang D W, Shan C F, Han J G. 2019. RGB-T salient object detection via fusing multi-level CNN features. IEEE Transactions on Image Processing, 29: 3321-3335 [DOI:10.1109/TIP.2019.2959253]

Zhang X N, Wang T T, Qi J Q, Lu H C and Wang G. 2018. Progressive attention guided recurrent network for salient object detection//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 714-722[DOI: 10.1109/CVPR.2018.00081]

Zhao J X, Cao Y, Fan D P, Cheng M M, Li X Y and Zhang L. 2019. Contrast prior and fluid pyramid integration for RGBD salient object detection//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 3922-3931[DOI: 10.1109/CVPR.2019.00405]