Current Issue Cover
小样本条件下的RGB-D显著性物体检测

何静, 傅可人(四川大学视觉合成图形图像技术国防重点学科实验室, 成都 610065)

摘 要
目的 现有基于RGB-D (RGB-depth)的显著性物体检测方法往往通过全监督方式在一个较小的RGB-D训练集上进行训练,因此其泛化性能受到较大的局限。受小样本学习方法的启发,本文将RGB-D显著性物体检测视为小样本问题,利用模型解空间优化和训练样本扩充两类小样本学习方法,探究并解决小样本条件下的RGB-D显著性物体检测。方法 模型解空间优化通过对RGB和RGB-D显著性物体检测这两种任务进行多任务学习,并采用模型参数共享的方式约束模型的解空间,从而将额外的RGB显著性物体检测任务学习到的知识迁移至RGB-D显著性物体检测任务中。另外,训练样本扩充通过深度估计算法从额外的RGB数据生成相应的深度图,并将RGB图像和所生成的深度图用于RGB-D显著性物体检测任务的训练。结果 在9个数据集上的对比实验表明,引入小样本学习方法能有效提升RGB-D显著性物体检测的性能。此外,对不同小样本学习方法在不同的RGB-D显著性物体检测模型下(包括典型的中期融合模型和后期融合模型)进行了对比研究,并进行相关分析与讨论。结论 本文尝试将小样本学习方法用于RGB-D显著性物体检测,探究并利用两种不同小样本学习方法迁移额外的RGB图像知识,通过大量实验验证了引入小样本学习来提升RGB-D显著性物体检测性能的可行性和有效性,对后续将小样本学习引入其他多模态检测任务也提供了一定的启示。
关键词
RGB-D salient object detection of using few-shot learning

He Jing, Fu Keren(National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu 610065, China)

Abstract
Objective Salient object detection is mainly used in computer vision pre-processing like video/image segmentation, visual tracking and video/image compression. Current RGB-depth(RGB-D) salient object detection(SOD) can be categorized into fully supervision and self-supervision. Fully supervised RGB-D SOD can effectively fuse the complementary information of two different modes for RGB images input and the corresponding depth maps by means of the three types of fusion (early/middle/late). To capture contextual information, self-supervised salient object detection uses a small number of unlabeled samples for pre-training. However, existing RGB-D salient object detection methods are mostly trained on a small RGB-D training set in a fully supervised manner, so their generalization ability is greatly restricted. Thanks to the emerging few-shot learning methods, our RGB-D salient object detection uses model hypothesis space optimization and training sample augmentation to explore and solve RGB-D salient object detection with few-shot learning.Method For model hypothesis space optimization, it can transfer the learned knowledge from extra RGB salient object detection task to RGB-D salient object detection task based on multi-task learning of RGB and RGB-D salient object detection tasks, and the hypothesis space of the model is constrained by sharing model parameters. Model-oriented, takeing into account middle and late fusions can add additional supervision to the network, therefore, the JL-DCF model is selected for middle fusion and the DANet model is optioned for late fusion. To improve the effectiveness and generalization of RGB-D salient object detection tasks, RGB-D and RGB are simultaneously input into the network for online training and optimization in terms of JL-DCF, and the coarse prediction of RGB is supervised to optimize the network. In view of the commonality between the semantic segmentation and the saliency detection, the dual attention network for scene segmentation(DANet) model is transferred to the RGB-D salient object detection network, named DANet. Similar to JL-DCF joint training, additional RGB supervision is added to the RGB branch of the two-stream DANet. Furthermore, the training sample augmentation generates the related depth map based on the additional RGB data in terms of the depth estimation algorithm, and uses the RGB and the synthesized depth map for the training of the RGB-D salient object detection task. We adopt ResNet-101 as our network backbone. The scale of input image is 320×320×3 in JL-DCF network, and the scale of DANet network input image is fixed to 480×480×3. The depth map is transformed into three-channel map by gray scale mapping. Our training set is composed of data from NJU2K, NLPR and DUTS, and the test set is NJU2K, NLPR, STERE, RGBD135, LFSD, SIP, DUT-RGBD, ReDWeb-S, DUTS (it is worth noting that, DUT-RGBD and ReDWeb-S are tested in the completed dataset based on 1 200 samples and 3 179 samples, respectively). The evaluation metrics are demonstrated as following:S measure (Sα), maximum F measure (Fβmax), maximum E measure (Eφmax) and MAE (M). Our experiment is based on the Pytorch framework. The momentum parameter is 0.99, the learning rate is 0.000 05, and the weight decay is set to 0.000 5. Stochastic gradient descent learning technique is used to accelerate on NVIDIA RTX 2080S GPU. 1) Modeling:it takes about 20 hours to train 50 epochs. 2) Sampling:it takes about 100 hours to train 50 epochs and a weighting coefficient α=2 200/10 553≈0.21 is illustrated to guarantee the roughly balanced in learning using the two different strategies.Result Our comparative experiments show that the introduction of few-shot learning methods on nine datasets can effectively improve the performance of RGB-D salient object detection. In addition, we compare different few-shot learning methods under different RGB-D salient object detection models (including typical middle-fusion model and late-fusion model), and draws relevant analysis and discussion. In addition, the visual saliency map shows its potential of our few-shot RGB-D saliency object detection method.Conclusion We facilitate the few-shot learning method for RGB-D salient object detection. It develops two different few-shot learning methods for transferring additional knowledge. Our research is beneficial to develop the subsequent introduction of few-shot learning towards more multi-modal detection tasks.
Keywords

订阅号|日报