面向跨模态行人重识别的单模态自监督信息挖掘

吴岸聪; 林城梽; 郑伟诗

发布时间： 2022-10-20
摘要点击次数： 2137
全文下载次数： 2189
DOI: 10.11834/jig.211050
2022 | Volume 27 | Number 10

面向跨模态行人重识别的单模态自监督信息挖掘

吴岸聪, 林城梽, 郑伟诗(中山大学计算机学院, 广州 510006)

摘要

目的在智能监控视频分析领域中，行人重识别是跨无交叠视域的摄像头匹配行人的基础问题。在可见光图像的单模态匹配问题上，现有方法在公开标准数据集上已取得优良的性能。然而，在跨正常光照与低照度场景进行行人重识别的时候，使用可见光图像和红外图像进行跨模态匹配的效果仍不理想。研究的难点主要有两方面：1）在不同光谱范围成像的可见光图像与红外图像之间显著的视觉差异导致模态鸿沟难以消除；2）人工难以分辨跨模态图像的行人身份导致标注数据缺乏。针对以上两个问题，本文研究如何利用易于获得的有标注可见光图像辅助数据进行单模态自监督信息的挖掘，从而提供先验知识引导跨模态匹配模型的学习。方法提出一种随机单通道掩膜的数据增强方法，对输入可见光图像的3个通道使用掩膜随机保留单通道的信息，使模型关注提取对光谱范围不敏感的特征。提出一种基于三通道与单通道双模型互学习的预训练与微调方法，利用三通道数据与单通道数据之间的关系挖掘与迁移鲁棒的跨光谱自监督信息，提高跨模态匹配模型的匹配能力。结果跨模态行人重识别的实验在“可见光—红外”多模态行人数据集SYSU-MM01（Sun Yat-Sen University Multiple Modality 01）、RGBNT201（RGB，near infrared，thermal infrared，201）和RegDB上进行。实验结果表明，本文方法在这3个数据集上都达到领先水平。与对比方法中的最优结果相比，在RGBNT201数据集上的平均精度均值mAP （mean average precision）有最高接近5%的提升。结论提出的单模态跨光谱自监督信息挖掘方法，利用单模态可见光图像辅助数据挖掘对光谱范围变化不敏感的自监督信息，引导单模态预训练与多模态有监督微调，提高跨模态行人重识别的性能。

关键词

行人重识别跨模态检索红外图像自监督学习互学习

Single-modality self-supervised information mining for cross-modality person re-identification

Wu Ancong, Lin Chengzhi, Zheng Weishi(School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China)

Abstract

Objective Urban video surveillance systems have been developing dramatically nowadays. The surveillance videos analysis is essential for security but a huge amount of labor-intensive data processing is highly time-consuming and costly. Intelligent video analysis can be as an effective way to deal with that. To analyze the concrete pedestrians'event, person re-identification is a basic issue of matching pedestrians across non-overlapping cameras views for obtaining the trajectories of persons in a camera network. The cross-camera scene variations are the key challenges for person re-identification, such as illumination, resolution, occlusions and background clutters. Thanks to the development of deep learning, single-modality visible image matching has achieved remarkable performance on benchmark datasets. However, visible image matching is not applicable in low-light scenarios like night-time outdoor scenes or dark indoor scenes. To resilient the related low-light issues, most of surveillance cameras can automatically switch to acquire near infrared images, which are visually different from visible images. When person re-identification is required for the penetration between normal-light and low-light, current person re-identification performance for cross-modality matching between visible images and infrared images cannot be satisfied. Thus, it is necessary to analyze the visible-infrared cross-modality person re-identification further.For visible-infrared cross-modality person re-identification, there are two key challenges as mentioned below:first, the spectrums and visual appearances of visible images and infrared images are significantly different. Visible images contain three channels of red (R), green (G) and blue (B) responses, while infrared images contain only one channel of near infrared responses. This leads to big modality gap. Next, lack of labeled data is still challenged based on manpower-based identification of the same pedestrian across visible image and infrared image. Current multi-modality benchmark dataset contains 500 personal identities only, which is not sufficient for training deep models. Existing visible-infrared cross-modality person re-identification methods mainly focus on bridging the modality gap. The small labeled data problem is still largely ignored by these methods.Method To provide prior knowledge for learning cross-modality matching model, we study self-supervised information mining on single-modality data based on auxiliary labeled visible images. First, we propose a data augmentation method called random single-channel mask. For three-channel visible images as input, random masks are applied to preserve the information of only one channel, to realize the robustness of features against spectrum change. The random single-channel mask can force the first layer of convolutional neural network to learn kernels that are specific to R, G or B channels for extracting shared appearance shape features. Furthermore, for pre-training and fine-tuning, we propose mutual learning between single-channel model and three-channel model. To mine and transfer cross-spectrum robust self-supervision information, mutual learning leverages the interrelations between single-channel data and three-channel data. We sort out that the three-channel model focuses on extracting color-sensitive features, and the single-channel model focuses on extracting color-invariant features. Transferring complementary knowledge by mutual learning improves the matching performance of the cross-modality matching model.Result Extensive comparative experiments were conducted on SYSU-MM01, RGBNT201 and RegDB datasets. Compared with the state-of-the-art methods, our method improve mean average precision (mAP) on RGBNT201 by 5% at most.Conclusion We propose a single-modality cross-spectrum self-supervised information mining method, which utilizes auxiliary single-modality visible images to mine cross-spectrum robust self-supervision information. The prior knowledge of the self-supervision information can guide single-modality pretraining and multi-modality finetuning for achieving better matching ability of the cross-modality person re-identification model.

Keywords

person re-identification cross-modality retrieval infrared image self-supervised learning mutual learning