红外—可见光跨模态的行人检测综述

别倩; 王晓; 徐新; 赵启军; 王正; 陈军; 胡瑞敏

发布时间： 2023-05-16
摘要点击次数： 2036
全文下载次数： 1362
DOI: 10.11834/jig.220670
2023 | Volume 28 | Number 5

红外—可见光跨模态的行人检测综述

别倩^1,2, 王晓^1,2, 徐新^1,2, 赵启军³, 王正⁴, 陈军⁴, 胡瑞敏⁴(1.武汉科技大学计算机科学与技术学院, 武汉 430065;2.武汉科技大学智能信息处理与实时工业系统湖北省重点实验室, 武汉 430065;3.四川大学视觉合成图形图像技术国家级重点实验室, 成都 610065;4.武汉大学多媒体网络通信工程湖北省重点实验室, 武汉 430072)

摘要

可见光图像在光照充足的条件下可以提供一系列辅助检测行人的信息,如颜色和纹理等信息,但在低照度场景下表现并不理想。红外图像虽然不能提供颜色和纹理信息,但红外图像根据热辐射差异成像而不依赖于光照条件这一特性,使其可以在低照度场景下有效区分行人区域与背景区域并提供清晰的行人轮廓信息。由于红外和可见光两种模态之间直观的互补性,同时使用红外和可见光图像的行人检测任务被认为是一个很有前景的研究方向,受到了广泛关注,大幅促进了在安防(如安全监控和自动驾驶)和疫情防控等领域应用的发展。本文对红外-可见光跨模态的行人检测工作进行全面梳理,并对未来方向进行深入思考。首先,该课题具有独特性质。可见光图像对应三通道的颜色信息而红外图像对应单通道的温差信息,如何在两种模态存在本质差异的前提下,充分利用二者的互补性是红外-可见光跨模态行人检测领域的核心挑战和主要任务。其次,近几年红外-可见光跨模态行人检测研究针对的问题可分为两类,即模态差异大和实际应用难。针对模态差异大的问题,可分为图像未对准和融合不充分两类问题。针对实际应用难的问题,又分为标注成本、实时检测和硬件成本3类问题。本文依次对跨模态行人检测的主要研究方向展开细致且全面的描述并进行相应的总结。然后,详细地介绍与跨模态行人检测相关的数据集和评价指标,并以不同的评价指标对相关方法在不同层面上进行比较。最后,对跨模态行人检测领域存在的且尚未解决的问题进行讨论,并提出对未来相关工作方向的一些思考。

关键词

跨模态行人检测可见光图像红外图像深度学习行人检测

Visible-infrared cross-modal pedestrian detection: a summary

Bie Qian^1,2, Wang Xiao^1,2, Xu Xin^1,2, Zhao Qijun³, Wang Zheng⁴, Chen Jun⁴, Hu Ruimin⁴(1.School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan 430065, China;2.Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan University of Science and Technology, Wuhan 430065, China;3.National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu 610065, China;4.Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan 430072, China)

Abstract

The precision of pedestrian detection is focused on instances-relevant location on given input images. However， due to the perception of visible images to light changes，visible images are challenged for lower visibility conditions like extreme weathers. Hence，visible images-based pedestrian detection is merely suitable for the development of temporal applications like autonomous driving and video surveillance. The infrared image can provide a clear pedestrian profile for such low-visibility scenes according to the temperature difference between the human body and the environment. Under the circumstances of sufficient light，visible images can also provide more information-lacked in infrared images like hair， face，and other related features. Visible and infrared images can provide visual information-added in common. However， the key challenges of visible and infrared images is to utilize the two modalities-between and their modality-specific noise mutually. To generate temperature information，the difference is leaked out that the visible image consists of color information in red，green，and blue（RGB）channels，while the infrared image has one channel only. And，imaging mechanismbased wavelength range of the two is different as well. The emerging deep learning technique based cross-modal pedestrian detection approaches have been developing dramatically. Our summary aims to review and analyze some popular researches on cross-modal pedestrian detection in recent years. It can be segmented into two categories：1）the difference between two different modalities and 2）the cross-modal detectors application to the real scene. The application of cross-modal pedestrian detectors to the actual scene can be divided into three types：cost analysis-related data annotation，real-time detection，and cost-analysis of applications. The research aspects between two modalities can be divided into：the misalignment and the inadequate fusion. The misalignment of two modalities shows that the visible-infrared image pairs are required to be strictly aligned，and the features from different modalities are called to match at corresponding positions. The inadequate fusion of two modalities is required to maximize the mutual benefits between two modalities. The early research on the insufficient fusion of two-modality is related to the study of the fusion stage（when to fuse）of two-modality. The later studies on the insufficient fusion of two-modality data are focused on the study of the fusion methods（how to fuse）of two-modality. The fusion stage can be divided into three steps：image，feature，and decision. Similarly，the fusion methods can be segmented into three categories：image，feature，and detection. Subsequently，we introduce some commonly used crossmodal pedestrian detection datasets，including the Korea Advanced Institute of Science and Technology （KAIST），the forward looking infrared radiometer（FLIR），the computer vision center-14（CVC-14），and the low-light visible-infrared parred（LLVIP）. Then，we introduce some evaluation metrics method for cross-modal pedestrian detectors，including missed rate（MR），mean average precision（mAP），and a pair of visible and thermal images in temporal （speed）. Finally，we summarize the challenges to be resolved in the field of cross-modal pedestrian detection and our predictions are focused on the future direction analysis of cross-modal pedestrian detection. 1）In the real world，due to the different parallax and field of view of two different sensors，the problem of misalignment of visible-infrared modality feature modules is more concerned about. However，the problem of unaligned modality features is possible to sacrifice the performance of the detector and hinder the use of unaligned data in datasets，and is not feasible to the application of dual sensors in real life to some extent. Thus，the problem of two modalities’position is to be resolved as a key research direction. 2）At present，the datasets of cross-modal pedestrian detection are all captured on sunny days，and current advanced cross-modal pedestrian detection methods are only based on all-day pedestrian detection on sunny days. However，to realize the cross-modal pedestrian detection system throughout all day and all weathers，it is required to optimize and beyond day and night data on sunny days. We also need to focus on the data under extreme weather conditions. 3）Recent studies on cross-modal pedestrian detection are focused on datasets captured by vehicle-mounted cameras. Compared to datasets captured from the monitoring perspective，the scenes of vehicle-mounted datasets are changeable，which can suppress over-fitting effectively. However，the nighttime images in the vehicle-mounted datasets may be brighter than those of the surveillance perspective datasets because of their headlight brightness at night. Therefore，we predict that multiple visual-angles datasets can be used to train the cross-modal pedestrian detector at the same time. It can not only increase the robustness of the model in darker scenes，but also suppress over-fitting at a certain scene. 4）Autonomous driving systems and robot systems are required to be quick responded for detection results. Although many models have fast inference ability on GPU（graphics processing unit），the inference speed on real devices need to be optimized，so real-time detection will be the continuous development direction of cross-modal pedestrian detection as well. 5）There is still a large gap in cross-modal pedestrian detection technology for small scale and partial or severe occluded pedestrians. However，driving systems-assisted detection and occlusion can be as a very common problem in life for small targets of pedestrians at a distance to alert drivers to slow down in advance. The cross-modal pedestrian detection technology can be forecasted and recognized for small scale targets and occlusion as the direction of future research.

Keywords

cross-modal pedestrian detection visible image infrared image deep learning pedestrian detection