许可1, 刘心溥1, 汪汉云2, 万建伟1, 郭裕兰1(1.国防科技大学;2.信息工程大学)
目的 基于可见光和红外双模态图像融合的目标检测算法在近几年得到了持续的关注，其是解决复杂场景下目标检测任务的有效手段。然而现有双光检测算法中的特征融合过程存在两大问题：一是特征融合方式较为简单，逐特征元素相加或者并联操作导致特征融合效果不佳；二是算法结构中仅有特征融合过程，而缺少特征选择过程，导致有用特征无法被高效利用。方法 为解决上述问题，本文提出了一种基于动态特征选择的可见光红外图像融合目标检测算法，其包含特征的动态融合层和动态选择层两个创新模块：动态融合层嵌入在骨干网络中，利用Transformer结构，多次对多源的图像特征图进行特征融合，以丰富特征表达；动态选择层嵌入在颈部网络中，利用三种注意力机制对多尺度特征图进行特征增强，以筛选有用特征。结果 本文所提算法在FLIR、LLVIP和VEDAI三个公开数据集上开展实验验证，与多种特征融合方式进行性能比较，mAP50指标相比于基线模型分别提升了1.3%、0.6%和3.9%，mAP75指标相比于基线模型分别提升了4.6%、2.6%和7.5%，mAP指标相比于基线模型分别提升了3.2%、2.1%和3.1%，同时设计了相关结构的消融实验，验证了所提算法的有效性。结论 本文所提出的基于动态特征选择的可见光红外图像融合目标检测算法，可以有效地融合可见光和红外两种图像模态的特征信息，提升了目标检测的性能。
Infrared-Visible image feature dynamic selection object detection network
Xu Ke, Liu Xinpu1, Wang Hanyun2, Wan Jianwei1, Guo Yulan1(1.National University of Defense Technology;2.Information Engineering University)
Objective In recent years, there has been significant attention given to the object detection algorithm that utilizes the fusion of visible and infrared dual-modal images. This algorithm serves as an effective approach for addressing object detection tasks in complex scenes. The process of object detection algorithms can be roughly divided into three stages. The first stage is feature extraction, which aims to extract geometric features from the input data. Next, the extracted features are fed into the neck network for multi-scale feature fusion. Finally, the fused features are input into the detection network to output object detection results. Similarly, dual-modal detection algorithms follow the same process to achieve object localization and classification. The difference lies in the fact that traditional object detection focuses on single-modal visible images, while dual-modal detection focuses on both visible and infrared image data. The dual-modal detection algorithm aims to simultaneously utilize information from both infrared and visible images. It merges these images to obtain more comprehensive and accurate target information, thereby enhancing the accuracy and robustness of the object detection process. Traditional fusion methods encompass pixel-level fusion and feature-level fusion. Pixel-level fusion employs a straightforward weighted overlay technique on the two types of images, enhancing the contrast and edge information of the targets. On the other hand, feature-level fusion extracts features from both the infrared and visible images, combining them to enhance the representation capability of the targets. However, the feature fusion process of existing dual-modal detection algorithms faces two major issues. First, the feature fusion methods employed are relatively simple, involving the addition or parallel operation of individual feature elements. As a consequence, these methods yield unsatisfactory fusion effects that limit the performance of subsequent object detection. Second, the algorithm structure solely focuses on the feature fusion process, neglecting the crucial feature selection process. This deficiency results in the inefficient utilization of valuable features. Method In this paper, to address the above two issues, we introduce a visible and infrared image fusion object detection algorithm that employs dynamic feature selection. Overall, we propose enhancements to the conventional YOLOv5 detector through modifications to its backbone, neck, and detection head components. We select CSPDarkNet53 as the backbone, which possesses an identical structure for both visible and infrared image branches. The algorithm incorporates two innovative modules: dynamic fusion layer and dynamic selection layer. The proposed algorithm includes embedding the dynamic fusion layer in the backbone network, utilizing the Transformer structure for multiple feature fusions in multi-source image feature maps to enrich feature expression. Moreover, it employs the dynamic selection layer in the neck network, utilizing three attention mechanisms (scale, space and channel) to improve multi-scale feature maps and screen useful features, which are implemented with SENet and deformable convolutions. In line with standard practices in target detection algorithms, we utilize the detection head of YOLOv5 to generate detection results. The loss function employed for algorithm training is the combined sum of bounding box regression loss, classification loss and confidence loss, which are implemented with Generalized Intersection over Union (GIoU), cross entropy and squared-error functions, respectively. Result In this paper, we validate our proposed algorithm through experimental evaluation on three publicly available datasets: FLIR, LLVIP, and VEDAI and use the mean average precision (mAP) for the evaluation. Compared to the baseline model that adds features individually, our algorithm achieves improvements of 1.3%, 0.6%, and 3.9% in mAP50 scores, and 4.6%, 2.6%, and 7.5% in mAP75 scores, respectively. Additionally, our algorithm demonstrates enhancements of 3.2%, 2.1%, and 3.1% in mAP scores on the respective datasets, effectively reducing the probability of object omission and false alarms. Moreover, we conduct ablation experiments on two innovative modules, the dynamic fusion layer and the dynamic selection layer. The complete algorithm model, incorporating these two layers, achieves the best performance on all three test datasets, thus validating the effectiveness of our proposed algorithm. We also compared the network model size and computational efficiency of these state-of-the-art algorithms, and experiments show that our algorithm can significantly improve algorithm performance while slightly increasing parameter computation. Furthermore, to better reveal the mechanism of the dynamic fusion layer, we visualize the attention weight matrices of the three dynamic fusion layers in the backbone. The visual analysis confirms that the dynamic fusion layer effectively integrates the feature information from both visible and infrared images. Conclusion In this paper, we propose a visible and infrared image fusion-based object detection algorithm using dynamic feature selection strategy, which incorporates two innovative modules: dynamic fusion layer and dynamic selection layer. Through extensive experiments, we demonstrate that our algorithm effectively integrates feature information from both visible and infrared image modalities, thereby enhancing the performance of object detection. However, it is worth noting that the algorithm in this paper has a little increasing computational complexity and requires pre-registration of the input visible and infrared images, which limits some application scenarios of the algorithm. The research on lightweight fusion modules and algorithms capable of processing unregistered dual light images will be the focus of future research in the field of multimodal fusion target detection.