刘祥1, 李辉1, 程远志2, 孔祥振3, 陈双敏1(1.青岛科技大学信息科学技术学院;2.哈尔滨工业大学计算机科学与技术学院;3.埃因霍芬理工大学工业工程学院)
目的 三维多目标跟踪是一项极具挑战性的任务，图像和点云的多模态融合能够提升多目标跟踪性能，但由于场景的复杂性以及多模态数据类型的不同，融合的充分性和关联的鲁棒性仍是亟待解决的问题。因此，提出图像与点云多重信息感知关联的三维多目标跟踪方法。方法 首先，提出混合软注意力模块，采用通道分离技术对图像语义特征进行增强，更好地实现通道和空间注意力之间的信息交互。然后，提出语义特征引导的多模态融合网络，将点云特征、图像特征以及逐点图像特征进行深度自适应持续融合，抑制不同模态的干扰信息，提高网络对远距离小目标以及被遮挡目标的跟踪效果。最后，构建多重信息感知亲和矩阵，利用交并比、欧氏距离、外观信息、方向相似性等多重信息进行数据关联，增加轨迹和检测的匹配率，提升跟踪性能。结果 在KITTI和NuScenes两个基准数据集上进行评估并和现有的先进跟踪方法进行对比。KITTI数据集上，HOTA和MOTA指标分别达到76.94%和88.12%，相比于对比方法中性能最好的模型，分别提升1.48%和3.49%。NuScenes数据集上，AMOTA和MOTA指标分别达到68.3%和57.9%，相比于对比方法中性能最好的模型，分别提升0.6%和1.1%，两个数据集上的整体性能均优于先进的跟踪方法。结论 提出的方法能够准确地跟踪复杂场景下的目标，具有更好的跟踪鲁棒性，更适合处理自动驾驶场景中的三维多目标跟踪任务。
3D Multi-object tracking based on image and point cloud multi-information perception association
Liu Xiang, Li Hui1, Cheng Yuan Zhi2, Kong Xiang Zhen3, Chen Shuang Min1(1.School of Information Science and Technology, Qingdao University of Science and Technology;2.School of Computer Science and Technology, Harbin Institute of Technology;3.Department of Industrial Engineering & Innovation Sciences, Eindhoven University of Technology)
Objective 3D multi-object tracking is a challenging task in autonomous driving, which plays a crucial role in improving the safety and reliability of the perception system. RGB cameras and LiDAR sensors are the most commonly used sensors for this task. While RGB cameras can provide rich semantic feature information, they lack depth information. LiDAR point clouds can provide accurate position and geometric information, but they suffer from problems such as dense near distance and sparse far distance, disorder, and uneven distribution. Multimodal fusion of images and point clouds can improve multi-object tracking performance, but due to the complexity of the scene and multimodal data types, the existing fusion methods are less effective and cannot obtain rich fusion features. In addition, existing methods usually use the intersection ratio or Euclidean distance between the predicted and detected bounding boxes of objects to calculate the similarity between objects, which can easily cause problems such as trajectory fragmentation and identity switching. Therefore, the adequacy of multimodal data fusion and the robustness of data association are still urgent problems to be solved. To this end, a 3D multi-object tracking method based on image and point cloud multi-information perception association is proposed. Method First, a hybrid soft attention module is proposed to enhance the image semantic features using channel separation techniques to better achieve information interaction between channel and spatial attention. The module includes two sub-modules, the first one is the soft channel attention sub-module, which first compresses the spatial information of image features into the channel feature vector after the global average pooling layer, followed by two fully connected layers to capture the correlation between channels, followed by the sigmoid function processing to obtain the channel attention map, and finally multiplies with the original features to obtain the channel enhancement features. The second is the soft spatial attention sub-module. In order to make better use of the channel attention map in spatial attention, firstly, according to the channel attention map, the channel enhancement features are divided into two groups along the channel axis using the channel separation mechanism, namely, the important channel group and the minor channel group, noting that the channel order is not changed in the separation process. Then the two groups of features are enhanced separately using spatial attention, and finally, the two groups of enhanced features are summed to obtain the final enhanced features. Then, a semantic feature-guided multimodal fusion network is proposed, in which point cloud features, image features, and point-by-image features are fused in a deep adaptive way to suppress the interference information of different modalities, to improve the tracking effect of the network on small and obscured objects at long distances by taking advantage of the complementary point cloud and image information. Specifically, the network first maps point cloud features, image features, and point-by-image features to the same channel and then stitches them together to obtain the stitched features, then uses a series of convolutional layers to obtain the correlation between the features, then obtains the respective adaptive weights after the sigmoid function, and multiplies them with the respective original features to obtain the respective attention features, and adds the obtained attention features to obtain the final fused features. Finally, the attention features are summed to obtain the final fused features. Finally, a multiple information perception affinity matrix is constructed to combine multiple information such as intersection ratio, Euclidean distance, appearance information, and direction similarity for data association to increase the matching rate of trajectory and detection and improve tracking performance. Specifically, firstly, the Kalman filter is used to predict the state of the trajectory in the current frame, then the intersection ratio, Euclidean distance, and directional similarity between the detection frame and the prediction frame are calculated and combined to represent the position affinity between objects, and then the appearance affinity matrix and the position affinity matrix are weighted and summed as the final multiple information perception affinity matrix. Finally, based on the obtained affinity matrix, the Hungarian matching algorithm is used to complete the association matching task between objects in two adjacent frames. Result First, the proposed modules are validated on the KITTI validation set, and the results of the ablation experiments show that each of the proposed modules, such as hybrid soft attention, semantic feature-guided multimodal fusion, and multiple information perception affinity matrix, can improve the tracking performance of the model to different degrees, which proves the effectiveness of the proposed modules. Then, they are evaluated on two benchmark datasets, KITTI and NuScenes, and compared with existing advanced 3D multi-object tracking methods. On the KITTI dataset, the HOTA and MOTA metrics of the proposed method reach 76.94% and 88.12%, respectively, which are 1.48% and 3.49% improvement compared with the best performing model of the compared methods. On the NuScenes dataset, the AMOTA and MOTA metrics of the proposed method reach 68.3% and 57.9%, respectively, with 0.6% and 1.1% improvement, respectively, compared to the best performing model in the comparison method, and the overall performance on both datasets outperforms the advanced tracking methods. Conclusion The proposed method effectively solves the problems of missed detection of obscured objects and small long-range objects, object identity switching and trajectory fragmentation, and can accurately and stably track multiple objects in complex scenes. Compared with existing competing methods, the proposed method has more advanced tracking performance and better tracking robustness and is more suitable for application in scenarios such as autonomous driving environment awareness and intelligent transportation.