贾迪, 王骞, 赵金源, 庞宇恒(辽宁工程技术大学)
目的 受遮挡与累积误差因素的影响，现有目标6维(6 Dimensions, 6D)姿态实时追踪方法在复杂场景中表现不佳，为此本文提出了一种高鲁棒性的刚体目标6D姿态实时追踪网络。方法 在网络的整体设计上，将当前帧彩色图像和深度图像(Red,Green,Blue-depth Map, RGB-D)与前一帧姿态估计结果经升维残差采样滤波和特征编码处理获得姿态差异，与前一帧姿态估计结果共同计算目标当前的6D姿态；在残差采样滤波模块的设计中，采用自门控Swish激活函数保留目标细节特征，提高目标姿态追踪的准确性；在特征聚合模块的设计中，将提取的特征分解为水平与垂直两个方向分量，分别从时间和空间上捕获长程依赖并保留位置信息，生成一组具有位置与时间感知的互补特征图，以此加强目标特征提取能力，从而加速网络收敛。结果 实验选用YCB-Video和YCBInEOAT数据集，实验结果表明本文提出的方法追踪速度达到90.9Hz，追踪精度模型点平均距离(Average Distance of Model Points,ADD)和最近点的平均距离(Average Closest Point Distance,ADD-S)分别达到93.24及95.84，均高于同类相关方法。本文方法的追踪精度指标ADD和ADD-S在追踪精度和追踪速度上均领先于目前的其他刚体姿态追踪方法，与se(3)-TrackNet网络相比，本文方法在6000组少量合成数据训练的条件下分别高出25.95和30.91，在8000组少量合成数据训练的条件下分别高出31.72和28.75，在10000组少量合成数据训练的条件下分别高出35.57和21.07，且在严重遮挡场景下能够实现对目标的高鲁棒6D姿态追踪。结论 本文提出的网络在合成数据驱动条件下可以更好地完成实时准确地追踪目标6D姿态，网络收敛速度快，实验结果验证了本文方法的有效性。
A fast convergence network for target 6D attitude tracking driven by synthetic data
Jia Di, Wang Qian, Zhao Jinyuan, Pang Yuheng(Liaoning technical university)
Objective Rigid object pose estimation is one of the fundamental and most challenging problems in computer vision, which has garnered significant attention in recent years. Researchers are seeking methods to localize the multiple degrees of freedom (DOF) of rigid objects in a three-dimensional scene, such as position translation and directional rotation. At the same time, there has been considerable progress in the field of rigid object pose estimation with the development of computer vision techniques. This task has become increasingly important in various applications, including robotics, space orbit servicing, autonomous driving, and augmented reality. Rigid object pose estimation can be roughly divided into two stages: the traditional pose estimation stage (e.g., feature-based, template matching, and three-dimensional coordinate-based methods) and the deep learning-based pose estimation stage (e.g., improved traditional methods and direct or indirect estimation methods). Despite the achievement of high tracking accuracy by existing methods and their improved variants, the tracking precision significantly deteriorates when they are applied to new scenes or novel target objects, exhibiting poor performance in complex environments. In such cases, a large amount of training data is required for deep learning across multiple scenarios, incurring high costs for data collection and network training. To address this issue, this paper proposes a real-time tracking network for rigid object 6D pose with fast convergence and high robustness, driven by synthetic data. It provides long-term stable 6D pose tracking for target rigid objects, greatly reducing the cost of data collection and the time required for network convergence. Method The network convergence speed is mainly improved by the overall design of the network, the residual sampling filtering module and the characteristic aggregation module. The rigid 6D pose transformation is calculated using Lie algebra and Lie group theory. The current frame RGB-D image and the previous frame"s pose estimation result are transformed into a pair of four-dimensional tensors and input into the network. The pose difference is obtained through residual sampling filtering processing and feature encoder, and the current 6D pose of the target is calculated jointly with the previous frame"s pose estimation. In the design of the residual sampling filtering module, the self-gated swish activation function is used to retain target detail features, and the displacement and rotation matrix is obtained by decoupling the target pose through feature encoding and decoder, which improves the accuracy of target pose tracking. In the design of the characteristic aggregation module, the features are decomposed into horizontal and vertical components, and a one-dimensional feature encoding is obtained through aggregation, capturing long-term dependencies and preserving position information from both time and space. A set of complementary feature maps with position and time awareness is generated to strengthen the target feature extraction ability, thereby accelerating the convergence of the network. Result To ensure consistent training and testing environments, all experiments were conducted on a desktop computer with an Intel Core email@example.comGhz processor and NVIDIA RTX 3060 GPU. Each target in the complete dataset contains approximately 23,000 sets of images with a size of 176×176 pixels, totaling about 15GB in capacity. During training and validation, the batch size was set to 80, and the model was trained for 300 epochs. The initial learning rate was set to 0.01, with decay rate parameters of 0.9 and 0.99 applied starting from the 100th and 200th epochs, respectively. When evaluating the tracking performance, the ADD metric is commonly used to assess the accuracy of pose estimation for non-symmetric objects. This involves calculating the Euclidean distance between each predicted point and the corresponding ground truth point, followed by summing these distances and taking their average. However, the ADD metric is not suitable for evaluating symmetric objects since multiple correct poses may exist for a symmetric object in the same image. In such cases, the ADD-S metric is used, which projects the ground truth and predicted models onto the symmetry plane and calculates the average distance between the projected points. This metric is more appropriate for evaluating the pose tracking results of symmetric objects. The YCB-Video dataset and YCBInEoAT dataset were used to evaluate the performance of relevant methods in the experiments. The YCB-Video dataset contains complex scenes captured by a moving camera under severe occlusion conditions, while the YCBInEoAT dataset involves tracking rigid objects with a robotic arm. These two datasets were utilized to validate the generality and robustness of the network across different scenarios. The experimental results show that the tracking speed of the proposed method reaches 90.9Hz, the average distance of model points (ADD) and the average distance of nearest points (ADD-S) reach 93.24 and 95.84, which are higher than similar related methods. Compared with the se(3)-TrackNet method, which has the highest tracking accuracy, the ADD and ADD-S of this method are 25.95 and 30.91 higher under the condition of 6000 sets of synthetic data, 31.72 and 28.75 higher under the condition of 8000 sets of synthetic data, and 35.75 higher under the condition of 10000 sets of synthetic data. The method achieves highly robust 6D pose tracking for targets in severely occluded scenes. Conclusion We propose a novel fast-converging network for tracking the pose of rigid objects, which combines the residual sampling filtering module and the characteristic aggregation module. This network can provide long-term effective 6D pose tracking of objects with only one initialization. By utilizing a small amount of synthetic data, the network quickly reaches a state of convergence and achieves desirable performance in complex scenes, including severe occlusion and drastic displacement. It demonstrates outstanding real-time pose tracking efficiency and tracking accuracy. Experimental results on different datasets validate the superiority and reliability of this approach. In future work, we will continue to optimize our model, further improve object tracking accuracy and network convergence speed, address the limitation of requiring CAD models for the network, and achieve category-level pose tracking.