郭文, 刘其贵, 丁昕苗(山东工商学院信息与电子工程学院)
目的 针对模糊行人特征造成身份切换的问题和复杂场景下目标之间遮挡造成跟踪精度降低的问题，提出了一个AIoU-Tracker多目标跟踪算法。方法 首先根据骨干网络检测头设计了一个特殊的AIoU（Adaptive Intersection over Union）回归损失函数，它从重叠面积、中心点距离和纵横比三个方面去衡量，缓解了由于模糊行人特征判别性不足造成的身份切换带来的困扰。其次提出了一种简单有效的层级（hierarchical）关联策略，在高分检测框和低分检测框分别关联之后，充分利用关联失败检测框周围的嵌入信息再次进行关联，提高了在遮挡条件下多目标跟踪的关联精度。结果 通过一系列的对比实验，提出的AIoU-Tracker跟踪方法对比于FairMOT跟踪方法在MOT16数据集上，HOTA（Higher Order Tracking Accuracy）值由58.3%提高至59.8%，IDF1（ID F1 Score）值由72.6%提高至73.1%，MOTA（Multi-Object Tracking Accuracy）值由69.3%提高至74.4%；在MOT17数据集上，HOTA值由59.3%提高至59.9%，IDF1值由72.3%提高至72.9%。结论 本文所提出的特征平衡性跟踪方法，使边界框大小特征、热图特征和中心点偏移量特征在训练测试中达到了更好的平衡性，使多目标跟踪结果更加准确。
Multi-object tracking using adaptive-IoU loss and hierarchical association
Guo Wen, Liu Qigui, Ding Xinmiao(School of Information and Electronic Engineering,Shandong Technology and Business University)
Objective Multiple Object Tracking (MOT) belongs to a mainstream task in computer vision, which aims mainly to estimate the tracklets of multiple objects in videos and has important applications in the fields of autonomous driving, human-computer interaction, and human activity recognition. A large number of methods focus on improving the tracking performance based on the given detection results. Re-ID based trackers can be divided into two categories: Separate Detection and Embedding (SDE) tracking models and Joint Detection and Embedding (JDE) tracking models. The SDE tracking model tunes the detection model and the Re-ID model separately to optimize the model, but this leads to the disadvantage that the SDE tracking model cannot perform real-time detection. The JDE tracking model performs object detection while outputting the object location and appearance embedding information for the next step of object association, thus improving the algorithm"s operational speed. However, the JDE tracking method suffers from the problem of identity switching due to ambiguous pedestrian features and the degradation of tracking accuracy due to occlusion between objects in complex scenes. To address these issues, an AIOU-Tracker multi-object tracking algorithm is proposed. Method Firstly, we utilize the backbone network detection head to design a special AIoU regression loss function that measures the overlap area, center point distance, and aspect ratio. This helps alleviate the problem caused by identity switching due to ambiguous pedestrian features. Secondly, a simple and effective hierarchical association method is proposed to leverage the embedding information around association failure detection frames for Re-ID. The high-score detection frames and low-score detection frames are associated separately, improving the association accuracy of multi-object tracking under occlusion conditions. We utilize a variant of the DLA-34 network architecture as the backbone network. The model parameters are trained on the COCO dataset and used to initialize the model. The experiments in this study are conducted on a system running Ubuntu 16.04 with 64GB of memory and a GTX2080Ti GPU. The software configuration includes CUDA 10.2. We train the model using the Adam optimizer for 30 epochs, with an initial learning rate of 10-4. The learning rate is decayed to 10-5 after 20 epochs, and the batch size is set to 16. We apply standard data augmentation techniques, including rotation, scaling, and color jittering. The input image size is adjusted to 1088×608, and the feature map resolution is set to 272×152. We evaluate our approach on the MOT Challenge benchmark, specifically the MOT16 dataset and the MOT17 dataset. The experiments utilize various datasets, including CrowdHuman, MIX dataset (ETH, CityPerson, CUHKSYSU, Caltech, and PRW). The ETH dataset and CityPerson dataset only provide bounding box annotations, so we only train the detection branch on these datasets. The Caltech dataset, MOT17, CUHKSYSU dataset, and PRW dataset provide both bounding box positions and ID annotations, allowing for training of both branches. To ensure a fair comparison, we remove the overlapping videos between the ETH dataset and the MOT17 test dataset. The CrowdHuman dataset only contains bounding box annotations, so we perform self-supervised training on it. To evaluate the tracking performance, we use several well-defined metrics, including Higher Order Tracking Accuracy (HOTA), Multi-Object Tracking Accuracy (MOTA), ID F1 Score (IDF1), False Positive (FP), False Negative (FN), and Number of Identity Switches (IDs). MOTA primarily assesses the performance of the detection branch, IDF1 evaluates identity preservation, focusing on the association performance, while HOTA provides a comprehensive evaluation of both the detection branch and the data association performance. Result The performance of our method is compared to existing methods on two datasets. The comparative results are as follows: 1) Our HOTA value is 59.8% on the MOT16 dataset, which is increased by 1.5% compared to the FairMOT. Our MOTA value is 74.4% on the MOT16 dataset, which is increased by 5.1% compared to the FairMOT. Our IDF1 value is 73.1% on the MOT16 dataset, which is increased by 0.5% compared to the FairMOT. 2) The HOTA value is 59.9% on the MOT17 dataset, which is increased by 0.6% compared to the FairMOT. The IDF1 value is 72.9% on the MOT17 dataset, which is increased by 1.6% compared to the FairMOT. Additionally, we conduct ablation studies on the MOT17 dataset to verify the effectiveness of different components in our method, which demonstrates that the proposed method significantly alleviates the competition in multiple object tracking. In the ablation studies, we observe a decrease in the number of identity switches through the added adaptive-IoU regression loss function. We also visualize the predicted Re-ID feature extraction positions, bounding box size feature, heat-map feature, and center point offset feature. The visualization results show that our method is more robust compared to FairMOT. Moreover, our hierarchical association method makes the association more robust. For example, even after two frames, obscured IDs can still be associated. Conclusion The proposed feature balancing tracking method achieves better balance among the bounding box size feature, heat-map feature, and center point offset feature during training and testing, resulting in more accurate multi-object tracking results. In this study, we propose two improvement measures for the FairMOT framework. Firstly, we design an AIoU regression loss module to optimize the detection branch, enabling it to optimize targets based on the current optimal distance and extract more accurate appearance features. Secondly, we optimize the Re-ID branch through a hierarchical association strategy module, utilizing three-level matching to enhance the tracking system"s association performance. Experimental results demonstrate significant improvements on the MOT17 dataset, with HOTA increasing to 59.9%, IDF1 increasing to 72.9%, and MOTA increasing to 70.8%. However, there is a competition issue between the detection and Re-ID branches in the JDE tracking model, which can lead to a decrease in MOTA. Future research will focus on investigating this competition in the JDE tracking model.