Current Issue Cover


摘 要
目的 在行人再识别中,行人朝向变化会导致表观变化,进而导致关联错误。现有方法通过朝向表示学习和基于朝向的损失函数来改善这一问题。然而,大多数朝向表示学习方法主要以嵌入朝向标签为主,并没有显式的向模型传达行人姿态的空间结构,从而减弱了模型对朝向的感知能力。此外,基于朝向的损失函数通常对相同身份的行人进行朝向聚类,忽略了由表观相似且朝向相同的负样本造成的错误关联的问题。方法 为了应对这些挑战,本文提出了面向行人再识别的朝向感知特征学习。首先,本文提出了基于人体姿态的朝向特征学习,它能够显式的捕捉人体姿态的空间结构。其次,本文提出的朝向自适应的三元组损失主动增大表观相似且相同朝向行人之间的间隔,进而将它们分离。结果 本文方法在大规模的行人再识别公开数据集MSMT17、Market1501等上进行测试。其中,在MSMT17数据集上,相比于性能第2的UniHCP模型,Rank1和mAP值分别提高了1.7%和1.3%;同时,在MSMT17数据集上的消融实验结果证明本文提出的算法有效改善了行人再识别的关联效果。结论 本文提出的算法能够有效处理上述挑战导致的行人再识别系统中关联效果变差的问题。
View-aware feature learning for person re-identification


Abstract: Objective In the contemporary digital and internet-driven environment, person re-identification (ReID) technology has become an integral component of domains such as intelligent surveillance, security, and new retail. However, in real-world scenarios, significant appearance differences often arise in the same person due to changes in view, leading to degraded association performance. Existing methods typically enhance the model"s representation ability and association capacity by first view representation learning and designing view-based loss functions to make the model perceive view information. While these methods have achieved outstanding results, significant challenges remain, which will be elaborated upon in the following sections. Challenge 1: How to retain person representational capability in models with implicit view feature learning? In terms of view feature representation, existing methods based on the Transformer architecture convert view labels into feature vectors through the view embedding layer. These methods hinder the model from perceiving complex posture information from simple labels. Consequently, these methods implicitly learn the view features; that is, they do not explicitly convey to the model the spatial structure of person posture, such as the position of keypoints and their topological relationships. This could result in the model not precisely perceiving person postures and views, thereby diminishing the model"s representational capability for persons. To address this, our method embeds keypoint coordinates and models the topological structure between keypoints. When this structured information is provided to the model, it can more intuitively understand person postures, allowing for explicit learning of person posture. Challenge 2: How to Separate Persons with Similar Appearance and the Same View During Indiscriminate Pushing of Anchor from Hard Negatives? Regarding the design of the view-based loss function, many existing methods generally do not differentiate specific views, learning generic view features, which might strip the model of essential person view information. Alternatively, some approaches leverage the Triplet Loss to reduce feature map distances for persons with the same views, while increasing the distances between clusters of the same identity with opposing views and bringing clusters of adjacent views closer together. However, based on our analysis of error cases in real scenarios, persons with similar appearances and the same views often rank higher in retrieval results, leading to degraded performance of the ReID system. Moreover, while the aforementioned methods set a uniform margin to push anchors from hard negative examples, persons with similar appearances and the same views might still not be distinctly separated. To address this issue, we introduced a large margin for different identities with similar appearances and same views to push them apart. To address the outlined challenges, we introduce view-aware feature learning for person re-identification (VAFL). Method Firstly, we propose view feature learning based on person posture (Pos2View). Specifically, the view of a person is inherently determined by the spatial arrangement of various body parts, which provides key insights into their view. Consequently, we integrate the person"s posture information into the feature map, enhancing the model"s ability to discern the person"s view. Secondly, we propose triplet loss with adaptive view (AdaView), which assigns adaptive margins between examples based on their views, thereby optimizing the Triplet Loss for person view awareness. The original Triplet Loss updates the model by pulling the anchor and the hard positive example closer, and pushing the hard negative example away from the anchor. However, our proposed AdaView emphasizes distancing persons with the same view and similar appearances far apart in the feature space. Specifically, these similar-appearance persons are the hard negative examples in the mini-batch, which have the closest Euclidean distance. Due to the high visual similarity among images of the same person with same views, we aim to pull them closer in the feature space, forming sub-clusters of images with the same view. This action is reflected in the minimal margin. To make the model sensitive to changes in person appearance due to view shifts, for images of the same person with different views, we push apart their corresponding sub-clusters in the feature space. This pushing is signified by a slightly larger margin. We deliberately increase the distance between images in the feature space that have similar appearances but belong to different identities with the same view. This operation is reflected by a larger margin. Collectively, the above steps define the AdaView. Result In our comprehensive analysis, we assessed the performance of our proposed method against a variety of established techniques in the field of person re-identification. Our evaluation encompassed multiple public datasets, including Market1501 (Market), DukeMTMC-ReID, MSMT17, and CUHK. To gauge the effectiveness of our approach, we utilized two primary metrics: Rank-1 (R1), which measures the accuracy of the first result in retrieval, and the mean Average Precision (mAP), assessing overall ranking accuracy. Our method involved leveraging person view annotations from select datasets and implementing a model trained on ResNet to predict views of individuals in the MSMT17 dataset. We employed various data augmentation strategies and adhered to hyperparameter settings in line with TransReID. In direct comparison with state-of-the-art methods, including classic person re-identification techniques and recent advancements like TransReID and UniHCP, our proposed method exhibited superior performance. Specifically, on the MSMT17 dataset, our approach surpassed UniHCP by 1.7% in R1 and 1.3% in mAP. This improvement can be attributed to our unique "VAFL" technique, which enhances cluster differentiation and retrieval accuracy. Further, we conducted tests in generalized person re-identification tasks to validate our model"s adaptability and stability in diverse scenarios. Compared with representative generalization methods, our approach demonstrated a slight edge, mainly due to the "VAFL" technique"s capacity to refine cluster boundaries and maintain a balance between intraclass compactness and interclass dispersion. Our ablation study revealed that removing the "VAFL" component from our model significantly reduced its performance, highlighting the component"s critical role in the overall effectiveness of our method. This study confirms the robustness and superiority of our approach in the realm of person re-identification, paving the way for its practical deployment in real-world applications. Conclusion In this paper, we introduce VAFL, which enhances the model"s sensitivity to view, aiding in distinguishing persons with similar appearances but from the same view. Experimental results demonstrate that our approach exhibits outstanding performance across various scenarios, confirming its efficiency and reliability.