目的 行人再识别是实现跨摄像头识别同一个行人的关键技术，面临着外观、光照、姿态、背景等各种问题，其中区别行人个体差异的核心是行人整体和局部特征的表征。为了高效地表征行人，本文提出一个多分辨率特征注意力融合的行人再识别方法。方法 该方法的主干网络基于HRNet（High-Resolution Network），通过交错卷积构建了4个不同的分支来抽取多分辨率行人图像特征，既对行人不同粒度特征进行抽取，也对不同分支特征进行交互。基于HRNet网络输出的4种不同分辨率的特征，本文借助注意力机制，提出了一种多分辨率特征注意力融合的方法，对行人进行高效的特征表示。结果 在Market1501、CUHK03以及DukeMTMC-ReID三个数据集上，实验验证了所提方法的有效性，在三个数据集上Rank1分别达到了95.3%，72.8%，90.5%，mAP分别达到了89.2%，70.4%，81.5%。其中在Market1501与DukeMTMC-ReID两个数据集上实验结果超越了当前最好表现。结论 本文所提出的多分辨率特征注意力融合的行人再识别方法，可以得到强有力的行人的特征表示，显著提升行人再识别的准确性。本文提出的方法着重在于提升网络提取特征的能力，得到强有力的特征表示。因此，该方法不仅可以应用于行人再识别，也可以应用于图像分类，目标检测等与特征提取相关的计算机视觉任务中。
Multi-Resolution feature attention fusion method for person re-identification
Shen Qing,Tian Chang,Wang Jiabao,Jiao Shanshan,Du Lin(Army engineering university of PLA)
Objective Person re-identification (ReID) is a computer vision task of re-identifying a queried person across non-overlapping surveillance camera views developed at different locations by matching person images. As the fundamental problem of intelligent surveillance analysis, it has attracted increasing interest in recent years among the computer vision and pattern recognition research communities. Great progress has been made in person ReID, it still faces many challenges such as the occlusion, illumination, pose variance, background clutter and so on. The key problem of solving these difficulties is to design a Convolutional Neural Network (CNN) architecture to extract discriminative feature representations, which can compact the `intra-class" variation (comes from the same person) and separate the ‘inter-class’ variation (comes from different persons). The algorithm process of person re-identification mainly includes two stages, the feature extraction and the distance measurement. Most of the contemporary study focus on the feature extraction, because a good feature could effectively distinguish different person. Thus, the designed CNN network need to have the good representation for the global and local feature of the different person. To fully mine the information contained in the image, we fuse the features of the same image at different resolutions to get a stronger feature representation and developed a multi-resolution feature attention fusion method for person re-identification.
Method At present, the mainstream person ReID methods are based on classical network such as ResNet, VGG-Net and so on. The main character of these network is that the resolution of the feature maps becomes smaller and smaller as the network goes deeper and deeper. These high-level features contain sufficient semantic information but lack spatial information. However, for person ReID task, spatial information of the person is necessary. The HRNet (High-Resolution Network) is a multi-branch network which can maintain high-resolution representations through the whole process. HRNet is constructed by interleaving convolutions, which is not only helpful to obtain the different granularity features, but also helpful for the information exchange between the different branches. HRNet can output four different resolution feature representations. In this paper, we first evaluate the performance of different resolution feature representations. The result shows that the performance of different resolution feature representations is not consistent on different datasets. Therefore, we proposed an attention module to fuse the different resolution feature representations. The attention module can generate four weights, which add up to 1. The different resolution feature representation could be updated by the weight. At last, the final feature representation is the summation of the four different updated feature.
Result Experiments have been conducted on the three re-id datasets, Market1501, CUHK03 and DukeMTMC-ReID. extensive experiments have indicated that our method pushes the performance to an exceptional level comparing to most existing methods. We achieve the rank-1 accuracy 95.6%, 72.8% and 90.5%, mAP scores 89.2%, 70.4% and 81.5% on Market1501, CUHK03 and DukeMTMC-ReID datasets respectively. Our method achieves the state-of-art results on the DukeMTMC-ReID dataset, and yields competitive performance with the state-of-the art methods over Market1501 and CUHK03. It is worth emphasizing that the mAP score of our method is also the highest on the Market1501 dataset. In the ablation study, We evaluated the influence of three different situations on the performance of the model, namely, the attention module at different locations, the images with different resolutions and the weights of different normalization methods. The result shows that the behind attention mechanism is better than the front attention mechanism, the image resolution has little influence on the performance and the sigmoid normalization method outperforms the softmax normalization method.
Conclusion In this paper, we proposed a multi-resolution attention fused method for person re-identification. HRNet is an original network to extract the Coarse-grain and fine-grain feature, which is helpful for the person re-identification. Through ablation study, we found that the performance of different resolution feature representation is not consistent on different datasets. Thus, we proposed an attention module to fuse the different resolution feature. The attention module outputs four weights, which could represent the importance of the different resolution feature. The fused feature is obtained by summing the updated features. The experiments were conducted on the Market1501, CUHK03 and DukeMTMC-ReID datasets. The results show that our method outperforms several state-of-the-art person re-identification approaches and the attention fused method improves the performance.