多分辨率特征注意力融合行人再识别
Multi-resolution feature attention fusion method for person re-identification
- 2020年25卷第5期 页码:946-955
收稿:2019-06-06,
修回:2019-10-23,
录用:2019-10-30,
纸质出版:2020-05-16
DOI: 10.11834/jig.190237
移动端阅览

浏览全部资源
扫码关注微信
收稿:2019-06-06,
修回:2019-10-23,
录用:2019-10-30,
纸质出版:2020-05-16
移动端阅览
目的
2
行人再识别是实现跨摄像头识别同一行人的关键技术,面临外观、光照、姿态、背景等问题,其中区别行人个体差异的核心是行人整体和局部特征的表征。为了高效地表征行人,提出一种多分辨率特征注意力融合的行人再识别方法。
方法
2
借助注意力机制,基于主干网络HRNet(high-resolution network),通过交错卷积构建4个不同的分支来抽取多分辨率行人图像特征,既对行人不同粒度特征进行抽取,也对不同分支特征进行交互,对行人进行高效的特征表示。
结果
2
在Market1501、CUHK03以及DukeMTMC-ReID这3个数据集上验证了所提方法的有效性,rank1分别达到95.3%、72.8%、90.5%,mAP(mean average precision)分别达到89.2%、70.4%、81.5%。在Market1501与DukeMTMC-ReID两个数据集上实验结果超越了当前最好表现。
结论
2
本文方法着重提升网络提取特征的能力,得到强有力的特征表示,可用于行人再识别、图像分类和目标检测等与特征提取相关的计算机视觉任务,显著提升行人再识别的准确性。
Objective
2
Person re-identification (ReID) is a computer vision task of re-identifying a queried person across non-overlapping surveillance camera views developed at different locations by matching images of the person. Given the fundamental problem of intelligent surveillance analysis
person ReID has attracted increasing interest in recent years among computer vision and pattern recognition research communities. Although great progress has been made in person ReID
it still has challenges
such as occlusion
illumination
pose variance
and background clutter. The key to solving these difficulties is to efficiently design a convolutional neural network (CNN) architecture that can extract discriminative feature representations. Specifically
the architecture should be capable of compacting the "intraclass" variation (obtained from the same individual) and separating the "interclass" variation (obtained from different individuals). The algorithm process of person ReID mainly includes two stages
namely
feature extraction and distance measurement. Most contemporary studies have focused on feature extraction because a good feature can effectively distinguish different persons. Thus
the designed CNN network needs to have good representation for the global and local features of different individuals. To fully mine the information contained in the image
we fuse the features of the same image at different resolutions to obtain a stronger feature representation and develop a multi-resolution feature attention fusion method for person ReID.
Method
2
At present
mainstream person ReID methods are based on classical networks such as ResNet and VGG(visual geometry group)-Net. The main characteristic of these networks is that the resolution of the feature maps becomes increasingly smaller as the network continuously deepens. Moreover
their high-level features contain sufficient semantic information but lack spatial information. However
for tasks involving person ReID
the spatial information of an individual is necessary. The high-resolution network (HRNet) is a multi-branch network that can maintain high-resolution representations throughout the whole process. HRNet is constructed by interleaving convolutions
which are helpful for obtaining different granularity features. It also helps in the information exchange among different branches. HRNet can output four different resolution feature representations. In this study
we first evaluate the performance of different resolution feature representations. Results show that the performance of these feature representations is not consistent on different datasets. Therefore
we propose an attention module to fuse the different resolution feature representations. The attention module can generate four weights
which add up to 1. The different resolution feature representations can be updated according to different weights. The final feature representation is the accumulation of the four different updated features.
Result
2
Experiments are conducted on three ReID datasets: Market1501
CUHK03
and DukeMTMC-ReID. Results indicate that our method pushes the performance to an exceptional level compared with most existing methods. Rank-1 accuracy is achieved with the following percentages: 95.6%
72.8%
and 90.5%. Moreover
mAP(mean average precision) scores of 89.2%
70.4%
and 81.5% are obtained on Market1501
CUHK03
and DukeMTMC-ReID datasets
respectively. Our method achieves state-of-the-art results on the DukeMTMC-ReID dataset and yields competitive performance with the state-of-the-art methods on Market1501 and CUHK03 datasets. The mAP score of our method is also the highest on the Market1501 dataset. In the ablation study
we evaluate the influence of three situations on the performance of our model
namely
the attention module at different locations
the images with different resolutions
and the weights of different normalization methods. Results show that the behind attention mechanism is better than the front attention mechanism. In addition
the image resolution has little influence on the performance
and the Sigmoid normalization method outperforms the softmax normalization method.
Conclusion
2
In this study
we proposed a multi-resolution attention fused method for person ReID. HRNet is an original network used to extract coarse-grain and fine-grain features
which are helpful for person ReID. Through ablation study
we found that the performance of different resolution feature representations is not consistent on different datasets. Thus
we proposed an attention module to fuse the different resolution features. The attention module outputs four weights to represent the importance of the different resolution features. The fused feature was obtained by accumulating the updated features. The experiments were conducted on Market1501
CUHK03
and DukeMTMC-ReID datasets. Results showed that our method outperforms several state-of-the-art person ReID approaches and that the attention fused method improves the performance.
Dai Z Z, Chen M Q, Gu X D, Zhu S Y and Tan P. 2018. Batch dropblock network for person re-identification and beyond[EB/OL ] . (2018-11-17)[2019-09-20 ] . https://arxiv.org/pdf/1811.07130.pdf https://arxiv.org/pdf/1811.07130.pdf
Fu X Y, Qi Q, Huang Y, Ding X H, Wu F and Paisley J. 2018. A deep tree-structured fusion model for single image deraining[EB/OL ] . (2018-11-21)[2019-09-20 ] . https://arxiv.org/pdf/1811.08632.pdf https://arxiv.org/pdf/1811.08632.pdf
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. LasVegas, USA: IEEE: 770-778[ DOI: 10.1109/cvpr.2016.90 http://dx.doi.org/10.1109/cvpr.2016.90 ]
Hu J, Shen L, Albanie S, Sun G and Wu E H. 2018. Squeeze-and-excitation networks[EB/OL ] . (2018-07-24)[2019-09-20 ] . https://arxiv.org/pdf/1709.01507.pdf https://arxiv.org/pdf/1709.01507.pdf
Huang G, Liu Z, Van Der Maaten L and Weinberger K Q. 2017. Densely connected convolutional networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE: 2261-2269[ DOI: 10.1109/cvpr.2017.243 http://dx.doi.org/10.1109/cvpr.2017.243 ]
Li X, Wang W H, Hu X L and Yang J. 2019. Selective kernel networks[EB/OL ] . (2019-03-15)[2019-09-20 ] . https://arxiv.org/pdf/1903.06586.pdf https://arxiv.org/pdf/1903.06586.pdf
Lin Y T, Zheng L, Zheng Z D, Wu Y, Hu Z L, Yan C G and Yang Y. 2019. Improving person re-identification by attribute and identity learning. Pattern Recognition, 95(1):151-161[DOI:10.1016/j.patcog.2019.06.006]
Luo H, Gu Y Z, Liao X Y, Lai S Q and Jiang W. 2019. Bag of tricks and a strong baseline for deep person re-identification[EB/OL ] . (2019-03-17)[2019-09-20 ] . https://arxiv.org/pdf/1903.07071.pdf https://arxiv.org/pdf/1903.07071.pdf
Park J, Woo S, Lee J Y and Kweon I S. 2018. Bam: bottleneck attention module[EB/OL ] . (2018-07-17)[2019-09-20 ] . https://arxiv.org/pdf/1807.06514.pdf https://arxiv.org/pdf/1807.06514.pdf
Selvaraju R R, Cogswell M, Das A, Vedantam R, Parikh D and Batra D. 2017. Grad-CAM: visual explanations from deep networks via gradient-based localization//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 618-626[ DOI: 10.1109/ICCV.2017.74 http://dx.doi.org/10.1109/ICCV.2017.74 ]
Simonyan K and Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition[EB/OL ] . (2014-09-04)[2019-09-20 ] . https://arxiv.org/pdf/1409.1556.pdf https://arxiv.org/pdf/1409.1556.pdf
Su n K, Xiao B, Liu D and Wang J D. 2019. Deep high-resolution representation learning for human pose estimation[EB/OL ] . (2019-02-25)[2019-09-20 ] . https://arxiv.org/pdf/1902.09212.pdf https://arxiv.org/pdf/1902.09212.pdf
Sun Y F, Zheng L, Yang Y, Tian Q and Wang S J. 2018. Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline)//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 501-518[ DOI: 10.1007/978-3-030-01225-0_30 http://dx.doi.org/10.1007/978-3-030-01225-0_30 ]
Szegedy C, Liu W, Jia Y Q, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V and Rabinovich A. 2015. Going deeper with convolutions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 1-9[ DOI: 10.1109/cvpr.2015.7298594 http://dx.doi.org/10.1109/cvpr.2015.7298594 ]
Tian M Q, Yi S, Li H S, Li S H, Zhang X S, Shi J P, Yan J J and Wang X G. 2018. Eliminating background-bias for robust person reidentification//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake, USA: IEEE: 5794-5803[ DOI: 10.1109/CVPR.2018.00607 http://dx.doi.org/10.1109/CVPR.2018.00607 ]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L and Polosukhin I. 2017. Attention is all you need[EB/OL ] . (2017-06-12)[2019-09-20 ] . https://arxiv.org/pdf/1706.03762.pdf https://arxiv.org/pdf/1706.03762.pdf
Wang G S, Yuan Y F, Chen X, Li J W and Zhou X. 2018. Learning discriminative features with multiple granularities for person re-identification//Proceedings of the 26th ACM International Conference on Multimedia. New York: ACM: 274-282[ DOI: 10.1145/3240508.3240552 http://dx.doi.org/10.1145/3240508.3240552 ]
Woo S, Park J, Lee J Y and Kweon I S. 2018. CBAM: convolutional block attention module//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 3-19[ DOI: 10.1007/978-3-030-01234-2_1 http://dx.doi.org/10.1007/978-3-030-01234-2_1 ]
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel R and Bengio Y. 2015. Show, attend and tell: neural image caption generation with visual attention[EB/OL ] . (2015-02-10)[2019-09-20 ] . https://arxiv.org/pdf/1502.03044.pdf https://arxiv.org/pdf/1502.03044.pdf
Zhang W, Hu S N, Liu K and Zha Z J. 2019. Learning compact appearance representation for video-based person re-identification. IEEE Transactions on Circuits and Systems for Video Technology, 29(8):2442-2452[DOI:10.1109/TCSVT.2018.2865749]
Zheng F, Deng C, Sun X, Jiang X Y, Guo X W, Yu Z Q, Huang F Y and Ji R R. 2018. Pyramidal person re-identification via multi-loss dynamic training[EB/OL ] . (2018-10-29)[2019-09-20 ] . https://arxiv.org/pdf/1810.12193.pdf https://arxiv.org/pdf/1810.12193.pdf
Zhong Z, Zheng L, Cao D L and Li S Z. 2017. Re-ranking person re-identification with k-reciprocal encoding//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE: 3652-3661[ DOI: 10.1109/CVPR.2017.389 http://dx.doi.org/10.1109/CVPR.2017.389 ]
相关作者
相关机构
京公网安备11010802024621