针对形变与遮挡问题的行人再识别

史维东; 张云洲; 刘双伟; 朱尚栋; 暴吉宁

doi:10.11834/jig.200016

图像分析和识别 | 浏览量 : 0 下载量: 0 CSCD: 7

PDF
导出
分享
收藏
专辑

针对形变与遮挡问题的行人再识别
Person re-identification based on deformation and occlusion mechanisms
2020年25卷第12期页码：2530-2540
纸质出版日期： 2020-12-16 ，

录用日期： 2020-05-04
DOI： 10.11834/jig.200016
稿件说明：

移动端阅览

史维东, 张云洲, 刘双伟, 朱尚栋, 暴吉宁. 针对形变与遮挡问题的行人再识别[J]. 中国图象图形学报, 2020,25(12):2530-2540.

Weidong Shi, Yunzhou Zhang, Shuangwei Liu, Shangdong Zhu, Jining Bao. Person re-identification based on deformation and occlusion mechanisms[J]. Journal of Image and Graphics, 2020,25(12):2530-2540.
史维东, 张云洲, 刘双伟, 朱尚栋, 暴吉宁. 针对形变与遮挡问题的行人再识别[J]. 中国图象图形学报, 2020,25(12):2530-2540. DOI： 10.11834/jig.200016.

Weidong Shi, Yunzhou Zhang, Shuangwei Liu, Shangdong Zhu, Jining Bao. Person re-identification based on deformation and occlusion mechanisms[J]. Journal of Image and Graphics, 2020,25(12):2530-2540. DOI： 10.11834/jig.200016.

摘要

目的

姿态变化和遮挡导致行人表现出明显差异，给行人再识别带来了巨大挑战。针对以上问题，本文提出一种融合形变与遮挡机制的行人再识别算法。

方法

为了模拟行人姿态的变化，在基础网络输出的特征图上采用卷积的形式为特征图的每个位置学习两个偏移量，偏移量包括水平和垂直两个方向，后续的卷积操作通过考虑每个位置的偏移量提取形变的特征，从而提高网络应对行人姿态改变时的能力；为了解决遮挡问题，本文通过擦除空间注意力高响应对应的特征区域而仅保留低响应特征区域，模拟行人遮挡样本，进一步改善网络应对遮挡样本的能力。在测试阶段，将两种方法提取的特征与基础网络特征级联，保证特征描述子的鲁棒性。

结果

本文方法在行人再识别领域3个公开大尺度数据集Market-1501、DukeMTMC-reID和CUHK03（包括detected和labeled）上进行评估，首位命中率Rank-1分别达到89.52%、81.96%、48.79%和50.29%，平均精度均值（mean average precision，mAP）分别达到73.98%、64.45%、43.77%和45.58%。

结论

本文提出的融合形变与遮挡机制的行人再识别算法可以学习到鉴别能力更强的行人再识别模型，从而提取更加具有区分性的行人特征，尤其是针对复杂场景，在发生行人姿态改变及遮挡时仍能保持较高的识别准确率。

Abstract

Objective

Person re-identification (re-ID) identifies a target person from a collection of images and shows great value in person retrieval and tracking from a collection of images captured by network cameras. Due to its important applications in public security and surveillance

person re-ID has attracted the attention of academic and industrial practitioner sat home and abroad. Although most existing re-ID methods have achieved significant progress

person re-ID continues to face two challenges resulting from the change of view in different surveillance cameras. First

pedestrians have a wide range of pose variations. Second

some people in public spaces are often occluded by various obstructions

such as bicycles or other people. These problems result in significant appearance changes and may introduce some distracting information. As a result

the same pedestrian captured by different cameras may look drastically different from each other and may prevent re-ID. One simple

effective method for addressing this problem is to obtain additional pedestrian samples. Using abundant practical scene images can help generate more post-variant and occluded samples

thereby helping re-ID systems achieve excellent robustness in complex situations. Some researchers have considered the representations of both the image and the key point-based pose as inputs to generate target poses and views via the generative adversarial networks (GAN) approach. However

GAN usually suffers from a convergence problem

and the generated target images usually have poor texture. In random erasing

a rectangle region is randomly selected from an image or feature map

and the original pixel value is discarded afterward to generate occluded examples. However

this approach only creates hard examples by spatially blocking the original image and (similar to the methods mentioned above) is very time consuming. To address these problems

we propose a person re-ID algorithm that generates hard deformation and occlusion samples.

Method

We use a deformable convolution module to simulate variations in pedestrian posture. The 2D offsets of regular grid sampling locations on the last feature map of the ResNet50 network are calculated by other branches that contain a multiple convolutional layer structure. These 2D offsets include the horizontal and vertical values

and

. Afterward

these offsets are reapplied to the feature maps to produce new feature maps and deformable features via resampling. In this way

the network can change the posture of pedestrians in both horizontal and vertical directions and subsequently generate deformable features

thereby improving the ability of the network in dealing with deformed images. To address the occlusion problem

we generate spatial attention maps by using the spatial attention mechanism. We also apply other convolutional operations on the last feature map of the ResNet50 backbone to produce a spatial attention map that highlights the important spatial locations. Afterward

we mask out the most discriminative regions in the spatial attention map and retain only the low responses by using a fixed threshold value. The processed spatial attention map is then multiplied by the original features to produce the occluded features. In this way

we simulate the occluded pedestrian samples and further improve the ability of the network to adapt to other occluded samples. In the testing

we cascade two features with the original features as our final descriptors. We implement and train our network by using Pytorch and an NVIDIA TITAN GPU device

respectively. We set the batch size to 32 and rescaled all images to a fixed size of 256×128 pixels during the training and testing procedures. We also adopt a stochastic gradient descent (SGD) with a momentum of 0.9 and weight decay coefficient of 0.000 5 to update our network parameters. The initial learning rate is set to 0.04

which is further divided by 10 after 40 epochs (the training process has 60 epochs). We fix the reduction ratio and erasing threshold to 16 and 0.7 in all datasets

respectively. We adopt random flip as our data augmentation technique

and we use ResNet50 as our backbone model that contains parameters that are pre-trained on the ImageNet dataset. This model is also trained end-to-end. We adopt cumulative match characteristic (CMC) and mean average precision (mAP) to compare the re-ID performance of the proposed method with that of existing methods.

Result

The performance of our proposed method is evaluated on public large-scale datasets Market-1501

DukeMTMC-reID

and CUHK03. We use a uniform random seed to ensure the repeatability of the equity comparison and the results. In the Market-1501

DukeMTMC-reID

and CUHK03 (detected and labeled) datasets

the proposed method has obtained Rank-1 (represents the proportion of the queried people) values of 89.52%

81.96%

48.79%

and 50.29%

respectively

while its mAP values in these datasets reach 73.98%

64.45%

43.77%

and 45.57%

respectively. In the detected and labeled CUHK03 datasets

the proposed method shows 9.43%/8.74% and 8.72%/8.0% improvements in its Rank-1 and mAP values

respectively. These experimental results validate the competitive performance of this method for small and large datasets.

Conclusion

The proposed person re-ID system based on the deformation and occlusion mechanisms can construct a highly recognizable model for extracting robust pedestrian features. This system maintains high recognition accuracy in complex application scenarios where occlusion and wide variations in pedestrian posture are observed. The proposed method can also effectively mitigate model overfitting in small-scale datasets (e.g.

CUHK03 dataset)

thereby improving its recognition rate.

关键词

行人再识别形变遮挡空间注意力机制鲁棒性

Keywords

person re-identificationdeformationocclusionspatial attention mechanismrobustness

references

Chen L Y and Li W J. 2019. Multishape part network architecture for person re-identification. Journal of Image and Graphics, 24(11):1932-1941

陈亮雨, 李卫疆. 2019.多形状局部区域神经网络结构的行人再识别.中国图象图形学报, 24(11):1932-1941[DOI:10.11834/jig.190042]

Dai J F, Qi H Z, Xiong Y W, Li Y, Zhang G D, Hu H and Wei Y C. 2017. Deformable convolutional networks//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 764-773[DOI: 10.1109/ICCV.2017.89http://dx.doi.org/10.1109/ICCV.2017.89]

Dai Z Z, Chen M Q, Gu X D, Zhu S Y and Tan P. 2019. Batch dropblock network for person re-identification and beyond[EB/OL].[2019-09-03].https://arxiv.org/pdf/1811.07130v2.pdfhttps://arxiv.org/pdf/1811.07130v2.pdf

Girshick R, Donahue J, Darrell T and Malik J. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 580-587[DOI: 10.1109/cvpr.2014.81http://dx.doi.org/10.1109/cvpr.2014.81]

Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A and Bengio Y. 2016. Generative adversarial nets[EB/OL].[2019-12-13].https://arxiv.org/pdf/1406.2661.pdfhttps://arxiv.org/pdf/1406.2661.pdf

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778[DOI: 10.1109/cvpr.2016.90http://dx.doi.org/10.1109/cvpr.2016.90]

Hermans A, Beyer L and Leibe B. 2017. In defense of the triplet loss for person re-identification[EB/OL].[2019-12-13].https://arxiv.org/pdf/1703.07737.pdfhttps://arxiv.org/pdf/1703.07737.pdf

Huang H J, Li D W, Zhang Z, Chen X T and Huang K Q. 2018. Adversarially occluded samples for person re-identification//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 5098-5107[DOI: 10.1109/cvpr.2018.00535http://dx.doi.org/10.1109/cvpr.2018.00535]

Köstinger M, Hirzer M, Wohlhart P, Roth P M and Bischof H. 2012. Large scale metric learning from equivalence constraints//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 2288-2295[DOI: 10.1109/cvpr.2012.6247939http://dx.doi.org/10.1109/cvpr.2012.6247939]

Krizhevsky A, Sutskever I and Hinton G E. 2012. Imagenet classification with deep convolutional neural networks//Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, USA: ACM: 1097-1105

Li W, Zhao R, Xiao T and Wang X G. 2014. DeepReID: deep filter pairing neural network for person re-identification//Proceedings of 2014 IEEEConference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 152-159[DOI: 10.1109/cvpr.2014.27http://dx.doi.org/10.1109/cvpr.2014.27]

Liao S C, Hu Y, Zhu X Y and Li S Z. 2015. Person re-identification by local maximal occurrence representation and metric learning//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 2197-2206[DOI: 10.1109/cvpr.2015.7298832http://dx.doi.org/10.1109/cvpr.2015.7298832]

Liu J X, Ni B B, Yan Y C, Zhou P, Cheng S and Hu J G. 2018. Pose transferrable person re-identification//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4099-4108[DOI: 10.1109/CVPR.2018.00431http://dx.doi.org/10.1109/CVPR.2018.00431]

Liu S W, Zhang Y Z, Qi L, Coleman S, Kerr D and Zhu S D. 2019. Adversarially erased learning for person re-identification by fully convolutional networks//Proceedings of 2019 International Joint Conference on Neural Networks. Budapest, Hungary: IEEE: 1-8[DOI: 10.1109/IJCNN.2019.8852283http://dx.doi.org/10.1109/IJCNN.2019.8852283]

Long J, Shelhamer E and Darrell T. 2015. Fully convolutional networks for semantic segmentation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3431-3440[DOI: 10.1109/cvpr.2015.7298965http://dx.doi.org/10.1109/cvpr.2015.7298965]

Lowe D G. 1999. Object recognition from local scale-invariant features//Proceedings of the 7th IEEE International Conference on Computer Vision. Kerkyra, Greece: IEEE: 1150-1157[DOI: 10.1109/iccv.1999.790410http://dx.doi.org/10.1109/iccv.1999.790410]

Ma L Q, Jia X, Sun Q R, Schiele B, Tuytelaars T and Van Gool L. 2017. Pose guided person image generation[EB/OL].[2019-12-13].https://arxiv.org/pdf/1705.09368.pdfhttps://arxiv.org/pdf/1705.09368.pdf

Matsukawa T, Okabe T, Suzuki E and Sato Y. 2016. Hierarchical Gaussian descriptor for person re-identification//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 1363-1372[DOI: 10.1109/CVPR.2016.152http://dx.doi.org/10.1109/CVPR.2016.152]

Pedagad S, Orwell J, Velastin S and Boghossian B. 2013. Local fisher discriminant analysis for pedestrian re-identification//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, USA: IEEE: 3318-3325[DOI: 10.1109/cvpr.2013.426http://dx.doi.org/10.1109/cvpr.2013.426]

Qi M B, Wang C C, Jiang J G and Li J. 2018. Person re-identification based on multi-feature fusion and alternating direction method of multipliers. Journal of Image and Graphics, 23(6):827-836

齐美彬, 王慈淳, 蒋建国, 李佶. 2018.多特征融合与交替方向乘子法的行人再识别.中国图象图形学报, 23(6):827-836[DOI:10.11834/jig.170507]

Qian X L, Fu Y W, Xiang T, Wang W X, Qiu J, Wu Y, Jiang Y G and Xue X Y. 2018. Pose-normalized image generation for person re-identification//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 661-678[DOI: 10.1007/978-3-030-01240-3_40http://dx.doi.org/10.1007/978-3-030-01240-3_40]

Ristani E, Solera F, Zou R, Cucchiara R and Tomasi C. 2016. Performance measures and a data set for multi-target, multi-camera tracking//Proceedings of European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 17-35[DOI: 10.1007/978-3-319-48881-3_2http://dx.doi.org/10.1007/978-3-319-48881-3_2]

Si J L, Zhang H G, Li C G, Kuen J, Kong X F, Kot A C and Wang G. 2018. Dual attention matching network for context-aware feature sequence based person re-identification//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 5363-5372[DOI: 10.1109/cvpr.2018.00562http://dx.doi.org/10.1109/cvpr.2018.00562]

Siarohin A, Sangineto E, Lathuilière S and Sebe N. 2018. Deformable GANs for pose-based human image generation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 3408-3416[DOI: 10.1109/cvpr.2018.00359http://dx.doi.org/10.1109/cvpr.2018.00359]

Sun Y F, Zheng L, Deng W J and Wang S J. 2017. SVDNet for pedestrian retrieval//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 3820-3828[DOI: 10.1109/ICCV.2017.410http://dx.doi.org/10.1109/ICCV.2017.410]

Zhao L M, Li X, Zhuang Y T and Wang J D. 2017. Deeply-learned part-aligned representations for person re-identification//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 3239-3248[DOI: 10.1109/iccv.2017.349http://dx.doi.org/10.1109/iccv.2017.349]

Zheng L, ShenL Y, Tian L, Wang S J, Wang J D and Tian Q. 2015. Scalable person re-identification: a benchmark//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 1116-1124[DOI: 10.1109/iccv.2015.133http://dx.doi.org/10.1109/iccv.2015.133]

Zheng Z D, Zheng L and Yang Y. 2018. A discriminatively learned CNN embedding for person reidentification. ACM Transactions on Multimedia Computing, Communications, and Applications, 14(1):1-20[DOI:10.1145/3159171]

Zhong Z, Zheng L, Cao D L and Li S Z. 2017b. Re-ranking person re-identification with k-reciprocal encoding//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 3652-3661[DOI: 10.1109/cvpr.2017.389http://dx.doi.org/10.1109/cvpr.2017.389]

Zhong Z, Zheng L, Kang G L, Li S Z and Yang Y. 2017a. Random erasing data augmentation[EB/OL].[2019-12-13].https://arxiv.org/pdf/1708.04896.pdfhttps://arxiv.org/pdf/1708.04896.pdf

Zhu F Q, Kong X W, Fu H Y and Tian Q. 2018. Two-stream complementary symmetrical CNN architecture for person re-identification. Journal of Image and Graphics, 23(7):1052-1060

朱福庆, 孔祥维, 付海燕, 田奇. 2018.两路互补对称CNN结构的行人再识别.中国图象图形学报, 23(7):1052-1060[DOI:10.11834/jig.170557]

Zhu Z, Huang T T, Shi B G, Yu M, Wang B F and Bai X. 2019. Progressive pose attention transfer for person image generation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 2342-2351[DOI: 10.1109/cvpr.2019.00245http://dx.doi.org/10.1109/cvpr.2019.00245]

文章被引用时，请邮件提醒。

提交