结合BiLSTM和注意力机制的视频行人再识别
Video person reidentification based on BiLSTM and attention mechanism
- 2019年24卷第10期 页码:1703-1710
收稿:2019-03-06,
修回:2019-5-2,
录用:2019-5-9,
纸质出版:2019-10-16
DOI: 10.11834/jig.190056
移动端阅览

浏览全部资源
扫码关注微信
收稿:2019-03-06,
修回:2019-5-2,
录用:2019-5-9,
纸质出版:2019-10-16
移动端阅览
目的
2
跨摄像头跨场景的视频行人再识别问题是目前计算机视觉领域的一项重要任务。在现实场景中,光照变化、遮挡、观察点变化以及杂乱的背景等造成行人外观的剧烈变化,增加了行人再识别的难度。为提高视频行人再识别系统在复杂应用场景中的鲁棒性,提出了一种结合双向长短时记忆循环神经网络(BiLSTM)和注意力机制的视频行人再识别算法。
方法
2
首先基于残差网络结构,训练卷积神经网络(CNN)学习空间外观特征,然后使用BiLSTM提取双向时间运动信息,最后通过注意力机制融合学习到的空间外观特征和时间运动信息,以形成一个有判别力的视频层次表征。
结果
2
在两个公开的大规模数据集上与现有的其他方法进行了实验比较。在iLIDS-VID数据集中,与性能第2的方法相比,首位命中率Rank1指标提升了4.5%;在PRID2011数据集中,相比于性能第2的方法,首位命中率Rank1指标提升了3.9%。同时分别在两个数据集中进行了消融实验,实验结果验证了所提出算法的有效性。
结论
2
提出的结合BiLSTM和注意力机制的视频行人再识别算法,能够充分利用视频序列中的信息,学习到更鲁棒的序列特征。实验结果表明,对于不同数据集,均能显著提升识别性能。
Objective
2
Video person reidentification (re-ID) has attracted much attention due to the rapidly growing surveillance camera networks and the increasing demand of public safety. In recent years
the person reidentification task has become one of the core problems in intelligent surveillance and multimedia applications. This task aims to match the image sequences of pedestrians from non-overlapping cameras distributed at different physical locations. Given a tracklet taken from one camera
re-ID is the process of matching the person from tracklets of interest in another view. In practice
video re-ID faces several challenges. The image qualities of video frames tend to be rather low and pedestrians also exhibit a large range of pose variations because video acquisition is less constrained. Pedestrians in videos are usually moving
resulting in serious out-of-focus
blurring
and scale variations. Moreover
the same person in different videos may look different. When people move between cameras
the large appearance changes caused by environmental and geometric variations increases the difficulty of re-ID task. A lot of works has been proposed to deal with these issues. A typical video-based person re-id system first extracts the frame-wise features with deep convolutional neural networks (CNNs). The extracted features are fed into several recurrent neural networks (RNNs) to capture temporal structure information. Finally
the average or maximum temporal pooling procedure is conducted on the output RNNs to aggregate the features. However
the average pooling operation only considers the generic features of pedestrian sequences
and the specific features of samples in a sequence are neglected. While the maximum pooling operation concentrates on finding the local salient features
useful information may be abandoned. In this case
a video person re-id algorithm based on bi-directional long short-term memory (BiLSTM) and attention mechanism is proposed to make full use of temporal information and improve the robustness of person re-id systems for complex surveillance scenes.
Method
2
From the input video sequence
the proposed algorithm breaks the long sequence into short snippets and randomly selects a constant number of frames for snippets. The snippets are fed into a pre-trained CNN network to extract the feature representation of each frame. In this method
the network can learn spatial appearance representation. Sequence representation is calculated by BiLSTM according to the temporal domain
which contains temporal motion information. BiLSTM in the network causes specific information to flow forward and backward in a flexible manner
allowing the underlying temporal information interaction to be fully exploited. After feature extraction
the frame-level and sequence-level features from the probe and gallery videos are fed into dot attention network independently. After calculating the correlation (the attention weight) between the sequence and its frames
the output sequence representation is reconstructed as a weighted sum of the frames at different spatial and temporal positions in the input sequence. In the attention mechanism
the network can alleviate sample noises and poor alignments in videos. Our network is implemented on the Pytorch platform and trained with a NVIDIA GTX 1080 GPU device. All training and testing images are rescaled to a fixed size of 256×128 pixels. The ResNet-50 with the pretrained parameters on ImageNet is considered the backbone network in our system. For network parameter training
we adopt stochastic gradient descent (SGD) with a momentum of 0.9. The learning rate is initially set as 0.001 and further divided by 10 after every 20 epochs. The batch size is set at 8 for training
and the total training process lasts for 40 epochs. The whole network is trained end-to-end with a joint identification and verification manner. During the test
the query and gallery videos are encoded to the feature vectors by using the aforementioned system. To compare the re-identification performance of the proposed method with the existing advanced methods
we adopt the cumulative matching characteristics (CMC) at rank-1
rank-5
rank-10
and rank-20 on all datasets.
Result
2
The proposed network is demonstrated on two public benchmark datasets including iLIDS-VID and PRID2011. For iLIDS-VID
the 600 video sequences of 300 persons are randomly split into 50% of persons for training and 50% of persons for testing. For PRID2011
we follow the experiment setup in previous methods and only use 400 video sequences of the first 200 persons
who appear in both cameras. The experiments on these two datasets are repeated 10 times with different test/train splits
and the results are averaged to ensure stable evaluation. Rank1 (represents the proportion of the queried people) results of the two datasets are 80.5% and 87.6% respectively. In the iLIDS-VID dataset
the Rank1 is increased by 4.5% compared with the second performance method. In the PRID2011 dataset
the Rank1 is increased by 3.9% compared with the second performance method. Extensive ablation studies verify the effectiveness of BiLSTM and attention mechanism. Compared with the results that only use LSTM in iLIDS-VID and PRID2011 datasets
the Rank1 (higher is better) is increased by 10.9% and 12.7%
respectively.
Conclusion
2
This work proposes video person re-id method based on BiLSTM and attention mechanism. The proposed algorithm can effectively learn spatio-temporal features relevant for re-id task. Furthermore
the proposed BiLSTM allow temporal information not only to propagate from front to back but also in the reverse direction. The attention mechanism can adaptively select the discriminative information from the sequentially varying features. The proposed network significantly improves the recognition rate and has a practical application value. The proposed method shows improved robustness of video person re-id systems in complex scenes and outperforms several state-of-the-art approaches.
Xiang T, Gong S G. Video behavior profiling for anomaly detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008, 30(5):893-908.[DOI:10.1109/TPAMI.2007. 70731]
Zheng W S, Wu A C. Asymmetric person re-identification:cross-view person tracking in a large camera network[J]. Scientia Sinica Informationis, 2018, 48(5):545-563.
郑伟诗, 吴岸聪.非对称行人重识别:跨摄像机持续行人追踪[J].中国科学:信息科学, 2018, 48(5):545-563 [DOI:10.1360/N112018-00017]
Gheissari N, Sebastian T B, Hartley R. Person reidentification using spatiotemporal appearance[C]//Proceedings of 2006 IEEE Conference on Computer Vision and Pattern Recognition. New York, USA: IEEE, 2006: 1528-1535.[ DOI: 10.1109/CVPR.2006.223 http://dx.doi.org/10.1109/CVPR.2006.223 ]
Qi M B, Wang C C, Jiang J G, et al. Person re-identification based on multi-feature fusion and alternating direction method of multipliers[J]. Journal of Image and Graphics, 2018, 23(6):827-836.
齐美彬, 王慈淳, 蒋建国, 等.多特征融合与交替方向乘子法的行人再识别[J].中国图象图形学报, 2018, 23(6):827-836. [DOI:10.11834/jig.170507]
Mclaughlin N, del Rincon J M, Miller P. Recurrent convolutional network for video-based person re-identification[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 1325-1334.[ DOI: 10.1109/CVPR.2016.148 http://dx.doi.org/10.1109/CVPR.2016.148 ]
Wu L, Shen C H, van den Hengel A. Deep recurrent convolutional networks for video-based person re-identification: an end-to-end approach[EB/OL]. 2016-06-12[2019-02-20] . https://arxiv.org/pdf/1606.01609.pdf https://arxiv.org/pdf/1606.01609.pdf .
Liu H, Jie Z Q, Jayashree K, et al. Video-based person re-identification with accumulative motion context[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2018, 28(10):2788-2802.[DOI:10.1109/TCSVT.2017.2715499]
Zhang W, Yu X D, He X Y. Learning bidirectional temporal cues for video-based person re-identification[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2018, 28(10):2768-2776.[DOI:10.1109/TCSVT.2017.2718188]
Xu S J, Cheng Y, Gu K, et al. Jointly attentive spatial-temporal pooling networks for video-based person re-identification[C]//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 4743-4752.[ DOI: 10.1109/ICCV.2017.507 http://dx.doi.org/10.1109/ICCV.2017.507 ]
Liu Y, Yan J J, Ouyang W L. Quality aware network for set to set recognition[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 4694-4703.[ DOI: 10.1109/CVPR.2017.499 http://dx.doi.org/10.1109/CVPR.2017.499 ]
Zhou Z, Huang Y, Wang W, et al. See the forest for the trees: joint spatial and temporal recurrent neural networks for video-based person re-identification[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 6776-6785.[ DOI: 10.1109/CVPR.2017.717 http://dx.doi.org/10.1109/CVPR.2017.717 ]
Zhang R M, Sun H B, Li J Y, et al. SCAN: self-and-collaborative attention network for video person re-identification[EB/OL].2018-07-20[2019-02-20] . https://arxiv.org/pdf/1807.05688.pdf https://arxiv.org/pdf/1807.05688.pdf .
Wang T Q, Gong S G, Zhu X T, et al. Person re-identification by video ranking[C]//Proceedings of the 13th European Conference on Computer Vision. Switzerland:Springer, 2014: 688-703.[ DOI: 10.1007/978-3-319-10593-2_45 http://dx.doi.org/10.1007/978-3-319-10593-2_45 ]
Hirzer M, Beleznai C, Roth P M, et al. Person re-identification by descriptive and discriminative classification[C]//Proceedings of the 17th Scandinavian Conference on Image Analysis. Ystad, Sweden: Springer, 2011: 91-102.[ DOI: 10.1007/978-3-642-21227-7_9 http://dx.doi.org/10.1007/978-3-642-21227-7_9 ]
He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 770-778.[ DOI: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]
Ren S Q, He K M, Girshick R, et al. Faster R-CNN:towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6):1137-1149.[DOI:10.1109/TPAMI.2016.2577031]
Cornegruta S, Bakewell R, Withey S, et al. Modelling radiological language with bidirectional long short-term memory networks[C]//Proceedings of the 7th International Workshop on Health Text Mining and Information Analysis. Austin, TX, USA: Association for Computational Linguistics, 2016: 17-27.[ DOI: 10.18653/v1/W16-6103 http://dx.doi.org/10.18653/v1/W16-6103 ]
Gao J Y, Nevatia R. Revisiting temporal modeling for video-based person ReID[EB/OL]. 2018-05-08[2019-02-20] . https://arxiv.org/pdf/1805.02104.pdf https://arxiv.org/pdf/1805.02104.pdf .
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[EB/OL]. 2017-12-06[2019-02-20] . https://arxiv.org/pdf/1706.03762.pdf https://arxiv.org/pdf/1706.03762.pdf .
Xiao T, Li S, Wang B C, et al. Joint detection and identification feature learning for person search[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 3376-3385.[ DOI: 10.1109/CVPR.2017.360 http://dx.doi.org/10.1109/CVPR.2017.360 ]
Chen D P, Li H S, Xiao T, et al. Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding[C]//Proceedings of 2018 IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 1169-1178.[ DOI: 10.1109/CVPR.2018.00128 http://dx.doi.org/10.1109/CVPR.2018.00128 ]
Saha B, Ram K S, Mukhopadhyay J, et al. Video based person re-identification by re-ranking attentive temporal information in deep recurrent convolutional networks[C]//Proceedings of the 25th IEEE International Conference on Image Processing. Athens, Greece: IEEE, 2018: 1663-1667.[ DOI: 10.1109/ICIP.2018.8451594 http://dx.doi.org/10.1109/ICIP.2018.8451594 ]
相关作者
相关机构
京公网安备11010802024621