Current Issue Cover
结合BiLSTM和注意力机制的视频行人再识别

余晨阳,温林凤,杨钢,王玉涛(东北大学信息科学与工程学院)

摘 要
目的 跨摄像头跨场景的视频行人再识别问题是目前计算机视觉领域的一项重要任务。在现实场景中,光照变化、遮挡、观察点变化以及杂乱的背景等造成行人外观的剧烈变化,增加了行人再识别的难度。为提高视频行人再识别系统在复杂应用场景中的鲁棒性,本文提出了一种结合双向长短时记忆循环神经网络(BiLSTM)和注意力机制的视频行人再识别算法。方法 首先基于残差网络结构,训练卷积神经网络(CNN)学习空间外观特征,然后使用BiLSTM提取双向时间运动信息,最后通过注意力机制融合学习到的空间外观特征和时间运动信息,以形成一个有判别力的视频层次表征。结果 在两个公开的大规模数据集上与现有的其他方法进行了实验比较。在iLIDS-VID数据集中,与性能第二的方法相比,Rank1指标提升了4.5%;在PRID2011数据集中,相比于性能第二的方法,Rank1指标提升了3.9%。同时分别在两个数据集中进行了消融实验,实验结果验证了所提出算法的有效性。结论 本文提出的结合BiLSTM和注意力机制的视频行人再识别算法,能够充分利用视频序列中的信息,学习到更鲁棒的序列特征。实验结果表明,对于不同数据集,均能显著提升识别性能。
关键词
Video person re-identification based on BiLSTM and attention mechanism

Yu Chenyang,Wen Linfeng,Yang Gang,Wang Yutao(College of Information Science and Engineering,Northeastern University)

Abstract
Objective Video person re-identification(re-ID) attracts much attention due to the rapidly growing surveillance camera networks and the increasing demand of public safety. In recent years, the person re-identification task has become one of the core problems in intelligent surveillance and multimedia applications. It aims to match image sequences of pedestrians from non-overlapping cameras distributed at different physical locations. In other words, given a tracklet taken from one camera, re-ID is the process of matching the person from tracklets of interest in another view. In practice, video re-ID faces several challenges. As video acquisition is much less constrained, the image qualities of video frames tend to be rather low and pedestrians also exhibit a large range of pose variations. Also, the pedestrians in videos are usually moving, resulting in serious out-of-focus, blurring and scale variations. What’s more, the same person in different videos may look quite different. In a word, when people move between cameras, the large appearance changes caused by environmental and geometric variations increases the difficulty of re-ID task. A lot of works has been proposed to deal with these issues. A typical video-based person re-id system first extracts the frame-wise features with deep CNNs. Then the extracted features are fed into several recurrent neural networks (RNNs) to capture the temporal structure information. Finally, the average or max temporal pooling procedure is conducted on the output RNNs to aggregate the features. However, the average pooling operation only considers the generic features of pedestrian sequences, the specific features of samples in a sequence are neglected. While the max pooling operation concentrates on finding the local salient features, a lot of useful information may be abandoned. In this case, a video person re-id algorithm based on bi-directional LSTM(BiLSTM) and attention mechanism is proposed, which aims to make full use of the temporal information and to improve the robustness of person re-id systems for complex surveillance scenes. Method First, from the input video sequence, the proposed algorithm breaks the long sequence into short snippets and randomly selects a constant number of frames for snippets. Then, the snippets are fed into a pre-trained CNN network to extract the feature representation of each frame. In this way, the network can learn spatial appearance representation. Next, the sequence representation is calculated by BiLSTM according to the temporal domain, which contains temporal motion information. The BiLSTM in the network make the specific information flow forward and backward in a flexible manner, allowing the underlying temporal information interaction to be fully exploited. After feature extraction, the frame-level and sequence-level features from the probe and gallery videos are fed into dot attention network independently. Finally, after calculating the correlation(the attention weight) between the sequence and its frames, the output sequence representation is reconstructed as a weighted sum of the frames at different spatial and temporal positions in the input sequence. Thanks to the attention mechanism, the network can alleviate sample noises and poor alignments appeared in videos. Our Network is implemented on Pytorch platform and trained with a NVIDIA GTX 1080 GPU device. All training and testing images are rescaled to a fixed size of 256*128. The ResNet-50 with the pretrained parameters on ImageNet is considered as the backbone network in our system. For network parameter training, we adopt stochastic gradient descent (SGD) with a momentum of 0.9. The learning rate is initially set as 0.001 and further divided by 10 after every 20 epochs. The batch size is set at 8 for training and the total training process lasts for 40 epochs. The whole network is trained end-to-end with a joint identification and verification manner. And during test, the query and gallery videos are encoded to feature vectors using the aforementioned system. To compare the re-identification performance of the proposed method with the existing advanced methods, we adopt the Cumulative Matching Characteristics (CMC) at rank-1, rank-5, rank-10 and rank20 on all the datasets. Result The proposed network is demonstrated on two public benchmark datasets including iLIDS-VID and PRID2011. For iLIDS-VID, the 600 video sequences of 300 persons are randomly split into 50% of persons for training and 50% of persons for testing. For PRID2011, we follow the experiment setup in previous methods and only use 400 video sequences of the first 200 persons, who appear in both cameras. The experiments on these two datasets are repeated 10 times with different test/train splits, and the results are averaged to ensure stable evaluation. The rank1(represents the proportion of the queried people) results of two datasets are 80.5% and 87.6% respectively. In the iLIDS-VID dataset, the Rank1 increased by 4.5% compared to the second performance method. In the PRID2011 dataset, the Rank1 increased by 3.9% compared to the second performance method. Extensive ablation studies verify the effectiveness of BiLSTM and attention mechanism. Compared with the results that only use LSTM in iLIDS-VID and PRID2011 datasets, the Rank1(higher is better) increased by 10.9% and 12.7%, respectively. Conclusion A video person re-id method based on bi-directional LSTM(BiLSTM) and attention mechanism is proposed in this study. The experimental results on benchmark datasets show that the proposed algorithm can effectively learn spatio-temporal features relevant for re-id task. Furthermore, the proposed BiLSTM allow temporal information not only propagate from front to back but also in the reverse direction. And the attention mechanism can adaptively select the discriminative information from the sequentially varying features. The proposed network significantly improves the recognition rate and has a practical application value. The experimental results also show that the proposed method achieves an improvement in the robustness of video person re-id systems in complex scenes and outperforms several state-of-the-art approaches.
Keywords
QQ在线


订阅号|日报