发布时间: 2019-10-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.190056
2019 | Volume 24 | Number 10

图像分析和识别

结合BiLSTM和注意力机制的视频行人再识别

余晨阳, 温林凤, 杨钢, 王玉涛

东北大学信息科学与工程学院, 沈阳 110819

收稿日期: 2019-03-06; 修回日期: 2019-05-02; 预印本日期: 2019-05-09

第一作者简介: 余晨阳, 1995年生, 男, 硕士研究生, 主要研究方向为行人再识别。E-mail:asuradayuci@gmail.com;
温林凤, 女, 硕士研究生, 主要研究方向为计算机视觉。E-mail:wenlinfeng@stu.neu.edu.cn;
王玉涛, 女, 教授, 主要研究方向为传感器数据融合、传感器信号建模, 多相流关键参数检测。E-mail:wangyutao@ise.neu.edu.cn.

中图法分类号: TP301.6

文献标识码: A

文章编号: 1006-8961(2019)10-1703-08

摘要

目的跨摄像头跨场景的视频行人再识别问题是目前计算机视觉领域的一项重要任务。在现实场景中，光照变化、遮挡、观察点变化以及杂乱的背景等造成行人外观的剧烈变化，增加了行人再识别的难度。为提高视频行人再识别系统在复杂应用场景中的鲁棒性，提出了一种结合双向长短时记忆循环神经网络（BiLSTM）和注意力机制的视频行人再识别算法。方法首先基于残差网络结构，训练卷积神经网络（CNN）学习空间外观特征，然后使用BiLSTM提取双向时间运动信息，最后通过注意力机制融合学习到的空间外观特征和时间运动信息，以形成一个有判别力的视频层次表征。结果在两个公开的大规模数据集上与现有的其他方法进行了实验比较。在iLIDS-VID数据集中，与性能第2的方法相比，首位命中率Rank1指标提升了4.5%；在PRID2011数据集中，相比于性能第2的方法，首位命中率Rank1指标提升了3.9%。同时分别在两个数据集中进行了消融实验，实验结果验证了所提出算法的有效性。结论提出的结合BiLSTM和注意力机制的视频行人再识别算法，能够充分利用视频序列中的信息，学习到更鲁棒的序列特征。实验结果表明，对于不同数据集，均能显著提升识别性能。

关键词

计算机视觉; 行人再识别; 卷积神经网络; 双向长短时记忆循环神经网络(BiLSTM); 注意力机制

Video person reidentification based on BiLSTM and attention mechanism

Yu Chenyang, Wen Linfeng, Yang Gang, Wang Yutao

College of Information Science and Engineering, Northeastern University, Shenyang 110819, China

Abstract

Objective Video person reidentification (re-ID) has attracted much attention due to the rapidly growing surveillance camera networks and the increasing demand of public safety. In recent years, the person reidentification task has become one of the core problems in intelligent surveillance and multimedia applications. This task aims to match the image sequences of pedestrians from non-overlapping cameras distributed at different physical locations. Given a tracklet taken from one camera, re-ID is the process of matching the person from tracklets of interest in another view. In practice, video re-ID faces several challenges. The image qualities of video frames tend to be rather low and pedestrians also exhibit a large range of pose variations because video acquisition is less constrained. Pedestrians in videos are usually moving, resulting in serious out-of-focus, blurring, and scale variations. Moreover, the same person in different videos may look different. When people move between cameras, the large appearance changes caused by environmental and geometric variations increases the difficulty of re-ID task. A lot of works has been proposed to deal with these issues. A typical video-based person re-id system first extracts the frame-wise features with deep convolutional neural networks (CNNs). The extracted features are fed into several recurrent neural networks (RNNs) to capture temporal structure information. Finally, the average or maximum temporal pooling procedure is conducted on the output RNNs to aggregate the features. However, the average pooling operation only considers the generic features of pedestrian sequences, and the specific features of samples in a sequence are neglected. While the maximum pooling operation concentrates on finding the local salient features, useful information may be abandoned. In this case, a video person re-id algorithm based on bi-directional long short-term memory (BiLSTM) and attention mechanism is proposed to make full use of temporal information and improve the robustness of person re-id systems for complex surveillance scenes. Method From the input video sequence, the proposed algorithm breaks the long sequence into short snippets and randomly selects a constant number of frames for snippets. The snippets are fed into a pre-trained CNN network to extract the feature representation of each frame. In this method, the network can learn spatial appearance representation. Sequence representation is calculated by BiLSTM according to the temporal domain, which contains temporal motion information. BiLSTM in the network causes specific information to flow forward and backward in a flexible manner, allowing the underlying temporal information interaction to be fully exploited. After feature extraction, the frame-level and sequence-level features from the probe and gallery videos are fed into dot attention network independently. After calculating the correlation (the attention weight) between the sequence and its frames, the output sequence representation is reconstructed as a weighted sum of the frames at different spatial and temporal positions in the input sequence. In the attention mechanism, the network can alleviate sample noises and poor alignments in videos. Our network is implemented on the Pytorch platform and trained with a NVIDIA GTX 1080 GPU device. All training and testing images are rescaled to a fixed size of 256×128 pixels. The ResNet-50 with the pretrained parameters on ImageNet is considered the backbone network in our system. For network parameter training, we adopt stochastic gradient descent (SGD) with a momentum of 0.9. The learning rate is initially set as 0.001 and further divided by 10 after every 20 epochs. The batch size is set at 8 for training, and the total training process lasts for 40 epochs. The whole network is trained end-to-end with a joint identification and verification manner. During the test, the query and gallery videos are encoded to the feature vectors by using the aforementioned system. To compare the re-identification performance of the proposed method with the existing advanced methods, we adopt the cumulative matching characteristics (CMC) at rank-1, rank-5, rank-10, and rank-20 on all datasets. Result The proposed network is demonstrated on two public benchmark datasets including iLIDS-VID and PRID2011. For iLIDS-VID, the 600 video sequences of 300 persons are randomly split into 50% of persons for training and 50% of persons for testing. For PRID2011, we follow the experiment setup in previous methods and only use 400 video sequences of the first 200 persons, who appear in both cameras. The experiments on these two datasets are repeated 10 times with different test/train splits, and the results are averaged to ensure stable evaluation. Rank1 (represents the proportion of the queried people) results of the two datasets are 80.5% and 87.6% respectively. In the iLIDS-VID dataset, the Rank1 is increased by 4.5% compared with the second performance method. In the PRID2011 dataset, the Rank1 is increased by 3.9% compared with the second performance method. Extensive ablation studies verify the effectiveness of BiLSTM and attention mechanism. Compared with the results that only use LSTM in iLIDS-VID and PRID2011 datasets, the Rank1 (higher is better) is increased by 10.9% and 12.7%, respectively. Conclusion This work proposes video person re-id method based on BiLSTM and attention mechanism. The proposed algorithm can effectively learn spatio-temporal features relevant for re-id task. Furthermore, the proposed BiLSTM allow temporal information not only to propagate from front to back but also in the reverse direction. The attention mechanism can adaptively select the discriminative information from the sequentially varying features. The proposed network significantly improves the recognition rate and has a practical application value. The proposed method shows improved robustness of video person re-id systems in complex scenes and outperforms several state-of-the-art approaches.

Key words

computer vision; person re-identification; convolutional neural network (CNN); bi-directional long short-term memory(BiLSTM); attention mechanism

0 引言

近年来，随着国家对公共安全问题的日益重视以及视频监控技术的发展，越来越多的摄像头被部署在人群密集场所^[1]。然而大规模视频监控系统的运行产生了海量的监控数据，单纯依靠人工难以快速分析和处理，因而利用计算机视觉技术自动完成监控任务的智能监控系统应运而生^[2]。尽管目前人脸识别技术已经较为成熟，但是在实际监控环境中，往往无法获取有效的人脸图像，使用全身信息来对行人进行锁定和查找就变得十分重要。这也促使行人再识别技术逐渐成为计算机视觉领域的研究热点，受到广泛的关注。

行人再识别^[3]旨在对某个摄像头中出现的一个行人，在其他摄像头中再次出现时，能准确地将该行人识别出来。由于摄像机视点改变、运动人体姿态的剧烈变化、光照、遮挡以及杂乱的背景等影响^[4]，导致行人再识别算法依旧面临着巨大的挑战。目前行人再识别的研究方法主要分为基于单帧图片和基于视频的行人再识别。

由于单帧图片的信息有限，容易受到光照、遮挡等因素的影响，而一个连续的视频帧序列包含丰富的时间信息，这样在复杂环境和剧烈姿态变化下能更好地识别行人，因此越来越多的工作开始集中于利用视频序列来进行行人再识别。基于视频的行人再识别典型的工作流程如图 1所示。

图 1 视频行人再识别标准工作流程

Fig. 1 Standard video person re-identification pipeline

Mclaughlin等人^[5]首次将深度神经网络用于行人再识别任务上，依靠循环神经网络(RNN)联系帧与帧之间的关系，取得了不错的实验结果。Wu等人^[6]设计了一个相似的结构，联合优化卷积神经网络(CNN)和RNN，为相似性度量提取一个鲁棒的时空特征表征。在文献[5]的基础上，Liu等人^[7]提出了累计运动背景网络(AMOC)，该网络除了提取序列图像的特征，还提取运动光流的运动特征，用于提高识别率。同时，Zhang等人^[8]针对视频序列编码问题，提出双向递归循环神经网络(BRNN)，同时学习前向和后向时间线索，有效提升了视频行人再识别的性能。

为了更好地从视频序列中提取有效信息，近期工作还引入了注意力机制用于视频行人再识别。Xu等人^[9]提出了联合空间和时间注意力池化网络(ASTPN)，引入了一个用于时间建模的共享注意力矩阵，实现了帧选择过程中探测序列与图库序列之间的信息交互。Liu等人^[10]设计了一个质量意识网络(QAN)，该网络估计每一帧的质量分数，以减弱噪声样本的影响。Zhou等人^[11]组合每帧的视觉特征和RNN隐藏层状态以生成注意力权重。此外，还采用空间RNN对6个方向的上下文信息进行集成，以增强特征图中每个位置的表征。Zhang等人^[12]设计了一个新颖的自关注和互协作网络，采用非参数注意力机制来细化视频的序列内和序列间特征表示，并为每个视频序列输出自关注和协作关注的特征表示，使得探测序列和图库序列之间有判别力的帧对齐。

虽然实验结果显示CNN-RNN结构能获得较好的检测性能，但在一些复杂场景和环境下，其适应性和鲁棒性仍有待提升。并且随着输入视频序列长度的增加，RNN存在梯度消失或梯度爆炸问题。同时，使用平均或最大池化来汇总整个序列的时间信息没有充分反映样本的特征，缺乏发现视频序列中有判别力帧的能力。而简单引入判别每一帧图片质量分数的注意力机制，无法发现每一帧中有判别性的身体部位。

针对上述问题，本文提出了一种结合双向长短时记忆循环神经网络(BiLSTM)和注意力机制的视频行人再识别方法。采用双向长短时记忆网络代替递归循环网络，更好地捕获时空长期依赖性。同时没有简单地通过平均池化方法对视频序列进行编码，而是采用新颖的注意力机制，在整个空间区域和时间维度上分配注意力得分，以实现帧选择和挖掘有判别力的部分。所提出的算法分别在两个公开的大规模行人再识别数据集iLIDS-VID^[13]和PRID2011^[14]上进行实验，取得了较高的准确率。

1 方法

提出的结合BiLSTM和注意力机制的视频行人再识别算法流程图如图 2所示，首先使用预训练的卷积神经网络从探测和图库视频中提取每个图像帧的特征表示；然后使用BiLSTM来计算视频序列的表征；将探测和图库视频中的帧级和序列级特征输入到注意力池化网络中，计算序列和其帧之间的相关性(注意力)；之后序列特征的输出被重建为输入序列中不同时间位置处的帧的加权和；最后将探测序列和图库序列的特征向量馈送进损失函数中，进行模型的训练。

图 2 本文视频行人再识别算法流程图

Fig. 2 Process of the proposed video person re-identification method

1.1 时空特征提取

深度卷积神经网络在计算机视觉各个领域中均表现出优异的性能，尤其是残差网络ResNet^[15]解决了增加网络深度带来的性能退化问题，使网络更容易优化且能够通过增加相当的深度来提高准确率，在图像分类、检测、定位^[16]等领域取得了优异的结果。在序列任务中，为了提高序列分类问题的模型性能, 同时克服常规RNN的局限，Cornegruta等人^[17]提出使用特定的时间框架来利用过去和未来的所有可用信息进行训练的双向循环神经网络。

采用ResNet50作为图像层次的特征提取器来提取每一帧图片的特征，并将第1个卷积层的通道数修改为5，以适应RGB图片和光流图的同时输入，同时移除了最后的平均池化层和全连接层。给定片段$c$，包含$L$帧图片，其中第$t$帧的图片经过卷积神经网络产生的特征向量表示为${\mathit{\boldsymbol{\varphi }}_t}$($c$)，所有$L$帧特征表示为集合$\mathit{\boldsymbol{ \boldsymbol{\varPhi} }}\left( c \right) = \left\{ {{\mathit{\boldsymbol{\varphi }}_t}\left( c \right)} \right\}_{t = 1}^L$。

视频行人再识别算法的关键在于将一系列的图像层次特征聚合成一个视频片段层次的特征^[18]。与传统方法只考虑视频序列的前向时间信息不同，本文采用BiLSTM在视频帧之间建立双向上下文信息，其每一时步的隐藏层状态分为两个部分，结构如图 3所示。

图 3 网络架构中采用的BiLSTM层

Fig. 3 The BiLSTM layer adopted in the proposed network architecture

BiLSTM捕获当前时步的双向上下文信息，具体定义为

$ {\mathit{\boldsymbol{b}}_h}(t) = E_{{\rm{LSTM}}}^{\rm{b}}\left( {{\mathit{\boldsymbol{\varphi }}_t}(c), {\mathit{\boldsymbol{b}}_h}(t - 1)} \right) $

(1)

$ {\mathit{\boldsymbol{f}}_h}(t) = E_{{\rm{LSTM }}}^{\rm{f}}\left( {{\mathit{\boldsymbol{\varphi }}_t}(c), {\mathit{\boldsymbol{f}}_h}(t - 1)} \right) $

(2)

$ \mathit{\boldsymbol{o}}(t) = \left\{ {{\mathit{\boldsymbol{b}}_h}(t), {\mathit{\boldsymbol{f}}_h}(t)} \right\} $

(3)

式中，$E_{{\rm{LSTM}}}^{\rm{f}}\left( \cdot \right)$和$E_{{\rm{LSTM}}}^{\rm{b}}\left( \cdot \right)$分别表示长短时记忆(LSTM)的前向和反向处理过程；$t$表示当前时步；${\mathit{\boldsymbol{f}}_h}\left( {t - 1} \right)$和${\mathit{\boldsymbol{b}}_h}\left( {t - 1} \right)$则代表BiLSTM的记忆，分别包含着从当前时步之前和之后的帧中学到的信息；$\mathit{\boldsymbol{o}}\left( t \right)$表示在两个不同方向上计算的最终隐藏层状态；最后对$\mathit{\boldsymbol{o}}\left( t \right)$进行全局平均池化，得到包含双向时间信息的序列特征$\mathit{\boldsymbol{Q}}$。

1.2 注意力池化

对于一个视频片段，由于相邻帧之间的变化很小，特征向量集合中包含冗余的信息。同时，一些帧由于突然出现遮挡或检测行人的丢失，会变成异常帧。为了充分利用特征向量中的有效信息，需要采用注意力机制来自动地选择序列特征中有判别力的信息。与文献[10]中仅估计每一帧的质量分数的注意力方式不同，本文采用的注意力是基于机器翻译中的“点积注意力”，其使用一个查询(query)和一系列的键—值对(key-value pairs)来生成最后的序列特征^[19]，具体结构如图 4所示。

图 4 注意力机制示意图

Fig. 4 Illustration of the proposed attention mechanism

首先将BiLSTM网络的输出作为query特征$\mathit{\boldsymbol{Q}}$，并为任意帧$t$的特征向量${{\mathit{\boldsymbol{\varphi }}_t}\left( c \right)}$建立一个key投影和value投影，将投影得到的结果通过批标准化层，得到key特征${\mathit{\boldsymbol{k}}_t}\left( c \right)$和value特征${\mathit{\boldsymbol{v}}_t}\left( c \right)$。注意力池化的过程主要分为以下3步：

1) 采用点积作为相似度函数，将query和每个key进行相似度计算，得到权重为

$ f\left( {\mathit{\boldsymbol{Q}}, {\mathit{\boldsymbol{K}}_t}} \right) = {\mathit{\boldsymbol{Q}}^{\rm{T}}}{\mathit{\boldsymbol{K}}_t} $

(4)

式中，${\mathit{\boldsymbol{K}}_t} = {\mathit{\boldsymbol{k}}_t}\left( c \right)$表示帧$t$的key特征。

2) 在时间维度上应用softmax函数对权重进行归一化，由此获得第$t$帧的注意力权重为

$ \begin{array}{*{20}{c}} {{\mathit{\boldsymbol{a}}_t} = {\rm{softmax}} \left( {f\left( {\mathit{\boldsymbol{Q}}, {\mathit{\boldsymbol{K}}_t}} \right)} \right) = }\\ {\frac{{\exp \left( {f\left( {\mathit{\boldsymbol{Q}}, {\mathit{\boldsymbol{K}}_t}} \right)} \right)}}{{\sum\limits_{l = 1}^L {\exp } \left( {f\left( {\mathit{\boldsymbol{Q}}, {\mathit{\boldsymbol{K}}_t}} \right)} \right)}}} \end{array} $

(5)

3) 将注意力权重和对应的value特征进行加权求和，得到最后的注意力为

$ \mathit{\boldsymbol{ \boldsymbol{\varPhi} }}(c) = {f_{{\rm{att}}}}(\mathit{\boldsymbol{Q}}, \mathit{\boldsymbol{K}}, \mathit{\boldsymbol{V}}) = \sum\limits_{t = 1}^L {{\mathit{\boldsymbol{a}}_t}} \circ {\mathit{\boldsymbol{V}}_t} $

(6)

式中，符号“$ \circ $”表示哈达玛积，即两个矩阵对应元素的乘积；${\mathit{\boldsymbol{V}}_t} = {\mathit{\boldsymbol{v}}_t}\left( c \right)$表示帧$t$的value特征。

在训练阶段，给定一对视频序列$\left( {{p_n}, {g_m}} \right)$，采用二元交叉熵损失来监督片段相似性估计的学习。相似性估计定义为

$ s\left( {{p_n}, {g_m}} \right) = \sigma \left[ {f\left( {{{\left( {\mathit{\boldsymbol{\phi}} \left( {{p_n}} \right) - \mathit{\boldsymbol{\phi}} \left( {{g_m}} \right)} \right)}^2}} \right)} \right] $

(7)

$ {s^*}\left( {{p_n}, {g_m}} \right) = \left\{ {\begin{array}{*{20}{l}} {s\left( {{p_n}, {g_m}} \right)}&{n = m}\\ {1 - s\left( {{p_n}, {g_m}} \right)}&{n \ne m} \end{array}} \right. $

(8)

式中，$\mathit{\boldsymbol{\phi}} \left( {{p_n}} \right)$和$\mathit{\boldsymbol{\phi}} \left( {{g_m}} \right)$分别表示片段${{p_n}}$和${{g_m}}$的序列特征，$n$和$m$表示行人ID；函数$f\left( \cdot \right)$表示全连接层将向量转换成一个值；函数$\sigma \left[ \cdot \right]$表示sigmoid函数，将相似性归一化。此时，对比损失函数定义为

$ {L_{{\rm{ver}}}} = - \frac{1}{{{N_{{\rm{ver}}}}}}\sum\limits_{\left( {{p_n}, {g_m}} \right)} {\log \left( {{s^*}\left( {{p_n}, {g_m}} \right)} \right)} $

(9)

式中，${N_{{\rm{evr}}}}$表示采样生成序列对的数目。同时将每帧经过CNN输出的特征作为输入，建立一个单独的分支，采用在线实例匹配(OIM)^[20]损失函数来监督预测的行人ID，具体为

$ {L_{{\rm{id}}}} = - \frac{1}{N}\sum\limits_{n = 1}^N {\sum\limits_{i = 1}^I {{y^{i, n}}} } \log \left( {\frac{{\exp \left( {\mathit{\boldsymbol{w}}_i^{\rm{T}}{\mathit{\boldsymbol{x}}_n}} \right)}}{{\sum\limits_{j = 1}^I {\exp } \left( {\mathit{\boldsymbol{w}}_j^{\rm{T}}{\mathit{\boldsymbol{x}}_n}} \right)}}} \right) $

(10)

式中，${\mathit{\boldsymbol{x}}_n}$表示第$n$张图片的CNN特征；训练集有$I$个行人共$N$张图片，如果第$n$张图片属于第$i$个行人则${y^{i, n}}$=1，否则${y^{i, n}}$=0；$\mathit{\boldsymbol{w}}$是与特征嵌入$\mathit{\boldsymbol{x}}$关联的系数。最终的损失函数为

$ {L_{{\rm{total}}}} = {L_{{\rm{ver}}}} + {L_{{\rm{id}}}} $

(11)

2 实验

实验基于Pytorch深度学习框架实现，硬件配置为32 GB内存，Intel(R) Core(TM) i7-4790k处理器和NVIDIA GTX1080 8 GB显卡的个人电脑。模型的超参数设置参照文献[21]，采用随机梯度下降算法作为优化方法；初始学习率为0.001，并每隔20个迭代轮次(epoch)缩小10倍；批次数设为8；视频片段包含的图片帧数$L$设为8；卷积神经网络输出的特征维度为2 048，value特征的维度为128，key特征和query特征的维度为128×4；每次实验的epochs设为40。

在iLIDS-VID和PRID2011数据集上对所提算法进行检验。对每项测试，每次实验随机生成训练集和测试集，在相同条件下重复进行10次实验，取10次实验结果的平均值作为本项测试的最终结果。实验结果采用累积匹配曲线(CMC)和首位命中率来评估算法的性能。CMC曲线是指在行人图库中搜索待查询的行人，在前$k$个搜索结果中成功找到该行人的概率。首位命中率Rank1 ($k$=1)是一个重要指标，表示被查询的行人出现在第1个搜索结果中的概率。

2.1 iLIDS-VID数据集的测试结果

视频数据取自iLIDS-VID数据集中的机场接站大厅，由两个无重叠视域的摄像机拍摄行人运动图像形成。包含300个不同ID的行人，每个行人有来自两个视角的一对视频序列，共600个图像视频序列。在数据集中每个行人的视频序列有23~192帧，平均序列长度为73个图像帧。数据集中的图片存在不同行人之间的服装相似性，不同摄像机存在光照和视角变化、复杂的背景以及严重的遮挡等问题，很具挑战性。

实验中随机选择300个行人序列构成训练集，剩下的300个行人构成测试集，所得到的实验结果与其他具有代表性的算法：经典的RNN-CNN网络^[6]、融合运动光流信息的AMOC算法^[7]、采用共享注意力矩阵的ASTPN算法^[9]以及在ASTPN网络基础上应用ReRank方法^[22]进行比较，结果如表 1所示。

表 1 不同算法在iLIDS-VID数据集上的识别率
Table 1 Matching rates of different methods on iLIDS-VID

下载CSV

/%
算法	Rank1	Rank5	Rank10	Rank20
RNN-CNN^[6]	58.0	84.0	91.0	96.0
AMOC^[7]	68.7	94.3	98.3	99.3
ASTPN^[9]	62.0	86.0	94.0	98.0
ASTPN+ReRank^[22]	76.0	94.0	97.0	98.0
本文	80.5	95.4	97.2	99.3
注：加粗字体为该排列(Rank)下的最优结果。

由表 1数据可知，本文算法与现有算法相比，在识别率上有显著提升。其中Rank1达到了80.5%，与性能第2的ASTPN+ReRank方法^[22]相比提高了4.5%，在Rank5和Rank20中与目前主流算法AMOC、ASTPN等相比都有一定的提升。

2.2 PRID2011数据集的测试结果

PRID2011数据集提供了2个不同监控摄像机下的多个行人轨迹，监控场景为人行道。在摄像机A下有385个行人视频序列，而在摄像机B下则有749个行人视频序列，但是只有200个行人是同时出现在两个摄像机视角中。每个视频序列有5~675帧，平均行人视频帧数为100帧，整个数据集共有24 541幅图像，每幅图像的尺寸为128×64像素。该数据集是在不拥挤的户外场景下采集的，有相对简单和干净的背景，几乎没有什么遮挡存在。

每次实验随机选择100个行人序列构成训练集，剩下的100个行人构成测试集，所得到的实验结果与其他主流算法进行比较，结果如表 2所示。

表 2 不同算法在PRID2011数据集上的识别率
Table 2 Matching rates of different methods on PRID2011

下载CSV

/%
算法	Rank1	Rank5	Rank10	Rank20
RNN-CNN^[6]	70.0	90.0	95.0	97.0
AMOC^[7]	83.7	98.3	99.4	100
ASTPN^[9]	77.0	95.0	99.0	99.0
ASTPN+ReRank^[22]	83.0	99.0	99.0	99.0
本文	87.6	97.7	99.2	100
注：加粗字体为该排列(Rank)下的最优结果。

由表 2数据可知，针对PRID2011数据集，本文方法与现有主流方法AMOC、ASTPN以及ASTPN+ReRank相比，在识别率上有明显提升，其中Rank1分别提升了3.9%、10.6%和4.6%，说明了本文方法的优越性。

2.3 算法有效性验证

为验证本文所提出的网络结构中的BiLSTM和注意力机制的有效性，在PRID2011和iLIDS-VID数据集上进行了相应的对比实验，实验结果如图 5所示。

图 5 在两个数据集上不同算法的性能比较

Fig. 5 Performances at different algorithms on two datasets

((a) iLIDS-VID; (b) PRID2011)

为验证BiLSTM的有效性，实验中分别采用LSTM和BiLSTM将一系列的图像层次特征聚合成一个视频片段层次的特征，并且实验中未使用注意力机制。实验结果表明，在PRID2011数据集上只使用BiLSTM层的Rank1结果比只使用LSTM层的结果提升了3.5%；在iLIDS-VID数据集上只使用BiLSTM层的Rank1结果比只使用LSTM层的结果提升了1.3%，说明了BiLSTM能更好地学习时间运动信息。

为验证注意力机制的有效性，本文在BiLSTM层上增加注意力机制。实验结果表明，在PRID2011数据集上使用注意力机制的Rank1结果提升了7.4%；在iLIDS-VID数据集上使用注意力机制的Rank1结果提升了11.4%，说明了注意力机制能有效提升行人再识别的性能。

3 结论

现有的视频行人再识别算法主要解决如何将一系列的图像层次特征聚合成一个鲁棒的视频片段层次特征。针对存在的问题，本文提出一种基于注意力机制的视频行人再识别方法，首先使用卷积神经网络学习每一帧图片的空间外观特征，然后使用BiLSTM网络捕获视频片段的时间运动信息，最后通过点积注意力机制来融合学习到的空间外观特征和时间运动信息，形成一个有判别力的视频片段层次特征。在两个数据集上的实验结果表明，本文方法显著地提高了视频行人再识别性能。但是现有的视频行人再识别数据集一般规模较小，没有充分的样本来进行网络的训练和优化，因而容易使网络过拟合，泛化能力不强。对此，下一步工作可以采用生成对抗网络(GAN)来扩大数据集。同时，视频行人再识别与多目标跟踪、人体姿态估计等相关问题相结合也是今后的研究方向。

参考文献

[1] Xiang T, Gong S G. Video behavior profiling for anomaly detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008, 30(5): 893–908. [DOI:10.1109/TPAMI.2007.70731]

[2] Zheng W S, Wu A C. Asymmetric person re-identification:cross-view person tracking in a large camera network[J]. Scientia Sinica Informationis, 2018, 48(5): 545–563. [郑伟诗, 吴岸聪. 非对称行人重识别:跨摄像机持续行人追踪[J]. 中国科学:信息科学, 2018, 48(5): 545–563. ] [DOI:10.1360/N112018-00017]

[3] Gheissari N, Sebastian T B, Hartley R. Person reidentification using spatiotemporal appearance[C]//Proceedings of 2006 IEEE Conference on Computer Vision and Pattern Recognition. New York, USA: IEEE, 2006: 1528-1535.[DOI: 10.1109/CVPR.2006.223]

[4] Qi M B, Wang C C, Jiang J G, et al. Person re-identification based on multi-feature fusion and alternating direction method of multipliers[J]. Journal of Image and Graphics, 2018, 23(6): 827–836. [齐美彬, 王慈淳, 蒋建国, 等. 多特征融合与交替方向乘子法的行人再识别[J]. 中国图象图形学报, 2018, 23(6): 827–836. ] [DOI:10.11834/jig.170507]

[5] Mclaughlin N, del Rincon J M, Miller P. Recurrent convolutional network for video-based person re-identification[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 1325-1334.[DOI: 10.1109/CVPR.2016.148]

[6] Wu L, Shen C H, van den Hengel A. Deep recurrent convolutional networks for video-based person re-identification: an end-to-end approach[EB/OL]. 2016-06-12[2019-02-20]. https://arxiv.org/pdf/1606.01609.pdf.

[7] Liu H, Jie Z Q, Jayashree K, et al. Video-based person re-identification with accumulative motion context[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2018, 28(10): 2788–2802. [DOI:10.1109/TCSVT.2017.2715499]

[8] Zhang W, Yu X D, He X Y. Learning bidirectional temporal cues for video-based person re-identification[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2018, 28(10): 2768–2776. [DOI:10.1109/TCSVT.2017.2718188]

[9] Xu S J, Cheng Y, Gu K, et al. Jointly attentive spatial-temporal pooling networks for video-based person re-identification[C]//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 4743-4752.[DOI: 10.1109/ICCV.2017.507]

[10] Liu Y, Yan J J, Ouyang W L. Quality aware network for set to set recognition[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 4694-4703.[DOI: 10.1109/CVPR.2017.499]

[11] Zhou Z, Huang Y, Wang W, et al. See the forest for the trees: joint spatial and temporal recurrent neural networks for video-based person re-identification[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 6776-6785.[DOI: 10.1109/CVPR.2017.717]

[12] Zhang R M, Sun H B, Li J Y, et al. SCAN: self-and-collaborative attention network for video person re-identification[EB/OL].2018-07-20[2019-02-20]. https://arxiv.org/pdf/1807.05688.pdf.

[13] Wang T Q, Gong S G, Zhu X T, et al. Person re-identification by video ranking[C]//Proceedings of the 13th European Conference on Computer Vision. Switzerland: Springer, 2014: 688-703.[DOI: 10.1007/978-3-319-10593-2_45]

[14] Hirzer M, Beleznai C, Roth P M, et al. Person re-identification by descriptive and discriminative classification[C]//Proceedings of the 17th Scandinavian Conference on Image Analysis. Ystad, Sweden: Springer, 2011: 91-102.[DOI: 10.1007/978-3-642-21227-7_9]

[15] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 770-778.[DOI: 10.1109/CVPR.2016.90]

[16] Ren S Q, He K M, Girshick R, et al. Faster R-CNN:towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137–1149. [DOI:10.1109/TPAMI.2016.2577031]

[17] Cornegruta S, Bakewell R, Withey S, et al. Modelling radiological language with bidirectional long short-term memory networks[C]//Proceedings of the 7th International Workshop on Health Text Mining and Information Analysis. Austin, TX, USA: Association for Computational Linguistics, 2016: 17-27.[DOI: 10.18653/v1/W16-6103]

[18] Gao J Y, Nevatia R. Revisiting temporal modeling for video-based person ReID[EB/OL]. 2018-05-08[2019-02-20]. https://arxiv.org/pdf/1805.02104.pdf.

[19] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[EB/OL]. 2017-12-06[2019-02-20]. https://arxiv.org/pdf/1706.03762.pdf.

[20] Xiao T, Li S, Wang B C, et al. Joint detection and identification feature learning for person search[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 3376-3385.[DOI: 10.1109/CVPR.2017.360]

[21] Chen D P, Li H S, Xiao T, et al. Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding[C]//Proceedings of 2018 IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 1169-1178.[DOI: 10.1109/CVPR.2018.00128]

[22] Saha B, Ram K S, Mukhopadhyay J, et al. Video based person re-identification by re-ranking attentive temporal information in deep recurrent convolutional networks[C]//Proceedings of the 25th IEEE International Conference on Image Processing. Athens, Greece: IEEE, 2018: 1663-1667.[DOI: 10.1109/ICIP.2018.8451594]