发布时间: 2017-04-16 摘要点击次数: 全文下载次数: DOI: 10.11834/jig.20170407 2017 | Volume 22 | Number 4 图像处理和编码

 收稿日期: 2016-11-24; 修回日期: 2016-12-22 基金项目: 国家自然科学基金项目（61273285，61673269，61375019） 第一作者简介: 张晶 (1992-), 女, 上海交通大学自动化系控制科学与工程专业硕士研究生, 主要研究方向为模式识别与计算机视觉。E-mail:crystal_zj@sjtu.edu.cn 中图法分类号: TP391 文献标识码: A 文章编号: 1006-8961(2017)04-0472-10

# 关键词

Global-local metric learning for person re-identification
Zhang Jing, Zhao Xu
Department of Automation, Shanghai Jiao Tong University, Key Laboratory of System Control and Information Processing, Ministry of Education, Shanghai 200240, China
Supported by: National Natural Science Foundation of China (61273285, 61673269, 61375019)

# Abstract

Objective The task in person re-identification is to match snapshots of people from non-overlapping camera views at different times and places. Intra-class images from different cameras show varying appearances due to variations in illumination, background, occlusion, viewpoint, and pose. Feature representation and metric learning are two major research directions in person re-identification. On the one hand, some studies focus on feature descriptors, which are discriminative for different classes and robust against intra-class variations. On the other hand, numerous metric learning algorithms have achieved good performance in person re-identification. The comparison of all the samples with a single global metric is inappropriate for handling heterogeneous data. Several researchers have proposed local metric learning. However, these methods generally require complicated computations to solve convex optimization problems. Method To improve the performance of metric learning algorithms and avoid complex computation, this study applies the concept of local metric learning and combines global metric learning algorithms, such as cross-view quadratic discriminant analysis (XQDA) and metric learning by accelerated proximal gradient (MLAPG). In the training stage, all the samples are softly partitioned into several clusters using the Gaussian mixture model (GMM). Local metrics are learned on each cluster using metric learning methods, such as XQDA and MLAPG. Meanwhile, a global metric is also learned for the entire training set. In the testing stage, the posterior probabilities of the testing samples that are aligned to each GMM component are computed. For each pair of samples, the local metrics weighted by their posterior probabilities of GMM components and the global metric weighted by a cross-validated parameter are integrated into the final metric for similarity evaluation. In this manner, we use different metrics to measure various pairs of samples, which is more suitable for heterogeneous data sets. In particular, we also propose an effective local metric learning strategy for MLAPG by modifying the weights of the loss values of the sample pairs in the loss function with the posterior probabilities of the samples aligned to each GMM component. Result We conduct experiments on three challenging data sets of person re-identification (i.e., VIPeR, PRID 450S, and QMUL GRID). Experimental results show that the proposed approach achieves better performance compared with traditional global metric learning methods. It performs significantly better on the VIPeR data set, providing more complex variations of backgrounds and clothes than on the other data sets, thereby improving matching accuracy by approximately 2.0%. In addition, we also conduct experiments on different types of feature representations for person re-identification to verify the generalized effectiveness of the proposed method. The matching accuracy is improved by approximately 1.3% to 3.4% with different feature descriptors. This result shows that the proposed approach can improve performance regardless of which feature descriptor is used. Conclusion We propose a novel framework for integrating global and local metric learning methods by taking advantages of both metric learning approaches. Numerous recent global metric learning approaches can be integrated into the proposed framework to obtain improved performance in the person re-identification problem. Compared with certain local metric learning approaches, the proposed framework integrates global metric learning methods flexibly and effectively. It doesn't require complicated computation unlike other local metric learning approaches. Moreover, the proposed metric learning framework can be applied to many feature representation approaches.

# Key words

person re-identification; metric learning; local metric learning; integrated global-local metric learning; Gaussian mixture model

# 1.1 全局度量学习

 $d\left( {{\mathit{\boldsymbol{x}}_i}, {\mathit{\boldsymbol{x}}_j}} \right) = {\left( {{\mathit{\boldsymbol{x}}_i}-{\mathit{\boldsymbol{x}}_j}} \right)^{\rm{T}}}\left( {\mathit{\boldsymbol{ \boldsymbol{\varSigma} }}_{_I}^{^{-1}}-\mathit{\boldsymbol{ \boldsymbol{\varSigma} }}_{_E}^{^{ - 1}}} \right)({\mathit{\boldsymbol{x}}_i} - {\mathit{\boldsymbol{x}}_j})$ (1)

 ${d_W}({\mathit{\boldsymbol{x}}_i}, {\mathit{\boldsymbol{x}}_j}) = {({\mathit{\boldsymbol{x}}_i}-{\mathit{\boldsymbol{x}}_j})^{\rm{T}}}\cdot\mathit{\boldsymbol{W}}\left( {\mathit{\boldsymbol{ \boldsymbol{\varSigma} }}_{_I}^{^{\prime-1}}-\mathit{\boldsymbol{ \boldsymbol{\varSigma} }}_{_E}^{^{\prime - 1}}} \right){\mathit{\boldsymbol{W}}^{\rm{T}}}({\mathit{\boldsymbol{x}}_i} - {\mathit{\boldsymbol{x}}_j})$ (2)

 ${\rm{min}}\;F\left( \mathit{\boldsymbol{M}} \right)\;\;{\rm{s}}{\rm{.t}}.\mathit{\boldsymbol{M}} \succ = 0$ (4)

# 2 特征表示

LOMO特征[2]对光照、亮度、视角等变化具有鲁棒性。该特征首先应用多尺度的Retinex变换[17]对图像预处理，以克服光照变化引起的颜色扭曲；然后在图像上滑窗提取特征，在每个滑窗内计算颜色和纹理直方图，直方图的每一维表示了某种特征的出现频次；对同一个水平条内的滑窗，取每一维特征的最大值作为这一水平条的特征，这样保证了特征对视角变化的鲁棒性；上述特征计算过程在3个尺度进行，保证了特征的尺度不变性。最终特征维度为26 960维。

GOG特征[4]是基于像素特征层级分布的一种区域描述子。首先图像被划分为一些较大的区域，每一个区域由一些更小的区块组成；在每个区块内提取像素特征，包括位置坐标、梯度、颜色等；用高斯分布拟合区块内的像素特征作为区块的特征表示；再用另一个高斯分布拟合较大区域内的所有区块的特征作为区域的特征描述子。

Wu等人[3]提出的特征融合网络有效结合了人工设计特征和卷积神经网络 (CNN)[18]特征。将改进的局部特征集 (ELF16) 描述子[19]和CNN特征联合映射到统一空间。通过反向传播算法，CNN网络参数将会受到人工设计特征的影响。

# 3.4 局部MLAPG

1.1.2节介绍了MLAPG算法的原理，该方法使用了非对称权重策略，类内和类间样本在损失函数中使用不同的权重。在式 (3) 中，${w_{ij}}$是样本对 ($\boldsymbol{x}_i$, $\boldsymbol{x}_j$) 的损失值权重。因此，我们可以考虑通过修改权重${w_{ij}}$来实现基于MLAPG方法的局部度量学习，该方法被称为局部MLAPG。

 $\begin{array}{l} {F_k}\left( {{\mathit{\boldsymbol{M}}_k}} \right) = \sum\limits_{i = 1}^n {\sum\limits_{j = 1}^m {w_{_{ij}}^{^k}} } f{M_k}\left( {{\mathit{\boldsymbol{x}}_i}, {\mathit{\boldsymbol{x}}_j}} \right)\\ {\rm{s}}{\rm{.t}}.\;\;{\mathit{\boldsymbol{M}}_k} \succ = 0 \end{array}$ (9)

# 4.1 实验设置及评价标准

VIPeR、PRID 450S、QMUL GRID数据集各自包含了两个无重叠视域的相机下的人体目标图像。实验中，其中一个相机视角下的图像组成备选集，另一视角的图像组成查询集，人体目标再识别的任务就是对查询集中的每个样本，在备选集中寻找其同一目标的匹配。在通常的实验设置中，将数据集随机划分为两部分，一半用于训练，另一半用于测试，重复10次随机实验，对实验结果取平均值。例如，对VIPeR数据集进行实验时，普遍做法是将632对目标随机分成不重叠的两部分，其中316对用于训练，另外316对用于测试，重复10次随机实验，对每次实验结果的CMC曲线和Rank-1准确率取平均值，作为最终结果。

# 4.2 人体目标再识别数据集

VIPeR数据集[11]是人体目标再识别研究中最常用的数据集。包含两个摄像机拍摄的632对目标，每个目标在两个相机下各有一幅图像，其背景、光照、视角和姿态等存在很大变化。图像统一大小为128×48像素。

PRID 450S[22]包含了两个静态相机视角下的450对行人目标图像，每幅图像的尺寸不是统一的。在实验中，为了利用文献[2]提出的方法计算LOMO特征，首先将图像尺寸统一到128×64像素大小，再进行特征计算。

QMUL GRID数据集[23]包含拍摄于地下车站的250对目标图像，另外还有775幅图像不属于这250个目标，并且没有类别标签，可用来扩充测试用的备选集。该数据集的图像分辨率较低，且亮度和视角变化很大。实验中每次取125对图像做训练，另外125对图像和775幅没有类别标签的图像一起构成测试集。

# 4.3 整合全局—局部度量学习验证

Table 1 accuracy of integrated global-local metric learning and global approaches

 /% 数据集 方法 $r$=1 $r$=10 $r$=20 VIPeR 整合全局-局部XQDA 41.99 82.50 92.25 XQDA[2] 40.00 80.51 91.08 整合全局-局部MLAPG 42.47 83.45 93.29 MLAPG[5] 40.73 82.34 92.37 PRID 450S 整合全局-局部XQDA 60.62 89.82 94.62 XQDA[2] 59.60 89.60 93.91 整合全局-局部MLAPG 59.73 90.44 95.56 MLAPG[5] 58.76 90.31 95.33 QMULGRID 整合全局-局部XQDA 18.80 44.08 55.52 XQDA[2] 18.32 44.08 55.44 整合全局-局部MLAPG 18.08 43.44 55.92 MLAPG[5] 17.68 43.28 55.28

# 4.4 局部度量学习验证

Table 2 Accuracy of local metric learning methods

 /% 方法 $r$=1 $r$=10 $r$=20 局部MLAPG 41.39 83.13 93.39 整合全局-局部MLAPG ($w_0$=0) 40.16 81.84 91.65 整合全局-局部XQDA ($w_0$=0) 37.78 79.30 90.13 MLAPG[5] 40.73 82.34 92.37 XQDA[2] 40.00 80.51 91.08

# 4.5 不同特征表示验证

Table 3 Accuracy of integrated global-local metric learning and global approaches with different features

 /% 方法 $r$=1 $r$=10 $r$=20 LOMO[2]+整合XQDA 41.99 82.50 92.25 LOMO[2]+XQDA[2] 40.00 80.51 91.08 GOG (原始)[4]+整合XQDA 42.15 83.67 91.90 GOG (原始)[4]+XQDA[2] 38.77 81.30 91.36 GOG (归一化)[4]+整合XQDA 43.89 85.16 93.64 GOG (归一化)[4]+XQDA[2] 42.53 84.40 92.97 FFN (原始)[3]+整合XQDA 31.58 71.80 83.99 FFN (原始)[3]+XQDA[2] 28.86 68.13 81.14 FFN (归一化)[3]+整合XQDA 32.59 73.86 86.49 FFN (归一化)[3]+XQDA[2] 30.13 72.75 85.73

# 参考文献

• [1] Zhao R, Ouyang W L, Wang X G. Person re-identification by salience matching[C]//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, NSW, Australia:IEEE, 2013:2528-2535.[DOI:10.1109/ICCV.2013.314]
• [2] Liao S C, Hu Y, Zhu X Y, et al. Person re-identification by local maximal occurrence representation and metric learning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA:IEEE, 2015:2197-2206.[DOI:10.1109/CVPR.2015.7298832]
• [3] Wu S X, Chen Y C, Li X, et al. An enhanced deep feature representation for person re-identification[C]//Proceedings of 2016 IEEE Winter Conference on Applications of Computer Vision. Lake Placid, NY, USA:IEEE, 2016:1-8.[DOI:10.1109/WACV.2016.7477681]
• [4] Matsukawa T, Okabe T, Suzuki E, et al. Hierarchical Gaussian descriptor for person re-identification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA:IEEE, 2016:1363-1372.[DOI:10.1109/CVPR.2016.152]
• [5] Liao S C, Li S Z. Efficient PSD constrained asymmetric metric learning for person re-identification[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Santiago, Chile:IEEE, 2015:3685-3693.[DOI:10.1109/ICCV.2015.420]
• [6] Köstinger M, Hirzer M, Wohlhart P, et al. Large scale metric learning from equivalence constraints[C]//Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, RI, USA:IEEE, 2012:2288-2295.[DOI:10.1109/CVPR.2012.6247939]
• [7] Zheng W S, Gong S G, Xiang T. Person re-identification by probabilistic relative distance comparison[C]//Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition. Providence, RI, USA:IEEE, 2011:649-656.[DOI:10.1109/CVPR.2011.5995598]
• [8] Huang S Y, Lu J W, Zhou J, et al. Nonlinear local metric learning for person re-identification[J/OL]. arXiv Preprint arXiv:1511.05169, 2015. 2015-11-16[2016-11-24].https://arxiv.org/abs/1511.05169v1.
• [9] Li W, Wang X G. Locally aligned feature transforms across views[C]//Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA:IEEE, 2013:3594-3601.[DOI:10.1109/CVPR.2013.461]
• [10] Bohné J, Ying Y M, Gentric S, et al. Large margin local metric learning[C]//Proceeding of the 13th European Conference on Computer Vision. Zurich, Switzerland:Springer International Publishing, 2014:679-694.[DOI:10.1007/978-3-319-10605-2_44]
• [11] Douglas G, Brennan S, Tao H. Evaluating appearance models for recognition, reacquisition, and tracking[C]//Proceedings of 2007 IEEE International Workshop on Performance Evaluation of Tracking and Surveillance. Rio de Janeiro, Brazil:IEEE, 2007:41-47.
• [12] Moghaddam B, Jebara T, Pentland A. Bayesian face recognition[J]. Pattern Recognition, 2000, 33(11): 1771–1782. [DOI:10.1016/S0031-3203(99)00179-X]
• [13] Tseng P. On accelerated proximal gradient methods for convex-concave optimization[J/OL]. 2008-05-21[2016-11-24].http://www.csie.ntu.edu.tw/b97058/tseng/papers/apgm.pdf.
• [14] Zhan D C, Li M, Li Y F, et al. Learning instance specific distances using metric propagation[C]//Proceedings of the 26th Annual International Conference on Machine Learning. Montreal, Quebec, Canada:ACM, 2009:1225-1232.[DOI:10.1145/1553374.1553530]
• [15] Saxena S, Verbeek J. Coordinated local metric learning[C]//Proceedings of 2015 IEEE International Conference on Computer Vision Workshop. Santiago, Chile:IEEE, 2015:369-377.[DOI:10.1109/ICCVW.2015.56]
• [16] Schroff F, Kalenichenko D, Philbin J. Facenet:a unified embedding for face recognition and clustering[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA:IEEE, 2015:815-823.[DOI:10.1109/CVPR.2015.7298682]
• [17] Jobson D J, Rahman Z, Woodell G A. A multiscale retinex for bridging the gap between color images and the human observation of scenes[J]. IEEE Transactions on Image Processing, 1997, 6(7): 965–976. [DOI:10.1109/83.597272]
• [18] Jia Y Q, Shelhamer E, Donahue J, et al. Caffe:convolutional architecture for fast feature embedding[C]//Proceedings of the 22nd ACM International Conference on Multimedia. Orlando, Florida, USA:ACM, 2014:675-678.[DOI:10.1145/2647868.2654889]
• [19] Chen Y C, Zheng W S, Lai J H. Mirror representation for modeling view-specific transform in person re-identification[C]//Proceedings of the 24th International Conference on Artificial Intelligence. Buenos Aires, Argentina:AAAI Press, 2015:3402-3408.
• [20] Wold S, Esbensen K, Geladi P. Principal component analysis[J]. Chemometrics and Intelligent Laboratory Systems, 1987, 2(1-3): 37–52. [DOI:10.1016/0169-7439(87)80084-9]
• [21] Vedaldi A, Fulkerson B. VLFeat:an open and portable library of computer vision algorithms[C]//Proceedings of the 18th ACM International Conference on Multimedia. Firenze, Italy:ACM, 2010:1469-1472.[DOI:10.1145/1873951.1874249]
• [22] Roth P M, Hirzer M, Köstinger M, et al. Mahalanobis distance learning for person re-identification[M]//Gong S G, Cristani M, Yan S C, et al. Person Re-Identification. London, Britain:Springer, 2014:247-267.[DOI:10.1007/978-1-4471-6296-4_12]
• [23] Loy C C, Xiang T, Gong S G. Multi-camera activity correlation analysis[C]//Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL, USA:IEEE, 2009:1988-1995.[DOI:10.1109/CVPR.2009.5206827]
• [24] Porikli F, Divakaran A. Multi-camera calibration, object tracking and query generation[C]//Proceedings of the 2003 International Conference on Multimedia and Expo. Baltimore, MD, USA:IEEE, 2003:653-656.[DOI:10.1109/ICME.2003.1221002]
• [25] Sánchez J, Perronnin F, Mensink T, et al. Image classification with the fisher vector:Theory and practice[J]. International Journal of Computer Vision, 2013, 105(3): 222–245. [DOI:10.1007/s11263-013-0636-x]