发布时间: 2019-02-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.180398
2019 | Volume 24 | Number 2

ChinaMM 2018

加权多特征外观表示的实时目标追踪

陈莹莹, 房胜, 李哲

山东科技大学计算机科学与工程学院, 青岛 266590

收稿日期: 2018-06-22; 修回日期: 2018-08-10

基金项目: 国家自然科学基金项目（61502278，61502277）

第一作者简介: 陈莹莹, 1993年生, 女, 硕士研究生, 主要研究方向为目标追踪。E-mail:skd2016yc@163.com;
房胜, 男, 教授, 主要研究方向为与多媒体信息相关的编码、分析、理解, 无线网络传输, 机器学习和智能服务。E-mail:fangs99@126.com.

中图法分类号: TP391

文献标识码: A

文章编号: 1006-8961(2019)02-0291-11

摘要

目的目标跟踪是计算机视觉领域重点研究方向之一，在智能交通、人机交互等方面有着广泛应用。尽管目前基于相关滤波的方法由于其高效、鲁棒在该领域取得了显著进展，但特征的选择和表示一直是追踪过程中建立目标外观时的首要考虑因素。为了提高外观模型的鲁棒性，越来越多的跟踪器中引入梯度特征、颜色特征或其他组合特征代替原始灰度单一特征，但是该类方法没有结合特征本身考虑不同特征在模型中所占的比重。方法本文重点研究特征的选取以及融合方式，通过引入权重向量对特征进行融合，设计了基于加权多特征外观模型的追踪器。根据特征的计算方式，构造了一项二元一次方程，将权重向量的求解转化为确定特征的比例系数，结合特征本身的维度信息，得到方程的有限组整数解集，最后通过实验确定最终的比例系数，并将其归一化得到权重向量，进而构建一种新的加权混合特征模型对目标外观建模。结果采用OTB-100中的100个视频序列，将本文算法与其他7种主流算法，包括5种相关滤波类方法，以精确度、平均中心误差、实时性为评价指标进行了对比实验分析。在保证实时性的同时，本文算法在Basketball、DragonBaby、Panda、Lemming等多个数据集上均表现出了更好的追踪结果。在100个视频集上的平均结果与基于多特征融合的尺度自适应跟踪器相比，精确度提高了1.2%。结论本文基于相关滤波的追踪框架在进行目标的外观描述时引入权重向量，进而提出了加权多特征融合追踪器，使得在复杂动态场景下追踪长度更长，提高了算法的鲁棒性。

关键词

相关滤波; 外观描述; 特征融合; 加权特征; 实时追踪

Real-time visual tracking via weighted multi-feature fusion on an appearance model

Chen Yingying, Fang Sheng, Li Zhe

College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China

Supported by: National Natural Science Foundation of China (61502278, 61502277)

Abstract

Objective Visual tracking is an important research direction in the field of computer vision and is widely applied in intelligent transportation, human-computer interaction, and other areas. Correlation filter-based trackers (CFTs) have achieved excellent performance due to their efficiency and robustness in tracking field. However, the design of a robust tracking algorithm for complex dynamic scenes is challenging due to the influence of lighting, fast motion, background interference, target rotation, scale change, occlusion, and other factors. In addition, the selection and presentation of features are constantly used as the primary considerations in establishing a target appearance model during tracking. To improve the robustness of the appearance model, many trackers introduce gradient feature, color feature, or several other combined features rather than a single gray feature. However, they do not discuss the role of each feature and their relationships in the model. Method The research on correlation filter theory achieves remarkable improvements. On the basis of this research, the appearance model is used to represent the target and verify the observation. This process is the most important part of any tracking algorithm. Moreover, the features are fundamental and difficult in appearance representation. Therefore, this study mainly focuses on the selection and combination of features. Gradient feature, color feature, and raw pixel have been discussed in previous works. As a common descriptor of shape and edge, gradient feature is invariable in translation and light and performs well in the tracking scene of deformation, light change, and partial occlusion. However, the gradient feature of the target is not evident, and the description capability of the feature is weakened when considerable noise is encountered in the background, target rotation, and target blur. The color of the target and background can be distinguished although they are usually different. On this basis, a new tracking method called weighted multi-feature fusion (WMFF) tracker is proposed via the introduction of a weight vector to fuse multiple feature on the appearance model. The model is dominated by gradient features and is supplemented by color feature and original pixels, which can compensate the inadequacies of single-gradient feature and provide the utilization of the color features of color, thereby making features complementary to each other. In detail, this study constructs a three-variable linear equation on weights based on the calculation method of each feature. The proportional relationships in this equation are solved rather than their specific values. The gradient feature can transform the solutions of weight vector to determine the proportional coefficients of each feature by using it as a criterion. Therefore, the equation is a system of linear equations of two unknowns. In addition, the equation has a limited integer solution set, and the final proportion coefficient is determined by experimental verification on test sequence in terms of the dimension information of feature calculation. This method normalizes the proportion coefficient as weight vector and builds a new weighted feature-mixing model of target appearance to model. The WMFF tracker adopts a detection-based tracking framework, which includes feature extraction, model construction, filter training, target center detection, and model update. Result A total of 100 video sequences from the object tracking benchmark datasets (herein, OTB-100 datasets) are adopted in the experiments to compare the performance with seven other state-of-the-art trackers, which include five CFTs. A total of 11 different attributes, such as illumination, occlusion, and scale variation, are annotated on video sequences. Comparisons and analyses are performed for these trackers by using precision, average center error, average Pascal VOC overlap ratio, and median frame per second as evaluation standards. Precision and success plots of different datasets are also presented, and the performance of different attributes are discussed. Experimental results on benchmark OTB-100 datasets demonstrate that our tracker can achieve real-time and better performance compared with other methods, especially on Basketball, DragonBaby, Panda, and Lemming sequences. The edge contours, especially the gradient information of the target, are unremarkable when the scene is subjected to motion blur due to occlusion or deformation, which causes the appearance model constructed by the gradient feature not being able to distinguish the target accurately and thus tracking failure easily occurs. Meanwhile, the WMFF tracker can utilize the color feature as a supplement to construct the appearance model in time to obtain a robust tracking effect when the gradient feature is invalid. The color feature has the same level of importance as the gradient feature and achieves an ideal feature combination effect. The performance of the proposed method outperforms other algorithms on multiple datasets, and the average results on OTB-100 datasets show that the precision is improved by 1.2% compared with a scale-adaptive kernel CFT with feature integration tracker. Conclusion In this study, a weight vector is introduced to combine features in describing the appearance of the target, and a WMFF tracker is proposed based on a CFT framework. A new hybrid feature HCG is dominated by gradient feature and is supplemented by color and gray feature, which can be used to model the appearance of the target. This model can compensate the deficiency of single feature and enables the function of each feature. This model not only can make the features complement one another but also make the appearance model adapt to multiple complex scenes. The WMFF tracker makes the tracking length longer than other trackers in complex dynamic scenes and improves the robustness of the algorithm.

Key words

correlation filter; appearance description; feature fusion; weighted feature; real-time tracking

0 引言

目标跟踪作为计算机视觉领域研究最活跃的主题之一，是指已知目标在视频序列第1帧中的初始信息(位置和大小)，自动估计目标在后续帧中的状态，以达到持续跟踪目标的目的，被广泛应用于运动分析、行为识别、智能交通、人机交互等方面。尽管近年来追踪算法取得了显著进展，但由于光照、快速运动、背景干扰、目标旋转、尺度变化、遮挡等因素的影响，使得为复杂动态场景设计一个鲁棒的追踪算法仍然是一项具有挑战性的工作。

现有的跟踪方法按照目标模型建立方式的不同可分为生成类方法^[1-2]和判别类方法^[3]。生成类方法通过在当前帧对目标区域建模，在下一帧寻找与所建模型具有最小重构误差的图像区域确定为所要跟踪的目标。代表性算法有稀疏编码^[1]、粒子滤波、主成分分析等。但是该类方法着重于对目标本身的刻画，忽略背景信息，在目标自身变化剧烈或被遮挡时容易产生漂移。判别类方法将跟踪问题转化为分类问题，通过设计鲁棒的分类器将图像中目标和背景分开。其中比较有代表性的有多实例学习方法(MIL)^[3]、boosting^[4]和结构支持向量机(SVM)^[5]等。判别式方法因为显著区分背景和前景的信息，表现更为鲁棒，逐渐在目标跟踪领域占据主流地位，包括目前大部分深度学习的方法以及基于相关滤波器的方法。

深度学习在图像的检测和识别上的成功应用极大地促进了目标跟踪领域的研究。MDNet算法^[6]的追踪精度可以达到90%以上，但深度学习应用于目标跟踪领域的问题在于缺乏训练数据。深度模型的强大来自于对大量标注训练数据的有效学习，而目标跟踪仅仅提供第1帧的bounding-box作为训练数据。这种情况下，在跟踪开始针对当前目标训练一个深度模型几乎是不可能的。此外，目前已有的深度学习目标跟踪方法还很难满足实时性的要求。

近年来，基于相关滤波的判别跟踪方法由于达到极高实时性的同时提高了跟踪精度而受到研究者广泛关注。基于越是相关的两个目标相关值越大，即视频帧中与初始化目标越相似，得到的响应也就越大的思想，相关滤波方法通过大量的训练样本学习出一个相关滤波器，并在后续跟踪中寻找预测分布中的响应峰值来定位目标的位置。误差最小平方和滤波器MOSSE^[7]最早将相关的思想用到目标跟踪，继而各种改进算法被提出。虽然该类方法取得了显著进展，但由于追踪场景的复杂性及目标外观变化的不确定性，使得研究高效、鲁棒的相关滤波追踪器仍具有重要的现实意义。

本文在已有相关滤波算法的基础上，分别对灰度、颜色、HOG特征进行讨论，引入权重向量对特征进行加权融合代替单特征以及特征简单融合，即构造了一项二元一次方程，将权重向量的求解转化为确定特征的比例系数，结合特征自身计算所得维度信息，得到有限组整数解集进而确定最终的比例系数，并将其归一化得到一种以HOG为主，CN、Gray为辅的新混合特征来对目标外观建模，该模型能够弥补单HOG特征的不足，充分发挥各特征作用，既使得特征实现互补，又使外观模型能够适应多复杂场景。

1 相关工作

基于相关滤波的判别式跟踪方法，通过将输入特征的循环偏移得到的训练样本回归为目标高斯分布学习出滤波模板，通过快速傅里叶变换提高计算速度，目前已经成为视觉追踪领域的一种主流方法。误差最小平方和滤波器(MOSSE)^[7]首次将相关滤波方法应用到视觉追踪上，文献[7]使用了一种自适应的训练策略，利用单特征求解最小二乘快速学习到相关滤波器。文献[8]利用图像的循环结构进行相邻帧的相关性检测，解决了训练样本缺乏问题，同时将时域上的点乘转化到频域上的卷积操作，在计算速率上达到了很好的实时性。但在目标外观模型的建立上二者均采用原始图像灰度作为特征，易受图像背景复杂、目标背景相似的影响。为了提高追踪效果，文献[9]引入方向梯度直方图(HOG)特征表示目标，提取了图像的边缘梯度信息，得到了不错的追踪效果。文献[10]在文献[8]的基础上加入了颜色命名(CN)属性^[11]，将RGB颜色映射到11维空间，采用自适应主成分分析(PCA)方法，将11维的颜色属性降到2维，提高了追踪性能。但该方法适用于目标和背景颜色对比较明显的场景及对于分辨率比较低的视频，极易发生目标丢失。尽管目前有越来越多的基于多特征建模的目标追踪方法，但是如何进行特征的选取和融合仍然是一个难点问题。基于多特征融合的尺度自适应跟踪器(SAMF)^[12]同时将原始图像灰度信息、颜色属性以及HOG 3种特征融合作为目标模型，追踪结果有所改进。文献[13-14]采用同样的特征组合方法。但文献[12]只是将每一特征维度作为一个通道，将多个特征维度即多个通道简单连接为一个向量，各特征的作用及特征之间的关系并没有讨论。此外，每一特征的计算方式不同，得到的特征维度参差不齐，比如原始像素只有1维特征，在最后的组合通道中只占1个通道；而HOG特征由于特征方向的多样，得到几十维的特征，在最终的组合特征中所占比例自然较大。这就导致了各特征地位的不均等。文献[15]采用CNN特征构建外观模型，但CNN特征需要足够的样本预先训练，更为重要的是该特征远远不能满足实时性要求。

1) 外观模型用来表示目标和验证观测，是所有跟踪算法最重要的部分，而在目标的表示中，特征又是最根本的。一个有效的外观模型要考虑几点因素：目标以不同的特征表示，如常见的灰度、LBP、颜色、HOG、Haar等特征。在相关滤波跟踪框架中，特征选取有效与否对于跟踪性能有着重要影响。首先，所提取的特征应属于基于图像块的全局特征而非基于特征点描述的特征；其次，不同特征在不同跟踪场景中对目标的描述能力不同，若仅选择固定的一个或两个特征进行跟踪，当后续场景变换后，特征有效性可能发生改变，从而导致跟踪失败；最后，由于跟踪算法实时性的要求，特征提取算法本身不宜有较高的计算复杂度。

2) 在线更新方案，以使跟踪器能够适应目标对象和背景的外观改变。当目标被遮挡或丢失时，原始样本信息更准确，而比较新的样本本身是不可靠的。

基于以上分析，本文重点讨论基于多特征融合的权重方程求解以及目标外观模型建立方案。

2 加权多特征追踪

2.1 SAMF算法回顾

文献[9]将相关滤波器的求解转化为岭回归问题，利用矩阵循环增加了训练样本，利用循环矩阵对角化的性质降低计算复杂度，为了处理非线性分类问题，又引入核技巧。滤波器通过最小化回归误差训练得到

$ \mathop {\min }\limits_\mathit{\boldsymbol{w}} \sum\limits_i {{{\left( {f\left( {{\mathit{\boldsymbol{x}}_i}} \right) - {y_i}} \right)}^2}} + \mathit{\boldsymbol{\lambda }}{\left\| \mathit{\boldsymbol{w}} \right\|^2} $

(1)

式中，${\mathit{\boldsymbol{x}}_i}$表示已提取图像的第$i$个特征，$\mathit{\boldsymbol{w}}$表示分类器参数，$f\left( \mathit{\boldsymbol{x}} \right) = {\mathit{\boldsymbol{w}}^{\rm{T}}}\mathit{\boldsymbol{x}}$表示相关滤波器，${y_i}$∈[0, 1]，根据距离正样本越近目标可能性越大的准则由高斯函数得到。$\mathit{\boldsymbol{\lambda }}$为正则化参数。KCF相对于其他的tracking-by-detection方法速度得到了极大的提升，效果也相对较好。但是其缺点也是很明显的，除了论文提到的目标变形和尺度问题，算法的鲁棒性有待进一步提高，如目标在遇到局部遮挡或快速运动时，失败率很高。例如以HOG为特征的KCF和以颜色属性为特征的CN算法在Deer、Girl2视频集的追踪中心误差对比结果如图 1所示。

图 1 不同数据集上单特征中心误差对比结果

Fig. 1 Center error comparison of single feature on different datasets((a) Deer; (b) Girl2)

从图 1可以看出，单HOG特征模型对目标的判别性不强，容易发生目标丢失(Deer序列24~37帧；Girl2序列330~1 500帧)；而此时CN特征能够准确区分目标和背景，恰好可以弥补HOG的不足，所以期望设计一种梯度特征、颜色或其他特征互补的混合特征模型。SAMF算法在KCF算法的基础上，将图像多特征作为$C$个通道连接成一个向量$\mathit{\boldsymbol{x}} = \left[ {{\mathit{\boldsymbol{x}}_1}, {\mathit{\boldsymbol{x}}_2}, \cdots , {\mathit{\boldsymbol{x}}_C}} \right]$，KCF中的高斯核函数写成

$ {\mathit{\boldsymbol{k}}^{\mathit{\boldsymbol{xx'}}}} = \exp \left( \begin{array}{l} - \frac{1}{{{\sigma ^2}}}\left( {{{\left\| \mathit{\boldsymbol{x}} \right\|}^2} + {{\left\| {{\mathit{\boldsymbol{x}}^{\rm{T}}}} \right\|}^2}} \right) - \\ 2{\mathit{\boldsymbol{F}}^{ - 1}}\left( {\sum\limits_{c = 1}^C {{{\mathit{\boldsymbol{\hat x}}}_c} \odot {{\left( {\mathit{\boldsymbol{\hat x}}_c^{\rm{T}}} \right)}^{\rm{H}}}} } \right) \end{array} \right) $

(2)

式中，$\sigma $表示核参数，$ \odot $表示内积，$c$表示通道$\left( {c = 1, \cdots , C} \right)$。${\mathit{\boldsymbol{F}}^{{\rm{ - }}1}}$为频域变频函数。算法选择常见的颜色属性特征CN、梯度特征HOG以及原始图像亮度值3种特征。HOG特征作为一种常用的形状和边缘描述算子，具有一定的平移和光照不变性，在处理形变、光照变化和局部遮挡类跟踪场景中表现良好。然而，当背景出现较多噪声或目标发生旋转以及目标运动模糊时，目标的梯度特征并不明显，此时HOG特征的描述能力变弱。CN^[10]利用文献[11]的映射方法将RGB空间转化为11维颜色特征空间，能够很好地表示目标的颜色信息，在彩色视频序列上得到了良好的跟踪性能。二者分别描述了目标的梯度和颜色特征，特征判别性较强，此外，归一化灰度特征Gray作为一种最简单的1维辅助特征，可以和任意其他特征进行简单串联，补充和提高特征表达能力。

2.2 问题提出

由于追踪场景的多样复杂性，上述特征组合方法远远达不到预期的特征互补效果。本节首先从实验和理论两方面对失败场景进行分析，后续章节在此基础上给出本文的追踪方法。以图 2为例，在Deer序列25帧之前，HOG特征的追踪精确度更高，但在第25帧和55帧左右，HOG表示的目标发生丢失，而从25帧开始，组合特征的结果趋向于失败的HOG，追踪失败，说明此时颜色特征并没有得到有效利用。

图 2 Deer数据集单/多特征中心误差对比结果

Fig. 2 Center error comparison of single and multi-feature on Deer

理论上来说，SAMF算法的结果更接近于HOG，这是因为在进行特征向量组合时，HOG的特征维度是31维，CN特征维度是10维，而Gray只占1维，在最终的融合特征维度中，HOG占了超过70%的比重，融合结果自然偏向于HOG。所以将上述3种特征进行简单融合，跟踪准确率和鲁棒性很差，这是因为不同类型的特征描述目标的能力不同，而在SAMF中无法发挥CN及其他特征的优势。

2.3 多特征加权融合算法

综合以上分析，本文从特征计算本身出发，提出一种特征加权组合方案。通过对HOG、CN和Gray特征分配不同权重，得到一种以HOG为主、CN和Gray为辅的新的混合特征HCG对目标外观建模，进而构成追踪器WMFF (weighted multi-feature fusion)。具体细节如下：

设${\mathit{\boldsymbol{x}}_{\rm{h}}}$表示提取图像块的HOG特征，${\mathit{\boldsymbol{x}}_{\rm{c}}}$表示CN特征，${\mathit{\boldsymbol{x}}_{\rm{g}}}$表示图像原始灰度，将其组合成混合特征$\mathit{\boldsymbol{x}}{\rm{ = }}\left[ {{\mathit{\boldsymbol{x}}_{\rm{h}}}, {\mathit{\boldsymbol{x}}_{\rm{c}}}, {\mathit{\boldsymbol{x}}_{\rm{g}}}} \right]$。分类器$f\left( \mathit{\boldsymbol{x}} \right)$由样本在以目标位置为中心、大小为$M$×$N$的图像块上训练，样本采用${x_i}$循环位移得到，式中$i \in \left\{ {\left[ {0, \cdots , M - 1} \right] \times \left[ {0, \cdots , N - 1} \right]} \right\}$。为了区分不同特征所起作用，引入权重向量$\mathit{\boldsymbol{\gamma = }}\left[ {{\gamma _{\rm{h}}}, {\gamma _{\rm{c}}}, {\gamma _{\rm{g}}}} \right]$，目标函数式(1)就可以写成

$ \mathop {\min }\limits_\mathit{\boldsymbol{w}} {\left\| {f\left( {\mathit{\boldsymbol{\gamma }}{\mathit{\boldsymbol{x}}^{\rm{T}}}} \right) - \mathit{\boldsymbol{y}}} \right\|^2} + \mathit{\boldsymbol{\lambda }}{\left\| \mathit{\boldsymbol{w}} \right\|^2} $

(3)

那么滤波器函数$f\left( \mathit{\boldsymbol{x}} \right)$就可以写成各特征线性组合的形式，即

$ f\left( \mathit{\boldsymbol{x}} \right) = {\gamma _{\rm{h}}}{f_{\rm{h}}}\left( \mathit{\boldsymbol{x}} \right) + {\gamma _{\rm{c}}}{f_{\rm{c}}}\left( \mathit{\boldsymbol{x}} \right) + {\gamma _{\rm{g}}}{f_{\rm{g}}}\left( \mathit{\boldsymbol{x}} \right) $

(4)

式(4)不需要求解${{\gamma _{\rm{h}}}, {\gamma _{\rm{c}}}, {\gamma _{\rm{g}}}}$的具体数值，只需得到它们的比例关系。考虑到以HOG为特征的KCF追踪器^[9]已经取得了不错的追踪效果，结合单特征计算时分别得到的维度信息，所以本文以HOG特征为主特征，设计特征之间的比例关系满足

$ {\gamma _{\rm{h}}} \times {d_{\rm{h}}} = {\gamma _{\rm{c}}} \times {d_{\rm{c}}} + {\gamma _{\rm{g}}} \times {d_{\rm{g}}} $

(5)

式中，${d_i}$表示各特征维度，$i \in \left\{ {{\rm{h, c, g}}} \right\}$。即

$ {d_{\rm{h}}} = {d_{\rm{c}}} \times \frac{{{\gamma _{\rm{c}}}}}{{{\gamma _{\rm{h}}}}} + {d_{\rm{g}}} \times \frac{{{\gamma _{\rm{g}}}}}{{{\gamma _{\rm{h}}}}} $

(6)

令$\frac{{{\gamma _c}}}{{{\gamma _{\rm{h}}}}} = \alpha $，$\frac{{{\gamma _{\rm{g}}}}}{{{\gamma _{\rm{h}}}}} = \beta $，式(6)可写成

$ {d_{\rm{c}}}\alpha + {d_{\rm{g}}}\beta = {d_{\rm{h}}} $

(7)

且满足$\alpha $，$\beta $为整数。文献[9]利用pdollar工具箱(https://www.mathworks.com/matlabcentral/fileexchange/56689-pdollar-toolbox)提取HOG特征，其提取了31维特征称之为31通道。文献[11]得到的颜色特征一般为10维，称之为10通道，所以式(7)的二元一次方程只存在4组整数解，解集为

$ \left\{ \begin{array}{l} \alpha = 0,\beta = {d_{\rm{h}}}/{d_{\rm{g}}}\\ \alpha = 1,\beta = \left( {{d_{\rm{h}}} - {d_{\rm{c}}}} \right)/{d_{\rm{g}}}\\ \alpha = 2,\beta = \left( {{d_{\rm{h}}} - 2{d_{\rm{c}}}} \right)/{d_{\rm{g}}}\\ \alpha = 3,\beta = \left( {{d_{\rm{h}}} - 3{d_{\rm{c}}}} \right)/{d_{\rm{g}}} \end{array} \right. $

(8)

为了确定$\alpha $，$\beta $的具体取值，选取标记有光照、尺度变化、遮挡等9种挑战场景的Box视频进行测试，对式(8)中的每组解集进行追踪精度的计算。图 3给出了$\alpha $在不同取值下，目标追踪精度变化的散点图。从图 3可以看出，$\alpha $取值越大，即CN特征所占比重越多，追踪精度越高。当$\alpha $=3时，追踪精度达到最高，此时CN特征权重增加，${\mathit{\boldsymbol{x}}_{\rm{h}}}:{\mathit{\boldsymbol{x}}_{\rm{c}}} \approx {\bf{1}}:{\bf{1}}$。为了避免数据集中存在灰度图像无法利用颜色特征，所以本文以Gray特征作为补充，即当CN特征对目标无效时，以Gray代替。本实验说明CN对目标的刻画能力和HOG同样重要，超过单灰度特征。所以在后续实验中，取$\alpha $=3。

图 3 追踪精度与$\alpha $取值散点图

Fig. 3 The scatter diagram of precision and $\alpha $

3 WMFF追踪器

3.1 滤波器训练

训练滤波器的目的就是找到一个滤波模板，使得将其作用在跟踪对象上时得到的响应值最大。

引入非线性映射函数$\varphi \left( x \right)$列向量，将式(1)的样本映射到高维空间，使其在新空间中线性可分，即

$ \mathit{\boldsymbol{w}} = \sum\limits_i {{\mathit{\boldsymbol{\alpha }}_i}\varphi \left( {{x_i}} \right)} $

(9)

式中，$\pmb \alpha $是$\mathit{\boldsymbol{w}}$的对偶空间变量。根据核矩阵的循环以及可对角化性质^[9]，有

$ \mathit{\boldsymbol{\hat \alpha }} = \frac{{\mathit{\boldsymbol{\hat y}}}}{{{{\mathit{\boldsymbol{\hat k}}}^{xx}} + \mathit{\boldsymbol{\lambda }}}} $

(10)

式中，$\mathit{\boldsymbol{\widehat \alpha }}$是$\mathit{\boldsymbol{\alpha }}$的离散傅里叶变换，即滤波器系数，核函数${\mathit{\boldsymbol{k}}^{xx}} = \varphi \left( \mathit{\boldsymbol{x}} \right)\varphi {\left( \mathit{\boldsymbol{x}} \right)^{\rm{T}}}$是核矩阵$\mathit{\boldsymbol{K}}$的第1行，即式(2)的径向基函数，$\mathit{\boldsymbol{\lambda }}$为正则化参数。根据式(9)训练得到滤波器$\mathit{\boldsymbol{\alpha }}$。

3.2 本文算法流程

本文算法采用基于检测的追踪框架，主要包括特征提取、模型构建、滤波器训练、目标中心检测以及模型更新。本文重点介绍特征提取和模型构建两方面，对于彩色视频图像，分别提取目标窗口的HOG、CN和Gray特征，而灰度视频图像则以Gray特征代替CN特征作为补充，将其以2.3节所述方式得到组合特征HCG。追踪流程总结如下：

1) 给定第1帧目标对象位置，在其附近提取窗口，得到$M$×$N$的图像块；

2) 分别提取图像块各特征，按照式(4)—式(8)将其组合得到HCG特征模型，并加窗进行边缘平滑；

3) 进行离散傅里叶(DFT)变换到频域空间，根据式(9)得到训练滤波器；

4) 下一帧到来时，首先在前一帧窗口分别提取并得到组合特征HCG，进行循环位移采样，将其与步骤3)训练所得滤波器进行卷积，得到响应值最大处即为当前帧目标中心位置；

5) 更新目标模型及滤波器模板；

6) 重复步骤4)和步骤5)，直至整个视频追踪完成。图 4给出了加权多特征追踪算法的执行过程，其中重点表示了特征提取和组合模块。

图 4 WMFF追踪器执行过程示意图

Fig. 4 Execution process diagram of WMFF tracker

4 实验

4.1 实验设置和数据集

本文所提算法实验均在MATLAB R2016b平台上完成，实验配置为3.40 GHz Intel(R) Core(TM) i7-6700 CPU，16 GB RAM，所有测试视频均来自OTB100^[16]，包含了光照变化(IV)、尺度变化(SV)、遮挡(OCC)、形变(DEF)、运动模糊(MB)、快速运动(FM)、平面内/外旋转(IPR/OPR)、移出视野(OV)、背景杂乱(BC)和低分辨率(LR)等常见的11种挑战场景。实验中用到7种对比算法，其中基于相关滤波的有KCF^[9]、CN^[10]、SAMF^[12]、DSST^[17]和Staple^[18]，基于稀疏表示的ASLA^[1]和PCOM^[19]。

4.2 评估方法

为了验证算法有效性，使用3种评价指标进行比较。第1种是精确度，$p = c/m$，追踪长度$c$表示视频序列平均中心误差(CLE)小于特定阈值的帧数，如图 5所示，阈值通常设置为20。CLE表示预测目标中心位置与实际中心位置之间欧氏距离的平均值，$m$为总帧数。第2种是成功率$s$。若跟踪框重叠率超过特定阈值(阈值通常设置为0.5)，则认为该视频帧跟踪成功。成功率表示所有追踪成功的视频帧数所占百分比。Pascal VOC重叠率计算为

$ {\mathit{\boldsymbol{R}}_{\rm{V}}} = \frac{{\left| {{\mathit{\boldsymbol{R}}_{\rm{G}}} \cap {\mathit{\boldsymbol{R}}_{\rm{T}}}} \right|}}{{\left| {{\mathit{\boldsymbol{R}}_{\rm{G}}} \cup {\mathit{\boldsymbol{R}}_{\rm{T}}}} \right|}} $

(11)

图 5 追踪长度示意图

Fig. 5 Tracking length

式中，${\mathit{\boldsymbol{R}}_{\rm{T}}}$表示追踪目标框，${\mathit{\boldsymbol{R}}_{\rm{G}}}$表示真实目标框。第3种是算法时间，以每秒传输帧数表示。此外，还有跟踪算法常用的精确度曲线和成功率曲线表示算法总体性能。

4.3 追踪算法性能比较

4.3.1 总体性能分析

由于篇幅有限，本节选择了几个代表性的视频集进行展示，通过实验比较得到的追踪精确度、平均中心误差以及重叠率(VOR)汇总如表 1所示。最优结果以粗体表示，次优结果以斜体表示。图 6给出了一次性评估(OPE)方法下，8种跟踪算法在不同视频集上得到的精确度曲线和成功率曲线。分析视频集可以看出，大部分场景中由于各种原因目标会发生运动模糊，此时目标的边缘轮廓甚至梯度信息不明显，HOG特征构建的外观模型无法准确区分目标和背景，而WMFF追踪器在HOG特征无效时，可以及时以颜色特征作为补充，进而构建外观模型，得到更为鲁棒的追踪效果。

表 1 不同追踪算法性能比较
Table 1 Performance comparison of different tracking algorithms

下载CSV

数据集	评价指标	CN	KCF	SAMF	DSST	Staple	ASLA	PCOM	WMFF
	精确度	0.71	0.92	0.93	0.81	0.88	0.00	0.38	0.94
Basketball	CLE	14.22	7.89	6.12	10.92	18.24	194.18	85.41	6.44
	VOR	0.56	0.68	0.74	0.58	0.72	0.06	0.19	0.74
	精确度	0.62	0.84	0.92	0.92	0.85	0.16	0.05	0.94
Coke	CLE	30.77	18.65	13.82	12.79	14.87	60.35	100.02	13.45
	VOR	0.42	0.56	0.57	0.58	0.58	0.17	0.05	0.59
	精确度	0.01	0.99	1.00	0.99	0.65	0.03	0.01	1.00
BlurCar1	CLE	169.83	5.10	4.65	4.22	108.90	116.55	175.94	4.41
	VOR	0.05	0.80	0.81	0.76	0.53	0.13	0.04	0.81
	精确度	0.44	0.07	0.31	0.20	0.53	0.00	0.03	0.51
Bird1	CLE	87.00	152.34	92.93	144.03	88.56	170.11	165.26	26.99
	VOR	0.30	0.05	0.16	0.11	0.30	0.02	0.02	0.28
	精确度	1.00	0.82	0.89	0.79	0.83	0.00	0.03	1.00
Deer	CLE	4.87	21.16	10.28	16.66	19.67	148.77	207.65	5.15
	VOR	0.76	0.64	0.69	0.65	0.67	0.04	0.03	0.76
	精确度	0.10	0.73	0.90	0.89	0.48	0.43	0.55	0.91
FaceOcc1	CLE	491.59	15.98	13.88	13.77	19.61	20.81	18.93	13.37
	VOR	0.11	0.76	0.78	0.77	0.79	0.77	0.75	0.79
	精确度	0.32	0.34	0.55	0.06	0.76	0.24	0.33	0.75
DragonBaby	CLE	94.68	50.40	46.06	142.37	21.58	44.46	86.61	21.96
	VOR	0.25	0.31	0.39	0.06	0.54	0.29	0.25	0.57
	精确度	0.31	0.49	0.27	0.43	0.27	0.17	0.17	0.74
Lemming	CLE	90.55	77.87	154.13	81.91	156.82	180.04	170.35	25.97
	VOR	0.29	0.38	0.23	0.33	0.24	0.14	0.14	0.57
	精确度	0.25	0.36	1.00	0.22	0.58	1.00	0.14	1.00
Panda	CLE	65.58	42.06	46.76	43.58	51.05	7.01	93.28	7.49
	VOR	0.15	0.16	0.28	0.13	0.32	0.50	0.10	0.50
注：最优结果以粗体表示，次优结果以斜体表示。

图 6 不同测试序列不同方法结果对比

Fig. 6 Comparison of results of different algorithms for different test sequences

((a)precision plot of Basketball; (b)success plot of Basketball; (c)precision plot of Lemming; (d)success plot of Lemming; (e)precision plot of Panda; (f)success plot of Panda)

4.3.2 场景性能分析

比如序列Bird1，该视频集面临的追踪难点包括DEF、FM和OV，特别是目标在飞行过程中遇到了较长时间的云层遮挡，此时几乎全部的追踪算法都会丢失目标，WMFF也不例外。但是相比其他算法，WMFF算法能够在得到较高重叠率、精确度的同时，得到更低的中心误差值。说明本文的算法追踪长度更长，偏离目标实际值更小。这是因为目标在即将进入云层之前，由于云层的半透明性，使得bird翅膀轮廓变的模糊，此时提取的HOG特征模型无法对其进行准确描述，而此时利用目标和云层的颜色差异，能够继续区分目标和云层背景。因此，实验结果超过以HOG为特征的KCF算法，也证实了以HCG特征构建的目标模型比SAMF算法的鲁棒性更强。

序列Deer面临的挑战包括MB、FM、IPR、BC以及LR，尤其是背景中相似目标的干扰，该视频集在单CN特征下已能够得到不错的追踪效果，但是在组合特征SAMF下，中心误差反而增大，查看追踪结果，发现在中间几帧发生目标丢失。丢失的几帧恰好是目标受到周围相似目标干扰，此时目标的梯度特征不明显，HOG特征描述目标能力变弱。而在WMFF模型中，增大了CN特征所占比重，更能发挥CN特征的作用，得到了成功的追踪效果。

DragonBaby序列面临的挑战包括SV、OCC、MB、FM、IPR、OPR以及OV，尤其是小男孩在打斗过程中目标姿态不断发生变化且变化幅度较大，导致运动模糊。此时包括SAMF在内的大部分算法追踪结果发生漂移，而WMFF此时充分利用了目标和背景的颜色差异，追踪长度远远超过KCF、CN，比SAMF算法的精度提高了36.4%。

4.3.3 特征对比分析

为了确定所提出方法对于追踪效果有明显提升，讨论各个特征对于跟踪算法的影响，并通过单独利用每个特征的相关实验来验证所提算法的有效性。

图 7给出了OTB-100数据集上分别利用HOG、CN以及原始灰度特征的平均追踪精度曲线图、平均成功率曲线图和本文算法的对比结果。从图中可以看出，利用加权多特征融合的追踪性能要远远超过各单特征，从而证明了特征融合的有效性。

图 7 单特征与特征加权融合追踪性能对比

Fig. 7 Performance comparison between single feature and WMFF((a) success plot on OTB-100; (b) precision plot on OTB-100)

4.3.4 时间性能分析

由于大多数目标追踪的应用场景对算法有实时性的要求，而为了实现一个人眼感受到的流畅的视频输出流，帧率至少要达到15帧/s以上。本文为了验证WMFF算法实时性，以帧率为指标，对比了WMFF算法和其他7种算法在100个视频集上的平均运行时间，对比结果如图 8所示。由对比图可以看出，时间最快的是KCF，其次是CN，WMFF排第3位，平均追踪速度在150帧/s左右。相比CN、KCF算法由于特征组合实时性有所降低，但是比SAMF算法实时性要高大约50帧/s，在确保准确率的情况下，满足了实时运行的需要，实现了两者兼顾。

图 8 OTB-100数据集上各算法平均追踪速度对比图

Fig. 8 Running time comparison of each algorithm on OTB-100

5 结论

在已有相关滤波追踪算法的基础上，重点研究了目标对象的特征提取和组合方式，通过对灰度、HOG、CN各特征的分析，提出了一种以HOG特征为主、CN和Gray为辅的混合多特征模型，该模型通过对各特征赋予不同权重，提高了模型鲁棒性。通过与已有算法的实验对比证明，本文提出的WMFF算法在多个数据集上保证了各评价指标的最优或次优。

本文算法并没有考虑目标尺度变化的影响，该类问题也是影响追踪器精度的一大因素。在接下来的工作中，将重点研究如何捕捉目标尺度的变化。

参考文献

[1] Jia X, Lu H C, Yang M H. Visual tracking via adaptive structural local sparse appearance model[C]//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, RI, USA: IEEE, 2012: 1822-1829.[DOI: 10.1109/CVPR.2012.6247880]

[2] Zhong W, Lu H C, Yang M H. Robust object tracking via sparse collaborative appearance model[J]. IEEE Transactions on Image Processing, 2014, 23(5): 2356–2368. [DOI:10.1109/TIP.2014.2313227]

[3] Babenko B, Yang M H, Belongie S. Robust object tracking with online multiple instance learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(8): 1619–1632. [DOI:10.1109/TPAMI.2010.226]

[4] Grabner H, Grabner M, Bischof H. Real-time tracking via online boosting[C]//Proceedings of 2006 British Machine Vision Conference. Edinburgh, UK: BMVA Press, 2006: 47-56[DOI: 10.5244/C.20.6]

[5] Hare S, Saffari A, Torr P H S. Struck: structured output tracking with kernels[C]//Proceedings of 2011 International Conference on Computer Vision. Barcelona, Spain: IEEE, 263-270.[DOI: 10.1109/ICCV.2011.6126251]

[6] Nam H, Han B. Learning multi-domain convolutional neural networks for visual tracking[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 1063-6919.[DOI: 10.1109/CVPR.2016.465]

[7] Bolme D S, Beveridge J R, Draper B A, et al. Visual object tracking using adaptive correlation filters[C]//Proceedings of 2010 IEEE Conference on Computer Vision and Pattern Recognition. San Francisco, USA: IEEE, 2010: 2544-2550.[DOI: 10.1109/CVPR.2010.5539960]

[8] Henriques J F, Caseiro R, Martins P, et al. Exploiting the circulant structure of tracking-by-detection with kernels[C]//Proceedings of the 12th European Conference on Computer Vision. Florence, Italy: Springer, 2012: 702-715.[DOI: 10.1007/978-3-642-33765-9_50]

[9] Henriques J F, Caseiro R, Martins P, et al. High-Speed tracking with Kernelized correlation filters[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(3): 583–596. [DOI:10.1109/TPAMI.2014.2345390]

[10] Danelljan M, Khan F S, Felsberg M, et al. Adaptive color attributes for real-time visual tracking[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE, 2014: 1090-1097.[DOI: 10.1109/CVPR.2014.143]

[11] Khan F S, Anwer R M, van de Weijer J, et al. Color attributes for object detection[C]//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, RI, USA: IEEE, 2012: 3306-3313.[DOI: 10.1109/CVPR.2012.6248068]

[12] Li Y, Zhu J K. A scale adaptive kernel correlation filter tracker with feature integration[C]//Proceedings of 2014 European Conference on Computer Vision. Zurich, Switzerland: Springer, 2015: 254-265.[DOI: 10.1007/978-3-319-16181-5_18]

[13] Xu F L, Wang H P, Song Y L, et al. A multi-scale kernel correlation filter tracker with feature integration and robust model updater[C]//Proceedings of the 29th Chinese Control and Decision Conference. Chongqing, China: IEEE, 2017: 1934-1939.[DOI: 10.1109/CCDC.2017.7978833]

[14] Huang D F, Luo L, Wen M, et al. Enable scale and aspect ratio adaptability in visual tracking with detection proposals[C]//Proceedings of the British Machine Vision Conference. Swansea, UK: BMVA Press, 2015.[DOI: 10.5244/C.29.185]

[15] Li F, Yao Y J, Li P H, et al. Integrating boundary and center correlation filters for visual tracking with aspect ratio variation[C]//Proceedings of 2017 IEEE International Conference on Computer Vision Workshop. Venice, Italy: IEEE, 2017: 2001-2009.[DOI: 10.1109/ICCVW.2017.234]

[16] Wu Y, Lim J, Yang M H. Object tracking benchmark[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9): 1834–1848. [DOI:10.1109/TPAMI.2014.2388226]

[17] Danelljan M, Häger G, Khan F S, et al. Accurate scale estimation for robust visual tracking[C]//Proceedings of 2014 British Machine Vision Conference. Nottingham: BMVC Press, 2014.[DOI: 10.5244/C.28.65]

[18] Bertinetto L, Valmadre J, Golodetz S, et al. Staple: complementary learners for real-time tracking[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, US: IEEE, 2016: 1401-1409.[DOI: 10.1109/CVPR.2016.156]

[19] Wang D, Lu H C. Visual tracking via probability continuous outlier model[C]//Proceedings of 2014 IEEE Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE, 2014: 3478-3485.[DOI: 10.1109/CVPR.2014.445]