发布时间: 2020-06-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.190437
2020 | Volume 25 | Number 6

图像分析和识别

自适应尺度突变目标跟踪

任俊丽¹, 郭浩¹, 董亚飞², 刘茹¹, 安居白¹, 王妍¹

1. 大连海事大学信息科学技术学院, 大连 116026;

2. 北京航空航天大学软件学院, 北京 100191

收稿日期: 2019-08-29; 修回日期: 2019-11-15; 预印本日期: 2019-11-22

基金项目: 国家自然科学基金项目(61471079)

第一作者简介: 任俊丽, 1991年生, 女, 硕士研究生, 主要研究方向为目标跟踪、目标检测。E-mail:Junli_Ren1@163.com;
郭浩, 男, 副教授, 主要研究方向为图像分类、图像分割、模式识别。E-mail:guohao0512@dlmu.edu.cn;
董亚飞, 男, 硕士研究生, 主要研究方向为目标检测、虚拟现实。E-mail:dyf0126@163.com;
刘茹, 女, 硕士研究生, 主要研究方向为图像分割。E-mail:1669594887@qq.com;
安居白, 男, 教授, 主要研究方向为数字图像处理技术、计算机视觉、海上溢油处理。E-mail:jubaian@dlmu.edu.cn;
王妍, 女, 硕士研究生, 主要研究方向为图像配准。E-mail:657933235@qq.com.

中图法分类号: TP391

文献标识码: A

文章编号: 1006-8961(2020)06-1150-10

摘要

目的尺度突变是目标跟踪中一项极具挑战性的任务，短时间内目标的尺度发生突变会导致跟踪要素丢失，使得跟踪误差积累导致跟踪漂移，为了更好地解决这一问题，提出了一种先检测后跟踪的自适应尺度突变的跟踪算法(kernelized correlation filter_you only look once，KCF_YOLO)。方法在跟踪的训练阶段使用相关滤波跟踪器实现快速跟踪，在检测阶段使用YOLO(you only look once)V3神经网络，并设计了自适应的模板更新策略，采用将检测到的物体的相似度与目标模板的颜色特征和图像指纹特征融合后的相似度进行对比的方法，判断目标是否发生遮挡，据此决定是否在当前帧更新目标模板。结果为证明本文方法的有效性在OTB(object tracking benchmark)2015数据集中具有尺度突变代表性的11个视频序列上进行试验，试验视频序列目标尺度变化为0.19.2倍，结果表明本文方法平均跟踪精度为0.955，平均跟踪速度为36帧/s，与经典尺度自适应跟踪算法比较，精度平均提高31.74%。结论本文使用相关滤波和神经网络在目标跟踪过程中先检测后跟踪的思想，提高了算法对目标跟踪过程中尺度突变情况的适应能力，实验结果验证了加入检测策略对后续目标尺度发生突变导致跟踪漂移的情况起到了很好的纠正作用，以及自适应模板更新策略的有效性。

关键词

目标跟踪; 相关滤波; 神经网络检测; 尺度突变; 尺度自适应

Adaptive scale sudden change object tracking

Ren Junli¹, Guo Hao¹, Dong Yafei², Liu Ru¹, An Jubai¹, Wang Yan¹

1. School of Information Science and Technology, Dalian Maritime University, Dalian 116026, China;

2. School of Software, Beihang University, Beijing 100191, China

Supported by: National Natural Science Foundation of China (61471079)

Abstract

Objective Video-based object detection and tracking have always been a research topic of high concern in the academic field of computer vision. Video object tracking has important research significance and broad application prospects in intelligent monitoring,human-computer interaction,robot vision navigation,and other aspects. Although the theoretical research of video object tracking technology has made considerable progress and several achievements have entered the practical stage,research on this technology still faces tremendous challenges,such as scale change,illumination change,motion blur,object deformation,and object occlusion,which result in many difficulties in visual tracking,particularly object scale mutation within a short time. It will lead to the loss of tracking elements,and the accumulation of tracking errors will lead to tracking drift. If the object scale is consistent,then considerable scale information will be lost. Thus,scale mutation is a challenging task in object tracking. To solve this problem,this study proposes an adaptive scale mutation tracking algorithm (kernelized correlation filter_you only look once,KCF_YOLO). Method The algorithm uses a correlation filter tracker to realize fast tracking in the training phase of tracking and uses you only look once (YOLO) V3 neural network in the detection phase. An adaptive template updating strategy is also designed. This strategy uses the method of comparing the color features of the detected object with the object template and the similarity of the image fingerprint features after fusion to determine whether occlusion occurs and whether the object template should be updated in the current frame. In the first frame of the video,the object is selected,assuming that the category of the object to be tracked is human. The object area is stored as the object template T. The object is selected,and it enters the training stage. The KCF algorithm is used for tracking. KCF extracts the multichannel history of gradients features of the object template. In the tracking process,1.5 times of the object template is selected as the object search range of the next frame,considerably reducing the search range. Tracking speed is remarkably improved. When the frame number is a multiple of 20,it enters the detection stage and uses YOLO V3 for object detection. YOLO V3 identifies all the people (P₁,P_2, P₃,…,P_n) in the current frame image. All the people to be identified are compared with the object template stored before 20 frames,and their color and image fingerprint features are extracted and compared with similarity (the similarity selection image fingerprint algorithm and color features are combined). If the similarity is greater than the average similarity of the first 20 frames,then the object template will be updated to the person with the greatest similarity. Simultaneously,the scale of the tracking box will be updated in accordance with the YOLO detection to achieve scale adaptation; otherwise,the object is judged as occluded and the template will not be updated. In the tracking phase,the updated or not updated object template is used as the latest status of the object in the tracking process for subsequent tracking. The preceding steps are repeated every 20 frames until the video and tracking end. The color and phase features are complementary. The image fingerprint feature selects the perceptual Hash(PHash) algorithm. After the discrete cosine transformation,the internal information of the image is mostly concentrated in the low-frequency area,reducing the calculation scope to the low-frequency area and losing color information. The color feature counts the distribution of colors in the entire image. The combination ensures the accuracy of similarity. A total of 11 video sequences representative of scale mutation in the object tracking benchmark(OTB)-2015 dataset are tested to prove the effectiveness of the proposed method. The results show that the average tracking accuracy of this algorithm is 0.955,and the average tracking speed is 36 frames/s. The self-made data of object reproduction are completely occluded for 130 frames. The result shows that tracking accuracy is 0.9,proving the validity of the algorithm that combines kernel correlation filtering and the YOLO V3 network. Compared with the classical scale adaptive tracking algorithm,accuracy is improved by 31.74% on average. Conclusion In this study,we adopt the ideas of correlation filtering and neural network to detect and track targets,improving the adaptability of the algorithm to scale mutation in the object tracking process. The experimental results show that the detection strategy can correct the tracking drift caused by the subsequent scale mutation and ensure the effectiveness of the adaptive template updating strategy. To address the problems of the traditional nuclear correlation filter being unable to deal with a sudden change in object scale within a short time and the slow tracking speed of a neural network,this work establishes a bridge between a correlation filter and a neural network. The tracker that combines a correlation filter and a neural network opens a new way.

Key words

object tracking; correlation filtering; neural network detection; scale mutation; scale adaptation

0 引言

目标跟踪是计算机视觉中具有重要意义和挑战性的课题，目标跟踪的精度与跟踪过程中光照变化、目标遮挡、尺度突变、相机抖动、目标形变等外界影响有较大关系(Choi等，2017)。当目标尺度短时间内发生突变时，目标的尺度差异较大，若将尺度一致化，会丢失很多信息，虽然目标跟踪领域内已经有很多跟踪算法，但针对目标尺度突变导致跟踪漂移问题的解决效果并不理想。对一个稳定的跟踪系统来说，若想解决目标尺度突变的跟踪问题，需解决两个关键点：尺度自适应和模板更新。

在信号处理中，互相关是用来表示频域内两个信号是否相关，两个信号越相似，其相关值越高。最早将互相关的思想引入计算机视觉领域对目标进行跟踪的是Bolme等人(2010)的输出平方差误差最小的相关滤波器(minimum output sum of squared error filter，MOSSE)；Henriques等人(2012)提出的CSK(circulant structure of tracking-by-detection with kernels)为了避免过拟合，在MOSSE模型的基础上增加了正则项，提出了最小方差分类器的概念，同时引入了循环矩阵和核函数，这使得目标跟踪的速度得到很大提升。

Henriques等人(2015)对通道的核化进行了扩展，在相关跟踪中使用多通道HOG(histogram of oriented gridients)特征的核化岭回归模型，提出了核相关滤波(kernelized correlation filters，KCF)。KCF的提出使得目标跟踪在相关滤波领域取得较大进展，吸引了众多关注，但KCF采用线性插值的方式对目标模型进行更新，使得目标更新的误差积累，导致目标跟踪漂移，尤其是当目标尺度发生突变时无法准确对目标进行跟踪。

基于深度学习的跟踪器有FCNT(visual tracking with fully convolutional networks)(Wang等，2015)，SiameseFC(fully-convolutional siamese)(Bertinetto等，2016)，MDNet(multi-domain convolutional neural networks)(Nam和Han，2016)和TCNN(convolutional neural networks in a tree structure)(Nam等，2016)等，由于深度模型具有多层网络结构，算法复杂度大，训练和更新模型时比较耗时，平均跟踪速度仅有4帧/s左右，很难满足实时性要求。

卷积神经网络在目标检测、图像分割、人脸识别等领域得到了广泛的应用，Ren等人(2017)提出了Faster R-CNN(region-convolutional neural networks)，利用RPN(region proposals network)来替代选择性搜索算法生成候选框，并将每生成一个候选框就进行一次卷积操作改进为所有候选框都生成后再进行一次卷积操作，两个方面的改进使得算法速度大大提升；YOLO(you only look once)针对运行速度较慢的two-stage目标检测的缺点，创造性提出了one-stage，真正实现端到端(end-to-end)检测，将目标分类和目标定位在一个步骤中完成，使得YOLO可实现45帧/s，满足实时性(25帧/s可达到实时效果)要求的运算速度(Redmon等，2016)。

YOLO V2(Redmon和Farhadi，2017)为了提高召回率和定位能力，在YOLO V1的基础上提出了一种联合训练方法，实现了使用有标记的检测数据集精确定位，使用分类数据增加类别和鲁棒性。YOLO V3(Redmon和Farhadi，2018)在YOLO V2的基础上进行多尺度预测(feature pyramid networks，FPN), 并且使用更好的基础分类网络(ResNet)和分类器，为精度和速度的提升提供了基础。

卷积神经网络在特征提取的过程中，对外界环境以及目标的尺度和形态变化的适应能力较强，同时考虑到精度和运算速度，本文选用改进后的YOLO V3对不同种类的物体进行大量样本的训练以实现在跟踪过程中对物体进行检测，从而减少误差积累，提高跟踪精度。在尺度自适应上选用YOLO V3网络每隔20帧对目标进行检测，同时将尺度进行更新，并根据相似度来决定当前帧是否更新模板的策略。

1 本文算法

本文提出了YOLO V3物体检测与KCF的目标跟踪算法相结合的算法KCF_YOLO，流程图如图 1所示。图 2为目标检测的网络结构示意图，图 3为KCF训练及跟踪过程图。其中，FFT(fast Fourier transform)为快速傅里叶变换，IFFT(inverse fast Fourier trans form)为快速傅里叶逆变换。

图 1 总流程图

Fig. 1 General flow chart

图 2 YOLO V3网络结构示意

Fig. 2 YOLO V3 network structure schematic diagram

图 3 KCF训练及跟踪过程

Fig. 3 KCF training and tracking process

KCF算法的检测结果为响应函数的最大值时，该位置即为检测目标的位置，即

$ \mathit{\boldsymbol{\widehat f}}(\mathit{\boldsymbol{z}}) = {\mathit{\boldsymbol{\widehat k}}^{xz}} \odot \widehat \alpha $

(1)

式中，带有^表示该参数的离散傅里叶变换形式，$ \odot $代表点乘，带有^表示该参数的离散傅里叶变换形式，$\mathit{\boldsymbol{x}} $为训练样本，$ \mathit{\boldsymbol{z}}$为候选区域目标，${\mathit{\boldsymbol{\widehat k}}^{xz}} $是$\mathit{\boldsymbol{x}} $和$ \mathit{\boldsymbol{z}}$的核相关，$ \alpha $为优化变量。

经过实验得到，KCF采用高斯核函数将单通道转换成多通道时跟踪情况最好，其高斯核函数表达式为

$ \begin{array}{*{20}{c}} {{\mathit{\boldsymbol{k}}^{xz}} = }\\ {\exp \left({ - \frac{1}{{{\sigma ^2}}}\left({\parallel \mathit{\boldsymbol{x}}{\parallel ^2} + \parallel \mathit{\boldsymbol{ z}}{\parallel ^2} - 2{F^{ - 1}}\left({\sum _c {\mathit{\boldsymbol{\widehat x}}_c^*} \odot {{\mathit{\boldsymbol{\hat z}}}_c}} \right)} \right)} \right)} \end{array} $

(2)

式中，$ \sigma $为高斯分布的标准差；$\mathit{\boldsymbol{x}} $, $ \mathit{\boldsymbol{z}}$为计算相关性的样本；$v $表示逆傅里叶变换；$ c$表示某个通道序号；${{\mathit{\boldsymbol{\hat x}}}_c} $, $ {{\mathit{\boldsymbol{\hat x}}}_c}^\prime $为训练样本和检测样本第$ c$个通道的傅里叶变换形式，$ \mathit{\boldsymbol{\widehat x}}_c^*$为$ {{\mathit{\boldsymbol{\hat x}}}_c}^\prime $的复数共轭矩阵。

KCF_YOLO的详细描述过程为：

1) 选取目标。读入视频帧的第1帧，选取目标，假设跟踪目标类别为人，将选取的该目标区域存储为目标模板$T $。

2) 检测阶段。选择使用YOLO V3进行目标检测，YOLO V3在全卷积神经网络提取图像特征的基础上，采用Darknet-53的网络结构(含有53个卷积层)，它借鉴了残差网络的做法，同时采用了3个不同尺度的特征图来进行对象检测，YOLO V3识别出当前帧图像中所有人$\left(p_{1}, p_{2}, p_{3}, \cdots, p_{i}\right) $。

3) 训练阶段。选取目标后用KCF算法进行跟踪，KCF提取该目标模板的多通道HOG(histogram of oriented gradient)特征，并进行FFT转换到频率域减少计算量，再进行IFFT，将响应的最大值点作为预测点，跟踪过程中选择目标模板的1.5倍作为下一帧目标搜索的范围，大大减小搜索范围，极大提高了跟踪速度。

4) 模板更新阶段。识别出所有待选人物，提取其颜色特征$ {P_i^{\rm{c}}}$和图像指纹特征${P_i^{\rm{p}}} $并进行相似度$ \left({P_i^{\rm{c}} + P_i^{\rm{p}}} \right)$对比，如果相似度大于$S $ ($S $为目标前20帧颜色特征与图像指纹特征之和的均值)，则将目标模板更新为$ {{P_i}}$，更新模板的同时根据YOLO V3检测的物体尺寸更新跟踪框的尺度，使尺度自适应得以实现；否则判定为目标出现遮挡，不进行模板更新。

5) 跟踪阶段。将更新或未更新的目标模板作为跟踪过程中目标的最新状态进行后续跟踪。每隔20帧重复步骤2)—步骤5)，直至视频结束，则跟踪结束。

提取图像指纹特征采用PHash(perceptual Hash)算法，该算法的原理是结合离散余弦变换，将灰度图像所包含的特征生成一组指纹(哈希值)，计算图像相似度即计算图像指纹的相似度，具体步骤如下：

1) 计算离散余弦变换(discrete cosine transform, DCT)。计算图像的DCT变换，得到32×32的DCT系数矩阵；

2) 缩小DCT。保留左上角的8×8矩阵(最低频率)；

3) 计算平均值。计算DCT的均值；

4) 计算Hash值。根据8×8的DCT矩阵，设置0或1的64位的Hash值，大于等于DCT均值的设为“1”，小于DCT均值的设为“0”。将得到的2维数组拉伸成1维，构成了一个长度为64 bit的向量，即为这幅图像的指纹；计算两幅图像指纹，使用汉明距离计算相似度。

统计HSV(hue saturation value)空间中颜色的特征，使得目标受尺度变化和形变的影响大大减小，将HSV颜色通道均分为m个数值区间，共得到m³(记为N)个颜色区间；统计每个颜色区间内的像素个数，获得向量，计算向量之间的相似度，即得到颜色特征的相似度。

对于步骤4)，YOLO V3网络检测出所有待选人物，并分别计算人物的颜色特征和图像指纹特征，与目标模板的颜色特征和图像指纹特征进行相似度对比，如果所选目标中人物颜色特征和图像指纹特征($ {P_i^{\rm{c}}}$ + ${P_i^{\rm{p}}} $)大于$S $，则认为相似度最大的人物即为后续继续跟踪的目标，并进行模板的更新，否则认为目标被遮挡尚未出现, 不更新目标模板。

颜色特征和图像指纹特征这两个特征是互补的。图像指纹特征选取感知哈希算法经过离散余弦变换后，图像内部信息大都集中在低频区域，将计算范围缩小至低频区域，丢失了颜色信息，颜色特征从颜色的角度出发，统计了整幅图像中的颜色分布情况。两者结合保证了相似度的准确性。

2 数据集

在OTB(object tracking benchmark)2015数据集上选取了具有代表性的11组目标尺度在短时间内发生突变的视频序列进行实验，实验所选视频及其主要挑战如表 1所示。主要挑战有尺度变化(scale variation，SV)、遮挡(occlusion，OCC)、快速运动(fast motion，FM)、平面内旋转(in-plane rotation，IPR)、平面外旋转(out-of-plane rotation，OPR)、光照变化(illumination variation，IV)、运动模糊(motion blur，MB)、变形(deformation，DEF)、背景模糊(background clutters，BC)、低分辨率(low resolution，LR)。

表 1 实验所选视频详情
Table 1 Details of the video selected for the trial

下载CSV

序号	视频序列	主要挑战	尺度变化帧数	目标宽和高变化/mm	尺度变化程度/倍
1	CarScale	SV, OCC, FM, IPR, OPR	135→187	(72, 44)→(158, 74)	3.69
2	Human2	SV, IV, MB, OPR	148→285	(95, 330)→(82, 209)	0.55
3	Human3	SV, OCC, DEF, OPR, BC	667→686→750	(29, 65)→(15, 51)→ (26, 99)	0.41→3.36
4	Human5	SV, OCC, DEF	202→276→433→480	(15, 42)→(28, 90)→ (55, 182)→(34, 99)	4→3.97→0.34
5	Human6	SV, OCC, DEF, FM, OPR, OV	170→250	(21, 59)→(61, 187)	9.206
6	Human7	SV, IV, OCC, DEF, MB, FM	1→69	(37, 116)→(27, 80)	0.503
7	RedTeam	SV, OCC, IPR, OPR, LR	548→822	(72, 34)→(23, 14)	0.132
8	Vase	SV, FM, IPR	58→79→90→110	(49, 61)→(140, 147) (64, 75)→(145, 171)	6.885→0.233→5.166
9	Walking	SV, OCC, DEF	59→207	(28, 71)→(17, 47)	0.402
10	Walking2	SV, OCC, LR	186→262→378	(30, 79)→(17, 71)→ (18, 62)	0.509→0.924 6
11	Woman	SV, IV, OCC, DEF, MB, FM, OPR	553→586	(20, 81)→(41, 171)	4.378
注：SV指尺度变化，OCC指遮挡，FM指快速运动，IPR指平面内旋转，OPR指平面外旋转, IV指光照变化，MB指运动模糊，DEF指变形，BC指背景模糊，LR指低分辨率。

这些视频不仅有尺度突变，同时伴有背景模糊、遮挡、形变等问题，为跟踪增加了难度。如表 1所示，尺度变化帧数为选取的具有代表性的尺度变化的起始帧数和结束帧数，目标宽和高变化括号内的数值分别为选取的尺度突变起始帧或结束帧目标的宽和高，尺度变化程度为结束帧的面积与起始帧的面积的比值，比值大于1说明尺度变大，小于1则说明尺度缩小。

图 4是视频序列Human3、Human6、CarScale RedTeam的突变情况，如Human3在19帧内尺度缩小至0.41倍(该倍数为前后帧面积的比值)，在后续64帧内尺度又增大至3.36倍；Human6在80帧内尺度增大至9.21倍; CarScale在52帧内尺度增大至3.69倍; RedTeam在274帧内尺度缩小至0.132倍。短时间内目标的尺度发生突变会导致跟踪要素丢失，传统跟踪方法无法适应这种突变使得误差积累跟踪漂移，本文就此问题在相关滤波和神经网络相结合上展开研究。

图 4 视频序列尺度突变情况

Fig. 4 Scale sudden change of video sequence((a) Human3; (b) Human6; (c) CarScale; (d) RedTeam)

3 实验及结果分析

本实验硬件环境的处理器为Inter(R)Core(TM)i5-3210M CPU@2.50 GHz，内存为4.00 GB，Linux操作系统，算法语言是C++，对比实验是公开的MATLAB代码，在Windows系统下运行。实验中所有用到的视频都是从官网中的OTB-50和OTB-100序列下载所得。

3.1 跟踪算法结果对比

本文方法是在KCF的基础上改进的跟踪方法，以达到自适应尺度突变的效果，为了说明方法的有效性，对比实验选取具有尺度自适应能力的跟踪方法。其中精确度是指跟踪目标框的中心点与人工标注的目标框中心点间的误差，对比结果见表 2。

表 2 跟踪结果精确度对比
Table 2 Comparison of tracking precision

下载CSV

序号	视频序列	KCF	SAMF	fDSST	DSST	TLD	KCF_YOLO(本文)
1	CarScale	0.806	0.849	0.813	0.758	0.853	1.000
2	Human2	0.171	0.646	0.160	0.364	0.257	0.832
3	Human3	0.006	0.006	0.006	0.306	0.008	0.950
4	Human5	0.265	0.244	0.997	0.024	1.000	1.000
5	Human6	0.290	0.923	0.876	0.448	0.458	0.989
6	Human7	0.472	0.448	0.498	0.448	1.000	0.896
7	RedTeam	1.000	1.000	1.000	1.000	0.721	1.000
8	Vase	0.793	0.782	0.764	0.849	0.964	0.841
9	Walking	1.000	1.000	1.000	1.000	0.426	1.000
10	Walking2	0.440	1.000	0.868	1.000	0.426	1.000
11	Woman	0.938	0.938	0.938	0.938	0.191	0.997
	平均值	0.562	0.684	0.720	0.649	0.573	0.955
注：加粗字体为每行最优值。

由表 2可以看出，KCF(Henriques等，2015)跟踪目标仅更新目标$\left({x, y} \right) $的位置，目标框的大小保持不变，所以应对目标尺度突变的能力比较差；SAMF(Li和Zhu, 2014)在KCF的基础上进行了改进，目标特征提取加入了颜色特征(color name，CN)，即将HOG特征和CN特征相结合，在尺度池上添加{1, 0.985, 0.99, 0.995, 1.005, 1.01, 1.015}7个尺度，循环选择最优尺度却牺牲了跟踪速度；DSST(discriminatiive scale space tracker)(Danelljan等，2017)利用尺度和位置滤波器两个滤波器相互独立，并能一次进行尺度评估和目标定位，初始化的尺度部分建立了17个尺度变化因子和33个内插尺度变化因子来进行尺度评估；fDSST(fast discriminative scale space tracking)(Danelljan等，2014)对DSST进行加速，分别对位置滤波器和尺度滤波器进行PCA(principal component analysis)降维和OR(orthogonal triangle)分解来降低计算量，提高计算速度；TLD(tracking learning detection)(Kalal等，2012)在二进制模式下采用一边检测一边学习的策略进行跟踪；本文KCF_YOLO方法则选用先检测后跟踪的方法，在模板更新和尺度自适应上利用神经网络的检测方法。表 3对以上6个跟踪器的特征提取和尺度自适应情况进行汇总。

表 3 6个跟踪器的区别
Table 3 Differences among six trackers

下载CSV

名称	特征	尺度自适应
KCF_YOLO(本文)	HOG, CN, PHash	是
KCF	HOG	否
SAMF	原始像素, HOG, CN	是
fDSST	HOG	是
DSST	HOG	是
TLD	二进制模式	是

距离精度(distance precision，DP)显示出目标中心点位置的变化，反映跟踪器的跟踪位置精度，但无法反映出目标尺度的变化。交并比精度(overlap precision，OP)是指预测目标框与实际目标框的相交部分占其合并部分的比率，简称交并比(intersection over union, IoU)的面积

$ {S_{{\rm{IoU}}}} = \frac{{area(A \cap B)}}{{area(A \cup B)}} $

式中，$ area(A), area(B)$分别表示预测目标区域和实际目标中心点区域，$ {S_{{\rm{IoU}}}}$表示两个面积的公共区域，反映出跟踪器尺度自适应的能力，两个评价指标相结合能够显示出跟踪器的综合能力。图 5(a)显示，在位置误差阈值大于10后本文方法KCF_YOLO持续反超其他跟踪算法；图 5(b)中IoU小于0.75前，KCF_YOLO相比其他跟踪算法有较大优势，当IoU大于0.75后优势不明显。分析认为，可能是受本文目标跟踪框的大小是每隔20帧进行更新的影响，若将YOLO V3检测的帧数缩减，IoU的值会更好，但跟踪速度会有所下降；一般通过选取阈值20或50判断跟踪器的好坏，IoU选取0.5，由图 5可知，KCF_YOLO算法在该范围内具有明显的优势，当距离阈值选取20，IoU选取0.5时，DP和OP最优的是KCF_YOLO。结果表明，与其他经典自适应尺度的跟踪算法相比，本文选择有效的检测和合适的更新策略在跟踪效果上有显著提高。

图 5 对比6个跟踪器所绘制的曲线

Fig. 5 Curves drawn compared with six trackers((a) distance precision; (b) overlap precision)

3.2 目标重现时尺度突变数据集验证

为验证本文方法的有效性，拍摄了两组跟踪目标完全遮挡丢失200帧左右后目标重现，并且目标尺度发生突变的视频序列，6个跟踪器跟踪结果如图 6。

图 6 目标重现时尺度突变数据集

Fig. 6 Dataset of scale sudden change when the target reappears((a) the first dataset; (b) the second dataset)

针对图 6(a)视频序列，#001表示第1帧图像，当跟踪目标无遮挡时，KCF、SAMF、fDSST、DSST、TLD和KCF_YOLO均准确跟踪，在134帧之后，目标出现严重遮挡，SAMF、KCF和KCF_YOLO等3个跟踪器仍能准确跟踪，目标在177帧时进入完全遮挡状态，在309帧时再现，尺度变大为原来的10.21倍，但KCF、SAMF、fDSST、DSST、TLD等5种跟踪器由于不能应对目标尺度突变，跟踪框停留在丢失目标的位置，并把该跟踪框当做原来跟踪的目标持续跟踪，使得目标跟踪误差积累，从而导致目标跟踪漂移。KCY_YOLO由于有YOLO V3检测机制且能应对目标尺度突变，当目标重现时，能将画面中的所有人都检测出来，并将所有检测出的目标与保存的目标模板进行相似度对比，当相似度大于设置的阈值时，则认为目标再现，将该目标尺度突变后的形态存为新的目标模板，减少跟踪要素的丢失，由此可知YOLO V3的检测机制为准确跟踪目标提供了保障。

针对图 6(b)视频序列，目标尺度较小且目标快速移动，遮挡后目标重现时，目标尺度变大为遮挡前的5.34倍，KCF、SAMF、fDSST、DSST、TLD等5种算法对该类目标重现时的尺度突变没有适应性，搜索面积仍旧为错误跟踪前目标位置的1.5倍，并将错误跟踪的目标当成后续跟踪的目标，但目标重现时超出这个范围，故无法准确跟踪，本文方法KCF_YOLO加入了模板更新策略，当KCF算法出现误差时，由于跟踪目标和当前目标模板不匹配，则不进行模板更新，当目标重现时，能通过相似度判断目标是否重现，从而选择是否更新目标模板，模板更新策略能及时纠正跟踪位置，当目标遮挡后尺度变大时，由于目标模板尺度也在更新，故能自适应跟踪目标的尺度，为准确检测跟踪目标提供了保障。以上两个目标重现视频序列的实验使得本文算法在该类视频序列上得到了鲁棒性的验证。

4 结论

针对传统核相关滤波器不能应对目标尺度短时间内突变，以及神经网络跟踪速度较慢的问题，本文建立了一个相关滤波和神经网络的桥梁，在跟踪训练阶段使用相关滤波KCF，检测阶段使用神经网络YOLO V3，本文先检测后跟踪的策略大大提高了神经网络的跟踪速度，同时提高了跟踪精确度；为适应目标尺度突变、背景模糊、相机抖动、目标形变等情况，从两个方面对KCF进行改进：1)对模板更新问题提出了将颜色特征和图像指纹特征相融合后进行相似度对比的方法，决定是否对目标模板进行更新；2)对尺度自适应提出了利用YOLO V3神经网络进行检测，从而能及时更新目标的尺度大小。两个方面的改进进一步提高了本文跟踪器的鲁棒性。

在OTB2015数据集上具有代表性的尺度突变视频序列上进行实验，验证了本文方法对尺度突变的视频场景具有一定前景。本文方法在该类视频序列中，能够成功跟踪目标，跟踪平均精度达到0.955，比先进的跟踪器平均提高了31.74%，为相关滤波和神经网络相结合的跟踪器开辟了一条新的道路。神经网络检测机制的加入使得核相关滤波的跟踪速度有所降低，跟踪速度为36帧/s，本文通过大量实验得出20帧作为YOLO V3的检测时间间隔在跟踪精度与跟踪速度上都能取得较好的效果。

在未来工作中，可以将YOLO V3检测的间隔帧数进行自适应调整以提高跟踪性能，同时继续寻找优化方法，改进神经网络，使得相关滤波和神经网络融合后的跟踪速度得以提高，促进两者结合在目标跟踪任务上有更好的应用。

参考文献

Bertinetto L, Valmadre J, Henriques J F, Vedaldi A and Torr P H S. 2016. Fully-convolutional Siamese networks for object tracking//Proceedings of 2016 European Conference on Computer Vision. Amsterdam, The Netherlands: Springer: 850-865[DOI: 10.1007/978-3-319-48881-3_56]

Bolme D S, Beveridge J R, Draper B A and Lui Y M. 2010. Visual object tracking using adaptive correlation filters//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, CA, USA: IEEE: 2544-2550[DOI: 10.1109/CVPR.2010.5539960]

Choi J W, Chang H J, Yun S, Fischer T, Demiris Y and Choi J Y. 2017. Attentional correlation filter network for adaptive visual tracking//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE: 4828-4837[DOI: 10.1109/CVPR.2017.513]

Danelljan M, Häger G, Khan F S and Felsberg M. 2014. Accurate scale estimation for robust visual tracking//Proceedings of British Machine Vision Conference 2014. Nottingham, UK: BMVA Press: 1-11[DOI: 10.5244/C.28.65]

Danelljan M, Häger G, Khan F S, Felsberg M. 2017. Discriminative scale space tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(8): 1561-1575 [DOI:10.1109/TPAMI.2016.2609928]

Henriques J F, Caseiro R, Martins P and Batista J. 2012. Exploiting the circulant structure of tracking-by-detection with kernels//Proceedings of the 12th European conference on Computer Vision. Florence, Italy: Springer: 702-715[DOI: 10.1007/978-3-642-33765-9_50]

Henriques J F, Caseiro R, Martins P, Batista J. 2015. High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3): 583-596 [DOI:10.1109/tpami.2014.2345390]

Kalal Z, Mikolajczyk K, Matas J. 2012. Tracking-learning-detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(7): 1409-1422 [DOI:10.1109/TPAMI.2011.239]

Li Y and Zhu J K. 2014. A scale adaptive kernel correlation filter tracker with feature integration//Proceedings of European Conference on Computer Vision. Zurich, Switzerland: Springer: 254-265[DOI: 10.1007/978-3-319-16181-5_18]

Nam H, Baek M and Han B. 2016. Modeling and propagating CNNs in a tree structure for visual tracking[EB/OL].[2019-08-20]. https://arxiv.org/pdf/1608.07242.pdf

Nam H and Han B. 2016. Learning multi-domain convolutional neural networks for visual tracking//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 4293-4302[DOI: 10.1109/CVPR.2016.465]

Redmon J, Divvala S, Girshick R and Farhadi A. 2016. You only look once: unified, real-time object detection//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 1-6[DOI: 10.1109/CVPR.2016.91]

Redmon J and Farhadi A. 2017. Yolo9000: better, faster, stronger//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE: 6517-6525[DOI: 10.1109/CVPR.2017.690]

Redmon J and Farhadi A. 2018. Yolov3: an incremental improvement[EB/OL].[2019-08-20]. https://arxiv.org/pdf/1804.02767.pdf

Ren S Q, He K M, Girshick R, Sun J. 2017. Faster R-CNN:towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149 [DOI:10.1109/TPAMI.2016.2577031]

Wang L J, Ouyang W L, Wang X G and Lu H C. 2015. Visual tracking with fully convolutional networks//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 3119-3127[DOI: 10.1109/ICCV.2015.357]