发布时间: 2020-06-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.190454
2020 | Volume 25 | Number 6

图像理解和计算机视觉

外观表征分析下动态更新相关滤波跟踪

强壮, 石繁槐

同济大学电子与信息工程学院控制科学与工程系, 上海 201804

收稿日期: 2019-09-12; 修回日期: 2019-11-16

基金项目: 上海市科技兴农重点攻关项目（沪农科创字（2018）第3-6号）

第一作者简介: 强壮, 1997年生, 男, 硕士研究生, 主要研究方向为计算机视觉与模式识别。E-mail:qiangzhuang@tongji.edu.cn.

中图法分类号: TP391

文献标识码: A

文章编号: 1006-8961(2020)06-1209-12

摘要

目的基于相关滤波和孪生神经网络的两类判别式目标跟踪方法研究已取得了较大进展，但后者计算量过大，完全依赖GPU（graphics processing unit）加速运算。传统相关滤波方法由于滤波模型采用固定更新间隔，难以兼顾快速变化目标和一般目标。针对这一问题，提出一种基于目标外观状态分析的动态模型更新算法，优化计算负载并提高跟踪精度，兼顾缓变目标的鲁棒跟踪和快速变化目标的精确跟踪。方法通过帧间信息计算并提取目标区域图像的光流直方图特征，利用支持向量机进行分类从而判断目标是否处于外观变化状态，随后根据目标类别和目标区域图像的光流主分量幅值动态设置合适的相关滤波器更新间隔。通过在首帧进行前背景分离运算，进一步增强对目标外观表征的学习，提高跟踪精度。结果在OTB100（object tracking benchmark with 100 sequences）基准数据集上与其他6种快速跟踪算法进行对比实验，本文算法的精准度和成功率分别为86.4%和64.9%，分别比第2名ECO-HC（efficient convolution operators using hand-crafted features）算法高出1.4%和0.9%。在平面内旋转、遮挡、部分超出视野和光照变化这些极具挑战性的复杂环境下，精准度分别比第2名高出3.0%、4.4%、5.2%和6.0%，成功率高出1.9%、3.1%、4.9%和4.0%。本文算法在CPU（central processing unit）上的运行速度为32.15帧/s，满足跟踪问题实时性的要求。结论本文的自适应模型更新算法在优化计算负载的同时取得了更好的跟踪精度，适合于工程部署与应用。

关键词

目标跟踪; 相关滤波; 光流; 外观状态分析; 自适应模型更新

Dynamic update correlation filter tracking based on appearance representation analysis

Qiang Zhuang, Shi Fanhuai

Department of Control Science and Engineering, College of Electronic and Information Engineering, Tongji University, Shanghai 201804, China

Supported by: Key research projects for developing agriculture through science and technology in Shanghai (3-6.2018)

Abstract

Objective Visual object tracking, which has a profound theoretical basis and application value, is one of the basic problems in computer vision research. Visual object tracking technology has wide applications but faces increasingly complex environments. Factors, such as scale changes, occlusion, and illumination variation, bring uncertain interferences to visual tracking. Research on robust, accurate, and fast visual object tracking algorithms should be conducted further. In recent years, two categories of discriminant model methods based on a discriminative correlation filter and the Siamese neural network have achieved high accuracy and robustness in the tracking problem. However, tracking methods based on the Siamese network are limited by the huge computation amount of a convolutional neural network (CNN) and can only be performed on high-performance GPUs(graphics processing units). The computing requirement seriously affects the application of this type of methods in the practical engineering environment. Tracking methods based on a discriminative correlation filter have simple frameworks, and thus, can use manually setting features to learn and update an object's representation and achieve real-time tracking on a single CPU(central processing unit). This types of real-time tracking algorithm has been applied well to mobile platforms, such as unmanned aerial vehicles. Under the traditional correlation filtering framework, updating the correlation filter frame by frame will lead to an excessively large computational load and affect real-time performance. The sparse model updating strategy proposed in recent years simply sets a fixed updating interval, reducing the convergence speed of the tracking model and easily losing track when the object changes rapidly. The tracking ability of the two types of correlation filtering tracking algorithms cannot meet the increasing application requirements in complex environments. For the correlation filter updating strategy, this study proposes a dynamic updating algorithm based on appearance representation analysis to optimize computation and improve tracking accuracy. Method First, optical flow features are used to estimate the appearance state of an object. We calculate the dense optical flow of the predicted target region's image. When the object is simply shifting, the optical flow's amplitude of each pixel is small, and the direction lacks a uniform rule because the image of the target area changes minimally. However, when the object is deforming or being occluded, the deformed part will generate a considerably larger optical flow, which differs from common objects. In this study, optical flow histogram information is extracted by dividing an image into m×n grids. The average optical flow amplitudes and angles of each pixel are counted in each grid to form the histogram feature vector. A support vector machine is then used to classify feature vectors to estimate the object's current appearance state. After appearance state analysis, the optical flow amplitude in the object region of the current frame is counted, and a statistical histogram of optical flow amplitude with an interval of 0.5 is constructed. The updating interval of the filter model is set in accordance with the magnitude of the main optical flow amplitude and the target category to realize the adaptive updating of the correlation filter. Moreover, the foreground-background separation operation based on discrete cosine transform in the first frame is used to obtain accurate labeling information, reduce similar background interference, and further optimize the learning of object representation. Result This algorithm is tested on the OTB100(object tracking benchmark with 100 sequences) dataset and compared with ECO-HC(efficient convolution operators using hand-crafted features), SRDCF(spatially regularized discriminative correlation filter), Staple(sum of template and pixel-wise learners), KCF(kernelized correlation filter), DSST(discriminative scale space tracker) and CSK(circulant structure of tracking-by-detection with kernels), which are fast tracking algorithms. On five typical challenging video image sequences, the algorithm proposed in this study achieved higher tracking overlap through the update interval adaptively setting model. It solved the overfitting problem of traditional frame-by-frame updating algorithms, such as Staple, and the problem in ECO-HC which is easy to lose fast-changing objects owing to the sparse updating strategy. The comprehensive quantitative analysis results on the entire OTB100 dataset showed that the tracking accuracy and success rate of the algorithm proposed in this work are 86.4% and 64.9%, respectively. The tracking accuracy and robustness of our algorithm are the best compared with other fast-tracking algorithms that can run on a CPU. Moreover, under highly challenging and complex environments, including in-plane rotation, occlusion, out of view, and illumination variation, our algorithm's precision was 3.0%, 4.4%, 5.2%, and 6.0% higher than that of the algorithm at second place, and the success rate was 1.9%, 3.1%, 4.9%, and 4.0% higher. In the running speed test on CPU i7-6850k, the frames per second of the algorithm developed in this work is 32.15, and the computational load is less than that of the frame-by-frame updating algorithm, thereby meeting the real-time requirements for tracking problems. Conclusion This study proposed a dynamic updating correlation filter tracking algorithm based on appearance representation analysis. A series of comparison results shows that the improved algorithm in this work can consider the robust tracking of slow-changing objects and the accurate tracking of fast-changing objects to achieve excellent real-time performance suitable for project deployment and application.

Key words

object tracking; correlation filtering; optical flow; appearance state analysis; adaptive model update

0 引言

视觉目标跟踪是计算机视觉研究的基本问题之一，具有深厚的理论基础与应用价值，其主要任务是在视频图像序列的第1帧给出包含目标的矩形框(手工标注或是目标检测结果)，在后续帧由跟踪算法计算获取目标位置。随着视觉目标跟踪技术的广泛应用，其所面临的环境亦越来越复杂。尺度变化、目标被遮挡、光照变化、快速运动等因素给视觉跟踪带来了更多的不确定性，这些不确定因素带来的干扰对目标跟踪算法影响很大，鲁棒、精确且快速的视觉目标跟踪技术尚有很大的研究空间。

视觉目标跟踪算法主要分为生成模型方法和判别模型方法。前者对当前帧的目标区域建模，从下一帧寻找与模型最相似的区域作为预测位置；而后者基于图像特征进行机器学习，以当前帧的目标区域为正样本，背景区域为负样本训练分类器，在下一帧通过分类器寻找预测位置。

基于相关滤波(discriminative correlation filter)和孪生神经网络(Siamese network)的两种判别模型方法在跟踪问题上取得了较高的精确度和鲁棒性，备受关注。Bertinetto 等人(2016b)提出SiamFC(fully-convolutional Siamese network)算法，通过一个全卷积孪生网络给出当前区域与目标模板的相似性度量，从而预测当前帧的目标位置。Li 等人(2018)在SiamFC的基础上引入目标检测中常用的RPN(region proposal network)结构，在孪生网络后面加入分类和回归两个分支进行定位。随后，更深的网络(Li等，2019)和更精确的框架(Wang等，2019)等陆续被应用，深度学习的复杂模型带来的精度优势愈发明显。然而，基于孪生神经网络的跟踪方法(Bertinetto等，2016b；Li等，2018, 2019；Wang等，2019)受限于卷积神经网络(convolutional neural network，CNN)的庞大计算量，只能在高性能GPU(graphics processing unit)上进行运算，这一点严重影响了该方法在实际工程环境中的应用。与之不同的是，基于相关滤波思想的跟踪方法由于框架简单，可以只使用少量手工设置的参数进行学习与更新，在CPU(central processing unit)上即可达到实时的运算速度，这一类快速算法在无人机等移动平台上取得了较好的应用。

相关滤波跟踪算法最初由Bolme 等人(2010)提出，其特点是兼顾准确率与速度。该方法通过最小化平方误差和来优化求解一个相关滤波器，计算图像与滤波器的卷积响应并取最大响应位置作为跟踪结果。Henriques 等人(2012)提出一种具有循环结构的核相关滤波算法CSK(circulant structure of tracking-by-detection with kernels)，利用循环矩阵扩充样本集，并使用核函数进行映射，以实现快速学习和跟踪。这两种早期方法奠定了相关滤波的框架基础。

随后，Danelljan等人(2015a)引入了颜色特征(color name，CN)优化彩色视频图像的目标跟踪问题。此外，他们还提出多尺度跟踪算法DSST(discriminative scale space tracker)(Danelljan等，2014)，该算法以MOSSE(minimum output sum of squared error)(Bolme等，2010)为基跟踪器，采用方向梯度直方图(histogram of oriented gradient，HOG)特征训练多尺度跟踪器(discriminative scale space tracker，DSST)。该方法通过相关滤波器进行目标定位，根据尺度因子和目标位置截取该尺度下的目标样本，将该尺度样本作为相关滤波器的输入，由最大响应确定目标尺度。Henriques 等人(2015)提出使用HOG特征的核相关滤波算法(kernelized correlation filter，KCF)，目标跟踪的精度得到显著提升。这段时间的研究主要集中在探索更有效的目标图像特征表示上，通过恰当的特征表示和尺度估计来提高跟踪的鲁棒性。

Danelljan 等人(2015c)提出空间正则化判别相关滤波(spatially regularized discriminative correlation filter，SRDCF)跟踪算法，在学习过程中引入了空间正则化权值，抑制了背景滤波响应并扩展了搜索区域，大大提高了跟踪精度，也导致了算法效率的降低。2017年，Danelljan 等人(2017)提出ECO-HC(efficient convolution operators using hand-crafted features)算法，并提出稀疏的模型更新策略，在满足CPU上实时性的同时取得了更好的跟踪精度，进一步完善了相关滤波框架。

在相关滤波算法的发展过程中可以看出，恰当、完善的滤波框架是跟踪算法取得较好效果的基础，而精确的特征可以更好地描述图像信息，在相同滤波框架下取得更好的跟踪效果。一些方法如DeepSRDCF(spatially regularized discriminative correlation filter with deep feature)(Danelljan等，2015b)、C-COT (continuous convolution operators for visual tracking)(Danelljan等，2016b)、ECO(efficient convolution operators)(Danelljan等，2017)、UPDT(unveiling the power of deep tracking)(Bhat等，2018)等，尝试将深度卷积特征融合进入相关滤波框架，虽然提高了精度，但却降低了跟踪算法的实时性，缺乏工程应用价值。

在特征表示等研究之外，相关滤波器的在线更新问题却长期被忽视。在传统相关滤波框架下，逐帧在线更新相关滤波器会造成过大的计算负载影响实时性，同时也不可避免地造成对最新几帧图像的过拟合；而ECO-HC(Danelljan等，2017)提出的稀疏模型更新策略由于简单设置了固定的更新间隔，降低了跟踪模型的收敛速度，且当目标快速变化时容易跟丢，一旦目标离开搜索区域便难以寻回。

本文在快速相关滤波框架的基础上进一步研究了跟踪中的滤波模型在线学习过程，提出一种基于外观表征分析的动态更新相关滤波跟踪算法(dynamic update correlation filter based on appearance representation analysis，DUCF)，从而改善了传统相关滤波算法固定模型更新间隔对外观快速变化目标跟踪反应较慢的局限性。DUCF计算并提取预测目标区域的光流直方图特征并利用支持向量机(support vector machine, SVM)进行分类，从而判定目标是否处于外观快速变化状态，根据目标区域主光流分量幅值动态设定适合的模型更新间隔。当目标外观快速变化时快速更新滤波器，提高跟踪精度；而当目标状态稳定时，使用稀疏的更新策略消极更新滤波器，以提高算法运行效率，同时防止过拟合, 从而在对普通目标稳定跟踪的同时提高对外观快速变化目标的学习，提升跟踪效果。

通过动态的相关滤波器更新策略，本文算法在OTB100(object tracking benchmark with 100 sequences)公开数据集(Wu等，2015)上进一步提高了快速跟踪方法的精度和鲁棒性，并仍然保持了在CPU上的实时性，具有较广泛的应用价值。

1 相关工作

基于相关滤波思想的视觉目标跟踪方法自提出以来便备受关注。相关滤波应用于目标跟踪的基础想法就是：依据跟踪目标生成滤波模板(相关滤波器)，在搜索区域内将图像与滤波器进行相关运算，计算搜索区域内图像与目标模板的匹配分数，将取得滤波响应最大处作为目标跟踪的结果，随后利用目标区域图像更新优化相关滤波器以适应目标的外观变化。

1.1 目标定位

对于一帧图像，通过方向梯度直方图和颜色特征等进行表示，记为${\mathit{\boldsymbol{x}}}_{j}$，包含$D$维信息, ${\mathit{\boldsymbol{x}}}^{d}_{j}∈ R ^{N_{d}×N_{d}}$, $N_{d}×N_{d}$为第$d$维特征图的分辨率。对于此跟踪问题训练一个$D$通道相关滤波器${\mathit{\boldsymbol{f}}}=({\mathit{\boldsymbol{f}}} ^{1}, …, {\mathit{\boldsymbol{f}}} ^{D})$，与特征图进行卷积来预测目标检测分数为

$ {\mathit{\boldsymbol{S}}_f}\{ {\mathit{\boldsymbol{x}}_j}\} = \mathit{\boldsymbol{f}} * {\mathit{\boldsymbol{x}}_j} = \sum\limits_{d = 1}^D {{\mathit{\boldsymbol{f}}^d}} * \mathit{\boldsymbol{x}}_j^d $

(1)

式中，${\mathit{\boldsymbol{S}}}_{f}\{{\mathit{\boldsymbol{x}}}_{j}\}$表示样本${\mathit{\boldsymbol{x}}}_{j}$对应的卷积响应图。

多通道滤波器可通过最小化二次项误差$E({\mathit{\boldsymbol{f}}})$学习，添加${\mathit{\boldsymbol{f}}}$的正则化项以防止过拟合，同时添加空间正则化权值${\mathit{\boldsymbol{w}}}$抑制背景响应为

$ E(\mathit{\boldsymbol{f}}) = \sum\limits_{j = 1}^M {{\alpha _j}} \left\| {{\mathit{\boldsymbol{S}}_f}\{ {\mathit{\boldsymbol{x}}_j}\} - {\mathit{\boldsymbol{y}}_j}} \right\|_{{L^2}}^2 + \sum\limits_{d = 1}^D {\left\| {\mathit{\boldsymbol{w}} \cdot {\mathit{\boldsymbol{f}}^d}} \right\|_{{L^2}}^2} $

(2)

式中，${\mathit{\boldsymbol{y}}}_{j}$表示样本${\mathit{\boldsymbol{x}}}_{j}$对应的标记分数响应图，$α_{j}$为样本${\mathit{\boldsymbol{x}}}_{j}$对应权值, $L^{2}$代表L2范数。

为了减小计算量，滤波器的训练同样可以通过快速傅里叶变换转换到频域进行，从而把空间域卷积转化为频域矩阵点积，即

$ \begin{array}{*{20}{l}} {E(\mathit{\boldsymbol{f}}) = \sum\limits_{j = 1}^M {{\alpha _j}} \left\| {\widehat {{\mathit{\boldsymbol{S}}_f}\{ {x_j}\} } - {{\mathit{\boldsymbol{\hat y}}}_j}} \right\|_{{L^2}}^2 + }\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \sum\limits_{d = 1}^D {\left\| {\mathit{\boldsymbol{\hat w}} * {{\mathit{\boldsymbol{\hat f}}}^d}} \right\|_{{L^2}}^2} } \end{array} $

(3)

式中，^表示对应变量的傅里叶变换结果。

由于使用$m×n$的窗格进行特征提取会导致响应图分辨率的下降，因此通过滤波响应图进行牛顿迭代来增加定位精度，即

$ {\mathit{\boldsymbol{X}}_{n + 1}} = {\mathit{\boldsymbol{X}}_n} - \frac{{\mathit{\boldsymbol{S}}_f^\prime ({\mathit{\boldsymbol{X}}_n})}}{{\mathit{\boldsymbol{S}}_f^{\prime \prime }({\mathit{\boldsymbol{X}}_n})}} = {\mathit{\boldsymbol{X}}_n} - {\mathit{\boldsymbol{H}}^{ - 1}}({\mathit{\boldsymbol{X}}_n}) \cdot {\mathit{\boldsymbol{J}}_{{S_f}}}({\mathit{\boldsymbol{X}}_n}) $

(4)

式中，${\mathit{\boldsymbol{J}}}_{S_{f}}$为Jacobian矩阵，${\mathit{\boldsymbol{H}}}$为Hessian矩阵，${\mathit{\boldsymbol{X}}}_{n}$代表第$n$次迭代后的响应中心位置。

1.2 更新策略

在进行目标定位后，取搜索区域图像作为训练样本加入样本空间以更新相关滤波器。Danelljan 等人(2017)提出通过无监督学习的方式生成样本高斯混合聚类，以聚类均值作为样本进行学习，从而减少训练样本数量，提高运算速度。

在完成样本空间的更新后，使用新的样本空间更新相关滤波器以适应目标的外观变化。传统相关滤波方法逐帧更新滤波器，计算负载较大，在线更新严重影响算法效率。Danelljan等人(2017)提出稀疏更新策略，即设置一个固定的模型更新间隔$N_{s}$，每$N_{s}$帧进行一次更新，从而提高跟踪算法的速度。然而，这种简单的稀疏更新策略将会降低算法的收敛速度，且当目标外观出现变化或是目标高速运动时，未能迅速更新的模型可能会失去对目标的跟踪。

2 基于目标外观表征分析的动态模型更新算法

针对相关滤波器的更新策略问题，本文提出一种基于外观表征分析的动态模型更新算法DUCF，通过估计目标的外观状态变化程度来动态设置模型更新间隔，当目标出现高速运动、运动模糊、遮挡等易跟丢的状态时，积极更新滤波器，提高跟踪精度；而当目标状态稳定，在进行平移等普通运动时，使用稀疏的更新策略消极更新滤波器，以提高算法运行效率，同时防止过拟合。图 1为本文算法整体结构流程图。

图 1 本文算法整体结构流程图

Fig. 1 The overall structure of our algorithm

2.1 目标外观状态变化估计

光流表示空间中的运动物体在成像平面上的像素运动瞬时速度，在视频图像序列中，通过相邻两帧信息很容易计算得到目标图像的光流特征，从而反映其运动信息。

假设$t$时刻图像上某一点处的灰度值是$I(x, y, t)$，记${I_x} = \frac{{\partial I}}{{\partial x}}, {I_y} = \frac{{\partial I}}{{\partial y}}, {I_t} = \frac{{\partial I}}{{\partial t}}$，基于全局光流平滑约束假设(Horn-Schunck(HS)法)，通过最小化误差式(5)可以求得当前帧的光流$u$，$v$，即

$ \begin{array}{*{20}{c}} {E = \int {\int {{{({I_x}u + {I_y}v + {I_t})}^2}} } + }\\ {\lambda \left[ {{{\left( {\frac{{\partial u}}{{\partial x}}} \right)}^2} + {{\left( {\frac{{\partial u}}{{\partial y}}} \right)}^2} + {{\left( {\frac{{\partial v}}{{\partial x}}} \right)}^2} + {{\left( {\frac{{\partial v}}{{\partial y}}} \right)}^2}} \right]{\rm{d}}x{\rm{d}}y} \end{array} $

(5)

式中，$λ$为平滑权重系数。

为了更加精确地刻画目标区域光流特性，本文选择对目标矩形框内图像计算光流，而非对搜索区域计算。后者会因背景杂乱对目标产生干扰，无法准确估计目标状态，同时搜索区域4倍于目标矩形框的大小也会增加计算负担，影响跟踪器的速度。当目标简单平移时，由于目标区域图像变化较小，各点光流幅值较小且方向缺乏统一规律；而当目标发生形变或遮挡等情况时，形变部分则会产生较大的光流，从而与普通目标产生区别。

首先使用HS法求目标区域光流，光流计算结果可视化如图 2所示。为了分析目标状态，将目标图像平均划分为$m×n$个栅格，每个栅格包含若干个像素点。在每个栅格中，计算各个像素点处光流幅值及角度的平均值，作为该栅格的统计特性，图 3为光流直方图特征提取示意图, ${\mathit{\boldsymbol{v}}}$代表光流直方图特征。若每个栅格${\mathit{\boldsymbol{Ω}}}_{k}$的长和宽分别为$s_{c}$与$s_{r}$，则栅格${\mathit{\boldsymbol{Ω}}}_{k}$的光流幅值$I(k)$和角度$R(k)$可以表示为

图 2 目标图像光流示意图

(蓝色箭头的长度代表光流幅值大小，方向代表光流角度)

Fig. 2 Optical flow diagram of target image

(The length of the blue arrow represents the magnitude, and the direction represents the angle)

图 3 光流直方图特征提取示意图

Fig. 3 Diagram of optical flow histogram feature extraction

$ \begin{array}{*{20}{c}} {I(k) = \frac{1}{{{s_r} \times {s_c}}}\sum\limits_{i = 1}^{{s_r}} {\sum\limits_{j = 1}^{{s_c}} {\sqrt {u{{(i,j)}^2} + v{{(i,j)}^2}} } } }\\ {(i,j){\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \in {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\mathit{\boldsymbol{ \boldsymbol{\varOmega} }}_k}} \end{array} $

(6)

$ R(k) = \frac{1}{{{s_r} \times {s_c}}}\sum\limits_{i = 1}^{{s_r}} {\sum\limits_{j = 1}^{{s_c}} {{\rm{ta}}{{\rm{n}}^{ - 1}}} } \frac{{v(i,j)}}{{u(i,j)}},(i,j){\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \in {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\mathit{\boldsymbol{ \boldsymbol{\varOmega} }}_k} $

(7)

$m$与$n$的选取应使得目标的状态变化可以相对明显地体现在上述光流特征中。若栅格划分过于稀疏，则大量光流数据平均后将掩盖目标的变化，无法有效反映目标的状态；若栅格划分过于密集，则容易受到复杂环境噪声的干扰，同时大幅增加计算复杂度，影响跟踪器的速度。经过综合考虑和实验测试，本文最终选择将目标图像划分为$5×5$栅格，图 4为部分图像栅格划分示意图。

图 4 部分图像栅格划分示意图

Fig. 4 Image raster partition diagram

先计算密集光流，再通过划分栅格统计各个图像块的平均光流，与在各个栅格中只计算关键点的稀疏光流相比，可以一定程度上减小噪声干扰，提高算法的稳定性。

在这种划分下，当目标图像进行平移等普通运动时，其外观特征变化较小，光流幅值整体较小且没有统一的方向；当目标图像出现运动模糊、遮挡、超出视野等情况时，其光流幅值较大且部分区域方向较为一致。因此通过对光流直方图特征的模式分类实现对目标外观状态的估计。

在这一分类问题中，受限于在线跟踪问题的实时性要求，分类器的复杂程度不宜过高，应保证算法的运行效率。相较于神经网络等分类器，支持向量机(SVM)(Chang和Lin，2011)的运算速度较快，在保证分类精度的同时对计算复杂度的影响相对较小，因此本文选用支持向量机实现光流分类，核函数选择为高斯核，即

$ K({\mathit{\boldsymbol{v}}_i},{\mathit{\boldsymbol{v}}_j}) = {\rm{exp}}\left( { - \frac{{{{\left\| {{\mathit{\boldsymbol{v}}_i} - {\mathit{\boldsymbol{v}}_j}} \right\|}^2}}}{{2{\sigma ^2}}}} \right) $

(8)

式中，$σ>0$，称为高斯核的带宽，${\mathit{\boldsymbol{v}}}_{i}, {\mathit{\boldsymbol{v}}}_{j}$为第$i$，$j$个训练样本对应的光流特征向量。

2.2 模型的自适应更新

在复杂环境中，目标的外观往往会发生变化，相关滤波器也应不断更新以适应目标的外观变化。然而，过于密集的模型更新会增大计算负载，影响实时性；而太稀疏的更新则会导致收敛速度降低和跟踪鲁棒性下降。本文基于对目标的外观状态分析来动态调整更新间隔，从而实现模型的自适应更新。

根据光流幅值特性，将目标外观特征划分为两类：第1类目标外观特征迅速变化，需要较小的模型更新间隔甚至立即更新；第2类目标外观特征缓慢变化或基本不变，该类目标不易跟丢，使用较大的更新间隔简化计算并提高跟踪的稳定性。

在利用支持向量机进行光流特征分类后，统计当前帧目标图像区域光流幅值，以0.5为间隔构建光流幅值统计直方图(如图 5)，取出现次数最多的区间作为当前帧的主光流幅值。

图 5 某帧图像光流幅值统计直方图

Fig. 5 Statistical histogram of optical flow amplitude for a certain frame image

当前帧目标图像主光流幅值进一步反映了目标外观变化的剧烈程度。对第1类目标，当主光流幅值大于阈值时，表征着目标外观特征急剧变化，在当前帧立刻进行滤波器的更新；当主光流幅值低于阈值时，根据其幅值大小分别设置模型更新间隔。对第2类目标，根据幅值大小对应设置较大的模型更新间隔，以提高计算速度并防止过拟合现象出现。两类目标的模型更新间隔$N_{s}$设置方案见表 1和表 2。

表 1 外观变化目标模型更新间隔$N_{s}$的设置
Table 1 Setting of $N_{s}$ for fast-changing target

下载CSV

主光流幅值	0~0.5	0.5~1	1~1.5	>1.5
$N_{s}$	7	5	4	1(立即更新)

表 2 普通目标模型更新间隔$N_{s}$的设置
Table 2 Setting of $N_{s}$ for common target

下载CSV

主光流幅值	0~1	1~3	>3
$N_{s}$	9	7	1(立即更新)

在算法中设置更新标志位，在上一次滤波器更新后，当前帧尚未达到一个更新间隔$N_{s}$，则只进行训练样本的更新，而不再训练相关滤波器，直接进行下一帧的处理。若当前帧距上次滤波器更新已经满足了一个更新间隔$N_{s}$，则先进行训练样本的更新，随后训练相关滤波器，完成后再进行后续帧处理。通过$N_{s}$的动态设置实现相关滤波器的自适应更新。

传统相关滤波跟踪算法的时间复杂度为${\rm{O}}\left({{N_{{\rm{CG}}}}CL\bar K} \right)$是高斯迭代的次数，$C$是滤波器特征通道数，$L$是在线训练样本数量，$\bar K $是每个通道傅里叶系数平均数。通过模型的自适应更新，本文算法DUCF的时间复杂度减少为${\rm{O}}\left({{N_{{\rm{CG}}}}CL\bar K\left({E + 1/{N_{\rm{A}}}} \right)} \right), \;{N_{\rm{A}}}$是平均模型更新间隔，$E$是外观状态估计所需时间与跟踪算法处理1帧图像所需时间的比值。由于光流计算区域较小，且所选特征与分类器均较为简单，目标外观状态估计所需时间为10 ms左右，随目标尺度不同而改变，$E$在0.5左右；$N_{\rm A}$则因目标变化剧烈程度不同而改变，受表 1和表 2中策略的影响，一般目标$N_{\rm A}$为4左右，$E+1/N_{\rm A}$小于1。因此本文算法比传统逐帧更新算法实时性更好，运行速度介于逐帧更新与简单稀疏更新策略之间。

2.3 复杂环境前背景分离

在目标跟踪任务中，由于第1帧给出的目标区域是简单的矩形框标注，而非像素级标注，当目标所处场景较为复杂时，目标附近的背景信息会对相关滤波器的训练产生干扰。在相关滤波框架下，首帧目标的权值较高，若该帧特征提取过程受到背景信息的较大干扰，将会影响对目标外观表征的学习，从而导致跟踪鲁棒性的下降。

针对这一问题，本文提出在第1帧初始化跟踪模型时进行精确的处理，通过视觉显著性检测来进一步完善目标标注信息。

Hou 等人(2012)提出一种基于离散余弦变换的显著性检测算子$S$为

$ S(\mathit{\boldsymbol{x)}} = {\rm{sign}} (DCT(\mathit{\boldsymbol{x)}})) $

(9)

通过反变换得到显著性图谱$M$为

$ M(\mathit{\boldsymbol{x}}) = IDCT {(S(\mathit{\boldsymbol{x}}))^2} $

(10)

式中，${\mathit{\boldsymbol{x}}}$表示被检测图像。

基于这一方法，在获取搜索区域后，对搜索区域图像进行显著性检测及前背景分离处理。一方面，搜索区域以外区域在特征提取过程中已经舍去，没有必要对其进行处理。另一方面，搜索区域容纳背景相对充足，可以充分发挥显著性检测算法的效果。

在实验的基础上确定一个阈值，保留显著性大于或等于阈值的区域，抑制显著性小于阈值的区域。值得注意的是，当背景较复杂或目标为物体的一部分时通过阈值判断获得的显著区域可能存在多个(如图 6(b)所示)，此时可以根据标注信息，保留目标中心点所在显著区域，抑制其他显著区域。当对移位后的搜索区域进行处理时，保留图像中心点所在区域即可。若某序列图像中心点被误判为背景区域，则不再对该序列进行前背景分离，仍使用原图进行后续处理。经过实验测试，选取阈值为0.1时效果较好。

图 6 OTB100部分序列视觉显著性检测结果

Fig. 6 OTB100 sequences visual saliency detection results ((a) Bird1;(b) MotorRolling; (c)Panda)

3 实验与分析

3.1 实验细节

本文实验在i7-6850K硬件平台下测试，操作系统为ubuntu 16.01，软件平台为MATLAB R2017a。

在具体算法实现中，通过手工方式选取典型视频图像序列并提取光流特征组成训练集预先离线训练光流分类支持向量机，并通过网格搜索法得到最优超参数$c=2, γ=0.313$。

在OTB100数据集(Wu等，2015)上测试本文算法DUCF，并与ECO-HC(Danelljan等，2017)、SRDCF(Danelljan等，2015c)、Staple(sum of template and pixel-wise learners)(Bertinetto等，2016a)、KCF(Henriques等，2015)、DSST(Danelljan等，2014)、CSK(Henriques等，2012)等非深度学习类快速跟踪算法对比分析。由于基于深度学习的SiamFC(Bertinetto等，2016b)、SiamRPN(visual tracking with Siamese region proposal network)(Li等，2018)等跟踪算法无法在普通CPU上实时运行，难以在实际工程环境中应用，因此本文不再与其进行结果对比。

在典型视频序列测试中，通过平均中心误差(average central error，ACE)和平均交并比(mean intersection over union，mIoU)来分析本文算法跟踪结果与其他快速算法跟踪结果的差异。ACE表明了跟踪结果的精度，而IoU可以较好地反映尺度信息和跟踪鲁棒性。

在算法的整体评价指标方面，利用精准度(precision)曲线和成功率(success rate)曲线衡量跟踪算法的效果。精准度即预测目标中心点与标注目标中心点相差像素距离低于阈值的频率。以像素距离为横轴，两中心点距离低于该距离的帧数占比为纵轴得到精准度曲线。而成功率则为预测目标框$R_{\rm T}$与标注目标框$R_{\rm G}$的IoU大于阈值的频率。以重叠阈值为横轴，IoU为纵轴得到成功率曲线。IoU计算公式为

$ {R_{{\rm{IoU}}}} = \left| {\frac{{{R_{\rm{T}}} \cap {R_{\rm{G}}}}}{{{R_{\rm{T}}} \cup {R_{\rm{G}}}}}} \right| $

(11)

在OTB100的整个数据集上测试跟踪算法以获得其综合评价，此外，通过在平面内旋转、部分遮挡、部分超出视野等具有挑战性的多组视频序列上进行测试可以进一步分析跟踪算法在实际复杂工程环境下的表现。

3.2 实验结果与分析

在5个典型的高难度视频序列上进行实验，跟踪结果如图 7所示，每个序列上的跟踪平均中心误差和IoU见表 3。

图 7 部分视频序列跟踪结果对比图

Fig. 7 Comparison diagram of partial video sequence tracking results((a) Bird2;(b) DragonBaby; (c)Skiing; (d)Biker; (e)KiteSurf)

表 3 部分视频序列跟踪结果的平均中心误差和平均IoU
Table 3 Average center error and average IoU of partial video sequence tracking results

下载CSV

视频序列	CSK(Henriques等，2012)		DSST(Danelljan等，2014)		KCF(Henriques等，2015)		Staple(Bertinetto等，2016a)		SRDCF(Danelljan等，2015c)		ECO-HC(Danelljan等，2017)		DUCF(本文)
视频序列	ACE	IoU	ACE	IoU	ACE	IoU	ACE	IoU	ACE	IoU	ACE	IoU	ACE	IoU
Bird2	18.30	0.58	55.65	0.45	21.37	0.58	6.81	0.77	16.65	0.59	13.50	0.65	5.89	0.76
DragonBaby	87.91	0.21	142.57	0.06	50.40	0.32	18.99	0.57	69.38	0.24	19.82	0.54	21.25	0.56
Skiing	247.59	0.06	195.66	0.07	260.05	0.05	242.70	0.12	263.43	0.06	261.89	0.06	59.13	0.29
Biker	79.31	0.28	74.75	0.29	77.18	0.26	79.43	0.28	97.30	0.33	4.77	0.53	4.49	0.52
KiteSurf	36.47	0.26	28.80	0.33	17.27	0.49	3.28	0.68	15.28	0.48	63.93	0.37	1.89	0.85
注：加粗字体为每行最优值。

从图 7(a)可以看出，本文算法得益于前背景分离运算，通过小样本学习即可在目标与背景相似度较高的环境中分辨目标细节，取得相比其他算法更精确的跟踪结果。

从图 7(b)(c)的结果来看，当目标剧烈变化，特别是发生旋转与形变时，ECO-HC算法(Danelljan等，2017)的稀疏更新导致滤波模型无法及时学习目标新外观表征，跟踪鲁棒性下降，其中Skiing序列更是在一开始就跟丢了目标；而本文算法在此时进行间隔较短的积极更新，及时学习了目标表征，取得了最好的跟踪效果。

图 7(d)Biker序列的跟踪结果显示，Staple(Bertinetto等，2016a)等传统逐帧更新的跟踪算法在目标变化较小时容易出现过拟合(如图 7(d)第1、第2幅图像)，本文算法此时使用了稀疏更新策略，防止了过拟合并且提高了运算效率。当车手起跳时，目标变化较快，此时DUCF使用积极的更新策略，取得了更精准和鲁棒的跟踪效果，当车手落到右边时依然取得了较好的跟踪结果。

图 7(e)则充分表明了简单的稀疏更新策略的缺陷。目标外观变化带有快速位移时，相关滤波器更新较慢可能导致在滤波器学习到目标新表征之前目标已经离开了搜索区域，此时算法将彻底跟丢目标并难以寻回。因此，通过目标外观状态估计来实现相关滤波器模型的自适应更新是十分必要的。本文算法也因此学习得到了较好的跟踪模型，全程保持了对目标的精确跟踪。

综合来看，本文算法在目标快速运动的视频序列上取得了较为精确的跟踪结果，跟踪能力优于其他固定模型更新间隔的快速跟踪算法。

在整个OTB100数据集的100个序列上进行综合定量分析，跟踪结果如图 8曲线所示，在各组具有较为困难的标记属性序列上的跟踪结果如图 9所示，其中具有平面内旋转、遮挡、部分超出视野和光照变化标记属性的视频序列数量分别为51，48，14和37。

图 8 OTB100数据集跟踪结果对比曲线

Fig. 8 Comparison curves of video sequence tracking results in OTB100 dataset((a)precision plots; (b) success plots)

图 9 OTB100数据集不同属性视频序列跟踪结果对比曲线

Fig. 9 Comparison curves of video sequence tracking results of different attributes in OTB100 dateset ((a) in-plane rotation; (b) occlusion; (c) out of view; (d) illumination variation)

从图 8的跟踪结果曲线可以看出，本文算法在OTB100数据集上的跟踪精准度和成功率分别为0.864和0.649，与其他可以在CPU上运行的快速跟踪算法相比，跟踪的精度和鲁棒性均达到了最好成绩。

图 9在各组具有挑战性的视频序列上的具体跟踪结果则表明本文DUCF算法在复杂环境下的跟踪能力提升尤为明显，在平面内旋转、遮挡、部分超出视野和光照变化这些极具挑战性的复杂环境下精准度分别比第2名高出3.0 %、4.4 %、5.2 %和6.0 %，成功率高出1.9 %、3.1 %、4.9 %和4.0 %。本文DUCF算法可估计当前目标的外观状态来动态地更新相关滤波器，并可更好地分辨目标与背景，从而适应不同环境下的跟踪需求，在通用目标跟踪任务中表现出色，适合实际工程环境中的应用。

在同一环境下与稀疏更新和逐帧更新的ECO-HC算法对比运算速度，具体结果见表 4。

表 4 不同算法运行速度对比
Table 4 Comparison of running speed of different algorithms

下载CSV

算法	速度/(帧/s)
ECO-HC($N_{s}$=6)	49.04
ECO-HC($N_{s}$=1)	26.58
DUCF(本文)	32.15
注：测试平台CPU为i7-6850K。

由表 4可知，本文算法在得到更好跟踪效果的同时相较于逐帧更新取得了更高的运算速度，有效降低了跟踪算法的计算负载，在CPU上仍然保持了实时性，可以适应移动平台的实际工程需求。

4 结论

本文分析了传统相关滤波算法采用固定模型更新间隔训练相关滤波器在目标外观状态发生变化时的不足，提出了基于目标外观状态分析的动态模型更新算法DUCF。通过提取目标区域的光流特征分析目标当前外观状态的变化，根据其外观状态以及目标区域图像主光流幅值动态设置滤波器更新间隔，从而改善对快速变化目标的跟踪效果。最后，本文在OTB100基准数据集上对改进算法与其他快速跟踪算法进行了比较测试，结果表明本文算法DUCF可以兼顾缓变目标的鲁棒跟踪和快速变化目标的精确跟踪，同时具有较好的实时性，适合移动平台的部署与应用。

由于外观表征分析模块需要逐帧运行，该模块的运算速度对于整体算法计算效率影响较大，因此后续工作可着重于优化提升外观表征分析计算效率。

参考文献

Bertinetto L, Valmadre J, Golodetz S, Miksik O and Torr P H S. 2016a. Staple: complementary learners for real-time tracking//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 1401-1409[DOI: 10.1109/CVPR.2016.156]

Bertinetto L, Valmadre J, Henriques J F, Vedaldi A and Torr P H S. 2016b. Fully-convolutional siamese networks for object tracking//Proceedings of the European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 850-865[DOI: 10.1007/978-3-319-48881-3_56]

Bhat G, Johnander J, Danelljan M, Khan F S and Felsberg M. 2018. Unveiling the power of deep tracking//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 493-509[DOI: 10.1007/978-3-030-01216-8_30]

Bolme D S, Beveridge J R, Draper B A and Lui M Y. 2010. Visual object tracking using adaptive correlation filters//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, USA: IEEE: 2544-2550[DOI: 10.1109/CVPR.2010.5539960]

Chang C C, Lin C J. 2011. LIBSVM:a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3): 27 [DOI:10.1145/1961189.1961199]

Danelljan M, Bhat G, Khan F S and Felsberg M. 2017. ECO: efficient convolution operators for tracking//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 6931-6939[DOI: 10.1109/CVPR.2017.733]

Danelljan M, Häger G, Khan F S and Felsberg M. 2014. Accurate scale estimation for robust visual tracking//Proceedings of the British Machine Vision Conference 2014. Nottingham, UK: BMVA Press: 38[DOI: 10.5244/C.28.65]

Danelljan M, Häger G, Khan F S and Felsberg M. 2015a. Coloring channel representations for visual tracking//Proceedings of the 19th Scandinavian Conference on Image Analysis. Copenhagen, Denmark: Springer: 117-129[DOI: 10.1007/978-3-319-19665-7_10]

Danelljan M, Häger G, Khan F S and Felsberg M. 2015b. Convolutional features for correlation filter based visual tracking//Proceedings of 2015 IEEE International Conference on Computer Vision Workshop. Santiago, Chile: IEEE: 621-629[DOI: 10.1109/ICCVW.2015.84]

Danelljan M, Häger G, Khan F S and Felsberg M. 2015c. Learning spatially regularized correlation filters for visualtracking//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 4310-4318[DOI: 10.1109/ICCV.2015.490]

Danelljan M, Robinson A, Khan F S and Felsberg M. 2016b. Beyond correlation filters: learning continuous convolution operators for visual tracking//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 472-488[DOI: 10.1007/978-3-319-46454-1_29]

Henriques J F, Caseiro R, Martins P and Batista J. 2012. Exploiting the Circulant structure of tracking-by-detection with kernels//Proceedings of the 12th European Conference on Computer Vision. Florence, Italy: Springer: 702-715[DOI: 10.1007/978-3-642-33765-9_50]

Henriques J F, Caseiro R, Martins P, Batista J. 2015. High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3): 583-596 [DOI:10.1109/TPAMI.2014.2345390]

Hou X D, Harel J, Koch C. 2012. Image signature:highlighting sparse salient regions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(1): 194-201 [DOI:10.1109/TPAMI.2011.146]

Li B, Wu W, Wang Q, Zhang F Y, Xing J L and Yan J J. 2019. SiamRPN++: evolution of Siamese visual tracking with very deep networks//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 4277-4286[DOI: 10.1109/CVPR.2019.00441]

Li B, Yan J J, Wu W, Zhu Z and Hu X L. 2018. High performance visual tracking with Siamese region proposal network//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 8971-8980[DOI: 10.1109/CVPR.2018.00935]

Wang Q, Zhang L, Bertinetto L, Hu W M and Torr P H S. 2019. Fast online object tracking and segmentation: a unifying approach//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 1328-1338[DOI: 10.1109/CVPR.2019.00142]

Wu Y, Lim J, Yang M H. 2015. Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9): 1834-1848 [DOI:10.1109/TPAMI.2014.2388226]