发布时间: 2021-01-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.200519
2021 | Volume 26 | Number 1

目标检测与跟踪

实时视觉目标跟踪与视频对象分割多任务框架

李瀚¹, 刘坤华¹, 刘嘉杰¹, 张晓晔²

1. 中山大学数据科学与计算机学院, 广州 510006;

2. 广东电科院能源技术有限责任公司, 广州 510080

收稿日期: 2020-08-26; 修回日期: 2020-10-21; 预印本日期: 2020-10-28

基金项目: 国家重点研发计划项目（2018YFB1305002）；国家自然科学基金项目（61773414，62006256）；广州市重点研发项目（202007050002）

第一作者简介: 李瀚, 1997年生, 男, 硕士研究生, 主要研究方向为计算机视觉。E-mail:lihan59@mail2.sysu.edu.cn;
刘坤华, 女, 博士后, 主要研究方向为计算机视觉、自动驾驶环境感知、SLAM。E-mail:liukh5@mail.sysu.edu.cn;
刘嘉杰, 男, 硕士研究生, 主要研究方向为语义分割与场景流。E-mail:liujj73@mail3.sysu.edu.cn.

通信作者: 张晓晔, 通信作者, 男, 博士后, 主要研究方向为机器人、人工智能、计算机视觉。E-mail:xiaoyz@whu.edu.cn.

中图法分类号: TP391.4

文献标识码: A

文章编号: 1006-8961(2021)01-0101-12

摘要

目的针对视觉目标跟踪（video object tracking，VOT）和视频对象分割（video object segmentation，VOS）问题，研究人员提出了多个多任务处理框架，但是该类框架的精确度和鲁棒性较差。针对此问题，本文提出一个融合多尺度上下文信息和视频帧间信息的实时视觉目标跟踪与视频对象分割多任务的端到端框架。方法文中提出的架构使用了由空洞深度可分离卷积组成的更加多尺度的空洞空间金字塔池化模块，以及具备帧间信息的帧间掩模传播模块，使得网络对多尺度目标对象分割能力更强，同时具备更好的鲁棒性。结果本文方法在视觉目标跟踪VOT-2016和VOT-2018数据集上的期望平均重叠率（expected average overlap，EAO）分别达到了0.462和0.408，分别比SiamMask高了0.029和0.028，达到了最先进的结果，并且表现出更好的鲁棒性。在视频对象分割DAVIS（densely annotated video segmentation）-2016和DAVIS-2017数据集上也取得了有竞争力的结果。其中，在多目标对象分割DAVIS-2017数据集上，本文方法比SiamMask有更好的性能表现，区域相似度的杰卡德系数的平均值J_M和轮廓精确度的F度量的平均值F_M分别达到了56.0和59.0，并且区域和轮廓的衰变值J_D和F_D都比SiamMask中的低，分别为17.9和19.8。同时运行速度为45帧/s，达到了实时的运行速度。结论文中提出的融合多尺度上下文信息和视频帧间信息的实时视觉目标跟踪与视频对象分割多任务的端到端框架，充分捕捉了多尺度上下文信息并且利用了视频帧间的信息，使得网络对多尺度目标对象分割能力更强的同时具备更好的鲁棒性。

关键词

视觉目标跟踪; 视频对象分割; 全卷积网络; 空洞空间金字塔池化; 帧间掩模传播

Multitask framework for video object tracking and segmentation combined with multi-scale interframe information

Li Han¹, Liu Kunhua¹, Liu Jiajie¹, Zhang Xiaoye²

1. School of Data and Computer Science, Sun Yat-sen University, Guangzhou 510006, China;

2. Guangdong Diankeyuan Energy Technology Co., Ltd, Guangzhou 510080, China

Supported by: National Key Research and Development Program of China (2018YFB1305002);National Natural Science Foundation of China (61773414, 62006256)

Abstract

Objective Visual object tracking (VOT) is widely used in scenes, such as car navigation, automatic video surveillance, and human-computer interaction. It is a basic research task in video applications and needs to infer the correspondence between the target and the frame. Given the position of any object of interest in the first frame of the video, its position is estimated in all subsequent frames with the highest possible accuracy. Similar to VOT, semi-supervised video object segmentation (VOS) requires segmentation of target objects on subsequent video sequences given the initial frame mask. It is also a basic research task of computer vision. However, the target object may experience large changes in pose, proportion, and appearance in the entire video sequence. It may encounter abnormal conditions, such as occlusion, rapid movement, and truncation. Therefore, performing robust VOT and VOS in a semi-supervised manner in video sequences is challenging. The continuous nature of the video sequence itself brings additional contextual information to VOS. The interframe consistency of video enables the network to effectively transfer information from frame to frame. In VOS, the information from previous frames can be regarded as temporal context and can provide useful hints for subsequent predictions. Therefore, the effective use of additional information brought by video is extremely important for video tasks. For the research of VOT and VOS, various multitask processing frameworks have been proposed by scholars. However, the accuracy and robustness of such frameworks are poor. This paper proposes a multitask end-to-end framework for real-time VOT and VOS to address these problems. This framework combines multi-scale context information and video interframe information. Method In this work, depthwise convolution is changed from depthwise convolution to atrous depthwise convolution, thereby forming the atrous depthwise separable convolution. In accordance with different atrous ratios, the convolution can have different receptive fields while maintaining its lightweight. This study designs an atrous spatial pyramid pooling module with many atrous ratios composed of atrous depthwise separable convolution and applies it to the VOS branch. The network can capture multi-scale context. This work uses 1, 3, 6, 9, 12, 24, 36, and 48 atrous ratios to convolve the feature map with different receptive fields and utilizes adaptive pooling for the feature map. These feature maps are concatenated, and a convolution kernel of 1×1 is used to transform the feature map channel. The feature map outputted by the model has rich multi-scale context information through these operations. This module uses the atrous depthwise separable convolution with different atrous rates for enabling the network to predict multi-scale targets. Continuity is a unique property of video sequences and causes additional contextual information to video tasks. The interframe consistency of video enables the network to effectively transfer information from frame to frame. In the VOS, the information from previous frames can be regarded as temporal context information and can provide useful hints for subsequent predictions. Therefore, the effective use of additional information brought by video is extremely important for video tasks. Inspired by the reference-guided mask propagation algorithm, a mask propagation module is added to the VOS branch for providing location and segmentation information to the network. The proposed mask propagation module is composed of 3×3 convolutions with atrous ratios of 2, 3, and 6. In our architecture, a multi-scale atrous spatial pyramid pooling module composed of atrous depthwise separable convolutions and an interframe mask propagation module with interframe information are used. These modules provide the network with strong ability to segment multi-scale target objects and has better robustness. Result All experiments in this work are performed using NVIDIA TITAN X graphics cards. The network in this article is trained in two stages. The training sets used in different stages are different due to their different nature. In the first stage of training, this work uses Youtube-VOS, common objects in context(COCO), DETection(ImageNet-DET), and ImageNet-VID (VIDeo) datasets. For the datasets without mask ground truth, the mask branch is not trained. For a video sequence with only a single frame, the picture and mask of the previous frame are set in the interframe mask propagation module to be the same as the current frame. Inspired by SiamMask, this article uses stochastic gradient descent optimizer algorithm and a warm-up training strategy. The learning rate increases from 1×10^-3 to 5×10^-3 in the first 5 epochs. A logarithmic decay strategy was then used to reduce the learning rate to 2.5×10^-4 through 15 epochs of learning. In the second stage, this article only uses the Youtube-VOS and COCO datasets for training. The two datasets have mask truth values to improve the segmentation effect of video objects. The second stage uses a logarithmic decay strategy to reduce the learning rate from 2.5×10^-4 to 1.0×10^-4 through 20 epochs. The expected average overlaps of the proposed method on the VOT-2016 and VOT-2018 datasets reach 0.462 and 0.408, respectively, which is approximately 0.03 higher than SiamMask. The proposed method achieves advanced results and shows better robustness. Competitive results are also achieved on the DAVIS-2016 and DAVIS-2017 datasets of VOS. On DAVIS-2017 dataset of multitarget object segmentation, the proposed method has better performance than SiamMask. The evaluation indexes J_M and F_M reach 56.0 and 59.0, respectively, and the decay values of the region and the contour are J_D and F_D. Their values are 17.9 and 19.8, respectively, which are lower than those in SiamMask. The running speed is 45 frames per second, reaching a real-time running speed. Conclusion In this study, we proposed a multitask end-to-end framework of real-time VOT and VOS. The proposed method integrates multi-scale context information and video interframe information, fully captures multi-scale context information, and utilizes the information between video frames. These features make the network robust to segmentation of multi-scale target objects.

Key words

visual object tracking(VOT); video object segmentation(VOS); fully convolutional network(FCN); atrous spatial pyramid pooling; inter-frame mask propagation

0 引言

视觉目标跟踪(video object tracking, VOT)在汽车导航、自动视频监控和人机交互等场景中有广泛的应用，是视频应用中的基本研究任务，它需要推理出目标在帧与帧之间的对应关系。在视频的第1帧中给定任意的感兴趣目标位置，并以尽可能高的准确性估算其在所有后续帧中的位置(Smeulders等，2014；李玺等，2019)。

与视觉目标跟踪类似，半监督视频对象分割要求在给定初始帧掩模的情况下在后续视频序列上分割出目标对象，这也是计算机视觉的基本研究任务。然而，在整个视频序列中，目标对象可能会经历较大的姿势、比例和外观变化。而且，它可能遇到遮挡、快速运动和截断等异常情况。因此，在视频序列中以半监督的方式进行鲁棒的视觉目标跟踪和视频对象分割是一项具有挑战性的任务。

视频序列本身的连续特性为视频对象分割任务带来了额外的上下文信息。首先，视频的帧间一致性使得网络能在帧与帧之间有效地传递信息。此外，在视频对象分割任务中，来自先前帧的信息可以被视为时间上下文，这可以为后续预测提供有用的提示。因此，有效利用视频带来的附加信息对于视频任务非常重要。

Wang等人(2019)提出了视觉目标跟踪和视频对象分割多任务框架，然而，他们的框架并没有考虑多尺度上下文信息和视频帧间的附加信息，造成框架鲁棒性不够。针对图像语义分割任务中多尺度上下文信息的问题，Deeplabv3+(Chen等，2018)提出使用空洞卷积以扩大感受野同时捕获多尺度上下文信息，大幅提升了分割的准确性。受Deeplabv3+启发，本文将多尺度空洞卷积引入到视觉目标跟踪和视频对象分割多任务框架中，以解决目前主流视频跟踪与分割框架缺少多尺度上下文信息的问题。因此，本文使用由深度可分离卷积组成的空洞空间金字塔池化模块和前一帧的帧间掩模传播模块，使得本文提出的框架具备捕获多尺度上下文信息的能力，同时利用视频带来的附加信息以获得更加精确的结果。为了验证本文方法的有效性，在视觉目标跟踪VOT- 2016，VOT-2018等数据集与视频对象分割DAVIS(densely annotated video segmentation)-2016，DAVIS-2017等数据集上进行了测试。本文方法在这些数据集上取得了有竞争力的结果，特别是在VOT-2018数据集上达到了0.408的期望平均重叠率(expected average overlap，EAO), 证明了本文工作的有效性。

1 相关工作

1.1 视觉目标跟踪

视觉目标跟踪的研究可以从特征提取(Henriques等，2015)、模板更新(Valmadre等，2017)、分类器设计(Zhang等，2017)和边界框回归等不同方面进行，以设计更快、更准确的跟踪器。早期特征提取主要使用颜色特征、纹理特征或其他手工制作的特征。得益于深度学习的发展，卷积神经网络(convolutional neural networks，CNN)的深度卷积特征已被广泛采用。模板更新可以提高模型的适应性，但是在线跟踪效率很低，并且存在模板更新的跟踪漂移问题。相关滤波器(correlation filter，CF)(Bolme等，2010；Danelljan等，2017；Li等, 2017；张艳琳等，2020)方法的引入使跟踪在效率和准确性上达到了前所未有的高度。孪生网络在跟踪器的匹配精度和速度上具有巨大的潜力。很多研究人员使用孪生神经网络代替相关滤波器的方法，取得了很好的效果。作为孪生神经网络的开创性工作之一，SiamFC(fully-convolutional siamese networks)(Bertinetto等，2016)构建了一个完全卷积的孪生网络来训练跟踪器。受其启发，更多研究人员投入这项工作的研究中，并提出了一些更加先进的模型。CFNet(correlation filter network)(Valmadre等，2017)将关联过滤器层引入了SiamFC框架中，并执行在线跟踪以提高准确性。通过使用两个在线转换模块来修改孪生分支，DSiam(dynamic siamese)(Guo等，2017)提出动态学习的孪生网络，该网络在提高精度的同时保持较快的运行速度。SA-Siam(semantic and appearance siamese)(He等，2018)提出了一个具有语义分支和外观分支的双重孪生网络，分别训练两个分支以保持两个分支的异质性，但在测试时将两个分支合并以提高跟踪精度。然而由于需要处理尺度变化问题，这些孪生网络需要处理多尺度搜索，导致运行速度变慢。

受Faster R-CNN(region-based convolutional network)(Ren等，2017)中的区域生成网络的启发，SiamRPN(siamese region proposal network)(Li等，2018)跟踪器在孪生网络输出后执行区域生成网络。通过联合训练区域生成网络的分类分支和回归分支，SiamRPN摒弃了为对象尺度不变性提取多尺度特征图的耗时步骤，并获得了非常有效的结果。但是，SiamRPN很难处理外观相似物体的干扰情况。基于SiamRPN，DaSiamRPN(distractor-aware siamese region proposal networks)(Zhu等，2018)在训练阶段使用了难分样本挖掘的思想。通过数据增强，改善了跟踪器的辨别力，并获得了更为可靠的结果。到目前为止，研究人员从SiamFC衍生提出了很多框架，但是人们仍然无法使用更深的CNN作为骨干网络进行训练。针对这个问题，SiamRPN++(Li等，2019)在模型训练期间随机移动训练对象在搜索区域中的位置，以消除中心偏差。通过这种方法，人们可以在非常深的骨干网络进行视觉目标跟踪的训练，以实现更好的跟踪精度。SiamCAR(siamese classification and regression)(Guo等，2020)提出了一种新的全卷积孪生网络框架，将视觉跟踪任务分为像素类别的分类和该像素处对象边界框的回归两个子问题，以逐像素的方式解决端到端的视觉跟踪问题。

1.2 视频对象分割

视频对象分割任务的目标是在给定第1帧的初始掩模的情况下在后续视频帧中分割出目标对象。研究人员提出了各种各样的方法来应对视频对象分割这一挑战，可以分为基于在线学习的方法、基于离线学习的方法和基于跟踪的方法。

1) 基于在线学习的方法。为了从背景和干扰因素中区分出目标对象，基于在线学习的方法在第1帧上微调了分割网络。OSVOS(one-shot video object segmentation)(Caelles等，2017)在测试视频的第1帧上微调了预训练的分割网络。OSVOS-S(semantic one-shot video object segmentation)(Maninis等，2019)通过引入实例信息增强了OSVOS的性能。许多其他的方法将在线学习作为提高准确度的一种技巧。这些研究表明，在线学习是一种能有效提高视频对象分割模型区分能力的方法。然而，因为在半监督学习任务中，在线模型需要更新模型权重，这意味着需要进行大量的优化迭代。

2) 基于离线学习的方法。离线学习的方法使用初始帧信息，并将目标信息通过传播或匹配的方法传递给后续帧。MaskTrack(Perazzi等，2017)将前一帧预测的掩膜与当前帧的图像拼接起来，以提供空间引导。FEELVOS(fast end-to-end embedding learning for video object segmentation)(Voigtlaender等，2019)提出了语义级的语义嵌入以及全局和局部匹配机制，以将位置信息传递到后续帧中。RGMP(reference-guided mask propagation)(Oh等，2018)使用孪生神经网络来获取搜索图像和参考图像之间的位置相似性。AGAMEVOS(a generative appearance model for end-to-end video object segmentation)(Johnander等，2019)使用概率生成模型来预测目标和背景特征的分布。这些方法不需要大计算量的在线微调，但是由于信息流效率低下，它们仍然无法达到较快的速度。而且，由于缺乏可靠的目标表示，它们通常只能达到次优的精度。

3) 基于跟踪的方法。FAVOS(fast and accurate video object segmentation)(Cheng等，2018)提出了一种基于部分的跟踪方法来跟踪目标对象的位置区域。SiamMask(Wang等，2019)通过在SiamRPN上添加掩模分支来缩小跟踪和分割之间的差距，并且其运行速度达到了实时运行速度。但是，该方法的框架没有本文使用深度可分离卷积组成的空洞空间金字塔池化模块和前一帧的帧间掩膜传播模块。本文提出的框架具备捕获多尺度上下文信息的能力，同时利用视频带来的附加信息获得更加精确的结果。

2 方法

2.1 空洞深度可分离卷积

深度可分离卷积的计算过程可以分为深度卷积和点卷积。如图 1(a)所示，深度卷积不改变特征图的通道数，它将特征图按每个通道分别进行卷积，之后将每个通道的卷积输出堆叠成和原特征图通道数相同的特征图。图 1(b)点卷积不改变特征图的大小，它使用1×1的卷积核，只对特征图做跨通道的卷积。通过堆叠1×1卷积核的数量来改变点卷积输出的通道数。总的来说，深度可分离卷积将特征图通过卷积操作压缩为单通道，再通过卷积操作将特征图堆叠到需要的通道数。通过这两个步骤，使模型的参数减少，网络更加轻量化并且能保持网络的性能。本文方法将深度可分离卷积中的图 1(a)改为空洞深度卷积图 1(c)，形成了空洞深度可分离卷积。根据不同的空洞率，该卷积能在保持轻量化的同时具有不同的感受野。

图 1 空洞深度可分离卷积

Fig. 1 Atrous depthwise separable convolution

((a) depthwise convolution；(b) pointwise convolution；(c) atrous depthwise convolution)

2.2 更加多尺度的空洞空间金字塔池化模块

本文设计了一个由2.1节中的空洞深度可分离卷积组成的具有更多空洞率的空洞空间金字塔池化模块，将其应用在视频对象分割分支中，使得网络具备捕捉多尺度上下文的能力，并且本文的空洞卷积都是基于深度可分离卷积的。如图 2所示，本文工作使用了1、3、6、9、12、24、36和48的空洞率对特征图进行不同感受野的卷积和对特征图使用了自适应池化。然后将这些特征图拼接在一起，再使用一个卷积核为1×1的卷积变换特征图的通道。通过该操作，模型输出的特征图将具有丰富的多尺度上下文信息。该模块使用了不同采样率的空洞深度可分离卷积，使得网络具备预测多尺度目标的能力。

图 2 更加多尺度的空洞空间金字塔池化模块

Fig. 2 A more multi-scale atrous spatial pyramid pooling module

2.3 帧间掩膜传播模块

连续性是视频序列特有的性质，其为视频任务带来了额外的上下文信息。首先，视频的帧间一致性使得网络能在帧与帧之间有效地传递信息。此外，在视频对象分割任务中，来自先前帧的信息可以被视为时间上下文，这可以为后续预测提供有用的提示。因此，有效利用视频带来的附加信息对于视频任务非常重要。

受RGMP(Oh等，2018)算法的启发，本文将掩膜传播模块加入到视频对象分割分支中，为网络提供定位和分割信息。本文的掩膜传播模块由空洞率为2、3、6的3×3空洞卷积组成。首先本文将前一帧的图像和前一帧的掩膜拼接起来输入到两个3×3的卷积层中提取融合特征，然后通过上采样将融合特征图大小放缩为15×15像素，然后将融合特征和当前帧网络提取并进行相关操作后的特征输入到掩膜传播模块中。网络的帧间掩膜传播流程图如图 3所示。

图 3 帧间掩膜传播流程图

Fig. 3 Inter-frame mask propagation flow chart

2.4 网络整体框架

本文使用ResNet-50(He等，2016)的前4层作为骨干网络${f_\theta }$，并在第4层中使用步长为1的卷积，以避免特征图空间分辨率过小。然后使用1×1卷积缩小骨干网络输出特征图的通道数。模板图像$\mathit{\boldsymbol{z}}$和搜索图像$\mathit{\boldsymbol{x}}$分别输入骨干网络${\mathit{\boldsymbol{f}}_\theta }$进行特征提取，得到${\mathit{\boldsymbol{f}}_\theta }\left(\mathit{\boldsymbol{z}} \right)$和${\mathit{\boldsymbol{f}}_\theta }\left(\mathit{\boldsymbol{x}} \right)$，之后对特征图${\mathit{\boldsymbol{f}}_\theta }\left(\mathit{\boldsymbol{z}} \right)$和${\mathit{\boldsymbol{f}}_\theta }\left(\mathit{\boldsymbol{x}} \right)$，进行相关性操作$g_{\theta}$，即

$ {g_\theta }(\mathit{\boldsymbol{z}}, \mathit{\boldsymbol{x}}) = {\mathit{\boldsymbol{f}}_\theta }(\mathit{\boldsymbol{z}})*{\mathit{\boldsymbol{f}}_\theta }(\mathit{\boldsymbol{x}}) $

(1)

式中，${\mathit{\boldsymbol{f}}_\theta }$是骨干网络，$*$代表相关性操作，这里使用的是类似深度卷积的深度相关操作。

进行相关性操作之后的特征图的每一个位置${m}_{n}$都代表模板特征和搜索图片在这个位置的相关性。

整体网络框架如图 4所示。

图 4 整体网络框架

Fig. 4 Overall network framework

对于视觉目标跟踪区域候选网络(region proposal network，RPN)分支，假设有$k$个锚框，分别使用1×1卷积改变特征图的通道数为$2k$和$4k$进行目标分类和边界框回归，得到$\mathit{\boldsymbol{A}}_{w \times h \times 2k}^{{\rm{cls}}}$和$\mathit{\boldsymbol{A}}_{w \times h \times 4k}^{{\rm{reg}}}$，其中$\rm{cls}$代表特征图用于分类，$\rm{reg}$代表特征图用于边界框回归，$w$和$h$是网络输出的特征图大小，$2k$和$4k$是网络输出的特征图的通道数。

与Faster R-CNN(Ren等，2017)类似，让$\mathit{\boldsymbol{A}}_{x}$、$\mathit{\boldsymbol{A}}_{y}$、$\mathit{\boldsymbol{A}}_{w}$和$\mathit{\boldsymbol{A}}_{h}$分别代表anchor中心点和宽高，$\mathit{\boldsymbol{G}}_{x}$、$\mathit{\boldsymbol{G}}_{y}$、$\mathit{\boldsymbol{G}}_{w}$和$\mathit{\boldsymbol{G}}_{h}$代表真值的中心点和宽高。进行如下转换

$ \begin{array}{l} \delta [0] = \frac{{{{\bf{G}}_x} - {{\bf{A}}_x}}}{{{{\bf{A}}_w}}}, \delta [1] = \frac{{{{\bf{G}}_y} - {{\bf{A}}_y}}}{{{{\bf{A}}_h}}}\\ \;\;\;\delta [2]{\rm{ }} = \ln \frac{{{{\bf{G}}_w}}}{{{{\bf{A}}_w}}}, \delta [3] = \ln \frac{{{{\bf{G}}_h}}}{{{{\bf{A}}_h}}} \end{array} $

(2)

然后使用L1平滑损失优化它们，得到边界框回归损失${L}_{R}$

$ {L_R} = \sum\limits_{i = 0}^3 {{S_{L1}}} (\delta [i]) $

(3)

$ {S_{L1}}(x){\rm{ }} = \left\{ {\begin{array}{*{20}{l}} {0.5{x^2}}&{|x| < 1}\\ {|x| - 0.5}&{|x| \ge 1} \end{array}} \right. $

(4)

分类损失${L}_{C}$是交叉熵损失

$ {L_C} = - \log \left({\frac{{{{\rm{e}}^{{x_{{\rm{cls}}}}}}}}{{\sum\limits_j {{{\rm{e}}^{{x_j}}}} }}} \right) $

(5)

对于视频对象分割分支，根据视觉目标跟踪的结果，对相关性操作之后的特征图的每个响应位置${m}_{n}$进行视频对象分割，在此过程中使用了空洞空间金字塔池化模块和掩膜传播模块并利用了网络低层级特征的信息。在训练的时候根据标签位置是否具有目标将标签${y}_{n}$标注为1和-1。

与SiamMask(Wang等，2019)类似，对每个响应位置${m}_{n}$使用损失函数进行二进制逻辑回归，并且只对标签${y}_{n}$为1的位置计算掩膜损失

$ {L_M} = \sum\limits_n {\left({\frac{{1 + {y_n}}}{{2wh}}\sum\limits_{ij} {\log } \left({1 + {{\rm{e}}^{ - c_n^{ij}m_n^{ij}}}} \right)} \right)} $

(6)

式中，${c}_{n}$为每个像素的真值标签，大小为掩膜大小$w \times h, c_n^{ij}$为第$n$个响应位置第($i$，$j$)像素的掩膜标签。通过使用$w \times h$个二进制分类器判断每个像素是否属于模板图像。

最后整体框架的损失为两个分支的损失之和

$ L = \alpha {L_M} + \beta {L_C} + \gamma {L_R} $

(7)

式中，$\alpha$、$\beta$和$\gamma$为平衡3个损失的超参数。

2.5 网络训练与测试细节

本文的网络分两阶段训练，由于训练集性质不同，不同阶段用到的训练集也不同。

第1阶段训练中，本文使用Youtube-VOS，COCO(common objects in context)，ImageNet-DET(DETection)和ImageNet-VID(VIDeo)数据集，对于没有掩膜真值的数据集，不对掩膜分支进行训练。对于只有单帧的视频序列，将帧间掩膜传播模块中的前一帧的图像和掩膜设置为与当前帧相同。受SiamMask启发，本文使用随机梯度下降(stochastic gradient descent，SGD)优化器算法，并使用预热训练策略。前5个周期(epoch)学习率从1×10^-3增加至5×10^-3，然后使用对数衰减策略通过15个周期学习使学习率降至2.5×10^-4。

第2阶段本文只使用YouTube-VOS和COCO数据集进行训练。这两个数据集具有掩膜真值，以对视频对象分割效果进行提升。第2阶段使用对数衰减策略通过20个周期使学习率从2.5×10^-4下降至1.0×10^-4。

在测试阶段，本文将$t$-1帧预测的掩膜与$t$-1帧的搜索图像拼接起来作为融合特征。融合特征和网络相关操作提取的特征一起作为$t$帧的掩膜传播模块的输入。如图 3所示，通过框架的运行不断更新掩膜。由于掩膜分支选取视觉目标跟踪结果中分类分数最高的响应位置${m}_{n}$进行视频对象分割，因此运行速度很快，达到了实时的效果。

3 实验结果

本节将验证本文方法在视觉目标跟踪和视频对象分割上的结果。所有实验都使用NVIDIA TITAN X显卡进行。

3.1 视觉目标跟踪实验结果

本文使用官方的视觉目标跟踪(visual object tracking，VOT)测试工具包进行测试，并且使用期望平均重叠率(expected average overlap，EAO)作为效果评估方法，该方法同时考虑了跟踪器的准确性和鲁棒性。本文在VOT-2016和VOT-2018数据集上验证本文的方法，并将本文方法与最先进的方法进行比较。

如表 1所示，本文方法在VOT-2016数据集上达到了0.462的EAO，0.632的准确率和0.191的鲁棒性。本文方法在EAO上比SiamMask高将近0.03，达到了最先进水平，并且具有更好的鲁棒性，说明本文方法的有效性。

表 1 在VOT-2016数据集上，与最新技术在EAO、准确率、鲁棒性和运行速度上的比较
Table 1 Comparison of EAO, accuracy, robustness and speed with the latest technology on the VOT-2016 dataset

下载CSV

算法	EAO	准确率	鲁棒性	速度/(帧/s)
Staple	0.322	0.53	-	80
C-COT	0.331	0.53	-	0.3
SiamRPN	0.344	0.56	-	160
SiamMask	0.433	0.639	0.214	55
本文	0.462	0.632	0.191	45
注：加粗字体表示每列最优结果，“-”表示原文献中未给出相应结果。

如表 2所示，本文方法在VOT-2018数据集上达到了0.408的EAO，比SiamMask高出了0.028，这验证了本文方法的有效性。本文方法的准确率为0.604，鲁棒性为0.253，速度为45帧/s。虽然运行速度有所下降，但在鲁棒性上却得到了较好的结果，因此EAO较高。

表 2 在VOT-2018数据集上，与最新技术在EAO、准确率、鲁棒性和运行速度上的比较
Table 2 Comparison of EAO, accuracy, robustness and speed with the latest technology on the VOT-2018 dataset

下载CSV

算法	EAO	准确率	鲁棒性	速度/(帧/s)
CSRDCF	0.263	0.466	0.318	48.9
STRCF	0.345	0.523	0.215	2.9
DaSiamRPN	0.326	0.569	0.337	160
SiamRPN	0.244	0.490	0.460	200
SiamMask	0.380	0.609	0.276	55
本文	0.408	0.604	0.253	45
注：加粗字体表示每列最优结果。

在VOT-2016和VOT-2018数据集上的测试结果验证了本文方法在视觉目标跟踪上的有效性。本文方法使模型具有更好的鲁棒性从而达到更好的EAO值。定性实验结果如图 5中的袋子、博尔特和费尔南多猫所示(其中袋子、博尔特和费尔南多猫来自VOT-2018数据集；汽车影子来自DAVIS-2016数据集；狗跳和马跳高来自DAVIS-2017数据集)。

图 5 本文方法在视频目标跟踪和视频对象分割上的定性实验结果

Fig. 5 Qualitative results of our method on video object tracking and video object segmentation

((a) bag; (b) Bolt; (c) Fernando; (d) car shadow; (e) dog jump; (f) horse jump)

3.2 视频对象分割实验结果

给定简单的初始边界框，本文模型能够进行视频对象分割任务，而不需要任何预处理。在DAVIS-2016(Perazzi等，2016)和DAVIS-2017数据集上验证本文方法的有效性。在视频对象分割数据集中，通常使用区域相似度的杰卡德系数(Jaccard index，$J$)和轮廓精确度的F度量(F-measure，$F$)作为主要度量标准

$ J = \frac{{\left| {\mathit{\boldsymbol{M}} \cap \mathit{\boldsymbol{G}}} \right|}}{{\left| {\mathit{\boldsymbol{M}} \cup \mathit{\boldsymbol{G}}} \right|}} $

(8)

式中，区域相似度的杰卡德系数$J$是掩膜$\mathit{\boldsymbol{M}}$和真值$\mathit{\boldsymbol{G}}$之间的相交与联合区域之比，它衡量了像素预测的准确率。

$ F = \frac{{2{P_{\rm{c}}}{R_{\rm{c}}}}}{{{P_{\rm{c}}} + {R_{\rm{c}}}}} $

(9)

式中，轮廓精确度的$\rm{F}$度量为基于轮廓的准确率$P_{\rm{c}}$和召回率$R_{\rm{c}}$的调和平均数，它衡量的是边界分割的准确率。

对于DAVIS数据集，本文分别考虑$J$和$F$的平均值(mean, $M$)、召回值(recall, $R$)与衰变值(decay)，分别记为$J_{\rm{M}}$、$J_{\rm{O}}$、$J_{\rm{D}}$和$F_{\rm{M}}$、$F_{\rm{O}}$、$F_{\rm{D}}$。

本文方法在DAVIS-2016和DAVIS-2017数据集上的测试结果如表 3和表 4所示。从表 3和表 4中可以看出, OnAVOS(online adaptation of convolu-tional neural networks for video object segmentation)(Voigtlaender和Leibe，2017)等算法的效果最好。但由于OnAVOS、OSVOS、FAVOS和OSMN(efficient video object segmentation via network modulation)(Yang等，2018)等算法在测试时需要对网络进行微调和预处理，导致运行速度非常慢，不能达到实时运行的效果。当本文方法和实时运行的SiamMask比较时，在单一目标对象分割DAVIS-2016数据集上，本文方法的$J_{\rm{M}}$和$F_{\rm{M}}$都比SiamMask差将近0.02。但是在多目标对象分割DAVIS-2017数据集上，本文方法优于SiamMask，因为$J_{\rm{M}}$和$F_{\rm{M}}$分别达到56.0和59.0，并且区域和轮廓的衰变值$J_{\rm{D}}$和$F_{\rm{D}}$都比SiamMask中的低。运行速度为45帧/s，达到了实时效果。

表 3 在DAVIS-2016数据集(验证集)上，与最新技术在区域相似度的J、轮廓精确度的F和运行速度上的比较
Table 3 Comparison of regional similarity J, contour accuracy F and running speed with the latest technology on the DAVIS-2016 dataset (validation set)

下载CSV

算法	J_M	J_O	J_D	F_M	F_O	F_D	速度/(帧/s)
RGMP	81.5	91.7	10.9	82.0	90.8	10.1	8
FAVOS	82.4	96.5	4.5	79.5	89.4	5.5	0.8
OnAVOS	86.1	96.1	5.2	84.9	89.7	5.8	0.08
OSMN	74.0	87.6	9.0	72.9	84.0	10.6	8.0
VPN	70.2	82.3	12.4	65.5	69.0	14.4	1.6
SiamMask	71.7	86.8	3.0	67.8	79.8	2.1	55
本文	69.2	83.4	4.0	65.5	75.2	7.0	45
注：加粗字体表示每列最优结果。

表 4 在DAVIS-2017数据集(验证集)上，与最新技术在区域相似度的J、轮廓精确度的F和运行速度的比较
Table 4 Comparison of regional similarity J, contour accuracy F and running speed with the latest technology on the DAVIS-2017 data set (validation set)

下载CSV

算法	J_M	J_O	J_D	F_M	F_O	F_D	速度/(帧/s)
OSVOS	56.6	63.8	26.1	63.9	73.8	27.0	0.1
FAVOS	54.6	61.1	14.1	61.8	73.3	18.0	0.8
OnAVOS	61.6	67.4	27.9	69.1	75.4	26.6	0.08
OSMN	52.5	60.9	21.5	57.1	66.1	24.3	8.0
SiamMask	54.3	62.8	19.3	58.5	67.5	20.9	55
本文	56.0	66.3	17.9	59.0	69.0	19.8	45
注：加粗字体表示每列最优结果。

从表 3和表 4可以看出，本文方法在保持实时运行速度的同时，视频对象分割的效果已经和非实时方法差距较小。这充分说明本文方法使用了多尺度空间空洞卷积模块和帧间掩膜传播模块后，网络对多尺度目标对象分割能力更强，同时具备更好的鲁棒性。定性实验结果如图 5中的汽车影子、狗跳和马跳高所示。同时，根据4.1节的结果，本文方法在VOT-2016和VOT-2018数据集的视觉目标跟踪效果都比SiamMask高0.03左右，这证明了本文方法的有效性。

4 结论

本文提出一个融合多尺度上下文信息和视频帧间信息的实时视觉目标跟踪与视频对象分割多任务的端到端框架，使得网络对多尺度目标对象捕获能力更强的同时具备更好的鲁棒性。本文框架在视觉目标跟踪VOT-2016和VOT-2018数据集上的期望平均重叠率分别达到了0.462和0.408，分别比SiamMask高了0.029和0.028，达到了最先进的结果，并且表现出更好的鲁棒性。在视频对象分割DAVIS-2016和DAVIS-2017数据集上也取得了有竞争力的结果。其中，在多目标对象分割DAVIS-2017数据集上，本文方法比SiamMask有更好的性能表现，区域相似度的平均值$J_{\rm{M}}$和轮廓精确度的平均值$F_{\rm{M}}$分别达到了56.0和59.0，并且区域和轮廓的衰变值$J_{\rm{D}}$和$F_{\rm{D}}$都比SiamMask中的低，分别为17.9和19.8。同时运行速度为45帧/s，达到了实时的运行速度。

然而，与非实时性的方法对比，本文在视频对象分割任务上的效果还有一定的提升空间。在未来的工作中，将继续研究如何融合掩膜信息以获得更好的效果以及研究其他视觉目标跟踪和视频对象分割多任务训练的方法。例如：尝试用RGMP(reference-guided mask propagation)(Oh等，2018)的方法，将视频帧与掩膜连接成一个四通道的图像作为输入流，利用图像与掩膜之间的信息提高视频对象分割效果；引入视频帧间的光流信息作为辅助信息，改变搜索帧的搜索区域范围，使得框架面对快速运动的物体更具鲁棒性；寻找一种能够使视觉目标跟踪和视频对象分割互相促进的多任务训练的方法，使得视频对象分割的结果能够融合到视觉目标跟踪的过程中，以达到两个任务互相促进的目的。

参考文献

Bertinetto L, Valmadre J, Henriques J F, Vedaldi A and Torr P H S. 2016. Fully-convolutional siamese networks for object tracking//Proceedings of European Conference on Computer Vision. Amsterdam, The Netherlands: Springer: 850-865[DOI:10.1007/978-3-319-48881-3_56]

Bolme D S, Beveridge J R, Draper B A and Lui Y M. 2010. Visual object tracking using adaptive correlation filters//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, USA: IEEE: 2544-2550[DOI:10.1109/CVPR.2010.5539960]

Caelles S, Maninis K K, Pont-Tuset J, Leal-Taixé L, Cremers D and van Gool L. 2017. One-shot video object segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 5320-5329[DOI:10.1109/CVPR.2017.565]

Chen L C, Zhu Y K, Papandreou G, Schroff F and Adam H. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 833-851[DOI:10.1007/978-3-030-01234-2_49]

Cheng J C, Tsai Y H, Hung W C, Wang S J and Yang M H. 2018. Fast and accurate online video object segmentation via tracking parts//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7415-7424[DOI:10.1109/CVPR.2018.00774]

Danelljan M, Häger G, Khan F S, Felsberg M. 2017. Discriminative scale space tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(8): 1561-1575 [DOI:10.1109/TPAMI.2016.2609928]

Guo D Y, Wang J, Cui Y, Wang Z H and Chen S Y. 2020. SiamCAR: siamese fully convolutional classification and regression for visual tracking//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 6268-6276[DOI:10.1109/CVPR42600.2020.00630]

Guo Q, Feng W, Zhou C, Huang R, Wan L and Wang S. 2017. Learning dynamic siamese network for visual object tracking//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 1781-1789[DOI:10.1109/ICCV.2017.196]

He A F, Luo C, Tian X M and Zeng W J. 2018. A twofold siamese network for real-time object tracking//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4834-4843[DOI:10.1109/CVPR.2018.00508]

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778[DOI:10.1109/CVPR.2016.90]

Henriques J F, Caseiro R, Martins P, Batista J. 2015. High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3): 583-596 [DOI:10.1109/TPAMI.2014.2345390]

Johnander J, Danelljan M, Brissman E, Khan F S and Felsberg M. 2019. A generative appearance model for end-to-end video object segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 8945-8954[DOI:10.1109/CVPR.2019.00916]

Li B, Wu W, Wang Q, Zhang F Y, Xing J L and Yan J J. 2019. Siamrpn++: evolution of siamese visual tracking with very deep networks//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 4277-4286[DOI:10.1109/CVPR.2019.00441]

Li B, Yan J J, Wu W, Zhu Z and Hu X L. 2018. High performance visual tracking with siamese region proposal network//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 8971-8980[DOI:10.1109/CVPR.2018.00935]

Li F, Yao Y J, Li P H, Zhang D, Zuo W M and Yang M H. 2017. Integrating boundary and center correlation filters for visual tracking with aspect ratio variation//Proceedings of 2017 IEEE International Conference on Computer Vision Workshops. Venice, Italy: IEEE: 2001-2009[DOI:10.1109/ICCVW.2017.234]

Li X, Zha Y F, Zhang T Z, Cui Z, Zuo W M, Hou Z Q, Lu H C, Wang H Z. 2019. Survey of visual object tracking algorithms based on deep learning. Journal of Image and Graphics, 24(12): 2057-2080 (李玺, 查宇飞, 张天柱, 崔振, 左旺孟, 侯志强, 卢湖川, 王菡子. 2019. 深度学习的目标跟踪算法综述. 中国图象图形学报, 24(12): 2057-2080) [DOI:10.11834/jig.190372]

Maninis K K, Caelles S, Chen Y, Pont-Tuset J, Leal-Taixé L, Cremers D, van Gool L. 2019. Video object segmentation without temporal information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(6): 1515-1530 [DOI:10.1109/TPAMI.2018.2838670]

Oh S W, Lee J Y, Sunkavalli K and Kim S J. 2018. Fast video object segmentation by reference-guided mask propagation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7376-7385[DOI:10.1109/CVPR.2018.00770]

Perazzi F, Khoreva A, Benenson R, Schiele B and Sorkine-Hornung A. 2017. Learning video object segmentation from static images//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 3491-3500[DOI:10.1109/CVPR.2017.372]

Perazzi F, Pont-Tuset J, McWilliams B, van Gool L, Gross M and Sorkine-Hornung A. 2016. A benchmark dataset and evaluation methodology for video object segmentation//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 724-732[DOI:10.1109/CVPR.2016.85]

Ren S Q, He K M, Girshick R, Sun J. 2017. Faster R-CNN:towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149 [DOI:10.1109/TPAMI.2016.2577031]

Smeulders A W M, Chu D M, Cucchiara R, Calderara S, Dehghan A, Shah M. 2014. Visual tracking:an experimental survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7): 1442-1468 [DOI:10.1109/TPAMI.2013.230]

Valmadre J, Bertinetto L, Henriques J, Vedaldi A and Torr P H S. 2017. End-to-end representation learning for correlation filter based tracking//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 5000-5008[DOI:10.1109/CVPR.2017.531]

Voigtlaender P and Leibe B. 2017. Online adaptation of convolutional neural networks for video object segmentation[EB/OL].[2020-08-26]. https://arxiv.org/pdf/1706.09364.pdf

Voigtlaender P, Chai Y N, Schroff F, Adam H, Leibe B and Chen L C. 2019. FEELVOS: fast end-to-end embedding learning for video object segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 9473-9482[DOI:10.1109/CVPR.2019.00971]

Wang Q, Zhang L, Bertinetto L, Hu W M and Torr P H S. 2019. Fast online object tracking and segmentation: a unifying approach//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 1328-1338[DOI:10.1109/CVPR.2019.00142]

Yang L, Wang Y, Xiong X, Yang J and Katsaggelos A K. 2018. Efficient video object segmentation via network modulation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 6499-6507[DOI:10.1109/CVPR.2018.00680]

Zhang L, Varadarajan J, Suganthan P N, Ahuja N and Moulin P. 2017. Robust visual tracking using oblique random forests//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 5825-5834[DOI:10.1109/CVPR.2017.617]

Zhang Y L, Qian X Y, Zhang M, Ge H J. 2020. Correlation filter target tracking algorithm based on adaptive multifeature fusion. Journal of Image and Graphics, 25(6): 1160-1170 (张艳琳, 钱小燕, 张淼, 葛红娟. 2020. 自适应多特征融合相关滤波目标跟踪. 中国图象图形学报, 25(6): 1160-1170) [DOI:10.11834/jig.190304]

Zhu Z, Wang Q, Li B, Wu W, Yan J J and Hu W M. 2018. Distractor-aware siamese networks for visual object tracking//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 103-119[DOI:10.1007/978-3-030-01240-3_7]