发布时间: 2019-08-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.180661
2019 | Volume 24 | Number 8

图像分析和识别

特征注意金字塔调制网络的视频目标分割

汤润发, 宋慧慧, 张开华, 姜斯浩

南京信息工程大学大气环境与装备技术协同创新中心江苏省大数据分析技术重点实验室, 南京 210044

收稿日期: 2018-12-14; 修回日期: 2019-03-11

基金项目: 国家自然科学基金项目(61872189, 61876088);江苏省自然科学基金项目(BK20170040);江苏省六大人才高峰基金项目(XYDXX-015, XYDXX-045)

第一作者简介: 汤润发, 1995年生, 男, 硕士研究生, 主要研究方向为视频对象分割。E-mail:980453623@qq.com;
张开华, 男, 教授, 主要研究方向为图像分割与目标跟踪。E-mail:zhkhua@gmail.com;
姜斯浩, 男, 硕士研究生, 主要研究方向为视频对象分割。E-mail:2298580785@qq.com.

中图法分类号: TP391.4

文献标识码: A

文章编号: 1006-8961(2019)08-1349-09

摘要

目的视频目标分割是在给定第1帧标注对象掩模条件下，实现对整个视频序列中感兴趣目标的分割。但是由于分割对象尺度的多样性，现有的视频目标分割算法缺乏有效的策略来融合不同尺度的特征信息。因此，本文提出一种特征注意金字塔调制网络模块用于视频目标分割。方法首先利用视觉调制器网络和空间调制器网络学习分割对象的视觉和空间信息，并以此为先验引导分割模型适应特定对象的外观。然后通过特征注意金字塔模块挖掘全局上下文信息，解决分割对象多尺度的问题。结果实验表明，在DAVIS 2016数据集上，本文方法在不使用在线微调的情况下，与使用在线微调的最先进方法相比，表现出更具竞争力的结果，$J$-mean指标达到了78.7%。在使用在线微调后，本文方法的性能在DAVIS 2017数据集上实现了最好的结果，$J$-mean指标达到了68.8%。结论特征注意金字塔调制网络的视频目标分割算法在对感兴趣对象分割的同时，针对不同尺度的对象掩模能有效结合上下文信息，减少细节信息的丢失，实现高质量视频对象分割。

关键词

视频对象分割; 全卷积网络; 调制器; 空间金字塔; 注意机制

Video object segmentation via feature attention pyramid modulating network

Tang Runfa, Song Huihui, Zhang Kaihua, Jiang Sihao

Jiangsu Key Laboratory of Big Data Analysis Technology, Jiangsu Collaborative Innovation Center on Atmospheric Environment and Equipment Technology, Nanjing University of Information Science and Technology, Nanjing 210044, China

Supported by: National Natural Science Foundation of China (61872189, 61876088)

Abstract

Objective Video object segmentation aims to separate a target object from the background and other instances on the pixel level. Segmenting objects in videos is a fundamental task in computer vision because of its wide applications, such as video surveillance, video editing, and autonomous driving. Video object segmentation suffers from the challenging factors of occlusion, fast motion, motion blur, and significant appearance variation over time. In this paper, we leverage modulators to learn the limited visual and spatial information of a given target object to adapt the general segmentation network to the appearance of a specific object instance. Existing video object segmentation algorithms lack appropriate strategies to use feature information of different scales due to the multi-scale segmentation objects. Therefore, we design a feature attention pyramid module for video object segmentation. Method To adapt the generic segmentation network to the appearance of a specific object instance in one single feed-forward pass, we employ two modulators, namely, visual modulator and spatial modulator, to learn to adjust the intermediate layers of the generic segmentation network given an arbitrary target object instance. The modulator produces a list of parameters by extracting information from the image of the annotated object and the spatial prior of the object, which are injected into the segmentation model for layer-wise feature manipulation. The visual modulator network is a convolutional neural network (CNN) that takes the annotated visual object image as input and produces a vector of scale parameters for all modulation layers. The visual modulator is used to adapt the segmentation network to focus on a specific object instance, which is the annotated object in the first frame. The visual modulator implicitly learns an embedding of different types of objects. It should produce similar parameters to adjust the segmentation network for similar objects, whereas it should produce different parameters for different objects. The spatial modulator network is an efficient network that produces bias parameters based on the spatial prior input. Given that objects move continuously in a video, we set the prior as the predicted location of the object mask in the previous frame. Specifically, we encode the location information as a heatmap with a 2D Gaussian distribution on the image plane. The center and standard deviations of the Gaussian distribution are computed from the predicted mask of the previous frame. The spatial modulator downsamples the heatmap into different scales to match the resolution of different feature maps in the segmentation network and then applies a scale-and-shift operation on each downsampled heatmap to generate the bias parameters of the corresponding modulation layer. The scale problem of the segmentation network can be solved by multi-scale pooling of the feature map. The feature fusion of different scales is used to achieve context information fusion of different receptive fields and the fusion of the overall contour and the texture details; thus, large-scale and small-scale object segmentation can be effectively combined with the context information to reduce the loss of detail information as possible, achieving high-quality pixel-level video object segmentation. PSPNet or DeepLab system performs spatial pyramid pooling at different grid scales or dilate rates (called atrous spatial pyramid pooling (ASPP)) to solve this problem. In the ASPP module, dilated convolution is a sparse calculation that may cause grid artifacts. On the one hand, the pyramid pooling module proposed in PSPNet may lose pixel-level localization information. These kinds of structure lack global context prior attention to select the features in a channel-wise manner as in SENet and EncNet. On the other hand, using channel-wise attention vector is not enough to extract multi-scale features effectively, and pixel-wise information is lacking. Inspired by SENet and ParseNet, we attempt to extract precise pixel-level attention for high-level features extracted from CNNs. Our proposed feature attention pyramid (FAP) module is capable of increasing the respective fields and classifying small and big objects effectively, thus solving the problem of multi-scale segmentation. Specifically, the FAP module combines the attention mechanism and the spatial pyramid and achieves context information fusion of different receptive fields by combining the features of different scales and simultaneously by means of the global context prior. We use the 30×30, 15×15, 10×10, and 5×5 pools in the pyramid structure, respectively, to better extract context from different pyramid scales. Then, the pyramid structure concatenates the information of different scales, which can incorporate context features precisely. Furthermore, the origin features from CNNs is multiplied in a pixel-wise manner by the pyramid attention features after passing through a 1×1 convolution. We also introduce the global pooling branch concatenated with output features. The feature map produces improved channel-wise attention to learn good feature representations so that context information can be effectively combined between segmentation of large-and small-scale objects. Benefiting from the spatial pyramid structure, the FAP module can fuse different scale context information and produce improved pixel-level attention for high-level feature maps in the meantime. Result We validate the effectiveness and robustness of the proposed method on the challenging DAVIS 2016 and DAVIS 2017 datasets. The proposed methoddemonstrates more competitive results on DAVIS 2016 compared with the state-of-art methods that use online fine-tuning, and it outperforms these methods on DAVIS 2017. Conclusion In this study, we first use two modulator networks to learn the visual and spatial information of the segmentation object mask. The visual modulator produces channel-wise scale parameters to adjust the weights of different channels in the feature maps, while the spatial modulator generates element-wise bias parameters to inject the spatial prior into the modulated features. We use the modulators as a prior guidance to enable the segmentation model to adapt to the appearance of specific objects. In addition to segmentation of objects of interest, the mask for objects of different scales can effectively combine context information to reduce the loss of details, thereby achieving high-quality pixel-level video object segmentation.

Key words

video object segmentation; full convolution network; modulator; spatial pyramid; attention mechanism

0 引言

视频目标分割是指在给定第1帧前景对象掩模标注的条件下，准确分割后续帧中的特定对象，在图像视觉内容分析与理解方面起着重要作用。视频对象分割可以更好地帮助理解视频，有助于完成交互式视频编辑、自动驾驶和机器人导航等任务。然而由于相机运动、对象变形、实例之间的遮挡和动态背景变化等复杂因素影响，视频目标分割仍是一项极具挑战性的任务。

现有的视频目标分割算法大致分为无监督方法和半监督方法两类。无监督方法主要是在没有任何目标先验知识(如初始对象掩模)的情况下，从背景中分割出移动对象，如基于概率模型^[1-2]、运动^[3-4]和对象建议^[5]等方法。

现有的无监督方法通常依赖超像素、显著性图或光流等视觉线索获得初始对象区域，且需要批处理整个视频来细化对象分割，此外在每一帧中生成和处理大量候选区域非常耗时。最近基于卷积神经网络的无监督方法^[6-7]利用诸如ImageNet预训练模型产生的丰富的层次化特征和大量的增强数据来实现高精度分割结果。但是由于不同实例和动态背景之间的运动混淆，现存的无监督方法难以精确分割多个对象。

与无监督方法不同，半监督视频对象分割方法通过给定一个初始对象的掩模，确定目标的关键视觉线索，因此这类方法可以处理多个实例分割的情况，并且通常比无监督方法性能更好。半监督视频目标分割方法通常分为基于传播的方法和基于检测的方法。

1) 基于传播的半监督视频目标分割方法^[8-11]主要利用对象运动的时间相关性，即假定当前帧和下一帧之间的对象运动变化平滑，因此可将先前帧预测的对象掩模作为引导信息传播到下一帧，帮助分割模型实现对感兴趣对象的定位。Perazzi等人^[9]提出了MaskTrack方法，将先前帧的掩模用做神经网络的附加输入通道，以便于利用时间上下文信息。Jampani等人^[10]提出了视频传播网络，利用学习到的双边滤波跨视频帧传播信息。Wug等人^[11]将标注的参考帧和带有先前帧掩模的当前帧传递给深层网络，并通过对先前帧掩模位置的粗略估计实现对当前帧的定位。这些方法依赖像素间的时空连接，因此只要目标外观和位置变化平滑就可以适应目标对象的复杂变形和移动，然而易受遮挡和快速运动等时间不连续性因素的影响，并且一旦传播不可靠，就容易导致漂移。

2) 基于检测的半监督视频目标分割方法^[12-13]利用给定的标注帧中目标对象的外观信息，将视频目标分割转换为每帧的像素级对象检测任务，以在不考虑时间一致性的情况下独立地处理每一帧。Caelles等人^[12]提出一次性视频对象分割(OSVOS)方法，能够透过遮挡来分割对象，不受运动范围的限制，不需要依次处理帧，并且误差不会随着时间传播。Maninis等人^[13]使用实例分割算法^[14]获得目标对象的语义先验作为引导拓展了这种方法。由于这些方法很少依赖时间一致性，因此对遮挡和漂移具有强大的处理能力。然而，其估计主要依赖标注帧的对象外观，所以往往不能适应外观的剧烈变化，且难以有效处理具有相似外观的对象实例分割。

上述半监督视频目标分割方法通常分为两个阶段：第1阶段训练通用的全卷积网络^[15]来分割前景对象；第2阶段基于视频的第1帧对该网络进行在线微调使模型适应特定的视频序列。尽管这些方法达到了很高的精度，但是微调过程会耗费大量时间，无法满足实时应用的需要，此外其中的一些方法还利用了光流信息，导致计算量非常大。

OSMN框架^[16]通过设计视觉调制器网络和空间调制器网络来实现对任意视频目标的分割，在DAVIS 2016和DAVIS 2017数据集上都取得了领先的结果。为进一步降低半监督分割的计算成本，实现更高的分割精度，本文成功扩展了OSMN框架，如图 1所示，本文模型由分割网络、视觉调制器、空间调制器和特征注意金字塔模块4部分组成，图中虚线部分表示上采样操作。通过从第1帧标注和先前帧对象的空间先验信息中提取特征信息，调制器会生成一个参数列表，对分割网络的卷积层进行逐层调制操作。同时针对输入图像中对象实例尺寸不同的问题，本文提出了特征注意金字塔模块来挖掘不同尺度的特征信息，以满足大小不同对象分割的需要。实验表明，与目前最先进的视频目标跟踪算法相比，本文方法在没有在线微调的情况下的效率最高，且精度相当。加入在线微调后，本文方法在最具有挑战性的DAVIS 2017数据集上优于目前先进的方法。

图 1 本文模型

Fig. 1 Our model

本文的贡献如下：

1) 本文成功拓展了OSMN架构，提出了一个新颖的特征注意金字塔模块，以适应不同尺寸的特定对象实例分割;

2) 在DAVIS 2016数据集上，本文达到了与使用在线微调的方法相当的性能，且在更具挑战性的DAVIS 2017数据集上超越了其他方法;

3) 进一步证明了特征调制与在线微调是互补的，将二者结合在一起可进一步提高性能。

1 特征注意金字塔网络

视频目标分割有两个重要线索：视觉外观和空间运动。为了使用来自视觉和空间域的信息，本文分别基于第1帧的对象掩模标注和先前帧掩模的空间位置，利用两个调制器网络引导分割网络适应于感兴趣的分割对象，然后借助特征注意金字塔模块提取不同尺度的特征，实现对不同尺度大小对象的精确分割。

1.1 视觉调制器

借助第1帧给定的带标注的对象掩模，视觉调制器增强与对象相关的映射，抑制与对象无关的映射，调整分割网络使其专注于特定的对象实例。具体来说，视觉调制器从第1帧给定的对象掩模中提取类别、颜色、形状和纹理等语义信息，并生成对应通道的权重来调整分割网络以适应分割对象。本文使用VGG-16网络作为视觉调制器的模型，同时修改了用于ImageNet分类训练的网络最后一个全连接层，以匹配分割网络的调制层中的参数数量。

为了预处理视觉调制器的输入，首先对第1帧的对象掩模进行裁剪，使前景像素占据主导地位，提高对小目标特征的学习能力；然后将背景像素设置为平均像素值，起到抑制背景作用，从而更好地突出前景；最后将裁剪好的图像固定分辨率大小，统一调整为224×224像素的输入尺寸，并应用数据增强策略，增加10%的随机缩放和10°的随机旋转。

1.2 空间调制器

为了区分同一对象的多个实例，空间调制器将图像中对象的先前位置作为输入。由于对象在视频中连续运动，因此将先前帧预测的对象掩模作为先验，帮助网络对当前帧的位置进行粗略定位。具体而言，将先前帧的位置信息编码为图像平面上2维高斯分布的热图，高斯分布的中心和标准偏差根据先前帧预测的掩模计算，为了匹配分割网络中不同特征图的分辨率，空间调制器将2维热图下采样为不同尺度，并应用缩放和移位操作生成相应调制层的偏置参数，具体为

$ \beta_{c}=\alpha_{c} \boldsymbol{m}+\boldsymbol{\eta}_{c} $

(1)

式中，$\boldsymbol{m}$是相应调制层的下采样高斯热图，$\alpha_{c}$和${\eta}_{c}$分别是第$c$个通道的缩放和移位参数。

式(1)等同于批量标准化(BN)的计算公式，因此可以通过1×1卷积，同时加入批量标准化层来生成偏置参数$\beta_{c}$。与批量标准化的原理相同，这里的缩放和移位参数起到归一化的作用，加速网络收敛，同时防止梯度消失或爆炸。图 1的底部展示了空间调制器的结构，类似于MaskTrack方法^[9]，空间调制器仅使用先前帧非常粗略的位置信息，丢掉了先前帧更多的信息，但是具有足够的信息帮助分割网络推断RGB图像的对象掩模，并且可防止模型过度依赖先前帧的掩模，防止视频对象大幅度运动导致模型漂移。

为了将2维热图预处理为空间调制器的输入，首先计算先前帧掩模的平均值和标准差，然后用20%的随机漂移和40%的随机缩放的策略来增强掩模。

1.3 分割网络

分割网络是基于VGG-16的全卷积网络。在分割网络的每一个卷积层之后，定义了一个新的调制层，其参数由视觉和空间调制器共同训练产生。视觉调制器产生通道的缩放参数来调整特征图中不同通道的权重，而空间调制器生成逐元素的偏置参数来将空间先验注入到调制特征中。具体来说，调制层进行的配置为

$\boldsymbol{y}_{c}=\boldsymbol{\gamma}_{c} \boldsymbol{\chi}_{c}+\boldsymbol{\beta}_{c}$

(2)

式中，$\boldsymbol{\gamma}_{c}$和$\boldsymbol{\beta}_{c}$分别表示来自第$c$通道的视觉调制器和空间调制器的调制参数, $\boldsymbol{\chi}_c$表示$c$通道的输入特征图，$\boldsymbol{y}_{c}$表示经过调制后的特征图。$\boldsymbol{\gamma}_{c}$是一个用于通道加权的标量，$\boldsymbol{\beta}_{c}$是一个逐元素偏置值的2维矩阵。

直观上，应该在全卷积网络中的每个卷积层后添加调制层，然而实验发现在较浅层添加调制层会使模型性能变差，原因可能是较浅层提取的低层特征对调制器引入的缩放和移位操作非常敏感。因此，本文将调制操作添加到VGG-16的最后3个阶段的所有卷积层之后。

在VGG-16最后3个阶段的最后一层卷积(经过视觉调制器和空间调制器调制后)上分别引入一个侧输出层，如图 1虚线部分所示，对其特征进行上采样得到与原图分辨率大小相同的特征图。最后融合不同阶段的特征图，并通过一个1×1的卷积得到最终的分割结果，这样可以将高层的语义信息和低层的空间信息融合，同时学习不同尺度的特征。

1.4 特征注意金字塔模块

分割模型通常采用空间金字塔池化^[17-19]来捕获多个范围的上下文信息。ParseNet^[17]利用图像级特征获取全局上下文信息。DeepLabv2^[18]提出带孔(atrous)空间金字塔池化，利用不同比率的并行带孔卷积层捕获多尺度特征信息。另外还有基于长短时记忆(LSTM)来聚合上下文信息^[19]等其他方法。

受注意力机制^[20-21]的启发，利用卷积神经网络提取的高级特征构造精确的像素级注意。金字塔结构可提取不同尺度的特征信息，并有效增加感受野，但是缺乏全局上下文先验引导逐通道选取特征。另一方面，如果采用类似SENet^[20]的注意机制，使用逐通道注意向量不能有效地提取多尺度特征，并且缺乏全局像素的空间信息。

通过以上的观察分析，本文提出特征注意金字塔模块，如图 2所示。

图 2 特征注意金字塔模块

Fig. 2 Feature attention pyramid module

该模块共有3个分支。1)分支1(黑色虚线框)通过全局平均池化操作学习全局上下文信息，接着通过一个通道数为512的1×1卷积，上采样到与原始特征图分辨率相同大小，从而生成全局像素的空间信息。2)为了更好地从不同的金字塔尺度中提取上下文信息，分支3(红色虚线框)利用不同大小的池化核挖掘不同尺度的特征信息，构建一个金字塔级，然后在每个金字塔级后分别使用1×1的卷积，由于原始特征图的通道数为512，所以这里每个金字塔级通道数都降维为128。池化核大小设置为30×30、15×15、10×10、5×5。3)对每个金字塔级特征上采样恢复到与原始特征图分辨率相同大小，并通过融合操作以及3×3的卷积来聚合不同尺度金字塔级特征。4)来自卷积神经网络(CNNs)的原始特征经过分支2(1×1卷积)后，与分支3中融合不同尺度信息的金字塔特征逐像素相乘，然后将分支1学习到的全局上下文先验作为引导进行逐像素相加，得到最终的特征表示。

2 实验设置和结果分析

2.1 训练细节

本文的网络是端到端训练的，视觉调制器和分割网络都使用在ImageNet分类任务上预训练的VGG-16模型进行初始化。因为最初无从知道分割网络中每个卷积层的每个通道特征的重要性，所以将调制参数$\left\{\gamma_{c}\right\}$初始化为1，最简单的做法是将视觉调制器的最后一个全连接层的权重和偏差分别设置为0和1。因为不知道当前帧的粗略空间位置信息，所以对空间调制器的权重随机初始化，随着迭代训练自适应的学习即可。本文使用交叉熵平衡损失，采用Adam优化器，设置动量$\beta_{1}$=0.9、$\beta_{2}$=0.999。为了更好地学习通用的对象特征，本文的网络结构在两个数据集上进行训练。1)在最大(约11万幅图像)的公共语义分割数据集MS-COCO上进行预训练，将批量大小设为8，首先用学习率为10^-5训练20万次，然后用学习率为10^-6训练10万次。2)为了对视频中运动对象的外观变化进行建模，使用视频分割数据集(DAVIS 2016或DAVIS 2017)对模型进行训练，将批量大小设为4，用学习率为10^-6训练5万次。实验表明两个数据集训练是互补的。

2.2 单个视频对象分割

单个视频对象分割在DAVIS 2016数据集上进行实验，该数据集由50个视频序列组成，其中30个用于训练，20个用于测试，共有3 455个像素级标注的视频帧，同时每个视频序列中仅有一个带标注的对象。

DAVIS 2016数据集基准评估的主要指标是区域相似度和轮廓精度。区域相似度$J$是掩膜$\boldsymbol{M}$和真值$\boldsymbol{G}$的交并比(IoU)函数

$J=\frac{|\boldsymbol{M} \cap \boldsymbol{G}|}{|\boldsymbol{M} \cup \boldsymbol{G}|}$

轮廓精度$F$是将掩膜看成一系列闭合轮廓的集合，计算基于轮廓的$F$度量，即准确率$P_{c}$和召回率$R_{c}$的函数

$F=\frac{2 P_{c} R_{c}}{P_{c}+R_{c}}$

DAVIS 2016数据集评估的辅助指标包括平均值(mean)、召回率(recall)和衰减值(decay)。$J$-mean是所有视频序列区域相似度的平均值，$F$-mean是所有视频序列轮廓精度的平均值，通常以这两个指标为主要评估标准。$J$-recall是$J$大于阈值0.5的帧数与总帧数的比值，$F$-recall是$F$大于阈值0.5的帧数与总帧数的比值。decay测量的是算法性能随时间的衰减，例如将一个视频序列等分为4个部分，$J$-decay测量的是第1部分的平均区域相似度与最后一个部分的平均区域相似度之间的差值。

为了公平比较，将本文方法与DAVIS Challenge官网(https://davischallenge.org/index.html)公布的结果进行比较，评估结果如表 1所示。

表 1 不同方法在DAVIS 2016数据集的评估结果
Table 1 Evaluation results of different methods on DAVIS 2016 dataset

下载CSV

方法	在线微调	$J$-mean	$J$-recall	$J$-decay↓	$F$-mean	$F$-recall	$F$-decay↓	$T$↓	速度/(s/帧)
VPN^[10]	不使用	70.2	82.3	12.4	65.5	69.0	14.4	32.4	0.63
OSMN^[16]		74.0	87.6	9.0	72.9	84.0	10.6	20.9	0.14
OnAVOS-		73.6	—	—	—	—	—	—	3.55
本文		78.7	90.1	5.7	78.4	89.7	7.8	7.8	0.117
OSVOS^[12]	使用	79.8	93.6	14.9	80.6	92.6	15.0	37.8	—
MaskTrack^[9]		79.7	93.1	8.9	75.4	87.1	9.0	21.8	—
OnAVOS		86.1	96.1	5.2	84.9	89.7	5.8	19	—
本文		81.7	93.1	11.0	84.4	96.0	14.8	14.8
注：加粗字体为最优结果，斜体为次优结果, “↓”表示值越低越优，$T$表示时间稳定性，方法中的“-”表示不使用在线微调。

从表 1可以看出，在不使用在线微调的情况下，本文在DAVIS 2016数据集上表现出最优的性能。以$J$-mean为指标，相比于基准OSMN框架，本文提高了4.7%的增益, 这得益于本文的特征注意金字塔模块的作用。在使用在线微调的情况下，OSVOS使用基于边界检测器的每帧掩模后处理，大约提供了2%的增益，MaskTrack使用条件随机场(CRF)作为后处理方式，增益也有所提高。而本文不使用在线微调以及后处理表现的性能为78.7%，与OSVOS和MaskTrack相当。由于VPN和OSMN方法没有公布微调后的结果，实验未与其进行比较。VPN和OSMN两个方法侧重于速度，而微调会浪费大量时间，导致速度变慢，所以它们的实验结果都是在不加微调的基础上得到的。OnAVOS^[22]利用了极其复杂的在线微调策略以及条件随机场(CRF)后处理方式在DAVIS 2016数据集上获得了最佳结果，但是去除掉这两个策略后(即OnAVOS-)只有73.6%，而本文在不使用微调下的结果，速度和精度均都优于OnAVOS。

2.3 多个视频对象分割

为了进一步验证模型的性能，在DAVIS 2017数据集上进行进一步实验。DAVIS 2017数据集是迄今为止最大的视频分割数据集，包含150个视频，共10 459个标注帧和376个对象实例。与DAVIS 2016数据集相同，DAVIS 2017数据集采用区域相似度和轮廓精度作为评估指标，并增加区域相似度与轮廓精度之间的平均值作为全局指标($J$ & $F$-mean)，评估结果如表 2所示。

表 2 不同方法在DAVIS 2017数据集的评估结果
Table 2 Evaluation results of different methods on DAVIS 2017 dataset

下载CSV

方法	在线微调	$J$ & $F$-mean	$J$-mean	$J$-recall	$J$-decay↓	$J$-mean	$J$-recall	$J$-decay↓
OSMN^[16]	不使用	54.8	52.5	60.9	21.5	57.1	66.1	24.3
FAVOS^[23]		58.2	54.6	61.1	14.1	61.8	72.3	18.0
本文		59.1	55.9	62.0	16.2	62.2	70.0	16.7
MaskTrack^[9]	使用	54.3	51.2	59.7	28.3	57.3	65.5	29.1
OSVOS^[12]		60.3	56.6	63.8	26.1	63.9	73.8	27.0
OSMN+		—	60.8	—	—	—	—	—
OnAVOS		65.4	61.6	67.4	27.9	69.1	75.4	26.6
本文		72.3	68.8	78.0	22.4	76.1	85.3	26.8
注：加粗字体为最优结果，斜体为次优结果, “↓”表示值越低越优, 方法中的“+”表示使用在线微调。

在表 2中，本文方法与FAVOS^[23]、MaskTrack等现有的最先进方法进行了比较，在不使用在线微调和使用在线微调两种情况下，本文算法都获得了最优的性能。虽然OnAVOS在DAVIS 2016数据集上表现的结果最优，但是在更具挑战的DAVIS 2017数据集上$J$-mean的结果仅有61.6%，而本文方法在使用在线微调的情况下，$J$-mean达到了68.6%，约有7%的增益，充分说明本文方法在处理不同尺度实例问题上的优越性。

图 3是DAVIS 2017数据集中具有挑战性的两个视频，第1个视频是不同尺度对象互相遮挡的问题，第2个视频是相似对象互相遮挡的问题。第1个视频结果表明，本文所提的基于特征注意金字塔模块的方法能够有效处理不同尺度对象的问题，获得相当优异的结果。第2个视频结果显示本文算法在第34帧错误地对黄色和绿色实例进行了分割，表明本文的模型还有进一步改善的空间，为解决这个问题，未来会引入一个位置敏感嵌入，更好地区分相似目标像素。

图 3 DAVIS 2017数据集视频对象分割结果

Fig. 3 The results of DAVIS 2017 dataset

3 结论

本文利用第1帧的标注对象掩模和先前帧的空间位置信息，构建视觉调制器和空间调制器网络，学习调整给定任意目标对象实例的通用分割网络的中间层，使得分割网络专注于感兴趣对象的分割。为了更好地处理不同尺度对象的分割，提出了一种新颖的特征注意金字塔模块来学习融合不同尺度的特征，并受注意机制的启发，借助全局上下文先验来学习更好的特征表示，在大、小尺度对象的分割上都能有效地结合上下文信息。与其他分割算法相比，本文方法在DAVIS 2016和DAVIS 2017数据集上都获得了具有竞争力的结果，特别是处理不同尺度实例的问题上达到了最优的效果。但本文方法仍有改善的空间，下一步的工作拟将在分割网络中引入一个位置敏感嵌入，来学习更强大的特征表示，从而处理相似对象实例彼此遮挡的问题。

参考文献

[1] Koh Y J, Kim C S. Primary object segmentation in videos based on region augmentation and reduction[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 7417-7425.[DOI: 10.1109/CVPR.2017.784]

[2] Wang S, Wu Q. Moving object segmentation algorithm based on Markov random field model[J]. Transducer and Microsystem Technologies, 2016, 35(7): 113–115, 119. [王闪, 吴秦. 基于马尔可夫随机场模型的运动对象分割算法[J]. 传感器与微系统, 2016, 35(7): 113–115, 119. ] [DOI:10.13873/J.1000-9787(2016)07-0113-03]

[3] Jang W D, Lee C, Kim C S. Primary object segmentation in videos via alternate convex optimization of foreground and background distributions[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 696-704.[DOI: 10.1109/CVPR.2016.82]

[4] Li X J, Zhang K H, Song H H. Unsupervised video segmentation by fusing multiple spatio-temporal feature representations[J]. Journal of Computer Applications, 2017, 37(11): 3134–3138, 3151. [李雪君, 张开华, 宋慧慧. 融合时空多特征表示的无监督视频分割算法[J]. 计算机应用, 2017, 37(11): 3134–3138, 3151. ] [DOI:10.11772/j.issn.1001-9081.2017.11.3134]

[5] Zhang D, Javed O, Shah M. Video object segmentation through spatially accurate and temporally dense extraction of primary object regions[C]//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE, 2013: 628-635.[DOI: 10.1109/CVPR.2013.87]

[6] Jain S D, Xiong B, Grauman K. FusionSeg: learning to combine motion and appearance for fully automatic segmentation of generic objects in videos[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 2117-2126.[DOI: 10.1109/CVPR.2017.228]

[7] Tokmakov P, Alahari K, Schmid C. Learning video object segmentation with visual memory[C]//Proceedings of 2017 IEEE International Conference on Computer Vision. 2017: 4481-4490.

[8] Deng Z X, Hong H, Jin Y, et al. Research and improvement on video target segmentation algorithm based on spatio-temporal dual-stream full convolutional network[J]. Industrial Control Computer, 2018, 31(8): 113–114, 129. [邓志新, 洪泓, 金一, 等. 基于时空双流全卷积网络的视频目标分割算法研究及改进[J]. 工业控制计算机, 2018, 31(8): 113–114, 129. ] [DOI:10.3969/j.issn.1001-182X.2018.08.050]

[9] Perazzi F, Khoreva A, Benenson R, et al. Learning video object segmentation from static images[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 3491-3500.[DOI: 10.1109/CVPR.2017.372]

[10] Jampani V, Gadde R, Gehler P V. Video propagation networks[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 3154-3164.[DOI: 10.1109/CVPR.2017.336]

[11] Wug O S, Lee J Y, Sunkavalli K, et al. Fast video object segmentation by reference-guided mask propagation[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 7376-7385.[DOI: 10.1109/CVPR.2018.00770]

[12] Caelles S, Maninis K K, Pont-Tuset J, et al. One-shot video object segmentation[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 5320-5329.[DOI: 10.1109/CVPR.2017.565]

[13] Maninis K K, Caelles S, Chen Y H, et al. Video object segmentation without temporal information[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018. [DOI:10.1109/TPAMI.2018.2838670]

[14] He K M, Gkioxari G, Dollár P, et al. Mask R-CNN[C]//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 2980-2988.

[15] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 3431-3440.

[16] Yang L, Wang Y, Xiong X, et al. Efficient video object segmentation via network modulation[C]//Proceedings of 2018 IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 6499-6507.

[17] Liu W, Rabinovich A, Berg A C. Parsenet: looking wider to see better[EB/OL].[2018-12-14]. https://arxiv.org/pdf/1506.04579.pdf.

[18] Chen L C, Papandreou G, Kokkinos I, et al. DeepLab:semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4): 834–848. [DOI:10.1109/TPAMI.2017.2699184]

[19] Yan Z C, Zhang H, Jia Y Q, et al. Combining the best of convolutional layers and recurrent layers: a hybrid network for semantic segmentation[EB/OL].[2018-12-14]. https://arxiv.org/pdf/1603.04871.pdf.

[20] Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]//Proceedings of 2018 IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 7132-7141.

[21] Zhang H, Dana K, Shi J P, et al. Context encoding for semantic segmentation[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 7151-7160.[DOI: 10.1109/CVPR.2018.00747]

[22] Voigtlaender P, Leibe B. Online adaptation of convolutional neural networks for video object segmentation[EB/OL].[2018-12-14]. https://arxiv.org/pdf/1706.09364.pdf.

[23] Cheng J, Tsai Y H, Hung W C, et al. Fast and accurate online video object segmentation via tracking parts[C]//Proceedings of 2018 IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 7415-7424.