发布时间: 2021-07-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.200521
2021 | Volume 26 | Number 7

图像理解和计算机视觉

时序特征融合的视频实例分割

黄泽涛¹, 刘洋¹, 于成龙², 张加佳¹, 王轩^1,3, 漆舒汉^1,3

1. 哈尔滨工业大学(深圳), 深圳 518055;

2. 深圳信息职业技术学院, 深圳 518172;

3. 鹏城实验室, 深圳 518052

收稿日期: 2020-08-27; 修回日期: 2021-01-25; 预印本日期: 2021-02-01

基金项目: 国家自然科学基金项目（61902093）；广东省自然科学基金项目（2020A1515010652）

作者简介: 黄泽涛, 1993年生, 男, 硕士研究生, 主要研究方向为计算机视觉、目标检测、图像分割。E-mail: 18S151530@stu.hit.edu.cn
刘洋, 男, 助理教授, 主要研究方向为机器学习、人工智能。E-mail: liu.yang@hit.edu.cn
于成龙, 男, 副教授, 主要研究方向为人工智能。E-mail: yucl@sziit.edu.cn
张加佳, 男, 副研究员, 主要研究方向为人工智能。E-mail: zhangjiajia@hit.edu.cn
王轩, 男, 教授, 主要研究方向为网络多媒体, 人工智能。E-mail: wangxuan@cs.hitsz.edu
漆舒汉, 通信作者, 男, 助理教授, 主要研究方向为计算机视觉、多媒体信息检索、机器博弈。E-mail: shuhanqi@cs.hitsz.edu.cn
*通信作者: 漆舒汉 shuhanqi@cs.hitsz.edu.cn

中图法分类号: TP391

文献标识码: A

文章编号: 1006-8961(2021)07-1692-12

摘要

目的随着移动互联网和人工智能的蓬勃发展，海量的视频数据不断产生，如何对这些视频数据进行处理分析是研究人员面临的一个挑战性问题。视频中的物体由于拍摄角度、快速运动和部分遮挡等原因常常表现得模糊和多样，与普通图像数据集的质量存在不小差距，这使得对视频数据的实例分割难度较大。目前的视频实例分割框架大多依靠图像检测方法直接处理单帧图像，通过关联匹配组成同一目标的掩膜序列，缺少对视频困难场景的特定处理，忽略对视频时序信息的利用。方法本文设计了一种基于时序特征融合的多任务学习视频实例分割模型。针对普通视频图像质量较差的问题，本模型结合特征金字塔和缩放点积注意力机制，在时间上把其他帧检测到的目标特征加权聚合到当前图像特征上，强化了候选目标的特征响应，抑制背景信息，然后通过融合多尺度特征丰富了图像的空间语义信息。同时，在分割网络模块增加点预测网络，提升了分割准确度，通过多任务学习的方式实现端到端的视频物体同时检测、分割和关联跟踪。结果在YouTube-VIS验证集上的实验表明，与现有方法比较，本文方法在视频实例分割任务上平均精度均值提高了2%左右。对比实验结果证明提出的时序特征融合模块改善了视频分割的效果。结论针对当前视频实例分割工作存在的忽略对视频时序上下文信息的利用，缺少对视频困难场景进行处理的问题，本文提出融合时序特征的多任务学习视频实例分割模型，提升对视频中物体的分割效果。

关键词

计算机视觉; 实例分割; 视频实例分割; 缩放点积注意力; 多尺度融合

Video instance segmentation based on temporal feature fusion

Huang Zetao¹, Liu Yang¹, Yu Chenglong², Zhang Jiajia¹, Wang Xuan^1,3, Qi Shuhan^1,3

1. Harbin Institute of Technology, Shenzhen 518055, China;

2. Shenzhen Institute of Information Technology, Shenzhen 518172, China;

3. Peng Cheng Laboratory, Shenzhen 518052, China

Supported by: National Natural Science Foundation of China(61902093); Natural Science Foundation of Guangdong Province, China(2020A1515010652)

Abstract

Objective With the rapid development of mobile internet and artificial intelligence, a growing number of video applications are gradually occupying people's daily life. Large volumes of video data are generated every day. In addition to the large number and high memory occupation of video data, the video content itself is complex, and often contains many characters, actions, and scenes. Thus, the video task is more challenging and urgent than the common image understanding task. How to process and analyze these video data is a challenging problem for many researchers. Due to the shooting angle and fast motion, the objects in the video often appear fuzzy and diverse, and a wide gap exists between the quality of the common image data set and that of the video dataset. Video instance segmentation is an extension of instance segmentation in the video field, which includes the detecting, segmenting, and tracking object instances. The method not only needs to assign the pixels of each frame to the corresponding semantic categories and object instances but also associate the instance objects across the entire video sequence. The problems of video defocus, motion blur, and partial occlusion in video images cause difficulty in video instance segmentation and result in poor performance. The existing video-instance segmentation algorithms mainly use the image-instance segmentation algorithms to further predict the target mask in every frame. Then, tracking algorithms are used to associate the detection results to generate the mask sequence along the video to solve the problem of instance segmentation in video. However, these algorithms rely on the initial image detection performance, and ignore the use of temporal context information, resulting in the lack of effective transmission and exchange of information between different frames, which makes the classification and segmentation performance not ideal in difficult video scenes. Method To solve this problem, this study designs a multi-task learning video instance segmentation model based on temporal feature fusion. We combine the feature pyramid network and scaled dot-product attention operation in the temporal domain. Feature pyramid network is a feature extractor designed according to the concept of feature pyramid, which aims to improve the accuracy and speed. It replaces the feature extractor in fast region convolutional neural network(R-CNN) and generates a higher-quality feature graph pyramid. In general, the feature pyramid network has two feature fusion ways of a bottom-up line and a top-down line. The bottom-up way is the forward process of the network, but top-down is intended to sample the top-level features, and then conduct element-wise addition with the corresponding features of the previous layer. The scaled dot-product attention is the basic component of the multi-head attention module in the transformer, which is a popular encoder-to-decoder attention network in machine translation. With the temporal feature fusion module, the object features detected by other frames are weighted and aggregated to the current image features to strengthen the feature response of candidate object and suppress the background information. Then, the spatial semantic information of the image is enriched by fusing multi-scale features of the current frame. Thus, the model can capture the fine correlation information between other frames and the current frame, and selectively aggregate the important features of other frames to enhance the representation of the current features. At the same time, point prediction network added to the segmentation module improves the segmentation precision compared with the general segmentation network of fully convolutional neural network. Then, the objects are detected, segmented, and tracked simultaneously in the video by our end-to-end multi-task learning video instance segmentation framework. Result Experiments on YouTube-VIS dataset show that our method improves the mean average precision of video instance segmentation by near 2% compared with current methods. We also conduct a series of ablation experiments. On the one hand, we add different segmentation network modules in the model, and compare the effect of the fully convolutional network and point predict segmentation network on the two-stage video instance segmentation model. On the other hand, because the temporal feature-fusion module needs to select the RPN(region proposal network) candidate objects of the auxiliary frame for information fusion in the training stage, experimental comparison is needed for different number settings of RPN objects. We find the best result 32.7% AP with 10 RPN objects using. This result shows that the proposed temporal feature-fusion module improves the effect of video segmentation. Conclusion In this study, a two-stage video-instance segmentation model with temporal feature fusion module is proposed. In the first stage, the backbone network ResNet extracts features from an input image, and the temporal feature-fusion module further extracts features of multiple scales through feature pyramid networks, and aggregates the object feature information detected by other frames to enhance the feature response of the current frame. Then, the region proposal network extracts multiple candidate objects from the image. In the second stage, the features of proposal objects are input into three parallel network heads to obtain the corresponding results. The detection network head obtains the object classification and position in the current image, the segmentation network head obtains the instance segmentation mask of the current image, and the associated network head achieves the continuous association of the object by matching the most similar instance object in the instance storage space. In summary, our video instance segmentation model combines the feature pyramid network and scaled dot-product attention operation to the video temporal-feature fusion, which improves the accuracy of the video segmentation result.

Key words

computer vision; instance segmentation; video instance segmentation; scaled dot-product attention; multi-scale fusion

0 引言

随着互联网+和人工智能+时代的来临，传统的信息交互方式和信息载体发生了巨大变化，人们不再满足于对文字、语音以及图像数据的交流分享，富含更多信息资源的视频数据开始在日常生活中流行起来。一方面，视频网站、直播平台和短视频应用越来越受到人们的欢迎，视频这一信息媒介成为网络多媒体消费的重要组成部分，在互联网生活中发挥着越来越重要的作用。另一方面，在智能监控、自动驾驶和虚拟现实等高科技领域，视频数据也是最主要的研究载体。这些海量的视频数据对视频理解、视频推荐等视频视觉技术提出了更高的要求，同时又扩展了更多的应用场景。如何能对这些视频数据进行处理和分析是很多研究人员面临的一个具有挑战性的问题，同时也是计算机视觉领域中的重要研究之一。

视频实例分割是一项新兴的研究工作，由图像的实例分割任务扩展而来。面向视频序列的实例分割不仅需要在每帧图像上同时实现对目标的检测、分割，还需要对物体实例进行关联，从而得到实例的整个分割掩膜序列。本质上是根据视频中的物体信息，在连续的视频帧中将属于同一物体的像素赋予相同的标签值(Yang等，2019)。由于视频帧存在运动模糊、镜头虚焦、部分遮挡和外观变化等图像质量问题(Zhu等，2017a, b)，涉及的视觉任务较多，因此建立一个分割效果良好且关联正确性高的视频实例分割系统仍然是很大的挑战。

研究者针对视频实例分割问题进行研究，提出了基于掩膜传播和检测后跟踪两类算法。

基于掩膜传播的算法首先通过图像实例分割方法得到视频第1帧的物体类别和掩膜，之后以第1帧的掩膜为指导，应用视频目标分割算法实现物体掩膜在后续帧的传播(姜斯浩等，2019)。例如，Yang等人(2018)提出了一种通过网络调制器实现高效视频对象分割的方法，该方法使用基于元学习的调制器来产生分割模型所需的所有参数，以适应特定的对象实例，避免了每次对第1帧进行在线微调的过程。掩膜传播的算法虽然能保持对物体的持续关联，但是非常依赖第1帧的检测效果，当第1帧的目标检测效果太差时，后续的传播就很容易被误导。

基于检测后跟踪的算法主要是在每一帧独立应用图像检测方法得到检测结果，然后通过跟踪的方法关联各帧检测的结果。例如，Bochinski等人(2017)提出了一种基于帧间的交并比匹配的IoUTracker方法，该方法仅需要利用检测结果的置信度和前后帧的交并比，实现了简单快速的跟踪，但是帧之间的信息无法传播交流。Yang等人(2019)提出了MaskTrack R-CNN(region convolutional neural network)，该模型通过在Mask R-CNN(He等，2017)的基础上增加跟踪分支关联不同帧之间的物体实例。它对每一帧的特征提取或者检测分割都是各自独立的，没有对其他帧的信息予以利用。Athar等人(2020)设计的STEm-Seg(spatio-temporal embeddings for instance segmentation)把视频片段作为一个整体输入，编码阶段使用2D卷积和特征金字塔网络(feature pyramid network, FPN)(Lin等，2017)提取帧特征后按时间维度堆叠在一起, 解码阶段使用3D卷积处理包含时空信息的特征嵌入，结构简单直接，但是由于数据集有限，3D卷积训练困难，所以效果一般。检测后跟踪的算法依靠图像检测方法完成所有帧的检测分割任务，对各帧的检测结果加以利用。但是，这类方法有的会忽略对时序上下文信息的利用，或者直接用复杂且难训练的3D卷积网络加以处理，在遇到视频图像常有的镜头失焦、模糊和遮挡等质量问题时，容易出现分割效果不佳的情况。

针对上述问题，本文设计了一种基于时序特征融合的多任务学习视频实例分割框架, 有效捕捉了视频其他帧与当前帧的细微关联信息，增强了候选目标的表征能力，抑制背景信息，更好地挖掘了视频时序信息。主要贡献如下：

1) 针对视频实例分割这一新任务，提出了包括时序特征融合模块以及检测、分割和关联网络的多任务学习视频实例分割框架，实现了对目标物体在视频序列中的实例分割；

2) 时序特征融合模块结合特征金字塔和基于时序的缩放点积注意力机制，充分挖掘了视频时序信息，增强了目标的外观表征，抑制了因某些帧的外观变化、部分遮挡等造成的特征响应表现力不足的问题；

3) 在视频实例分割数据集中与现有方法对比，分割效果取得了一定的提升。

1 相关工作

1.1 图像实例分割

实例分割(王子愉等，2019)是计算机视觉中的一项具有挑战性的任务, 结合了目标检测(Ren等，2015；Girshick等，2014)和语义分割(Long等，2015；Zhao等，2017；Chen等，2018a)任务。实例分割不仅需要给图像的每一个像素点分配不同的语义类别，而且还要进一步划分给不同的物体实例(Hariharan等，2014)。目前的实例分割算法主要是在目标检测算法的基础上添加语义分割网络集成的，根据所依赖的目标检测算法可以分为两阶段和单阶段的算法。两阶段的实例分割算法是基于候选框的方法实现的，基本思路是产生候选框，再针对候选框中的像素进行语义分类而得到实例的像素级分类。Mask R-CNN是经典的两阶段实例分割算法，其在检测网络中添加了一支基于全卷积网络(fully convolutional network, FCN)的语义分割网络头，实现了对不同物体实例的类别、位置和分割掩膜的预测。Huang等人(2019)在此基础上增加了掩膜得分机制，设计了MS R-CNN(mask scoring region CNN)算法。YOLACT(you only look at coefficients)(Bolya等，2019)的思路与Mask R-CNN类似，只不过它是基于单阶段的目标检测算法进行的扩展。此后，Xie等人(2020)的PolarMask模型通过极坐标建模的方法，在全卷积无锚框的检测算法FCOS(fully convolutional one-stage object detection)(Tian等，2019)基础上，从每个像素中心点出发设计不同角度的射线，利用回归学习实例的掩膜，开创了新的掩膜建模方式。

实例分割进一步扩展到视频领域的工作称为视频实例分割，除了完成每一帧的物体实例分割之外，还需要确定每帧物体之间的相关关系。

1.2 视频目标分割

视频目标分割(video object segmentation, VOS)根据推理过程是否预先提供第1帧的掩膜信息，分为半监督式分割和无监督式分割。半监督式的视频目标分割算法主要研究如何把预先提供的目标掩膜传播到整个视频序列(Chen等，2018b)。早期的OSVOS(one-shot video object segmentation)模型(Caelles等，2017)和MaskTrack模型(Perazzi等，2017)，以及Yang等人(2018)通过网络调制器实现的高效视频对象分割模型都只针对一个目标对象在视频序列中分割掩膜。后来提出的多目标分割除了需要面对单目标分割的挑战之外，还需要考虑多个目标的遮挡和相似等复杂因素，例如，Li等人(2018)通过使用条件随机场对分割网络结果进一步分类，进行最终目标类别的判别等。

无监督式的视频目标分割算法则不需要用户标注目标的先验信息，而是自动检测视频中显著的目标(Jain等，2017；Tokmakov等，2017)。例如，Grundmann等人(2010)通过将视频按照外观信息过分割时空区域，并构建图模型来解决无监督分割问题等。

上述两种方法分割的对象都是一般性的目标，而不需要关注具体的语义类别。

1.3 视频实例分割

视频实例分割(video instance segmentation, VIS)是实例分割在视频领域的扩展工作，具体包括检测、分割和关联跟踪对象实例的任务。他不仅需要把每一帧的像素点分配到相应的语义类别和物体实例，而且还要关联整个视频序列中出现的实例对象(Yang等，2019)。目前的视频实例分割算法主要通过应用图像实例分割方法得到初始检测结果，之后根据传播目标掩膜或者跟踪方法关联检测结果分为两类方法。

这些方法解决了对视频物体实例的分割和帧序列之间关联的问题，但是这些方法还是比较依赖初始的图像检测效果，存在忽略对视频上下文信息利用的情况，容易造成分类和分割效果不理想。

2 模型设计

2.1 问题定义

假定有一个预先定义的类别标签集合$\boldsymbol{C}$={1, …, ${K}$}，其中${K}$表示类别数量。给定一个包含${T}$帧的视频序列，并假设属于类别标签集合$\boldsymbol{C}$的实例有${N}$个。对于每一个实例$i $，用$ c^{i} \in \boldsymbol{C}$表示它对应的真实类别标签，用$ \boldsymbol{m}_{\left(p \cdots q\right)}^{i}$表示它对应的二值前景掩膜，其中$p $和$ q$分别表示实例出现的开始和结尾帧，$ p \in[1, T], q \in[p, T]$。

视频实例分割算法对于每一个出现的实例$j $，需要预测类别标签$\tilde{c}^{j} \in \boldsymbol{C}$, 计算置信度分数$s^{j} \in[0, 1]$，以及预测对应的二值掩膜序列$\widetilde{\boldsymbol{m}}_{\tilde{p} \cdots \tilde{q}}^{j}$。

2.2 模型概览

本文设计的视频实例分割框架如图 1所示，通过两阶段的多任务学习方式同时对视频中的物体进行检测、分割和关联跟踪。第1阶段，基础网络ResNet(He等，2016)对输入的一帧图像提取特征，基于时空的特征融合模块通过特征金字塔FPN进一步提取多个尺度的特征，聚合其他帧已检测到的目标特征信息增强当前帧的特征响应；之后，候选区域生成网络(region proposal network, RPN)提取图像中的多个候选目标。第2阶段，分别将候选目标特征输入到3个并行任务的分支网络头得到相应的结果。其中，检测网络头获取当前图像的目标分类和检测框位置；分割网络头获取当前图像的目标分割掩膜；关联网络头通过在实例存储空间中匹配最相似的实例对象，实现对目标的持续关联。模型通过共享包含时序特征融合模块的底层网络，增强候选目标的外观表征，抑制背景信息，为检测和分割等多个任务提供了高质量的目标特征向量输入。

图 1 时序特征融合的视频实例分割模型

Fig. 1 Video instance segmentation model based on temporal feature fusion

2.3 时序特征融合模块

视频帧相对于普通图像，虽然存在镜头虚焦和运动模糊等图像质量问题，但是，帧与帧之间有着明显的相似关系。受特征金字塔和缩放点积注意力机制(Vaswani等，2017；Buades等，2005；Liu等，2019)所启发，本文提出了时序特征融合模块，实现提取多尺度的空间特征信息，并且根据当前帧与其他帧检测目标之间的时序特征关系计算相似度和注意力矩阵，得到其他帧特征向量相对当前帧的加权特征，最后再通过残差连接的方式重新聚合成多尺度的特征向量，由此增强了当前帧的特征响应强度，有助于后续的检测和分割任务。

如图 2所示，当前帧的某一位置$\boldsymbol{x}_{i} $的加强特征响应可通过已检测目标的所有位置$\boldsymbol{y}_{j} $基于点积相似性的加权计算得到，实线表示相关权重较高的位置，虚线表示权重较低的位置。在这个例子里，$ t$-1帧中已检测到的人与$ t$帧的人所在的对应位置有着较高的相关权重，而与滑板以及背景的海水位置相关程度较低。另一方面，检测到的滑板目标与滑板位置特征的相关程度相对较高。

图 2 视频帧之间的特征关系

Fig. 2 Relation of the frames' feature

如图 3所示，对于当前输入的一帧图像，基础网络ResNet和特征金字塔FPN会提取对应的多尺度特征，时序特征融合模块首先将不同尺度的特征都缩放到中间的同一尺度$\boldsymbol{C}$₄的大小，然后再用相加取平均的方式，聚合各个特征得到一个组合不同尺度的新特征, 即$ \boldsymbol{X}=\frac{1}{L} \sum\limits_{l=l_{\min }}^{l_{\max }} \boldsymbol{C}_{l}$, 其中，$\boldsymbol{C}_{l}$代表第$l$层尺度的特征。接下来，将实例空间中保存的每个目标特征与当前帧的特征$\boldsymbol{X}$进行缩放点积注意力计算，得到聚合了不同目标信息的特征。最后，再通过残差连接的方式，把新的特征与原来的多尺度特征分别相加融合，得到涵盖了时空信息的加强多尺度特征。

图 3 时序特征融合模块

Fig. 3 Temporal feature fusion module

缩放点积注意力计算如式(1)所示。通过这样的操作可以做到根据已检测目标的视觉特征在当前图像上确定相似目标的区域，并且抑制无关的背景区域，从而得到一个关注目标的注意力特征图。

$ A(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V})=S\left(\frac{\boldsymbol{Q} \boldsymbol{K}^{\mathrm{T}}}{\sqrt{d_{k}}}\right) \boldsymbol{V} $

(1)

式中，$S $代表softmax函数，$d_{k}$表示输入特征$\boldsymbol{Q}$和$\boldsymbol{K}$的向量维度，这里取默认值1以免缩放过多，$\boldsymbol{V}$表示需要参与加权计算的特征。

具体的过程如图 4所示，依然用$\boldsymbol{Y}$和$\boldsymbol{X}$分别表示输入的其他帧检测到的目标和当前帧图像的特征向量，$ \theta, \varphi, g$是针对特征向量的线性变换，具体用1×1的卷积运算实现。其中, 当前帧的特征$\boldsymbol{X}$映射为$\boldsymbol{Q}$, 实例空间保存的目标特征$\boldsymbol{Y}$映射为$\boldsymbol{K}$和$\boldsymbol{V}$。通过计算$\boldsymbol{Q}$和$\boldsymbol{K}$的点积相似性得到注意力矩阵, 然后进行$\boldsymbol{V}$相对注意力矩阵的加权求和计算。最后，通过残差连接的形式与原特征向量相加融合，得到新的叠加其他帧目标信息的增强特征向量。对于$\boldsymbol{X}$的每一个位置$\boldsymbol{x}_{i} $，注意力矩阵计算为

$ \boldsymbol{x}_{i}^{\prime}=\frac{1}{c(\boldsymbol{Y})} \sum\limits_{\forall j}\left(f\left(\boldsymbol{x}_{i}, \boldsymbol{y}_{j}\right) g\left(\boldsymbol{y}_{j}\right)\right) $

(2)

$ f\left(\boldsymbol{x}_{i}, \boldsymbol{y}_{j}\right)=\theta\left(\boldsymbol{x}_{i}\right)^{\mathrm{T}} \boldsymbol{\phi}\left(\boldsymbol{y}_{j}\right) $

(3)

图 4 注意力机制

Fig. 4 Scaled dot-product attention

式中，$c(\boldsymbol{Y})=\sum\limits_{\forall k} f\left(\boldsymbol{x}_{i}, \boldsymbol{y}_{k}\right) $，对应上述的softmax操作，$f$函数是点积相似性度量函数。

最终的加权特征向量经最后一层卷积层恢复特征的通道数。由于有多个目标特征与当前帧特征进行缩放点积注意力计算，因此，将得到不同目标加权后的多个特征向量。将这些特征向量通过相加取平均的方式融合，共用一个可训练的1维参数$\alpha $加权后与原始特征再一次相加融合。$\boldsymbol{M}$表示实例空间保存的目标特征向量，${Conv}\left(\boldsymbol{X}_{m}^{\prime}\right)$表示融合了第$m$个目标信息的强化特征，最后得到新的组合特征$\boldsymbol{Z}$

$ \boldsymbol{Z}=\alpha \times\left(\frac{1}{|\boldsymbol{M}|} \sum\limits_{m \in \boldsymbol{M}} {Conv}\left(\boldsymbol{X}_{m}^{\prime}\right)\right)+\boldsymbol{X} $

(4)

2.4 多任务网络设计

面向视频序列的实例分割本质上是在连续的视频帧中将属于同一物体的像素赋予相同的标签值，需要对每帧图像中出现的物体实例进行检测和分割，并且关联帧与帧之间的同一实例。本文的视频实例分割模型在Mask R-CNN的基础上增加了并行的实例关联模块，同时在分割网络头引入PointRend(Kirillov等，2020)结构进一步改善分割效果。

候选区域生成网络通过预先定义的锚框提取多个候选区域，再经过RoIAlign操作获得候选区域对应的特征向量，以此作为后续任务的输入。检测网络头通过部分共享的全连接层以及softmax分别得到实例回归后的检测框和类别信息。

现有的分割网络头使用全卷积网络FCN(Long等，2015)对候选区域进行像素级的分类，解决语义级别的图像分割问题，但是，由于较难预测的像素主要集中在物体的边缘，而全卷积网络对于物体内部和边缘的像素点预测是不区分难易程度、平等对待的，因此分割效果不够理想。

本文借鉴PointRend将带有时序信息的候选目标特征输入到粗糙掩膜分割和边缘点预测两个子网络，粗糙掩膜分割通过2×2的卷积层对候选区域特征进行降采样，然后使用带有两个1 024维隐藏层的简单多层感知机网络获得实例的粗糙分割掩膜。边缘点预测是在粗糙掩膜上采样边缘点，然后与对应的点输入细粒度特征连接，组成点的特征向量，同样经过带有3个256维的简单多层感知机网络，最终合并得到分割更精准的实例掩膜。

关联模块如图 5所示，通过两个1 024维的全连接层得到每个候选目标的特征向量，然后与实例空间中已有的实例计算内积的相似度，为候选目标分配最相似的实例编号，并且更新实例空间对应的特征向量。

图 5 关联模块

Fig. 5 Association module

对于候选目标$i $, 其属于新实例的情况为$ x=0$，属于之前的$ 1 \sim N$个实例之一的情况为$x \in[1, N]$。$\boldsymbol{\phi}_{i}$和$\boldsymbol{\phi}_{j}$分别表示候选目标和${N}$个已存在实例的特征映射，$\boldsymbol{\phi}_{i}^{\mathrm{T}} \boldsymbol{\phi}_{j}$是内积。最后通过softmax计算将候选目标$i $分配给实例$\boldsymbol{x}$的概率

$ p_{i}(x)= \begin{cases}\frac{1}{1+\sum\limits_{j=1}^{N} \mathrm{e}^{\phi_{i}^{\mathrm{T}} \phi_{j}}} & x=0 \\ \frac{\mathrm{e}^{\phi_{i}^{\mathrm{T}} \phi_{x}}}{1+\sum\limits_{j=1}^{N} \mathrm{e}^{\phi_{i}^{\mathrm{T}} \phi_{j}}} & x \in[1, N]\end{cases} $

(5)

2.5 损失函数

模型的损失函数由分类、检测框回归、分割和关联等多个任务的损失函数构成

$ L=L_{\mathrm{cls}}+L_{\mathrm{reg}}+L_{\text {mask }}+L_{\text {assoc }} $

(6)

$L_{\mathrm{reg}} $包括两个部分，在RPN回归时使用Smooth L1 Loss损失函数，在检测网络回归调整时使用Balance L1 Loss，具体为

$ {smooth}_{\mathrm{L} 1}(x)= \begin{cases}0.5 x^{2} & |x|<1 \\ |x|-0.5 & \text { 其他 }\end{cases} $

(7)

$ \begin{gathered} L_{b}(x)= \\ \begin{cases}\frac{\alpha}{b}(b|x|+1) \ln (b|x|+1)-\alpha|x| & |x|<1 \\ \gamma|x|+C & \text { 其他 }\end{cases} \end{gathered} $

(8)

RPN的Smooth L1 Loss可以加速收敛并且防止梯度爆炸，而Balance L1 Loss可以平衡困难样本对损失函数的贡献，其中的超参数满足$ \alpha \ln (b+1)=\gamma$，默认$\alpha=0.5, b=7, \gamma=1.5 $。

$ L_{\mathrm{cls}}$和$ L_{\mathrm{assoc}}$都使用分类交叉熵损失函数，$ p$和$p^* $表示前景和背景两个类别的概率。

$ L_{\mathrm{cls}}=-\log \left(p \times p^{*}+(1-p)\left(1-p^{*}\right)\right) $

(9)

$ L_{\mathrm{assoc}}=-\sum\limits_{i} \log \left(p_{i}\left(x_{i}\right)\right) $

(10)

$L_{\text {mask }} $参考PointRend模型，使用预测掩膜和真实掩膜，以及预测点和真实点之间的二值交叉熵损失函数：$ L_{\text {mask }}=B C E\left(\boldsymbol{M}, \boldsymbol{M}^{*}\right)$。

3 实验设置

3.1 数据集

本文在公开数据集YouTube-VIS(video instance segmentatim)上进行了实验，以评估本文提出框架的性能。

YouTube-VIS数据集是一个由YouTube-VOS大规模视频数据集发展而来的实例分割数据集，包含高分辨率的YouTube视频，共有40个类别标签，如人、动物类和交通工具类等。还有4 883个不同的实例物体，以及131 000幅高质量二值前景。随机切分成2 238个样本的训练视频集和302个样本的验证视频集，验证视频集确保每个类别至少有4个实例。所有的算法都在训练视频集上训练，结果评估在验证视频集上。

3.2 实验设置

3.2.1 实验细节

YouTube-VIS数据集提供间隔为5帧的图像数据。对于每一帧，利用ResNet进行特征提取。本文采用ResNet-50版本的残差网络提取图像特征，并采用该网络在ImageNet数据集(Deng等，2009)和Microsoft COCO数据集(common objects in context)(Lin等，2014)上预训练网络参数。

输入YouTube-VIS数据集的图像尺寸统一缩放为640×360像素，并有0.5的概率进行随机水平翻转数据增强。为加快训练速度，通过在2个Tesla P100上各训练8幅图像进行分布式训练，这样一个批次一共处理16幅图像。

训练时，模型输入除了当前帧外还需要随机采样同一视频的另一帧作为辅助帧以帮助时序特征融合模块和关联网络的训练。其中，把辅助帧经RPN生成的置信度最大的前$n$个检测目标作为时序特征融合模块的输入，实验中设置$n$=10。

分割网络参数参考PointRend模型。边缘点预测过程中，自适应点采样的策略在训练阶段采样参数设置$k$=3, $\beta$=0.75。

模型按照标准12个epoch，使用带动量的随机梯度下降方法进行训练。初始学习率$\theta $为0.005，动量因子$ \gamma$是0.9，权重衰减项$ \eta $为0.000 1。在最开始时，为了防止初始学习率较大可能带来的模型振荡问题，对前500个迭代的学习率从小到大进行逐步预热，之后再根据0.005的初始学习率继续训练。并且，在第8个和第11个epoch减小10%的学习率，以避免网络参数更新过快。

3.2.2 评价指标

视频实例分割任务采用平均精度均值(mean average precision，AP)和召回率(recall, AR)两个指标进行评估。需要通过视频交并比(intersection over union, IoU)进一步计算获得。IoU阈值一共有10个，以5%的步长从50%一直到95%。

视频实例分割计算IoU的对象从图像上扩展到视频序列，假定有真实掩膜$\boldsymbol{m}_{p \cdots q} $, 预测掩膜$\boldsymbol{\tilde m}_{\tilde{p} \cdots \tilde{q}}$, 如果某帧图像没有目标的话，则有$\boldsymbol{m}_{t}=\mathbf{0}$或者$\widetilde{\boldsymbol{m}}_{t}=\mathbf{0} $，即用空白掩膜对其进行填补，这样就可以把之前的$ p$和$\tilde{p}$扩展到1，$ q$和$\tilde{q}$扩展到$T$，最后视频实例分割对应的IoU计算为

$ {IoU}(i, j)=\frac{\sum\limits_{t=1}^{T}\left|\boldsymbol{m}_{t}^{i} \cap \tilde{\boldsymbol{m}}_{t}^{j}\right|}{\sum\limits_{t=1}^{T}\left|\boldsymbol{m}_{t}^{i} \cup \tilde{\boldsymbol{m}}_{t}^{j}\right|} $

(11)

式中，$i $和$j $分别表示真实的每个实例$i $和算法预测出来的每个实例$j $，即把两个实例在整个视频的交集序列与并集序列相除。这样就可以把整个视频序列的预测掩膜序列和真实掩膜序列进行合理比较，有效衡量分割掩膜的准确程度。

3.3 结果比较

本文方法与以下现有的视频实例分割方法进行了对比。

掩膜传播方法(mask propagation)，以视频初始帧为引导应用视频目标分割算法传播目标掩膜，具体包括OSMN(one-shot modulation network)(Yang等，2018), FEELVOS(fast end-to-end embedding learning for video object segmentation)(Voigtlaender等，2019)。检测后跟踪方法(track-by-detect)，应用目标检测算法得到图像目标后再通过多种跟踪方法跨帧关联目标，现有算法包括MaskTrack R-CNN，STEm-Seg。

从表 1实验结果中可以看出，本文方法在验证数据集上比现有基准方法的视频分割效果好，其中，平均准确率AP相对提升2.1%，召回率AR相对提升1.5%。现有的方法如MaskTrack R-CNN因为忽视对其他帧的视觉信息的使用，只关注当前图像的特征，使得各帧之间的检测和分割难以相互利用。而像STEM-Seg一次输入一个视频片段，通过3D卷积提取视频的时空信息，虽然是首次引入3D卷积解决此项任务，但是由于当前数据集数量不充足，难以训练好3D卷积网络权重，因此效果并不是很好。本文方法则利用视频的时序特征信息，把目标物体在其他帧的视觉特征有效传递叠加到当前帧特征中，这进一步提高了视频分割的精确程度。

表 1 实验结果对比
Table 1 Quantitative comparisons of mAP

下载CSV

/%
方法	AP	AP50	AP75
FCN	30.3	51.1	32.6
点预测网络	30.6	51.4	32.7
注：加粗字体表示每列最优结果。

3.4 消融实验

本文方法通过增加点预测网络进一步提高分割结果的精准度。表 2为使用FCN和点预测网络时分割结果的平均精度均值对比，由表 2可以看出，通过引入基于渲染的点采样方法对图像的部分点进行更精细的预测，一定程度提升了目标掩膜的分割效果。

表 2 在YouTube-VIS验证集上视频实例分割结果
Table 2 Results of YouTube-VIS validation dataset

下载CSV

/%
方法	类型	AP	AP50	AP75	AR1	AR10
OSMN	掩膜传播	23.4	36.5	25.7	28.9	31.1
FEELVOS	掩膜传播	26.9	42	29.7	29.9	33.4
MaskTrack R-CNN	检测后跟踪	30.3	51.1	32.6	31.0	35.5
STEM-Seg	检测后跟踪	30.6	50.7	33.5	31.6	37.1
本文	检测后跟踪	32.7	53.6	35.2	33.1	38.2
注：加粗字体表示每列最优结果。

训练阶段，由于时序特征融合模块是选取辅助帧的RPN候选目标进行信息融合，针对不同的目标数设置需要进行实验对比。选取辅助帧RPN生成的置信度最大的前$n$个为目标，本节分别比较当$n$=2, 5, 10, 15, 20时的模型效果。如图 6所示，当选取的目标数较少时，不能覆盖辅助帧上的大部分目标，因此模型效果虽然有所提升，但不是很明显，模型性能随目标数的增多而增强。但是YouTube-VIS数据集每帧图像的物体数相对较少，当选取了过多RPN目标时容易把一些识别效果差、重复识别的目标特征甚至是误分类的背景特征引入时序特征融合模块，反而使模型效果变差。综合对比实验结果，本文默认选取辅助帧的RPN生成的置信度最大的前10个。

图 6 时序特征融合模块辅助帧的RPN目标

Fig. 6 RPN objects in temporal feature fusion

视频实例分割模型在不同的视频场景中的检测和分割效果如图 7所示，对于图 7(a)和图 7(b)这两种场景，本文模型能够较好地检测到目标，并且给出准确的分割掩膜。从图 7(c)可以看到，对于同样是猎豹但不同的实例，本文模型也能清楚区分，而且对于后面被遮挡的鹿以及图 7(d)被遮挡的大熊猫，模型也能在时序特征融合模块的作用下，通过其他帧信息较好地完成识别。但是，模型还存在一些不足之处，比如对于太过接近的实例无法很好地区分，对于同一实例可能存在重复检测框，因此，模型还有很多改进的空间。

图 7 时序特征融合模块效果对比

Fig. 7 Results of the temporal feature fusion module((a) person and surfboard in fast motion; (b) eagle with deformation posture; (c) multiple leopard instances; (d) occluded gaint panda)

4 结论

针对当前视频实例分割工作存在的忽略对视频时序上下文信息的利用、分类和分割的效果不够理想的问题，本文提出了时序特征融合的多任务学习视频实例分割模型。本模型能更好地捕捉其他帧与当前帧的细微关联信息，有选择地聚合其他帧的关键特征来增强当前特征的表征能力，更好地挖掘视频时序信息，提升后续任务的效果。最后，通过多任务学习的方式同时对视频中的物体进行检测、分割和关联跟踪。

当前的特征融合机制主要关注整个图像特征层面，因此对物体的语义分类和粗糙分割起到了较大的作用，而更精细的掩膜分割效果提升还需要进一步研究。将来的工作会继续考虑在物体实例和像素层面上，以及在更多帧序列间进行特征的聚合加强，提升目标物体的检测和分割效果。

参考文献

Athar A, Mahadevan S, Ošep A, Leal-Taixé L and Leibe B. 2020. Stem-seg: spatio-temporal embeddings for instance segmentation in videos//Proceedings of 2020 European Conference on Computer Vision. Springer, Cham: Springer: 158-177[DOI: 10.1007/978-3-030-58621-8_10]

Bochinski E, Eiselein V and Sikora T. 2017. High-speed tracking-by-detection without using image information//Proceedings of 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). Lecce, Italy: IEEE: 1-6[DOI: 10.1109/AVSS.2017.8078516]

Bolya D, Zhou C, Xiao F Y and Lee Y J. 2019. YOLACT: real-time instance segmentation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea(South): IEEE: 9156-9165[DOI: 10.1109/ICCV.2019.00925]

Buades A, Coll B and Morel J M. 2005. A non-local algorithm for image denoising//Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). San Diego, USA: IEEE: 60-65[DOI: 10.1109/CVPR.2005.38]

Caelles S, Maninis K K, Pont-Tuset J, Leal-Taixé L, Cremers D and Van Gool L. 2017. One-shot video object segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 5320-5329[DOI: 10.1109/CVPR.2017.565]

Chen L C, Papandreou G, Kokkinos I, Murphy K, Yuille A L. 2018a. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4): 834-848 [DOI:10.1109/TPAMI.2017.2699184]

Chen Y H, Pont-Tuset J, Montes A and Van Gool L. 2018b. Blazingly fast video object segmentation with pixel-wise metric learning//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 1189-1198[DOI: 10.1109/CVPR.2018.00130]

Deng J, Dong W, Socher R, Li L J, Li K and Li F F. 2009. ImageNet: a large-scale hierarchical image database//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, USA: IEEE: 248-255[DOI: 10.1109/CVPR.2009.5206848]

Girshick R, Donahue J, Darrell T and Malik J. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 580-587[DOI: 10.1109/CVPR.2014.81]

Grundmann M, Kwatra V, Han M and Essa I. 2010. Efficient hierarchical graph-based video segmentation//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, USA: IEEE: 2141-2148[DOI: 10.1109/CVPR.2010.5539893]

Hariharan B, Arbeláez P, Girshick R and Malik J. 2014. Simultaneous detection and segmentation//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 297-312[DOI: 10.1007/978-3-319-10584-0_20]

He K M, Gkioxari G, Dollár P and Girshick R. 2017. Mask R-CNN//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2980-2988[DOI: 10.1109/ICCV.2017.322]

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778[DOI: 10.1109/CVPR.2016.90]

Huang Z J, Huang L C, Gong Y C, Huang C and Wang X G. 2019. Mask scoring R-CNN//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 6402-6411[DOI: 10.1109/CVPR.2019.00657]

Jain S D, Xiong B and Grauman K. 2017. FusionSeg: learning to combine motion and appearance for fully automatic segmentation of generic objects in videos//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 2117-2126[DOI: 10.1109/CVPR.2017.228]

Jiang S H, Song H H, Zhang K H, Tang R F. 2019. Video object segmentation method based on dual pyramid network. Journal of Computer Applications, 39(8): 2242-2246 (姜斯浩, 宋慧慧, 张开华, 汤润发. 2019. 基于双重金字塔网络的视频目标分割方法. 计算机应用, 39(8): 2242-2246) [DOI:10.11772/j.issn.1001-9081.2018122566]

Kirillov A, Wu Y X, He K M and Girshick R. 2020. PointRend: image segmentation as rendering//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 9796-9805[DOI: 10.1109/CVPR42600.2020.00982]

Li S Y, Seybold B, Vorobyov A, Fathi A, Huang Q and Jay Kuo C C. 2018. Instance embedding transfer to unsupervised video object segmentation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 6526-6535[DOI: 10.1109/CVPR.2018.00683]

Lin T Y, Dollár P, Girshick R, He K M, Hariharan B and Belongie S. 2017. Feature pyramid networks for object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 936-944[DOI: 10.1109/CVPR.2017.106]

Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P and Zitnick C L. 2014. Microsoft COCO: common objects in context//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 740-755[DOI: 10.1007/978-3-319-10602-1_48]

Liu X Y, Ren H B and Ye T M. 2019. Spatio-temporal attention network for video instance segmentation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision Workshop. Seoul, Korea(South): IEEE: 725-727[DOI: 10.1109/ICCVW.2019.00092]

Long J, Shelhamer E and Darrell T. 2015. Fully convolutional networks for semantic segmentation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3431-3440[DOI: 10.1109/CVPR.2015.7298965]

Perazzi F, Khoreva A, Benenson R, Schiele and Sorkine-Hornung A. 2017. Learning video object segmentation from static images//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 3491-3500[DOI: 10.1109/CVPR.2017.372]

Ren S Q, He K M, Girshick R and Sun J. 2015. Faster R-CNN: towards real-time object detection with region proposal networks//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press: 91-99

Tian Z, Shen C H, Chen H and He T. 2019. FCOS: fully convolutional one-stage object detection//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea(South): IEEE: 9626-9635[DOI: 10.1109/ICCV.2019.00972]

Tokmakov P, Alahari K and Schmid C. 2017. Learning video object segmentation with visual memory//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 4491-4500[DOI: 10.1109/ICCV.2017.480]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc. : 6000-6010

Voigtlaender P, Chai Y, Schroff F, Adam H, Leibe B and Chen L C. 2019. FEELVOS: fast end-to-end embedding learning for video object segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 9473-9482[DOI: 10.1109/CVPR.2019.00971]

Wang Z Y, Yuan C, Li J C. 2019. Instance segmentation with separable convolutions and multi-level features. Journal of Software, 30(4): 954-961 (王子愉, 袁春, 黎健成. 2019. 利用可分离卷积和多级特征的实例分割. 软件学报, 30(4): 954-961) [DOI:10.13328/j.cnki.jos.005667]

Xie E Z, Sun P Z, Song X G, Wang W H, Liu X B, Liang D, Shen C H and Luo P. 2020. PolarMask: single shot instance segmentation with polar representation//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 12190-12199[DOI: 10.1109/CVPR42600.2020.01221]

Yang L J, Fan Y C and Xu N. 2019. Video instance segmentation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea(South): IEEE: 5187-5196[DOI: 10.1109/ICCV.2019.00529]

Yang L J, Wang Y R, Xiong X H, Yang J C and Katsaggelos A K. 2018. Efficient video object segmentation via network modulation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, USA: IEEE: 6499-6507[DOI: 10.1109/CVPR.2018.00680]

Zhao H S, Shi J P, Qi X J, Wang X G and Jia J Y. 2017. Pyramid scene parsing network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 6230-6239[DOI: 10.1109/CVPR.2017.660]

Zhu X Z, Wang Y J, Dai J F, Yuan L and Wei Y C. 2017a. Flow-guided feature aggregation for video object detection//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 408-417[DOI: 10.1109/ICCV.2017.52]

Zhu X Z, Xiong Y W, Dai J F, Yuan L and Wei Y C. 2017b. Deep feature flow for video recognition//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 4141-4150[DOI: 10.1109/CVPR.2017.441]