Current Issue Cover
时空图卷积网络与注意机制的视频目标分割

姚睿1, 夏士雄1, 周勇1, 赵佳琦1, 胡伏原2(1.中国矿业大学计算机科学与技术学院, 徐州 221116;2.苏州科技大学电子与信息工程学院, 苏州 215009)

摘 要
目的 从大量数据中学习时空目标模型对于半监督视频目标分割任务至关重要,现有方法主要依赖第1帧的参考掩膜(通过光流或先前的掩膜进行辅助)估计目标分割掩膜。但由于这些模型在对空间和时域建模方面的局限性,在快速的外观变化或遮挡下很容易失效。因此,提出一种时空部件图卷积网络模型生成鲁棒的时空目标特征。方法 首先,使用孪生编码模型,该模型包括两个分支:一个分支输入历史帧和掩膜捕获序列的动态特征,另一个分支输入当前帧图像和前一帧的分割掩膜。其次,构建时空部件图,使用图卷积网络学习时空特征,增强目标的外观和运动模型,并引入通道注意模块,将鲁棒的时空目标模型输出到解码模块。最后,结合相邻阶段的多尺度图像特征,从时空信息中分割出目标。结果 在DAVIS(densely annotated video segmentation)-2016和DAVIS-2017两个数据集上与最新的12种方法进行比较,在DAVIS-2016数据集上获得了良好性能,Jacccard相似度平均值(Jaccard similarity-mean,J-M)和F度量平均值(F measure-mean,F-M)得分达到了85.3%,比性能最高的对比方法提高了1.7%;在DAVIS-2017数据集上,J-MF-M得分达到了68.6%,比性能最高的对比方法提高了1.2%。同时,在DAVIS-2016数据集上,进行了网络输入与后处理的对比实验,结果证明本文方法改善了多帧时空特征的效果。结论 本文方法不需要在线微调和后处理,时空部件图模型可缓解因目标外观变化导致的视觉目标漂移问题,同时平滑精细模块增加了目标边缘细节信息,提高了视频目标分割的性能。
关键词
Spatial-temporal video object segmentation with graph convolutional network and attention mechanism

Yao Rui1, Xia Shixiong1, Zhou Yong1, Zhao Jiaqi1, Hu Fuyuan2(1.School of Computer Science and Technology, China University of Mining and Technology, Xuzhou 221116, China;2.School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China)

Abstract
Objective The task of video object segmentation (VOS) is to track and segment a single object or multiple objects in a video sequence. VOS is an important issue in the field of computer vision. Its goal is to manually or automatically provide specific object masks on the first frame or reference frame and then segment these specific objects in the entire video sequence. VOS plays an important role in video understanding. According to the types of video object labels, VOS methods can be divided into four categories:unsupervised, interactive, semi-supervised, and weakly supervised. In this study, we deal with the problem of semi-supervised VOS; that is, the ground truth of object mask is only given in the first frame, the segmented object is arbitrary, and no further assumptions are made about the object category. Currently, semi-supervised VOS methods are mostly based on deep learning. These methods can be divided into two types:detection-based methods and matching-based or motion propagation methods. Without using temporal information, detection-based methods learn the appearance model to perform pixel-level detection and object segmentation at each frame of the video. Matching-based or motion propagation methods utilize the temporal correlation of object motion to propagate from the first frame or a given mask frame to the object mask of the subsequent frame. Matching-based methods first calculate the pixel-level matching between the features of the template frame and the current frame in the video and then directly divide each pixel of the current frame from the matching result. There are two types of methods based on motion propagation. One type of method is to introduce optical flow to train the VOS model. Another type of method learns deep object features from the object mask of the previous frame and refines the object mask of the current frame. Most existing methods mainly rely on the reference mask of the first frame (assisted by optical flow or previous mask) to estimate the object segmentation mask. However, due to the limitations of these models in modeling spatial and temporal domain, they easily fail under rapid appearance changes or occlusion. Therefore, a spatial-temporal part-based graph model is proposed to generate robust spatial-temporal object features. Method In this study, we propose an encode-decode-based VOS framework for spatial-temporal part-based graph. First, we use the Siamese architecture for the encode model. The input has two branches:the historical image frame branch stream and the current image frame branch stream. To simplify the model, we introduce a Markov hypothesis, that is, given the current frame and K-1 previous frames, and K-1 previously estimated segmentation masks. One branch inputs the dynamic features of the historical image frame and the mask, and the other branch inputs the current frame image and the segmentation mask of the previous frame. Both branches use ResNet50 as the base network, and the network weights are derived from the ImageNet pre-trained model. After obtaining the results of Res5 stage, we use the global convolution module to output image features, where the size of the convolution kernel is set to 7 and the number of channels of the feature is set to 512, which is the same as the other feature dimensions. Next, we design a structural graph representation model based on parts (nodes) and use the graph convolutional network to learn the object appearance model. To represent the spatial-temporal object model, we construct an undirected spatial-temporal part-based graph GST on frames with dense grid parts (nodes) and K (i.e., t-K,…, t-1), use a two-layer graph convolutional network to output feature matrix, and aggregate the target features of the spatial-temporal components through max pooling. In addition, we construct an undirected spatial part-based graph GS (similar to GST), which has the same processing steps as the above two-layer graph convolutional network, and then we obtain the spatial part-based object features. Next, the spatial-temporal part-based features and spatial part-based features are channel aligned to form a whole feature, and the channels are 256. The output functions of the spatial-temporal part-based feature model and the spatial part-based feature model have different characteristics, and we adopt an attention mechanism to assign different weights to all features. To optimize the feature map, we introduce a residual module to improve the edge details. Finally, in the decoding module, we construct a smooth refinement module, add an attention mechanism module, and merge features of adjacent stages in a multi-scale context. Specifically, the decoding module consists of three smooth and fine modules, plus a convolution layer and a Softmax layer, and then outputs the mask of the video object. The training process mainly includes two stages. First, we use the simulated images generated from the static images to pre-train the network. Second, we fine-tune this pre-trained model on the VOS dataset. The time window size K is set to 3. In the testing, the interval 3 is used to update the reference frame image and mask, so that the historical information can be effectively memorized. Result In the experimental section, the proposed method does not require online fine-tuning and post-processing, and it is compared with 12 latest methods on two datasets. On the DAVIS(densely annotated video segmentation)-2016 dataset, compared with the method with the highest performance, our Jaccard similarity-mean (J-M) & F measure-mean (F-M) score is 85.3% and increased by 1.7%. On the DAVIS-2017 dataset, compared with the method with the highest performance, our J-M & F-M score is 68.6% and is increased by 1.2%. At the same time, on the DAVIS-2016 dataset, a comparative experiment of network input and post-processing is carried out. Conclusion In this work, we studied the problem of robust spatial-temporal object model in VOS. A spatial-temporal VOS with part-based graph is proposed to alleviate the drift of visual object. The experimental results show that our model outperforms several state-of-the-art VOS approaches.
Keywords

订阅号|日报