时空图卷积网络与注意机制的视频目标分割

姚睿; 夏士雄; 周勇; 赵佳琦; 胡伏原

doi:10.11834/jig.200357

图像分析和识别 | 浏览量 : 0 下载量: 152 CSCD: 2

PDF
导出
分享
收藏
专辑

时空图卷积网络与注意机制的视频目标分割
Spatial-temporal video object segmentation with graph convolutional network and attention mechanism
2021年26卷第10期页码：2376-2387
收稿：2020-07-16，

修回：2020-8-26，

录用：2020-9-2，

纸质出版：2021-10-16
DOI： 10.11834/jig.200357
稿件说明：

移动端阅览

姚睿, 夏士雄, 周勇, 赵佳琦, 胡伏原. 时空图卷积网络与注意机制的视频目标分割[J]. 中国图象图形学报, 2021,26(10):2376-2387. DOI： 10.11834/jig.200357.

Rui Yao, Shixiong Xia, Yong Zhou, Jiaqi Zhao, Fuyuan Hu. Spatial-temporal video object segmentation with graph convolutional network and attention mechanism[J]. Journal of Image and Graphics, 2021, 26(10): 2376-2387. DOI： 10.11834/jig.200357.

摘要

目的

从大量数据中学习时空目标模型对于半监督视频目标分割任务至关重要，现有方法主要依赖第1帧的参考掩膜（通过光流或先前的掩膜进行辅助）估计目标分割掩膜。但由于这些模型在对空间和时域建模方面的局限性，在快速的外观变化或遮挡下很容易失效。因此，提出一种时空部件图卷积网络模型生成鲁棒的时空目标特征。

方法

首先，使用孪生编码模型，该模型包括两个分支：一个分支输入历史帧和掩膜捕获序列的动态特征，另一个分支输入当前帧图像和前一帧的分割掩膜。其次，构建时空部件图，使用图卷积网络学习时空特征，增强目标的外观和运动模型，并引入通道注意模块，将鲁棒的时空目标模型输出到解码模块。最后，结合相邻阶段的多尺度图像特征，从时空信息中分割出目标。

结果

在DAVIS（densely annotated video segmentation）-2016和DAVIS-2017两个数据集上与最新的12种方法进行比较，在DAVIS-2016数据集上获得了良好性能，Jacccard相似度平均值（Jaccard similarity-mean，

J-M

）和F度量平均值（F measure-mean，

F-M

）得分达到了85.3%，比性能最高的对比方法提高了1.7%；在DAVIS-2017数据集上，

J-M

和

F-M

得分达到了68.6%，比性能最高的对比方法提高了1.2%。同时，在DAVIS-2016数据集上，进行了网络输入与后处理的对比实验，结果证明本文方法改善了多帧时空特征的效果。

结论

本文方法不需要在线微调和后处理，时空部件图模型可缓解因目标外观变化导致的视觉目标漂移问题，同时平滑精细模块增加了目标边缘细节信息，提高了视频目标分割的性能。

Abstract

Objective

The task of video object segmentation (VOS) is to track and segment a single object or multiple objects in a video sequence. VOS is an important issue in the field of computer vision. Its goal is to manually or automatically provide specific object masks on the first frame or reference frame and then segment these specific objects in the entire video sequence. VOS plays an important role in video understanding. According to the types of video object labels

VOS methods can be divided into four categories: unsupervised

interactive

semi-supervised

and weakly supervised. In this study

we deal with the problem of semi-supervised VOS; that is

the ground truth of object mask is only given in the first frame

the segmented object is arbitrary

and no further assumptions are made about the object category. Currently

semi-supervised VOS methods are mostly based on deep learning. These methods can be divided into two types: detection-based methods and matching-based or motion propagation methods. Without using temporal information

detection-based methods learn the appearance model to perform pixel-level detection and object segmentation at each frame of the video. Matching-based or motion propagation methods utilize the temporal correlation of object motion to propagate from the first frame or a given mask frame to the object mask of the subsequent frame. Matching-based methods first calculate the pixel-level matching between the features of the template frame and the current frame in the video and then directly divide each pixel of the current frame from the matching result. There are two types of methods based on motion propagation. One type of method is to introduce optical flow to train the VOS model. Another type of method learns deep object features from the object mask of the previous frame and refines the object mask of the current frame. Most existing methods mainly rely on the reference mask of the first frame (assisted by optical flow or previous mask) to estimate the object segmentation mask. However

due to the limitations of these models in modeling spatial and temporal domain

they easily fail under rapid appearance changes or occlusion. Therefore

a spatial-temporal part-based graph model is proposed to generate robust spatial-temporal object features.

Method

In this study

we propose an encode-decode-based VOS framework for spatial-temporal part-based graph. First

we use the Siamese architecture for the encode model. The input has two branches: the historical image frame branch stream and the current image frame branch stream. To simplify the model

we introduce a Markov hypothesis

that is

given the current frame and

$$K$$

-1 previous frames

and

$$K$$

-1 previously estimated segmentation masks. One branch inputs the dynamic features of the historical image frame and the mask

and the other branch inputs the current frame image and the segmentation mask of the previous frame. Both branches use ResNet50 as the base network

and the network weights are derived from the ImageNet pre-trained model. After obtaining the results of Res5 stage

we use the global convolution module to output image features

where the size of the convolution kernel is set to 7 and the number of channels of the feature is set to 512

which is the same as the other feature dimensions. Next

we design a structural graph representation model based on parts (nodes) and use the graph convolutional network to learn the object appearance model. To represent the spatial-temporal object model

we construct an undirected spatial-temporal part-based graph

$${\mathit{\boldsymbol{G}}_{{\rm{ST}}}}$$

on frames with dense grid parts (nodes) and

$$K$$

(i.e.

$$t$$

$$K$$

…

$$t$$

-1)

use a two-layer graph convolutional network to output feature matrix

and aggregate the target features of the spatial-temporal components through max pooling. In addition

we construct an undirected spatial part-based graph

$${\mathit{\boldsymbol{G}}_{{\rm{S}}}}$$

(similar to

$${\mathit{\boldsymbol{G}}_{{\rm{ST}}}}$$

)

which has the same processing steps as the above two-layer graph convolutional network

and then we obtain the spatial part-based object features. Next

the spatial-temporal part-based features and spatial part-based features are channel aligned to form a whole feature

and the channels are 256. The output functions of the spatial-temporal part-based feature model and the spatial part-based feature model have different characteristics

and we adopt an attention mechanism to assign different weights to all features. To optimize the feature map

we introduce a residual module to improve the edge details. Finally

in the decoding module

we construct a smooth refinement module

add an attention mechanism module

and merge features of adjacent stages in a multi-scale context. Specifically

the decoding module consists of three smooth and fine modules

plus a convolution layer and a Softmax layer

and then outputs the mask of the video object. The training process mainly includes two stages. First

we use the simulated images generated from the static images to pre-train the network. Second

we fine-tune this pre-trained model on the VOS dataset. The time window size

$$K$$

is set to 3. In the testing

the interval 3 is used to update the reference frame image and mask

so that the historical information can be effectively memorized.

Result

In the experimental section

the proposed method does not require online fine-tuning and post-processing

and it is compared with 12 latest methods on two datasets. On the DAVIS(densely annotated video segmentation)-2016 dataset

compared with the method with the highest performance

our Jaccard similarity-mean (

J-M

)

F measure-mean (

F-M

) score is 85.3% and increased by 1.7%. On the DAVIS-2017 dataset

compared with the method with the highest performance

our

J-M & F-M

score is 68.6% and is increased by 1.2%. At the same time

on the DAVIS-2016 dataset

a comparative experiment of network input and post-processing is carried out.

Conclusion

In this work

we studied the problem of robust spatial-temporal object model in VOS. A spatial-temporal VOS with part-based graph is proposed to alleviate the drift of visual object. The experimental results show that our model outperforms several state-of-the-art VOS approaches.

关键词

Keywords

references

Bao L C, Wu B Y and Liu W. 2018. CNN in MRF: video object segmentation via inference in a CNN-based higher-order spatio-temporal MRF//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 5977-5986[ DOI: 10.1109/CVPR.2018.00626 http://dx.doi.org/10.1109/CVPR.2018.00626 ]

Caelles S, Maninis K K, Pont-Tuset J, Leal-TaixéL, Cremers D and Van Gool L. 2017. One-shot video object segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 5320-5329[ DOI: 10.1109/CVPR.2017.565 http://dx.doi.org/10.1109/CVPR.2017.565 ]

Cheng J C, Tsai Y H, Wang AJ and Yang M H. 2017. SegFlow: joint learning for video object segmentation and optical flow//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 686-695[ DOI: 10.1109/ICCV.2017.81 http://dx.doi.org/10.1109/ICCV.2017.81 ]

Cui Z, Cai Y Y, Zheng W M, Xu C Y and Yang J. 2019. Spectral filter tracking. IEEE Transactions on Image Processing, 28(5): 2479-2489[DOI: 10.1109/TIP.2018.2886788]

Deng J, Dong W, Socher R, Li L J, Li K and Li F F. 2009. ImageNet: a large-scale hierarchical image database//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, USA: IEEE: 248-255[ DOI: 10.1109/CVPR.2009.5206848 http://dx.doi.org/10.1109/CVPR.2009.5206848 ]

Grundmann M, Kwatra V, Han M and Essa I. 2010. Efficient hierarchical graph-based video segmentation//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, USA: IEEE: 2141-2148[ DOI: 10.1109/CVPR.2010.5539893 http://dx.doi.org/10.1109/CVPR.2010.5539893 ]

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 770-778[ DOI: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]

Hu J, Shen L and Sun G. 2018a. Squeeze-and-excitation networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7132-7141[ DOI: 10.1109/CVPR.2018.00745 http://dx.doi.org/10.1109/CVPR.2018.00745 ]

Hu Y T, Huang J B and Schwing A G. 2018b. VideoMatch: matching based video object segmentation//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 56-73[ DOI: 10.1007/978-3-030-01237-3_4 http://dx.doi.org/10.1007/978-3-030-01237-3_4 ]

Hu Y T, Huang J B and Schwing A G. 2018c. Maskrnn: instance level video object segmentation[EB/OL]. [2020-07-08] . https://arxiv.org/pdf/1803.11187.pdf https://arxiv.org/pdf/1803.11187.pdf

Ilg E, Mayer N, Saikia T, Keuper M, Dosovitskiy A and Brox T. 2017. FlowNet 2.0: evolution of optical flow estimation with deep networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 1647-1655[ DOI: 10.1109/CVPR.2017.179 http://dx.doi.org/10.1109/CVPR.2017.179 ]

Jampani V, Gadde R and Gehler P V. 2017. Video propagation networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 3154-3164[ DOI: 10.1109/CVPR.2017.336 http://dx.doi.org/10.1109/CVPR.2017.336 ]

Khoreva A, Benenson R, Ilg E, Brox T and Schiele B. 2019. Lucid data dreaming for video object segmentation. International Journal of Computer Vision, 127(9): 1175-1197[DOI: 10.1007/s11263-019-01164-6]

Kingma D P and Ba J L. 2014. Adam: a method for stochastic optimization[EB/OL]. [2020-07-08] . https://arxiv.org/pdf/1412.6980.pdf https://arxiv.org/pdf/1412.6980.pdf

Kipf T N and Welling M. 2016. Semi-supervised classification with graph convolutional networks[EB/OL]. [2020-07-08] . https://arxiv.org/pdf/1609.02907.pdf https://arxiv.org/pdf/1609.02907.pdf

Li X X and Change L C. 2018. Video object segmentation with joint reidentification and attention-aware mask propagation//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 93-110[ DOI: 10.1007/978-3-030-01219-9_6 http://dx.doi.org/10.1007/978-3-030-01219-9_6 ]

Long J, Shelhamer E and Darrell T. 2015. Fully convolutional networks for semantic segmentation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA: IEEE: 3431-3440[ DOI: 10.1109/CVPR.2015.7298965 http://dx.doi.org/10.1109/CVPR.2015.7298965 ]

Luiten J, Voigtlaender P and Leibe B. 2018. PReMVOS: Proposal-Generation, Refinement and Merging for Video Object Segmentation//Proceedings of Asian Conference on Computer Vision (ACCV). Perth, Australia: Springer: 565-580[doi: DOI: 10.1007/978-3-030-20870-7_35 http://dx.doi.org/10.1007/978-3-030-20870-7_35 ]

Maninis K K, Caelles S, Chen Y, Pont-Tuset J, Leal-Taixé L, Cremers D and Van Gool L. 2019. Video object segmentation without temporal information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(6): 1515-1530[DOI: 10.1109/TPAMI.2018.2838670]

Maninis K K, Caelles S, Pont-Tuset J and Van Gool L. 2018. Deep extreme cut: from extreme points to object segmentation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 616-625[ DOI: 10.1109/CVPR.2018.00071 http://dx.doi.org/10.1109/CVPR.2018.00071 ]

Newswanger A and Xu C L. 2017. One-shot video object segmentation with iterative online fine-tuning[EB/OL]. [2020-07-08] . https://arxiv.org/pdf/1706.09364.pdf https://arxiv.org/pdf/1706.09364.pdf

Peng C, Zhang X Y, Yu G, Luo G M and Sun J. 2017. Large kernel matters-improve semantic segmentation by global convolutional network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 1743-1751[ DOI: 10.1109/CVPR.2017.189 http://dx.doi.org/10.1109/CVPR.2017.189 ]

Perazzi F, Khoreva A, Benenson R, Schiele B and Sorkine-Hornung A. 2017. Learning video object segmentation from static images//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 3491-3500[ DOI: 10.1109/CVPR.2017.372 http://dx.doi.org/10.1109/CVPR.2017.372 ]

Perazzi F, Pont-Tuset J, McWilliams B, Van Gool L, Gross M and Sorkine-Hornung A. 2016. A benchmark dataset and evaluation methodology for video object segmentation//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 724-732[ DOI: 10.1109/CVPR.2016.85 http://dx.doi.org/10.1109/CVPR.2016.85 ]

Pont-Tuset J, Perazzi F, Caelles S, Arbeláez P, Sorkine-Hornung A and Van Gool L. 2017. The 2017 davis challenge on video object segmentation[EB/OL]. [2020-07-08] . https://arxiv.org/pdf/1704.00675.pdf https://arxiv.org/pdf/1704.00675.pdf

Voigtlaender P, Chai Y, Schroff F, Adam H, Leibe B and Chen L C. 2019. FEELVOS: fast end-to-end embedding learning for video object segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 9473-9482[ DOI: 10.1109/CVPR.2019.00971 http://dx.doi.org/10.1109/CVPR.2019.00971 ]

Voigtlaender P and Leibe B. 2017. Online adaptation of convolutional neural networks for video object segmentation//Proceedings of the British Machine Vision Conference (BMVC). London, UK: BMVA Press: 116.1-116.13[ DOI: 10.5244/C.31.116 http://dx.doi.org/10.5244/C.31.116 ]

Wang Z Q, Xu J, Liu L, Zhu F and Shao L. 2019. Ranet: ranking attention network for fast video object segmentation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE: 3977-3986[ DOI: 10.1109/ICCV.2019.00408 http://dx.doi.org/10.1109/ICCV.2019.00408 ]

Wug O S, Lee J Y, Sunkavalli K and Kim S J. 2018. Fast video object segmentation by reference-guided mask propagation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7376-7385[ DOI: 10.1109/CVPR.2018.00770 http://dx.doi.org/10.1109/CVPR.2018.00770 ]

Xiao H X, Feng J S, Lin G S, Liu Y and Zhang M J. 2018. MoNet: deep motion exploitation for video object segmentation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 1140-1148[ DOI: 10.1109/CVPR.2018.00125 http://dx.doi.org/10.1109/CVPR.2018.00125 ]

Xu K, Wen L Y, Li G R, BO L F and Huang Q M. 2019. Spatiotemporal CNN for video object segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 1379-1388[ DOI: 10.1109/CVPR.2019.00147 http://dx.doi.org/10.1109/CVPR.2019.00147 ]

Yan S J, Xiong Y J and Lin D H. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition[EB/OL]. [2020-07-08] . https://arxiv.org/pdf/1801.07455.pdf https://arxiv.org/pdf/1801.07455.pdf

Yang L J, Wang Y R, Xiong X H, Yang J C and Katsaggelos A K. 2018. Efficient video object segmentation via network modulation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 6499-6507[ DOI: 10.1109/CVPR.2018.00680 http://dx.doi.org/10.1109/CVPR.2018.00680 ]

Yao R, Lin G S, Xia S X, Zhao J Q and Zhou Y. 2020. Video object segmentation and tracking: a survey. ACM Transactions on Intelligent Systems and Technology, 11(4): #36[DOI: 10.1145/3391743]

Yoon J S, Rameau F, Kim J, Lee S, Shin S and Kweon I S. 2017. Pixel-level matching for video object segmentation using convolutional neural networks//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 2186-2195[ DOI: 10.1109/ICCV.2017.238 http://dx.doi.org/10.1109/ICCV.2017.238 ]

Yu C Q, Wang J B, Peng C, Gao C X, Yu G and Sang N. 2018. Learning a discriminative feature network for semantic segmentation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 1857-1866[ DOI: 10.1109/CVPR.2018.00199 http://dx.doi.org/10.1109/CVPR.2018.00199 ]

Zhang L, Lin Z, Zhang J M, Lu H C and He Y. 2019. Fast video object segmentation via dynamic targeting network//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE: 5581-5590[ DOI: 10.1109/ICCV.2019.00568 http://dx.doi.org/10.1109/ICCV.2019.00568 ]

Zhang Y, Chen X W, Li J, Wang C and Xia C Q. 2015. Semantic object segmentation via detection in weakly labeled video//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA: IEEE: 3641-3649[ DOI: 10.1109/CVPR.2015.7298987 http://dx.doi.org/10.1109/CVPR.2015.7298987 ]