特征注意金字塔调制网络的视频目标分割

汤润发; 宋慧慧; 张开华; 姜斯浩

doi:10.11834/jig.180661

图像分析和识别 | 浏览量 : 0 下载量: 4 CSCD: 2

PDF
导出
分享
收藏
专辑

特征注意金字塔调制网络的视频目标分割
Video object segmentation via feature attention pyramid modulating network
2019年24卷第8期页码：1349-1357
收稿：2018-12-14，

修回：2019-3-11，

纸质出版：2019-08-16
DOI： 10.11834/jig.180661
稿件说明：

移动端阅览

汤润发, 宋慧慧, 张开华, 姜斯浩. 特征注意金字塔调制网络的视频目标分割[J]. 中国图象图形学报, 2019,24(8):1349-1357. DOI： 10.11834/jig.180661.

Runfa Tang, Huihui Song, Kaihua Zhang, Sihao Jiang. Video object segmentation via feature attention pyramid modulating network[J]. Journal of Image and Graphics, 2019, 24(8): 1349-1357. DOI： 10.11834/jig.180661.

摘要

目的

视频目标分割是在给定第1帧标注对象掩模条件下，实现对整个视频序列中感兴趣目标的分割。但是由于分割对象尺度的多样性，现有的视频目标分割算法缺乏有效的策略来融合不同尺度的特征信息。因此，本文提出一种特征注意金字塔调制网络模块用于视频目标分割。

方法

首先利用视觉调制器网络和空间调制器网络学习分割对象的视觉和空间信息，并以此为先验引导分割模型适应特定对象的外观。然后通过特征注意金字塔模块挖掘全局上下文信息，解决分割对象多尺度的问题。

结果

实验表明，在DAVIS 2016数据集上，本文方法在不使用在线微调的情况下，与使用在线微调的最先进方法相比，表现出更具竞争力的结果，

$$J$$

-mean指标达到了78.7%。在使用在线微调后，本文方法的性能在DAVIS 2017数据集上实现了最好的结果，

$$J$$

-mean指标达到了68.8%。

结论

特征注意金字塔调制网络的视频目标分割算法在对感兴趣对象分割的同时，针对不同尺度的对象掩模能有效结合上下文信息，减少细节信息的丢失，实现高质量视频对象分割。

Abstract

Objective

Video object segmentation aims to separate a target object from the background and other instances on the pixel level. Segmenting objects in videos is a fundamental task in computer vision because of its wide applications

such as video surveillance

video editing

and autonomous driving. Video object segmentation suffers from the challenging factors of occlusion

fast motion

motion blur

and significant appearance variation over time. In this paper

we leverage modulators to learn the limited visual and spatial information of a given target object to adapt the general segmentation network to the appearance of a specific object instance. Existing video object segmentation algorithms lack appropriate strategies to use feature information of different scales due to the multi-scale segmentation objects. Therefore

we design a feature attention pyramid module for video object segmentation.

Method

To adapt the generic segmentation network to the appearance of a specific object instance in one single feed-forward pass

we employ two modulators

namely

visual modulator and spatial modulator

to learn to adjust the intermediate layers of the generic segmentation network given an arbitrary target object instance. The modulator produces a list of parameters by extracting information from the image of the annotated object and the spatial prior of the object

which are injected into the segmentation model for layer-wise feature manipulation. The visual modulator network is a convolutional neural network (CNN) that takes the annotated visual object image as input and produces a vector of scale parameters for all modulation layers. The visual modulator is used to adapt the segmentation network to focus on a specific object instance

which is the annotated object in the first frame. The visual modulator implicitly learns an embedding of different types of objects. It should produce similar parameters to adjust the segmentation network for similar objects

whereas it should produce different parameters for different objects. The spatial modulator network is an efficient network that produces bias parameters based on the spatial prior input. Given that objects move continuously in a video

we set the prior as the predicted location of the object mask in the previous frame. Specifically

we encode the location information as a heatmap with a 2D Gaussian distribution on the image plane. The center and standard deviations of the Gaussian distribution are computed from the predicted mask of the previous frame. The spatial modulator downsamples the heatmap into different scales to match the resolution of different feature maps in the segmentation network and then applies a scale-and-shift operation on each downsampled heatmap to generate the bias parameters of the corresponding modulation layer. The scale problem of the segmentation network can be solved by multi-scale pooling of the feature map. The feature fusion of different scales is used to achieve context information fusion of different receptive fields and the fusion of the overall contour and the texture details; thus

large-scale and small-scale object segmentation can be effectively combined with the context information to reduce the loss of detail information as possible

achieving high-quality pixel-level video object segmentation. PSPNet or DeepLab system performs spatial pyramid pooling at different grid scales or dilate rates (called atrous spatial pyramid pooling (ASPP)) to solve this problem. In the ASPP module

dilated convolution is a sparse calculation that may cause grid artifacts. On the one hand

the pyramid pooling module proposed in PSPNet may lose pixel-level localization information. These kinds of structure lack global context prior attention to select the features in a channel-wise manner as in SENet and EncNet. On the other hand

using channel-wise attention vector is not enough to extract multi-scale features effectively

and pixel-wise information is lacking. Inspired by SENet and ParseNet

we attempt to extract precise pixel-level attention for high-level features extracted from CNNs. Our proposed feature attention pyramid (FAP) module is capable of increasing the respective fields and classifying small and big objects effectively

thus solving the problem of multi-scale segmentation. Specifically

the FAP module combines the attention mechanism and the spatial pyramid and achieves context information fusion of different receptive fields by combining the features of different scales and simultaneously by means of the global context prior. We use the 30×30

15×15

10×10

and 5×5 pools in the pyramid structure

respectively

to better extract context from different pyramid scales. Then

the pyramid structure concatenates the information of different scales

which can incorporate context features precisely. Furthermore

the origin features from CNNs is multiplied in a pixel-wise manner by the pyramid attention features after passing through a 1×1 convolution. We also introduce the global pooling branch concatenated with output features. The feature map produces improved channel-wise attention to learn good feature representations so that context information can be effectively combined between segmentation of large-and small-scale objects. Benefiting from the spatial pyramid structure

the FAP module can fuse different scale context information and produce improved pixel-level attention for high-level feature maps in the meantime.

Result

We validate the effectiveness and robustness of the proposed method on the challenging DAVIS 2016 and DAVIS 2017 datasets. The proposed methoddemonstrates more competitive results on DAVIS 2016 compared with the state-of-art methods that use online fine-tuning

and it outperforms these methods on DAVIS 2017.

Conclusion

In this study

we first use two modulator networks to learn the visual and spatial information of the segmentation object mask. The visual modulator produces channel-wise scale parameters to adjust the weights of different channels in the feature maps

while the spatial modulator generates element-wise bias parameters to inject the spatial prior into the modulated features. We use the modulators as a prior guidance to enable the segmentation model to adapt to the appearance of specific objects. In addition to segmentation of objects of interest

the mask for objects of different scales can effectively combine context information to reduce the loss of details

thereby achieving high-quality pixel-level video object segmentation.

关键词

Keywords

references

Koh Y J, Kim C S. Primary object segmentation in videos based on region augmentation and reduction[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 7417-7425.[ DOI: 10.1109/CVPR.2017.784 http://dx.doi.org/10.1109/CVPR.2017.784 ]

Wang S, Wu Q. Moving object segmentation algorithm based on Markov random field model[J]. Transducer and Microsystem Technologies, 2016, 35(7):113-115, 119.

王闪, 吴秦.基于马尔可夫随机场模型的运动对象分割算法[J].传感器与微系统, 2016, 35(7):113-115, 119. [DOI:10.13873/J.1000-9787(2016)07-0113-03]

Jang W D, Lee C, Kim C S. Primary object segmentation in videos via alternate convex optimization of foreground and background distributions[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 696-704.[ DOI: 10.1109/CVPR.2016.82 http://dx.doi.org/10.1109/CVPR.2016.82 ]

Li X J, Zhang K H, Song H H. Unsupervised video segmentation by fusing multiple spatio-temporal feature representations[J]. Journal of Computer Applications, 2017, 37(11):3134-3138, 3151.

李雪君, 张开华, 宋慧慧.融合时空多特征表示的无监督视频分割算法[J].计算机应用, 2017, 37(11):3134-3138, 3151. [DOI:10.11772/j.issn.1001-9081.2017.11.3134]

Zhang D, Javed O, Shah M. Video object segmentation through spatially accurate and temporally dense extraction of primary object regions[C]//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE, 2013: 628-635.[ DOI: 10.1109/CVPR.2013.87 http://dx.doi.org/10.1109/CVPR.2013.87 ]

Jain S D, Xiong B, Grauman K. FusionSeg: learning to combine motion and appearance for fully automatic segmentation of generic objects in videos[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 2117-2126.[ DOI: 10.1109/CVPR.2017.228 http://dx.doi.org/10.1109/CVPR.2017.228 ]

Tokmakov P, Alahari K, Schmid C. Learning video object segmentation with visual memory[C]//Proceedings of 2017 IEEE International Conference on Computer Vision. 2017: 4481-4490.

Deng Z X, Hong H, Jin Y, et al. Research and improvement on video target segmentation algorithm based on spatio-temporal dual-stream full convolutional network[J]. Industrial Control Computer, 2018, 31(8):113-114, 129.

邓志新, 洪泓, 金一, 等.基于时空双流全卷积网络的视频目标分割算法研究及改进[J].工业控制计算机, 2018, 31(8):113-114, 129.

Perazzi F, Khoreva A, Benenson R, et al. Learning video object segmentation from static images[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 3491-3500.[ DOI: 10.1109/CVPR.2017.372 http://dx.doi.org/10.1109/CVPR.2017.372 ]

Jampani V, Gadde R, Gehler P V. Video propagation networks[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 3154-3164.[ DOI: 10.1109/CVPR.2017.336 http://dx.doi.org/10.1109/CVPR.2017.336 ]

Wug O S, Lee J Y, Sunkavalli K, et al. Fast video object segmentation by reference-guided mask propagation[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 7376-7385.[ DOI: 10.1109/CVPR.2018.00770 http://dx.doi.org/10.1109/CVPR.2018.00770 ]

Caelles S, Maninis K K, Pont-Tuset J, et al. One-shot video object segmentation[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 5320-5329.[ DOI: 10.1109/CVPR.2017.565 http://dx.doi.org/10.1109/CVPR.2017.565 ]

Maninis K K, Caelles S, Chen Y H, et al. Video object segmentation without temporal information[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.[DOI:10.1109/TPAMI.2018.2838670]

He K M, Gkioxari G, Dollár P, et al. Mask R-CNN[C]//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 2980-2988.

Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 3431-3440.

Yang L, Wang Y, Xiong X, et al. Efficient video object segmentation via network modulation[C]//Proceedings of 2018 IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 6499-6507.

Liu W, Rabinovich A, Berg A C. Parsenet: looking wider to see better[EB/OL].[2018-12-14] . https://arxiv.org/pdf/1506.04579.pdf https://arxiv.org/pdf/1506.04579.pdf .

Chen L C, Papandreou G, Kokkinos I, et al. DeepLab:semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4):834-848.[DOI:10.1109/TPAMI.2017. 2699184]

Yan Z C, Zhang H, Jia Y Q, et al. Combining the best of convolutional layers and recurrent layers: a hybrid network for semantic segmentation[EB/OL].[2018-12-14] . https://arxiv.org/pdf/1603.04871.pd https://arxiv.org/pdf/1603.04871.pd f.

Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]//Proceedings of 2018 IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 7132-7141.

Zhang H, Dana K, Shi J P, et al. Context encoding for semantic segmentation[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 7151-7160.[ DOI: 10.1109/CVPR.2018.00747 http://dx.doi.org/10.1109/CVPR.2018.00747 ]

Voigtlaender P, Leibe B. Online adaptation of convolutional neural networksfor video object segmentation[EB/OL].[2018-12-14] . https://arxiv.org/pdf/1706.09364.pdf https://arxiv.org/pdf/1706.09364.pdf .

Cheng J, Tsai Y H, Hung W C, et al. Fast and accurate online video object segmentation via tracking parts[C]//Proceedings of 2018 IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 7415-7424.