时序特征融合的视频实例分割

黄泽涛; 刘洋; 于成龙; 张加佳; 王轩; 漆舒汉

doi:10.11834/jig.200521

图像理解和计算机视觉 | 浏览量 : 0 下载量: 0 CSCD: 1

PDF
导出
分享
收藏
专辑

时序特征融合的视频实例分割
Video instance segmentation based on temporal feature fusion
2021年26卷第7期页码：1692-1703
纸质出版日期： 2021-07-16 ，

录用日期： 2021-02-01
DOI： 10.11834/jig.200521
稿件说明：

移动端阅览

黄泽涛, 刘洋, 于成龙, 张加佳, 王轩, 漆舒汉. 时序特征融合的视频实例分割[J]. 中国图象图形学报, 2021,26(7):1692-1703.

Zetao Huang, Yang Liu, Chenglong Yu, Jiajia Zhang, Xuan Wang, Shuhan Qi. Video instance segmentation based on temporal feature fusion[J]. Journal of Image and Graphics, 2021,26(7):1692-1703.
黄泽涛, 刘洋, 于成龙, 张加佳, 王轩, 漆舒汉. 时序特征融合的视频实例分割[J]. 中国图象图形学报, 2021,26(7):1692-1703. DOI： 10.11834/jig.200521.

Zetao Huang, Yang Liu, Chenglong Yu, Jiajia Zhang, Xuan Wang, Shuhan Qi. Video instance segmentation based on temporal feature fusion[J]. Journal of Image and Graphics, 2021,26(7):1692-1703. DOI： 10.11834/jig.200521.

摘要

目的

随着移动互联网和人工智能的蓬勃发展，海量的视频数据不断产生，如何对这些视频数据进行处理分析是研究人员面临的一个挑战性问题。视频中的物体由于拍摄角度、快速运动和部分遮挡等原因常常表现得模糊和多样，与普通图像数据集的质量存在不小差距，这使得对视频数据的实例分割难度较大。目前的视频实例分割框架大多依靠图像检测方法直接处理单帧图像，通过关联匹配组成同一目标的掩膜序列，缺少对视频困难场景的特定处理，忽略对视频时序信息的利用。

方法

本文设计了一种基于时序特征融合的多任务学习视频实例分割模型。针对普通视频图像质量较差的问题，本模型结合特征金字塔和缩放点积注意力机制，在时间上把其他帧检测到的目标特征加权聚合到当前图像特征上，强化了候选目标的特征响应，抑制背景信息，然后通过融合多尺度特征丰富了图像的空间语义信息。同时，在分割网络模块增加点预测网络，提升了分割准确度，通过多任务学习的方式实现端到端的视频物体同时检测、分割和关联跟踪。

结果

在YouTube-VIS验证集上的实验表明，与现有方法比较，本文方法在视频实例分割任务上平均精度均值提高了2%左右。对比实验结果证明提出的时序特征融合模块改善了视频分割的效果。

结论

针对当前视频实例分割工作存在的忽略对视频时序上下文信息的利用，缺少对视频困难场景进行处理的问题，本文提出融合时序特征的多任务学习视频实例分割模型，提升对视频中物体的分割效果。

Abstract

Objective

With the rapid development of mobile internet and artificial intelligence

a growing number of video applications are gradually occupying people's daily life. Large volumes of video data are generated every day. In addition to the large number and high memory occupation of video data

the video content itself is complex

and often contains many characters

actions

and scenes. Thus

the video task is more challenging and urgent than the common image understanding task. How to process and analyze these video data is a challenging problem for many researchers. Due to the shooting angle and fast motion

the objects in the video often appear fuzzy and diverse

and a wide gap exists between the quality of the common image data set and that of the video dataset. Video instance segmentation is an extension of instance segmentation in the video field

which includes the detecting

segmenting

and tracking object instances. The method not only needs to assign the pixels of each frame to the corresponding semantic categories and object instances but also associate the instance objects across the entire video sequence. The problems of video defocus

motion blur

and partial occlusion in video images cause difficulty in video instance segmentation and result in poor performance. The existing video-instance segmentation algorithms mainly use the image-instance segmentation algorithms to further predict the target mask in every frame. Then

tracking algorithms are used to associate the detection results to generate the mask sequence along the video to solve the problem of instance segmentation in video. However

these algorithms rely on the initial image detection performance

and ignore the use of temporal context information

resulting in the lack of effective transmission and exchange of information between different frames

which makes the classification and segmentation performance not ideal in difficult video scenes.

Method

To solve this problem

this study designs a multi-task learning video instance segmentation model based on temporal feature fusion. We combine the feature pyramid network and scaled dot-product attention operation in the temporal domain. Feature pyramid network is a feature extractor designed according to the concept of feature pyramid

which aims to improve the accuracy and speed. It replaces the feature extractor in fast region convolutional neural network(R-CNN) and generates a higher-quality feature graph pyramid. In general

the feature pyramid network has two feature fusion ways of a bottom-up line and a top-down line. The bottom-up way is the forward process of the network

but top-down is intended to sample the top-level features

and then conduct element-wise addition with the corresponding features of the previous layer. The scaled dot-product attention is the basic component of the multi-head attention module in the transformer

which is a popular encoder-to-decoder attention network in machine translation. With the temporal feature fusion module

the object features detected by other frames are weighted and aggregated to the current image features to strengthen the feature response of candidate object and suppress the background information. Then

the spatial semantic information of the image is enriched by fusing multi-scale features of the current frame. Thus

the model can capture the fine correlation information between other frames and the current frame

and selectively aggregate the important features of other frames to enhance the representation of the current features. At the same time

point prediction network added to the segmentation module improves the segmentation precision compared with the general segmentation network of fully convolutional neural network. Then

the objects are detected

segmented

and tracked simultaneously in the video by our end-to-end multi-task learning video instance segmentation framework.

Result

Experiments on YouTube-VIS dataset show that our method improves the mean average precision of video instance segmentation by near 2% compared with current methods. We also conduct a series of ablation experiments. On the one hand

we add different segmentation network modules in the model

and compare the effect of the fully convolutional network and point predict segmentation network on the two-stage video instance segmentation model. On the other hand

because the temporal feature-fusion module needs to select the RPN(region proposal network) candidate objects of the auxiliary frame for information fusion in the training stage

experimental comparison is needed for different number settings of RPN objects. We find the best result 32.7% AP with 10 RPN objects using. This result shows that the proposed temporal feature-fusion module improves the effect of video segmentation.

Conclusion

In this study

a two-stage video-instance segmentation model with temporal feature fusion module is proposed. In the first stage

the backbone network ResNet extracts features from an input image

and the temporal feature-fusion module further extracts features of multiple scales through feature pyramid networks

and aggregates the object feature information detected by other frames to enhance the feature response of the current frame. Then

the region proposal network extracts multiple candidate objects from the image. In the second stage

the features of proposal objects are input into three parallel network heads to obtain the corresponding results. The detection network head obtains the object classification and position in the current image

the segmentation network head obtains the instance segmentation mask of the current image

and the associated network head achieves the continuous association of the object by matching the most similar instance object in the instance storage space. In summary

our video instance segmentation model combines the feature pyramid network and scaled dot-product attention operation to the video temporal-feature fusion

which improves the accuracy of the video segmentation result.

关键词

计算机视觉实例分割视频实例分割缩放点积注意力多尺度融合

Keywords

computer visioninstance segmentationvideo instance segmentationscaled dot-product attentionmulti-scale fusion

references

Athar A, Mahadevan S, Ošep A, Leal-TaixéL and Leibe B. 2020. Stem-seg: spatio-temporal embeddings for instance segmentation in videos//Proceedings of 2020 European Conference on Computer Vision. Springer, Cham: Springer: 158-177[DOI: 10.1007/978-3-030-58621-8_10http://dx.doi.org/10.1007/978-3-030-58621-8_10]

Bochinski E, Eiselein V and Sikora T. 2017. High-speed tracking-by-detection without using image information//Proceedings of 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). Lecce, Italy: IEEE: 1-6[DOI: 10.1109/AVSS.2017.8078516http://dx.doi.org/10.1109/AVSS.2017.8078516]

Bolya D, Zhou C, Xiao F Y and Lee Y J. 2019. YOLACT: real-time instance segmentation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea(South): IEEE: 9156-9165[DOI: 10.1109/ICCV.2019.00925http://dx.doi.org/10.1109/ICCV.2019.00925]

Buades A, Coll B and Morel J M. 2005. A non-local algorithm for image denoising//Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). San Diego, USA: IEEE: 60-65[DOI: 10.1109/CVPR.2005.38http://dx.doi.org/10.1109/CVPR.2005.38]

Caelles S, Maninis K K, Pont-Tuset J, Leal-TaixéL, Cremers D and Van Gool L. 2017. One-shot video object segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 5320-5329[DOI: 10.1109/CVPR.2017.565http://dx.doi.org/10.1109/CVPR.2017.565]

Chen L C, Papandreou G, Kokkinos I, Murphy K and Yuille A L. 2018a. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4): 834-848[DOI: 10.1109/TPAMI.2017.2699184]

Chen Y H, Pont-Tuset J, Montes A and Van Gool L. 2018b. Blazingly fast video object segmentation with pixel-wise metric learning//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 1189-1198[DOI: 10.1109/CVPR.2018.00130http://dx.doi.org/10.1109/CVPR.2018.00130]

Deng J, Dong W, Socher R,Li L J, Li K and Li F F. 2009. ImageNet: a large-scale hierarchical image database//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, USA: IEEE: 248-255[DOI: 10.1109/CVPR.2009.5206848http://dx.doi.org/10.1109/CVPR.2009.5206848]

Girshick R, Donahue J, Darrell T and Malik J. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 580-587[DOI: 10.1109/CVPR.2014.81http://dx.doi.org/10.1109/CVPR.2014.81]

Grundmann M, Kwatra V, Han M and Essa I. 2010. Efficient hierarchical graph-based video segmentation//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, USA: IEEE: 2141-2148[DOI: 10.1109/CVPR.2010.5539893http://dx.doi.org/10.1109/CVPR.2010.5539893]

Hariharan B, Arbeláez P, Girshick R and Malik J. 2014. Simultaneous detection and segmentation//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 297-312[DOI: 10.1007/978-3-319-10584-0_20http://dx.doi.org/10.1007/978-3-319-10584-0_20]

He K M, Gkioxari G, Dollár P and Girshick R. 2017. Mask R-CNN//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2980-2988[DOI: 10.1109/ICCV.2017.322http://dx.doi.org/10.1109/ICCV.2017.322]

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778[DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]

Huang Z J, Huang L C, Gong Y C, Huang C and Wang X G. 2019. Mask scoring R-CNN//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 6402-6411[DOI: 10.1109/CVPR.2019.00657http://dx.doi.org/10.1109/CVPR.2019.00657]

Jain S D, Xiong B and Grauman K. 2017. FusionSeg: learning to combine motion and appearance for fully automatic segmentation of generic objects in videos//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 2117-2126[DOI: 10.1109/CVPR.2017.228http://dx.doi.org/10.1109/CVPR.2017.228]

Jiang S H, Song H H, Zhang K H and Tang R F. 2019. Video object segmentation method based on dualpyramid network. Journal of Computer Applications, 39(8): 2242-2246

姜斯浩, 宋慧慧, 张开华, 汤润发. 2019. 基于双重金字塔网络的视频目标分割方法. 计算机应用, 39(8): 2242-2246)[DOI: 10.11772/j.issn.1001-9081.2018122566]

Kirillov A, Wu Y X, He K M and Girshick R. 2020. PointRend: image segmentation as rendering//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 9796-9805[DOI: 10.1109/CVPR42600.2020.00982http://dx.doi.org/10.1109/CVPR42600.2020.00982]

Li S Y, Seybold B, Vorobyov A, Fathi A, Huang Q and Jay Kuo C C. 2018. Instance embedding transfer to unsupervised video object segmentation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 6526-6535[DOI: 10.1109/CVPR.2018.00683http://dx.doi.org/10.1109/CVPR.2018.00683]

Lin T Y, Dollár P, Girshick R, He K M, Hariharan B and Belongie S. 2017. Feature pyramid networks for object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 936-944[DOI: 10.1109/CVPR.2017.106http://dx.doi.org/10.1109/CVPR.2017.106]

Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P and Zitnick C L. 2014. Microsoft COCO: common objects in context//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 740-755[DOI: 10.1007/978-3-319-10602-1_48http://dx.doi.org/10.1007/978-3-319-10602-1_48]

Liu X Y, Ren H B and Ye T M. 2019. Spatio-temporal attention network for video instance segmentation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision Workshop. Seoul, Korea(South): IEEE: 725-727[DOI: 10.1109/ICCVW.2019.00092http://dx.doi.org/10.1109/ICCVW.2019.00092]

Long J, Shelhamer E and Darrell T. 2015. Fully convolutional networks for semantic segmentation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3431-3440[DOI: 10.1109/CVPR.2015.7298965http://dx.doi.org/10.1109/CVPR.2015.7298965]

Perazzi F, Khoreva A, Benenson R, Schiele and Sorkine-Hornung A. 2017. Learning video object segmentation from static images//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 3491-3500[DOI: 10.1109/CVPR.2017.372http://dx.doi.org/10.1109/CVPR.2017.372]

Ren S Q, He K M, Girshick R and Sun J. 2015. Faster R-CNN: towards real-time object detection with region proposal networks//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press: 91-99

Tian Z, Shen C H, Chen H and He T. 2019. FCOS: fully convolutional one-stage object detection//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea(South): IEEE: 9626-9635[DOI: 10.1109/ICCV.2019.00972http://dx.doi.org/10.1109/ICCV.2019.00972]

Tokmakov P, Alahari K and Schmid C. 2017. Learning video object segmentation with visual memory//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 4491-4500[DOI: 10.1109/ICCV.2017.480http://dx.doi.org/10.1109/ICCV.2017.480]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc. : 6000-6010

Voigtlaender P, Chai Y, Schroff F, Adam H, Leibe B and Chen L C. 2019. FEELVOS: fast end-to-end embedding learning for video object segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 9473-9482[DOI: 10.1109/CVPR.2019.00971http://dx.doi.org/10.1109/CVPR.2019.00971]

Wang Z Y, Yuan C and Li J C. 2019. Instance segmentation with separable convolutions and multi-level features. Journal of Software, 30(4): 954-961

王子愉, 袁春, 黎健成. 2019. 利用可分离卷积和多级特征的实例分割. 软件学报, 30(4): 954-961)[DOI: 10.13328/j.cnki.jos.005667]

Xie E Z, Sun P Z, Song X G, Wang W H, Liu X B, Liang D, Shen C H and Luo P. 2020. PolarMask: single shot instance segmentation with polar representation//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 12190-12199[DOI: 10.1109/CVPR42600.2020.01221http://dx.doi.org/10.1109/CVPR42600.2020.01221]

Yang L J, Fan Y C and Xu N. 2019. Video instance segmentation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea(South): IEEE: 5187-5196[DOI: 10.1109/ICCV.2019.00529http://dx.doi.org/10.1109/ICCV.2019.00529]

Yang L J, Wang Y R, Xiong X H, Yang J C and Katsaggelos A K. 2018. Efficient video object segmentation via network modulation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, USA: IEEE: 6499-6507[DOI: 10.1109/CVPR.2018.00680http://dx.doi.org/10.1109/CVPR.2018.00680]

Zhao H S, Shi J P, Qi X J, Wang X G and Jia J Y. 2017. Pyramid scene parsing network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 6230-6239[DOI: 10.1109/CVPR.2017.660http://dx.doi.org/10.1109/CVPR.2017.660]

Zhu X Z, Wang Y J, Dai J F, Yuan L and Wei Y C. 2017a. Flow-guided feature aggregation for video object detection//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 408-417[DOI: 10.1109/ICCV.2017.52http://dx.doi.org/10.1109/ICCV.2017.52]

Zhu X Z, Xiong Y W, Dai J F, Yuan L and Wei Y C. 2017b. Deep feature flow for video recognition//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 4141-4150[DOI: 10.1109/CVPR.2017.441http://dx.doi.org/10.1109/CVPR.2017.441]

文章被引用时，请邮件提醒。

提交

平行视觉的基本框架与关键算法

三维步态识别研究进展

分割一切模型SAM的潜力与展望：综述

“三维视觉—语言”推理技术的前沿研究与最新趋势

深度学习实时语义分割综述