时序特征融合的视频实例分割
Video instance segmentation based on temporal feature fusion
- 2021年26卷第7期 页码:1692-1703
纸质出版日期: 2021-07-16 ,
录用日期: 2021-02-01
DOI: 10.11834/jig.200521
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2021-07-16 ,
录用日期: 2021-02-01
移动端阅览
黄泽涛, 刘洋, 于成龙, 张加佳, 王轩, 漆舒汉. 时序特征融合的视频实例分割[J]. 中国图象图形学报, 2021,26(7):1692-1703.
Zetao Huang, Yang Liu, Chenglong Yu, Jiajia Zhang, Xuan Wang, Shuhan Qi. Video instance segmentation based on temporal feature fusion[J]. Journal of Image and Graphics, 2021,26(7):1692-1703.
目的
2
随着移动互联网和人工智能的蓬勃发展,海量的视频数据不断产生,如何对这些视频数据进行处理分析是研究人员面临的一个挑战性问题。视频中的物体由于拍摄角度、快速运动和部分遮挡等原因常常表现得模糊和多样,与普通图像数据集的质量存在不小差距,这使得对视频数据的实例分割难度较大。目前的视频实例分割框架大多依靠图像检测方法直接处理单帧图像,通过关联匹配组成同一目标的掩膜序列,缺少对视频困难场景的特定处理,忽略对视频时序信息的利用。
方法
2
本文设计了一种基于时序特征融合的多任务学习视频实例分割模型。针对普通视频图像质量较差的问题,本模型结合特征金字塔和缩放点积注意力机制,在时间上把其他帧检测到的目标特征加权聚合到当前图像特征上,强化了候选目标的特征响应,抑制背景信息,然后通过融合多尺度特征丰富了图像的空间语义信息。同时,在分割网络模块增加点预测网络,提升了分割准确度,通过多任务学习的方式实现端到端的视频物体同时检测、分割和关联跟踪。
结果
2
在YouTube-VIS验证集上的实验表明,与现有方法比较,本文方法在视频实例分割任务上平均精度均值提高了2%左右。对比实验结果证明提出的时序特征融合模块改善了视频分割的效果。
结论
2
针对当前视频实例分割工作存在的忽略对视频时序上下文信息的利用,缺少对视频困难场景进行处理的问题,本文提出融合时序特征的多任务学习视频实例分割模型,提升对视频中物体的分割效果。
Objective
2
With the rapid development of mobile internet and artificial intelligence
a growing number of video applications are gradually occupying people's daily life. Large volumes of video data are generated every day. In addition to the large number and high memory occupation of video data
the video content itself is complex
and often contains many characters
actions
and scenes. Thus
the video task is more challenging and urgent than the common image understanding task. How to process and analyze these video data is a challenging problem for many researchers. Due to the shooting angle and fast motion
the objects in the video often appear fuzzy and diverse
and a wide gap exists between the quality of the common image data set and that of the video dataset. Video instance segmentation is an extension of instance segmentation in the video field
which includes the detecting
segmenting
and tracking object instances. The method not only needs to assign the pixels of each frame to the corresponding semantic categories and object instances but also associate the instance objects across the entire video sequence. The problems of video defocus
motion blur
and partial occlusion in video images cause difficulty in video instance segmentation and result in poor performance. The existing video-instance segmentation algorithms mainly use the image-instance segmentation algorithms to further predict the target mask in every frame. Then
tracking algorithms are used to associate the detection results to generate the mask sequence along the video to solve the problem of instance segmentation in video. However
these algorithms rely on the initial image detection performance
and ignore the use of temporal context information
resulting in the lack of effective transmission and exchange of information between different frames
which makes the classification and segmentation performance not ideal in difficult video scenes.
Method
2
To solve this problem
this study designs a multi-task learning video instance segmentation model based on temporal feature fusion. We combine the feature pyramid network and scaled dot-product attention operation in the temporal domain. Feature pyramid network is a feature extractor designed according to the concept of feature pyramid
which aims to improve the accuracy and speed. It replaces the feature extractor in fast region convolutional neural network(R-CNN) and generates a higher-quality feature graph pyramid. In general
the feature pyramid network has two feature fusion ways of a bottom-up line and a top-down line. The bottom-up way is the forward process of the network
but top-down is intended to sample the top-level features
and then conduct element-wise addition with the corresponding features of the previous layer. The scaled dot-product attention is the basic component of the multi-head attention module in the transformer
which is a popular encoder-to-decoder attention network in machine translation. With the temporal feature fusion module
the object features detected by other frames are weighted and aggregated to the current image features to strengthen the feature response of candidate object and suppress the background information. Then
the spatial semantic information of the image is enriched by fusing multi-scale features of the current frame. Thus
the model can capture the fine correlation information between other frames and the current frame
and selectively aggregate the important features of other frames to enhance the representation of the current features. At the same time
point prediction network added to the segmentation module improves the segmentation precision compared with the general segmentation network of fully convolutional neural network. Then
the objects are detected
segmented
and tracked simultaneously in the video by our end-to-end multi-task learning video instance segmentation framework.
Result
2
Experiments on YouTube-VIS dataset show that our method improves the mean average precision of video instance segmentation by near 2% compared with current methods. We also conduct a series of ablation experiments. On the one hand
we add different segmentation network modules in the model
and compare the effect of the fully convolutional network and point predict segmentation network on the two-stage video instance segmentation model. On the other hand
because the temporal feature-fusion module needs to select the RPN(region proposal network) candidate objects of the auxiliary frame for information fusion in the training stage
experimental comparison is needed for different number settings of RPN objects. We find the best result 32.7% AP with 10 RPN objects using. This result shows that the proposed temporal feature-fusion module improves the effect of video segmentation.
Conclusion
2
In this study
a two-stage video-instance segmentation model with temporal feature fusion module is proposed. In the first stage
the backbone network ResNet extracts features from an input image
and the temporal feature-fusion module further extracts features of multiple scales through feature pyramid networks
and aggregates the object feature information detected by other frames to enhance the feature response of the current frame. Then
the region proposal network extracts multiple candidate objects from the image. In the second stage
the features of proposal objects are input into three parallel network heads to obtain the corresponding results. The detection network head obtains the object classification and position in the current image
the segmentation network head obtains the instance segmentation mask of the current image
and the associated network head achieves the continuous association of the object by matching the most similar instance object in the instance storage space. In summary
our video instance segmentation model combines the feature pyramid network and scaled dot-product attention operation to the video temporal-feature fusion
which improves the accuracy of the video segmentation result.
计算机视觉实例分割视频实例分割缩放点积注意力多尺度融合
computer visioninstance segmentationvideo instance segmentationscaled dot-product attentionmulti-scale fusion
Athar A, Mahadevan S, Ošep A, Leal-TaixéL and Leibe B. 2020. Stem-seg: spatio-temporal embeddings for instance segmentation in videos//Proceedings of 2020 European Conference on Computer Vision. Springer, Cham: Springer: 158-177[DOI: 10.1007/978-3-030-58621-8_10http://dx.doi.org/10.1007/978-3-030-58621-8_10]
Bochinski E, Eiselein V and Sikora T. 2017. High-speed tracking-by-detection without using image information//Proceedings of 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). Lecce, Italy: IEEE: 1-6[DOI: 10.1109/AVSS.2017.8078516http://dx.doi.org/10.1109/AVSS.2017.8078516]
Bolya D, Zhou C, Xiao F Y and Lee Y J. 2019. YOLACT: real-time instance segmentation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea(South): IEEE: 9156-9165[DOI: 10.1109/ICCV.2019.00925http://dx.doi.org/10.1109/ICCV.2019.00925]
Buades A, Coll B and Morel J M. 2005. A non-local algorithm for image denoising//Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). San Diego, USA: IEEE: 60-65[DOI: 10.1109/CVPR.2005.38http://dx.doi.org/10.1109/CVPR.2005.38]
Caelles S, Maninis K K, Pont-Tuset J, Leal-TaixéL, Cremers D and Van Gool L. 2017. One-shot video object segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 5320-5329[DOI: 10.1109/CVPR.2017.565http://dx.doi.org/10.1109/CVPR.2017.565]
Chen L C, Papandreou G, Kokkinos I, Murphy K and Yuille A L. 2018a. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4): 834-848[DOI: 10.1109/TPAMI.2017.2699184]
Chen Y H, Pont-Tuset J, Montes A and Van Gool L. 2018b. Blazingly fast video object segmentation with pixel-wise metric learning//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 1189-1198[DOI: 10.1109/CVPR.2018.00130http://dx.doi.org/10.1109/CVPR.2018.00130]
Deng J, Dong W, Socher R,Li L J, Li K and Li F F. 2009. ImageNet: a large-scale hierarchical image database//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, USA: IEEE: 248-255[DOI: 10.1109/CVPR.2009.5206848http://dx.doi.org/10.1109/CVPR.2009.5206848]
Girshick R, Donahue J, Darrell T and Malik J. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 580-587[DOI: 10.1109/CVPR.2014.81http://dx.doi.org/10.1109/CVPR.2014.81]
Grundmann M, Kwatra V, Han M and Essa I. 2010. Efficient hierarchical graph-based video segmentation//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, USA: IEEE: 2141-2148[DOI: 10.1109/CVPR.2010.5539893http://dx.doi.org/10.1109/CVPR.2010.5539893]
Hariharan B, Arbeláez P, Girshick R and Malik J. 2014. Simultaneous detection and segmentation//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 297-312[DOI: 10.1007/978-3-319-10584-0_20http://dx.doi.org/10.1007/978-3-319-10584-0_20]
He K M, Gkioxari G, Dollár P and Girshick R. 2017. Mask R-CNN//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2980-2988[DOI: 10.1109/ICCV.2017.322http://dx.doi.org/10.1109/ICCV.2017.322]
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778[DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]
Huang Z J, Huang L C, Gong Y C, Huang C and Wang X G. 2019. Mask scoring R-CNN//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 6402-6411[DOI: 10.1109/CVPR.2019.00657http://dx.doi.org/10.1109/CVPR.2019.00657]
Jain S D, Xiong B and Grauman K. 2017. FusionSeg: learning to combine motion and appearance for fully automatic segmentation of generic objects in videos//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 2117-2126[DOI: 10.1109/CVPR.2017.228http://dx.doi.org/10.1109/CVPR.2017.228]
Jiang S H, Song H H, Zhang K H and Tang R F. 2019. Video object segmentation method based on dualpyramid network. Journal of Computer Applications, 39(8): 2242-2246
姜斯浩, 宋慧慧, 张开华, 汤润发. 2019. 基于双重金字塔网络的视频目标分割方法. 计算机应用, 39(8): 2242-2246)[DOI: 10.11772/j.issn.1001-9081.2018122566]
Kirillov A, Wu Y X, He K M and Girshick R. 2020. PointRend: image segmentation as rendering//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 9796-9805[DOI: 10.1109/CVPR42600.2020.00982http://dx.doi.org/10.1109/CVPR42600.2020.00982]
Li S Y, Seybold B, Vorobyov A, Fathi A, Huang Q and Jay Kuo C C. 2018. Instance embedding transfer to unsupervised video object segmentation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 6526-6535[DOI: 10.1109/CVPR.2018.00683http://dx.doi.org/10.1109/CVPR.2018.00683]
Lin T Y, Dollár P, Girshick R, He K M, Hariharan B and Belongie S. 2017. Feature pyramid networks for object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 936-944[DOI: 10.1109/CVPR.2017.106http://dx.doi.org/10.1109/CVPR.2017.106]
Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P and Zitnick C L. 2014. Microsoft COCO: common objects in context//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 740-755[DOI: 10.1007/978-3-319-10602-1_48http://dx.doi.org/10.1007/978-3-319-10602-1_48]
Liu X Y, Ren H B and Ye T M. 2019. Spatio-temporal attention network for video instance segmentation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision Workshop. Seoul, Korea(South): IEEE: 725-727[DOI: 10.1109/ICCVW.2019.00092http://dx.doi.org/10.1109/ICCVW.2019.00092]
Long J, Shelhamer E and Darrell T. 2015. Fully convolutional networks for semantic segmentation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3431-3440[DOI: 10.1109/CVPR.2015.7298965http://dx.doi.org/10.1109/CVPR.2015.7298965]
Perazzi F, Khoreva A, Benenson R, Schiele and Sorkine-Hornung A. 2017. Learning video object segmentation from static images//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 3491-3500[DOI: 10.1109/CVPR.2017.372http://dx.doi.org/10.1109/CVPR.2017.372]
Ren S Q, He K M, Girshick R and Sun J. 2015. Faster R-CNN: towards real-time object detection with region proposal networks//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press: 91-99
Tian Z, Shen C H, Chen H and He T. 2019. FCOS: fully convolutional one-stage object detection//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea(South): IEEE: 9626-9635[DOI: 10.1109/ICCV.2019.00972http://dx.doi.org/10.1109/ICCV.2019.00972]
Tokmakov P, Alahari K and Schmid C. 2017. Learning video object segmentation with visual memory//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 4491-4500[DOI: 10.1109/ICCV.2017.480http://dx.doi.org/10.1109/ICCV.2017.480]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc. : 6000-6010
Voigtlaender P, Chai Y, Schroff F, Adam H, Leibe B and Chen L C. 2019. FEELVOS: fast end-to-end embedding learning for video object segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 9473-9482[DOI: 10.1109/CVPR.2019.00971http://dx.doi.org/10.1109/CVPR.2019.00971]
Wang Z Y, Yuan C and Li J C. 2019. Instance segmentation with separable convolutions and multi-level features. Journal of Software, 30(4): 954-961
王子愉, 袁春, 黎健成. 2019. 利用可分离卷积和多级特征的实例分割. 软件学报, 30(4): 954-961)[DOI: 10.13328/j.cnki.jos.005667]
Xie E Z, Sun P Z, Song X G, Wang W H, Liu X B, Liang D, Shen C H and Luo P. 2020. PolarMask: single shot instance segmentation with polar representation//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 12190-12199[DOI: 10.1109/CVPR42600.2020.01221http://dx.doi.org/10.1109/CVPR42600.2020.01221]
Yang L J, Fan Y C and Xu N. 2019. Video instance segmentation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea(South): IEEE: 5187-5196[DOI: 10.1109/ICCV.2019.00529http://dx.doi.org/10.1109/ICCV.2019.00529]
Yang L J, Wang Y R, Xiong X H, Yang J C and Katsaggelos A K. 2018. Efficient video object segmentation via network modulation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, USA: IEEE: 6499-6507[DOI: 10.1109/CVPR.2018.00680http://dx.doi.org/10.1109/CVPR.2018.00680]
Zhao H S, Shi J P, Qi X J, Wang X G and Jia J Y. 2017. Pyramid scene parsing network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 6230-6239[DOI: 10.1109/CVPR.2017.660http://dx.doi.org/10.1109/CVPR.2017.660]
Zhu X Z, Wang Y J, Dai J F, Yuan L and Wei Y C. 2017a. Flow-guided feature aggregation for video object detection//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 408-417[DOI: 10.1109/ICCV.2017.52http://dx.doi.org/10.1109/ICCV.2017.52]
Zhu X Z, Xiong Y W, Dai J F, Yuan L and Wei Y C. 2017b. Deep feature flow for video recognition//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 4141-4150[DOI: 10.1109/CVPR.2017.441http://dx.doi.org/10.1109/CVPR.2017.441]
相关作者
相关机构