自适应权重更新的轻量级视频目标分割算法
Lightweight video object segmentation algorithm based on adaptive weight update
- 2023年28卷第12期 页码:3772-3783
收稿日期:2022-05-10,
修回日期:2023-03-05,
纸质出版日期:2023-12-16
DOI: 10.11834/jig.220409
移动端阅览
浏览全部资源
扫码关注微信
收稿日期:2022-05-10,
修回日期:2023-03-05,
纸质出版日期:2023-12-16
移动端阅览
目的
2
针对现有视频目标分割(video object segmentation,VOS)算法不能自适应进行样本权重更新,以及使用过多的冗余特征信息导致不必要的空间与时间消耗等问题,提出一种自适应权重更新的轻量级视频目标分割算法。
方法
2
首先,为建立一个具有较强目标判别性的算法模型,所提算法根据提取特征的表征质量,自适应地赋予特征相应的权重;其次,为了去除冗余信息,提高算法的运行速度,通过优化信息存储策略,构建了一个轻量级的记忆模块。
结果
2
实验结果表明,在公开数据集DAVIS2016(densely annotated video segmentation)和DAVIS2017上,本文算法的区域相似度与轮廓准确度的均值
<math id="M1"><mi>J</mi><mo>&</mo><mi>F</mi></math>
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=51651850&type=
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=51651858&type=
5.67266655
2.62466669
分别达到了85.8%和78.3%,与对比的视频目标分割算法相比具有明显的优势。
结论
2
通过合理且无冗余的历史帧信息利用方式,提升了算法对于目标建模的泛化能力,使目标掩码质量更高。
Objective
2
Video object segmentation is a basic computer vision task that is widely used in video editing, video synthesis, autopilot, and other fields. This paper studies the problem of semi-supervised video object segmentation, that is, when the real label mask of the target in the first frame of the video is given, the segmentation mask of the target specified by the first frame in the remaining frame is predicted. First, in the video sequence, the target object undergoes great changes in appearance due to continuous motion and variable camera viewing angle. Second, if there is occlusion of other objects, then the target object may disappear from this frame. Third, similar targets of the same category increase the difficulty of segmenting specific targets. Therefore, although annotations are provided in the first frame, semi-supervised video object segmentation (VOS) remains a challenge. Recently, the algorithm based on memory network has become mainstream in video object segmentation. Space-time memory VOS (STMVOS) uses the memory network to store additional feature information of historical frames. When segmenting each frame, STMVOS uses memory information to match the feature information of the current frame of the video pixel by pixel. While STMVOS outperforms all previous methods, this algorithm suffers from slow segmentation speed because of its high computational complexity. Unlike STMVOS, fast and robust models (FRTM) also uses the memory network to store historical frame information yet uses memory information to update its proposed target model. The target model takes the feature information from the backbone network as input and outputs the rough mask of the target. This mask is then used as the input of the subsequent refinement and segmentation of the network, and the fine segmentation mask of the target is eventually outputted. After processing each frame, FRTM stores the features and mask of the frame in the memory module for subsequent updates of the target model. The speed of FRTM is 3.5 times higher than that of STMVOS while achieving competitive accuracy. However, FRTM faces several problems. First, after processing each frame, FRTM stores the corresponding feature information and mask in the memory module, which undoubtedly generates too much repetitive and redundant information in this module. Second, when storing memory frames, FRTM only mechanically gives a fixed proportion of weight to the latest stored feature information without considering the quality of the current frame, which is obviously disadvantageous in training a target model with strong discrimination.
Method
2
To solve the above problems, this paper proposes a video object segmentation algorithm based on memory module and adaptive weight update. First, given that the benchmark algorithm simply uses the linear update method to give the nearest frame the highest weight and does not consider the quality of the feature itself, in order to achieve a reasonable weight distribution of the benchmark algorithm, this study proposes a feature quality discrimination method based on mask mapping that takes into account inter-frame connection and feature quality when calculating the weight for each feature to be stored in the memory module. The corresponding weight is then given adaptively. Second, the benchmark algorithm stores the features and corresponding masks of each frame in the memory module, resulting in a certain degree of information redundancy. In order to remove redundant historical frame information, improve the running speed of the algorithm, and reduce the memory consumption of the algorithm by optimizing the information storage strategy, a lightweight memory module is constructed.
Result
2
On the DAVIS2016 dataset, the region similarity
<math id="M2"><mi>J</mi></math>
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=51651828&type=
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=51651810&type=
1.52400005
2.53999996
of the proposed algorithm is 85.9%, its contour accuracy
<math id="M3"><mi>F</mi></math>
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=51651839&type=
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=51651838&type=
2.03200006
2.28600001
is 85.7%, its average
<math id="M4"><mi>J</mi><mo>&</mo><mi>F</mi></math>
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=51651846&type=
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=51651844&type=
5.67266655
2.62466669
is 85.8%, and its speed is 13.5 frame/s. The average
<math id="M5"><mi>J</mi><mo>&</mo><mi>F</mi></math>
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=51651846&type=
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=51651844&type=
5.67266655
2.62466669
of the proposed algorithm is two orders of magnitude (8.2% and 5.6%, respectively) higher than those of MaskTrack and OSVOS. The proposed algorithm also outperforms the other mainstream algorithms introduced from 2017 to 2021 in terms of average
<math id="M6"><mi>J</mi><mo>&</mo><mi>F</mi></math>
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=51651850&type=
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=51651858&type=
5.67266655
2.62466669
. Specifically, the proposed algorithm outperforms FRTM and G-FRTM in terms of average
<math id="M7"><mi>J</mi><mo>&</mo><mi>F</mi></math>
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=51651846&type=
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=51651844&type=
5.67266655
2.62466669
by 2.3% and 1.5%, respectively. FRTM is also inferior to the proposed algorithm in terms of speed. On the DAVIS2017 dataset, the proposed algorithm has a region similarity
J
of 75.5%, contour accuracy
<math id="M8"><mi>F</mi></math>
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=51651839&type=
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=51651838&type=
2.03200006
2.28600001
of 81.1%, average
<math id="M9"><mi>J</mi><mo>&</mo><mi>F</mi></math>
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=51651846&type=
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=51651844&type=
5.67266655
2.62466669
of 78.3%, and speed of 9.4 frame/s. This algorithm outperforms the early classical algorithms MaskTrack and OSVOS in terms of average
<math id="M10"><mi>J</mi><mo>&</mo><mi>F</mi></math>
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=51651846&type=
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=51651844&type=
5.67266655
2.62466669
by 24% and 18%, respectively, and in terms of speed by two orders of magnitude. The proposed algorithm also outperforms the mainstream algorithms introduced from 2017 to 2021 in terms of average. Specifically, the average
<math id="M11"><mi>J</mi><mo>&</mo><mi>F</mi></math>
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=51651850&type=
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=51651858&type=
5.67266655
2.62466669
of FRTM and G-FRTM are 1.6% and 1.9% lower than those of the proposed algorithm, respectively, and this algorithm even has a higher speed than FRTM.
Conclusion
2
In this paper, a video object segmentation algorithm based on memory module and adaptive weight update is proposed. First, to capture the target area accurately and reduce the influence of noise information on the target model, the proposed algorithm assigns the corresponding weight after evaluating the quality of the stored feature information. Second, this algorithm uses a lightweight memory module to store the relevant information of the history frame. In some challenging scenarios, the proposed algorithm can still generate an accurate and robust segmentation mask of the target, which also proves its effectiveness.
Bao L C , Wu B Y and Liu W . 2018 . CNN in MRF: video object segmentation via inference in a CNN-based higher-order spatio-temporal MRF // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City, USA : IEEE: 5977 - 5986 [ DOI: 10.1109/CVPR.2018.00626 http://dx.doi.org/10.1109/CVPR.2018.00626 ]
Caelles S , Maninis K K , Pont-Tuset J , Leal-Taixé L , Cremers D and van Gool L . 2017 . One-shot video object segmentation // Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition . Honolulu, USA : IEEE: 5320 - 5329 [ DOI: 10.1109/CVPR.2017.565 http://dx.doi.org/10.1109/CVPR.2017.565 ]
Chen X , Li Z X , Yuan Y , Yu G , Shen J X and Qi D L . 2020 . State-aware tracker for real-time video object segmentation // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle, USA : IEEE: 9381 - 9390 [ DOI: 10.1109/CVPR42600.2020.00940 http://dx.doi.org/10.1109/CVPR42600.2020.00940 ]
Chen Y H , Pont-Tuset J , Montes A and van Gool L . 2018 . Blazingly fast video object segmentation with pixel-wise metric learning // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City, USA : IEEE: 1189 - 1198 [ DOI: 10.1109/CVPR.2018.00130 http://dx.doi.org/10.1109/CVPR.2018.00130 ]
Cheng J C , Tsai Y H , Hung W C , Wang S J and Yang M H . 2018 . Fast and accurate online video object segmentation via tracking parts // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City, USA : IEEE: 7415 - 7424 [ DOI: 10.1109/CVPR.2018.00774 http://dx.doi.org/10.1109/CVPR.2018.00774 ]
Ge W B , Lu X K and Shen J B . 2021 . Video object segmentation using global and instance embedding learning // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville, USA : IEEE: 16831 - 16840 [ DOI: 10.1109/CVPR46437.2021.01656 http://dx.doi.org/10.1109/CVPR46437.2021.01656 ]
Hu Y T , Huang J B and Schwing A G . 2018 . VideoMatch: matching based video object segmentation // Proceedings of the 15th European Conference on Computer Vision . Munich, Germany : Springer: 56 - 73 [ DOI: 10.1007/978-3-030-01237-3_4 http://dx.doi.org/10.1007/978-3-030-01237-3_4 ]
Huang X H , Xu J R , Tai Y W and Tang C K . 2020 . Fast video object segmentation with temporal aggregation network and dynamic template matching // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle, USA : IEEE: 8876 - 8886 [ DOI: 10.1109/CVPR42600.2020.00890 http://dx.doi.org/10.1109/CVPR42600.2020.00890 ]
Ji G P , Fu K R , Wu Z , Fan D P , Shen J B and Shao L . 2021 . Full-duplex strategy for video object segmentation // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal, Canada : IEEE: 4902 - 4913 [ DOI: 10.1109/ICCV48922.2021.00488 http://dx.doi.org/10.1109/ICCV48922.2021.00488 ]
Johnander J , Danelljan M , Brissman E , Khan F S and Felsberg M . 2019 . A generative appearance model for end-to-end video object segmentation // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach, USA : IEEE: 8945 - 8954 [ DOI: 10.1109/CVPR.2019.00916 http://dx.doi.org/10.1109/CVPR.2019.00916 ]
Khoreva A , Benenson R , Ilg E , Brox T and Schiele B . 2019 . Lucid data dreaming for object tracking . [EB/OL]. [ 2022-05-10 ]. https://arxiv.org/pdf/1703.09554.pdf https://arxiv.org/pdf/1703.09554.pdf
Li B , Yan J J , Wu W , Zhu Z and Hu X L . 2018 . High performance visual tracking with siamese region proposal network // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City, USA : IEEE: 8971 - 8980 [ DOI: 10.1109/CVPR.2018.00935 http://dx.doi.org/10.1109/CVPR.2018.00935 ]
Li H , Liu K H , Liu J J and Zhang X Y . 2021 . Multitask framework for video object tracking and segmentation combined with multi-scale interframe information . Journal of Image and Graphics , 26 ( 1 ): 101 - 112
李瀚 , 刘坤华 , 刘嘉杰 , 张晓晔 . 2021 . 实时视觉目标跟踪与视频对象分割多任务框架 . 中国图象图形学报 , 26 ( 1 ): 101 - 112 [ DOI: 10.11834/jig.200519 http://dx.doi.org/10.11834/jig.200519 ]
Li X X and Loy C C . 2018 . Video object segmentation with joint re-identification and attention-aware mask propagation // Proceedings of the 15th European Conference on Computer Vision . Munich, Germany : Springer: 93 - 110 [ DOI: 10.1007/978-3-030-01219-9_6 http://dx.doi.org/10.1007/978-3-030-01219-9_6 ]
Lin H J , Qi X J and Jia J Y . 2019 . AGSS-VOS: attention guided single-shot video object segmentation // Proceedings of 2019 IEEE/CVF International Conference on Computer Vision . Seoul, Korean (South) : IEEE: 3948 - 3956 [ DOI: 10.1109/ICCV.2019.00405 http://dx.doi.org/10.1109/ICCV.2019.00405 ]
Luiten J , Voigtlaender P and Leibe B . 2019 . PReMVOS: proposal-generation, refinement and merging for video object segmentation // Proceedings of the 14th Asian Conference on Computer Vision . Perth, Australia : Springer: 565 - 580 [ DOI: 10.1007/978-3-030-20870-7_35 http://dx.doi.org/10.1007/978-3-030-20870-7_35 ]
Maninis K K , Caelles S , Chen Y , Pont-Tuset J , Leal-Taixe L , Cremers D and Van Gool L . 2019 . Video object segmentation without temporal information . IEEE Transactions on Pattern Analysis and Machine Intelligence , 41 ( 6 ): 1515 - 1530 [ DOI: 10.1109/TPAMI.2018.2838670 http://dx.doi.org/10.1109/TPAMI.2018.2838670 ]
Oh S W , Lee J Y , Sunkavalli K and Kim S J . 2018 . Fast video object segmentation by reference-guided mask propagation // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City, USA : IEEE: 7376 - 7385 [ DOI: 10.1109/CVPR.2018.00770 http://dx.doi.org/10.1109/CVPR.2018.00770 ]
Oh S W , Lee J Y , Xu N and Kim S J . 2019 . Video object segmentation using space-time memory networks // Proceedings of 2019 IEEE/CVF International Conference on Computer Vision . Seoul, Korea (South) : IEEE: 9225 - 9234 [ DOI: 10.1109/ICCV.2019.00932 http://dx.doi.org/10.1109/ICCV.2019.00932 ]
Park H , Yoo J , Jeong S , Venkatesh G and Kwak N . 2021 . Learning dynamic network using a reuse gate function in semi-supervised video object segmentation // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville, USA : IEEE: 8401 - 8410 [ DOI: 10.1109/CVPR46437.2021.00830 http://dx.doi.org/10.1109/CVPR46437.2021.00830 ]
Perazzi F , Khoreva A , Benenson R , Schiele B and Sorkine-Hornung A . 2017 . Learning video object segmentation from static images // Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition . Honolulu, USA : IEEE: 3491 - 3500 [ DOI: 10.1109/CVPR.2017.372 http://dx.doi.org/10.1109/CVPR.2017.372 ]
Robinson A , Lawin F J , Danelljan M , Khan F S and Felsberg M . 2020 . Learning fast and robust target models for video object segmentation // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle, USA : IEEE: 7404 - 7413 [ DOI: 10.1109/CVPR42600.2020.00743 http://dx.doi.org/10.1109/CVPR42600.2020.00743 ]
Voigtlaender P , Chai Y N , Schroff F , Adam H , Leibe B and Chen L C . 2019 . FEELVOS: fast end-to-end embedding learning for video object segmentation // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach, USA : IEEE: 9473 - 9482 [ DOI: 10.1109/CVPR.2019.00971 http://dx.doi.org/10.1109/CVPR.2019.00971 ]
Voigtlaender P and Leibe B . 2017 . Online adaptation of convolutional neural networks for video object segmentation [EB/OL]. [ 2022-05-10 ]. https://arxiv.org/pdf/1706.09364.pdf https://arxiv.org/pdf/1706.09364.pdf
Voigtlaender P , Luiten J , Torr P H S and Leibe B . 2020 . Siam R-CNN: visual tracking by re-detection // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle, USA : IEEE: 6577 - 6587 [ DOI: 10.1109/CVPR42600.2020.00661 http://dx.doi.org/10.1109/CVPR42600.2020.00661 ]
Wang Q , Zhang L , Bertinetto L , Hu W M and Torr P H S . 2019a . Fast online object tracking and segmentation: a unifying approach // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach, USA : IEEE: 1328 - 1338 [ DOI: 10.1109/CVPR.2019.00142 http://dx.doi.org/10.1109/CVPR.2019.00142 ]
Wang S Y , Hou Z Q , Wang N , Li F C , Pu L and Ma S G . 2021 . Video object segmentation algorithm based on adaptive template updating and multi-feature fusion . Opto-Electronic Engineering , 48 ( 10 ): #210193
汪水源 , 侯志强 , 王囡 , 李富成 , 蒲磊 , 马素刚 . 2021 . 基于自适应模板更新与多特征融合的视频目标分割算法 . 光电工程 , 48 ( 10 ): # 210193 [ DOI: 10.12086/oee.2021.210193 http://dx.doi.org/10.12086/oee.2021.210193 ]
Wang Z Q , Xu J , Liu L , Zhu F and Shao L . 2019b . RANet: ranking attention network for fast video object segmentation // Proceedings of 2019 IEEE/CVF International Conference on Computer Vision . Seoul, Korean (South) : IEEE: 3977 - 3986 [ DOI: 10.1109/ICCV.2019.00408 http://dx.doi.org/10.1109/ICCV.2019.00408 ]
Xu K , Wen L Y , Li G R , Bo L F and Huang Q M . 2019 . Spatiotemporal CNN for video object segmentation // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach, USA : IEEE: 1379 - 1388 [ DOI: 10.1109/CVPR.2019.00147 http://dx.doi.org/10.1109/CVPR.2019.00147 ]
Yang L J , Wang Y R , Xiong X H , Yang J C and Katsaggelos A K . 2018 . Efficient video object segmentation via network modulation // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City, USA : IEEE: 6499 - 6507 [ DOI: 10.1109/CVPR.2018.00680 http://dx.doi.org/10.1109/CVPR.2018.00680 ]
Yang S , Zhang L , Qi J Q , Lu H C , Wang S and Zhang X X . 2021 . Learning motion-appearance co-attention for zero-shot video object segmentation // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal, Canada : IEEE: 1544 - 1553 [ DOI: 10.1109/ICCV48922.2021.00159 http://dx.doi.org/10.1109/ICCV48922.2021.00159 ]
Yoon J S , Rameau F , Kim J , Lee S , Shin S and Kweon I S . 2017 . Pixel-level matching for video object segmentation using convolutional neural networks // Proceedings of 2017 IEEE International Conference on Computer Vision . Venice, Italy : IEEE: 2186 - 2195 [ DOI: 10.1109/ICCV.2017.238 http://dx.doi.org/10.1109/ICCV.2017.238 ]
Zhang Y Z , Wu Z R , Peng H W and Lin S . 2020 . A transductive approach for video object segmentation // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle, USA : IEEE: 6947 - 6956 [ DOI: 10.1109/CVPR42600.2020.00698 http://dx.doi.org/10.1109/CVPR42600.2020.00698 ]
相关作者
相关机构