Current Issue Cover
自适应权重更新的轻量级视频目标分割算法

汪水源1, 侯志强1,2, 李富成1,2, 马素刚1,2, 余旺盛3(1.西安邮电大学计算机学院, 西安 710121;2.西安邮电大学陕西省网络数据分析与智能处理重点实验室, 西安 710121;3.空军工程大学信息与导航学院, 西安 710077)

摘 要
目的 针对现有视频目标分割(video object segmentation,VOS)算法不能自适应进行样本权重更新,以及使用过多的冗余特征信息导致不必要的空间与时间消耗等问题,提出一种自适应权重更新的轻量级视频目标分割算法。方法 首先,为建立一个具有较强目标判别性的算法模型,所提算法根据提取特征的表征质量,自适应地赋予特征相应的权重;其次,为了去除冗余信息,提高算法的运行速度,通过优化信息存储策略,构建了一个轻量级的记忆模块。结果 实验结果表明,在公开数据集DAVIS2016 (densely annotated video segmentation)和DAVIS2017上,本文算法的区域相似度与轮廓准确度的均值J&F分别达到了85.8%和78.3%,与对比的视频目标分割算法相比具有明显的优势。结论 通过合理且无冗余的历史帧信息利用方式,提升了算法对于目标建模的泛化能力,使目标掩码质量更高。
关键词
Lightweight video object segmentation algorithm based on adaptive weight update

Wang Shuiyuan1, Hou Zhiqiang1,2, Li Fucheng1,2, Ma Sugang1,2, Yu Wangsheng3(1.College of Computer Science and Technology, Xi'an University of Posts and Telecommunications, Xi'an 710121, China;2.Shaanxi Key Laboratory of Network Data Analysis and Intelligent Processing, Xi'an University of Posts and Telecommunications, Xi'an 710121, China;3.Information and Navigation Institute, Air Force Engineering University, Xi'an 710077, China)

Abstract
Objective Video object segmentation is a basic computer vision task that is widely used in video editing, video synthesis, autopilot, and other fields. This paper studies the problem of semi-supervised video object segmentation, that is, when the real label mask of the target in the first frame of the video is given, the segmentation mask of the target specified by the first frame in the remaining frame is predicted. First, in the video sequence, the target object undergoes great changes in appearance due to continuous motion and variable camera viewing angle. Second, if there is occlusion of other objects, then the target object may disappear from this frame. Third, similar targets of the same category increase the difficulty of segmenting specific targets. Therefore, although annotations are provided in the first frame, semi-supervised video object segmentation(VOS)remains a challenge. Recently, the algorithm based on memory network has become mainstream in video object segmentation. Space-time memory VOS(STMVOS)uses the memory network to store additional feature information of historical frames. When segmenting each frame, STMVOS uses memory information to match the feature information of the current frame of the video pixel by pixel. While STMVOS outperforms all previous methods, this algorithm suffers from slow segmentation speed because of its high computational complexity. Unlike STMVOS, fast and robust models(FRTM)also uses the memory network to store historical frame information yet uses memory information to update its proposed target model. The target model takes the feature information from the backbone network as input and outputs the rough mask of the target. This mask is then used as the input of the subsequent refinement and segmentation of the network, and the fine segmentation mask of the target is eventually outputted. After processing each frame, FRTM stores the features and mask of the frame in the memory module for subsequent updates of the target model. The speed of FRTM is 3. 5 times higher than that of STMVOS while achieving competitive accuracy. However, FRTM faces several problems. First, after processing each frame, FRTM stores the corresponding feature information and mask in the memory module, which undoubtedly generates too much repetitive and redundant information in this module. Second, when storing memory frames, FRTM only mechanically gives a fixed proportion of weight to the latest stored feature information without considering the quality of the current frame, which is obviously disadvantageous in training a target model with strong discrimination. Method To solve the above problems, this paper proposes a video object segmentation algorithm based on memory module and adaptive weight update. First, given that the benchmark algorithm simply uses the linear update method to give the nearest frame the highest weight and does not consider the quality of the feature itself, in order to achieve a reasonable weight distribution of the benchmark algorithm, this study proposes a feature quality discrimination method based on mask mapping that takes into account inter-frame connection and feature quality when calculating the weight for each feature to be stored in the memory module. The corresponding weight is then given adaptively. Second, the benchmark algorithm stores the features and corresponding masks of each frame in the memory module, resulting in a certain degree of information redundancy. In order to remove redundant historical frame information, improve the running speed of the algorithm, and reduce the memory consumption of the algorithm by optimizing the information storage strategy, a lightweight memory module is constructed. Result On the DAVIS2016 dataset, the region similarity J of the proposed algorithm is 85. 9%, its contour accuracy F is 85. 7%, its average J&F is 85. 8%, and its speed is 13. 5 frame/s. The average J&F of the proposed algorithm is two orders of magnitude(8. 2% and 5. 6%, respectively)higher than those of MaskTrack and OSVOS. The proposed algorithm also outperforms the other mainstream algorithms introduced from 2017 to 2021 in terms of average J&F. Specifically, the proposed algorithm outperforms FRTM and G-FRTM in terms of average J&F by 2. 3% and 1. 5%, respectively. FRTM is also inferior to the proposed algorithm in terms of speed. On the DAVIS2017 dataset, the proposed algorithm has a region similarity J of 75. 5%, contour accuracy F of 81. 1%, average J&F of 78. 3%, and speed of 9. 4 frame/s. This algorithm outperforms the early classical algorithms MaskTrack and OSVOS in terms of average J&F by 24% and 18%, respectively, and in terms of speed by two orders of magnitude. The proposed algorithm also outperforms the mainstream algorithms introduced from 2017 to 2021 in terms of average. Specifically, the average J&F of FRTM and GFRTM are 1. 6% and 1. 9% lower than those of the proposed algorithm, respectively, and this algorithm even has a higher speed than FRTM. Conclusion In this paper, a video object segmentation algorithm based on memory module and adaptive weight update is proposed. First, to capture the target area accurately and reduce the influence of noise information on the target model, the proposed algorithm assigns the corresponding weight after evaluating the quality of the stored feature information. Second, this algorithm uses a lightweight memory module to store the relevant information of the history frame. In some challenging scenarios, the proposed algorithm can still generate an accurate and robust segmentation mask of the target, which also proves its effectiveness.
Keywords

订阅号|日报