Current Issue Cover
融合视觉词与自注意力机制的视频目标分割

季传俊1,2, 陈亚当1,2, 车洵3(1.南京信息工程大学计算机学院、软件学院、网络空间安全学院, 南京 210044;2.数字取证教育部工程研究中心, 南京 210044;3.南京众智维信息科技有限公司, 南京 210006)

摘 要
目的 视频目标分割(video object segmentation,VOS)是在给定初始帧的目标掩码条件下,实现对整个视频序列中感兴趣对象的分割,但是视频中往往会出现目标形状不规则、背景中存在干扰信息和运动速度过快等情况,影响视频目标分割质量。对此,本文提出一种融合视觉词和自注意力机制的视频目标分割算法。方法 对于参考帧,首先将其图像输入编码器中,提取分辨率为原图像1/8的像素特征。然后将该特征输入由若干卷积核构成的嵌入空间中,并将其结果上采样至原始尺寸。最后结合参考帧的目标掩码信息,通过聚类算法对嵌入空间中的像素进行聚类分簇,形成用于表示目标对象的视觉词。对于目标帧,首先将其图像通过编码器并输入嵌入空间中,通过单词匹配操作用参考帧生成的视觉词来表示嵌入空间中的像素,并获得多个相似图。然后,对相似图应用自注意力机制捕获全局依赖关系,最后取通道方向上的最大值作为预测结果。为了解决目标对象的外观变化和视觉词失配的问题,提出在线更新机制和全局校正机制以进一步提高准确率。结果 实验结果表明,本文方法在视频目标分割数据集DAVIS (densely annotated video segmentation)2016和DAVIS 2017上取得了有竞争力的结果,区域相似度与轮廓精度之间的平均值J&F-mean (Jaccard and F-score mean)分别为83.2%和72.3%。结论 本文提出的算法可以有效地处理由遮挡、变形和视点变化等带来的干扰问题,实现高质量的视频目标分割。
关键词
Visual words and self-attention mechanism fusion based video object segmentation method

Ji Chuanjun1,2, Chen Yadang1,2, Che Xun3(1.School of Computer Science, Nanjing University of Information Science and Technology, Nanjing 210044, China;2.Engineering Research Center of Digital Forensics, Ministry of Education, Nanjing 210044, China;3.Nanjing OpenX Technology Co., Ltd., Nanjing 210006, China)

Abstract
Objective Video object segmentation (VOS) involves foreground objects segmentation from the background in a video sequence. Its applications are relevant to video detection, video classification, video summarization, and self-driving. Our research is focused on a semi-supervised setting, which estimates the mask of the target object in the remaining frames of the video based on the target mask given in the initial frame. However, current video object segmentation algorithms are constrained of the issue of irregular shape, interference information and super-fast motion. Hence, our research develops a video object segmentation algorithm based on the integration of visual words and self-attention mechanism. Method For the reference frame, the reference frame image is first fed into the encoder to extract features of those resolutions are 1/8 of the original image. Subsequently, the extracted features are fed into the embedding space composed of several 3×3 convolution kernels, and the result is up-sampled to the original size. During the training process, the pixels from the same target in the embedding space are close to each other, while the pixels from different targets are far apart. Finally, the visual words representing the target objects are formed by combining the mask information annotated in the reference frames and clustering the pixels in the embedding space using a clustering algorithm. For the target frame, its image is first fed into the encoder and passed through the embedding space, and then a word matching operation is performed to represent the pixels in the embedding space with a certain number of visual words to obtain similarity maps. However, learning visual words is a challenging task because there is no real information about their corresponding object parts. Therefore, a meta-training algorithm is used to alternate between unsupervised learning of visual words and supervised learning of pixel classification given these visual words. The application of visual vocabulary allows for more robust matching because an object may be obscured, deformed, changed perspective, or disappear and reappear from the same video, and its partial appearance may remain the same. Then, the self-attention mechanism is applied to the generated similarity map to capture the global dependency, and the maximum value is taken in the channel direction as the predicted result. To resolve significant appearance changes and global mismatch issues, an efficient online update and global correction mechanism is adopted to improve the accuracy further. For the online update mechanism, the updated timing has an impact on the performance of the model. When the update interval is shorter, the dictionary is updated more frequently, which aids the network to adapt dynamic scenes and fast-moving objects better. However, if the interval is too short, it is possible to cause more noisy visual words, which will affect the performance of the algorithm. Therefore, it is important to use an appropriate update frequency. Here, the visual dictionary is set to be updated every 5 frames. Furthermore, to ensure that the prediction masks used to update visual words in the online update mechanism are reliable, a simple outlier removal process is applied to the prediction masks. Specifically, given a region with the same prediction annotation, the prediction region is accepted only if it intersects the object mask predicted in the previous frame. If there is no intersection, this prediction mask is discarded and the prediction is made directly on it based on the previous result. Result We validate the effectiveness and robustness of our method on the challenging DAVIS 2016(densely annotated video segmentation) and DAVIS 2017 datasets. Our method is compared to state-of-the-art methods, with J&F-mean(Jaccard and F-score mean) score of 83.2% on DAVIS 2016, with J&F-mean score of 72.3% on DAVIS 2017. We achieved comparable accuracy to the fine-tuning-based method, and reached a competitive level in terms of the speed/accuracy trade-off of the two video object segmentation datasets. Conclusion The proposed algorithm can effectively deal with the interference problems caused by occlusion, deformation and viewpoint change, and achieve high quality video object segmentation.
Keywords

订阅号|日报