Current Issue Cover
基于多帧时空注意力的半监督视频分割方法

罗思涵, 袁夏, 梁永顺(南京理工大学)

摘 要
目的 传统的半监督视频分割多是基于光流的方法建模关键帧与当前帧之间的特征关联。而光流法在使用过程中容易因遮挡、特殊纹理等情况而产生错误,从而导致多帧融合存在问题。为了更好地融合多帧特征,本文提取第一帧的外观特征信息与邻近关键帧的位置信息,通过Transformer和改进的PAN模块进行特征融合,从而基于多帧时空注意力学习并融合多帧的特征,基于多帧时空注意力的视频分割方法由视频预处理(即外观特征提取网络和当前帧特征提取网络)和基于Transformer和改进的PAN模块的特征融合两部分构成。方法 具体来说,该算法方案分为以下几个步骤:首先构建一个外观信息特征提取网络,用于提取第一帧图像的外观信息;然后构建一个当前帧特征提取网络,提取当前帧特征信息,通过Transformer模块对当前帧与第一帧的特征进行融合,使用第一帧的外观信息指导当前帧特征信息的提取;然后借助邻近数帧掩码图与当前帧特征图进行局部特征匹配,决策出与当前帧位置信息相关性较大的数帧作为邻近关键帧,用来指导当前帧位置信息的提取。最后借助改进的PAN特征聚合模块,将深层语义信息与浅层语义信息进行融合。结果 本文算法在DAVIS-2016数据集上的J和F得分为81.5%和80.9%,在DAVIS-2017数据集上为78.4%和77.9% , 均优于对比方法。本文算法的运行速度为22帧/s,排在第2,比PLM算法略低1.6%。在YouTube-VOS数据集上也取得了有竞争力的结果,J和F的平均值达到了71.2%,领先于所有的对比方法。结论 基于多帧时空注意力的半监督视频分割算法在对目标物体进行分割同时,能有效融合全局与局部信息,减少细节信息的丢失,在保持较高效率的同时能有效提高半监督视频分割的准确率。
关键词
Semi-supervised video segmentation method based on multi-frame spatio-temporal attention

Luosihan, yuanxia, liangyongshun(Nanjing University of Science and Technology)

Abstract
Objective Video Object Segmentation (VOS) aims to provide high-quality segmentation of target object instances throughout an input video sequence, obtaining pixel-level masks of the target objects, thereby finely segmenting the target from the background images. Compared to tasks such as object tracking and detection which involve bounding-box level tasks (using rectangular frames to select targets), VOS has pixel-level accuracy, which is more conducive to accurately locating the target and outlining the details of the target"s edge. Depending on the supervision information provided, video object segmentation can be divided into three scenarios: semi-supervised video object segmentation, interactive video object segmentation, and unsupervised video object segmentation. In this article, we focus on the semi-supervised task. In the scenario of semi-supervised video object segmentation, pixel-level annotated masks of the first frame of the video are provided, and subsequent prediction frames can fully utilize the annotated mask of the first frame to assist in computing the segmentation results of each prediction frame. With the development of deep neural network technology, currently, semi-supervised VOS methods are mostly based on deep learning. These methods can be divided into the following three categories: detection-based methods, matching-based, and propagation-based methods. Detection-based object segmentation algorithms treat video object segmentation tasks as image object segmentation tasks, without considering the temporal association of videos, believing that only a strong frame-level object detector and segmenter are needed to perform target segmentation frame by frame. Matching-based works typically segment video objects by calculating pixel-level matching scores or semantic feature matching scores between the template frame and the current prediction frame. Propagation-based methods propagate the multi-frame feature information before the prediction frame to the prediction frame, and calculate the correlation between the prediction frame feature and the previous frame feature to represent video context information. This context information locates the key areas of the entire video and can guide single-frame image segmentation. There are two types of motion-based propagation methods, one introduces optical flow to train the VOS model, and the other learns deep target features from the previous frame"s target mask and refines the target mask in the current frame. Existing semi-supervised video segmentation is mostly based on optical flow methods to model the feature association between key frames and the current frame. However, the optical flow method is prone to errors due to occlusions, special textures, etc., leading to issues in multi-frame fusion. To better integrate multi-frame features, this article extracts the appearance feature information of the first frame and the positional information of the adjacent key frames, and fuses the features through the Transformer and the improved PAN module, thereby learning and integrating features based on multi-frame spatio-temporal attention. Method In this study, we propose a semi-supervised video object segmentation method based on the fusion of features using the Transformer mechanism. This method integrates multi-frame appearance feature information and positional feature information using the Transformer mechanism. Specifically, the algorithm is divided into the following steps:(1)Appearance Information Feature Extraction Network: First, we construct an appearance information feature extraction network. This module, based on CSPDarknet53, is modified and consists of CBS modules, CSPRes modules, ResSPP modules, and REP modules. The first frame of the video serves as the input, which is passed through three CBS modules to obtain the shallow features F_s. These features are then processed through six CSPRes modules, followed by a ResSPP module and, finally, another CBS module to produce the output F_d, representing the appearance information extracted from the first frame of the video.(2)Current Frame Feature Extraction Network: We then build a network to extract features from the current frame. This network comprises three cascaded CBS modules, which are used to extract the current frame"s feature information. Simultaneously, the Transformer feature fusion module merges the features of the current frame with those of the first frame. The appearance information from the first frame guides the extraction of feature information from the current frame. Within this, the Transformer module consists of an encoder and a decoder.(3)Local Feature Matching: With the aid of the mask maps from several adjacent frames and the feature map of the current frame, local feature matching is performed. This process determines the frames with positional information that has a strong correlation with the current frame and treats them as nearby keyframes. These keyframes are then used to guide the extraction of positional information from the current frame.(4)Enhanced PAN Feature Aggregation Module: Finally, the input feature maps are passed through an SPP module that contains max-pooling layers of sizes 3×3,5×5 and 9×9. Following this, the improved PAN structure powerfully fuses the features across different layers. The feature maps undergo a Concat operation, which integrates deep semantic information with shallow semantic information. By integrating these steps, the proposed method aims to improve the accuracy and robustness of video object segmentation tasks. Result In the experimental section, the proposed method did not require online fine-tuning and post-processing. Our algorithm was compared with the current 10 mainstream methods on the DAVIS-2016 and DAVIS-2017 datasets, as well as with 5 methods on the YouTube-VOS dataset. On the DAVIS-2016 dataset, the algorithm achieved commendable performance, with a region similarity score J and contour accuracy score F of 81.5% and 80.9% respectively, which is an improvement of 1.2% over the highest-performing comparison method. On the DAVIS-2017 dataset, J and F scores reached 78.4% and 77.9% respectively, an improvement of 1.3% over the highest-performing comparison method. The running speed of our algorithm is 22 frames/s, ranking it second, slightly lower than the PLM algorithm by 1.6%. On the YouTube-VOS dataset, competitive results were also achieved, with average J and F scores reaching 71.2%, surpassing all comparison methods. Conclusion The semi-supervised video segmentation algorithm based on multi-frame spatio-temporal attention can effectively integrate both global and local information while segmenting target objects. This minimizes the loss of detailed information, and while maintaining high efficiency, it can effectively improve the accuracy of semi-supervised video segmentation.
Keywords

订阅号|日报