贺楚景,刘钦颖,王子磊(中国科学技术大学大数据学院, 合肥 230027;中国科学技术大学自动化系, 合肥 230027)
目的 视频动作检测旨在检测出视频中所有人员的空间位置,并确定其对应的动作类别。现实场景中的视频动作检测主要面临两大问题,一是不同动作执行者之间可能存在交互作用,仅根据本身的区域特征进行动作识别是不准确的;二是一个动作执行者在同一时刻可能有多个动作标签,单独预测每个动作类忽视了它们的内在关联。为此,本文提出了一种建模交互关系和类别依赖的视频动作检测方法。方法 首先,特征提取部分提取出关键帧中每个动作执行者的区域特征;然后,长短期交互部分设计短期交互模块(short-term interaction module,STIM)和长期交互模块(long-term interaction module,LTIM),分别建模动作执行者之间的短期时空交互和长期时序依赖,特别地,基于空间维度和时间维度的异质性,STIM采用解耦机制针对性地处理空间交互和短期时间交互;最后,为了解决多标签问题,分类器部分设计类别关系模块(class relationship module,CRM)计算类别之间的依赖关系以增强表征,并利用不同模块对分数预测结果的互补性,提出一种双阶段分数融合(two-stage score fusion,TSSF)策略更新最终的概率得分。结果 在公开数据集AVA v2.1(atomic visual actions version 2.1)上进行定量和定性分析。定量分析中,以阈值取0.5时的平均精度均值mAP@IoU 0.5(mean average precision@intersection over union 0.5)作为主要评价指标,本文方法在所有测试子类、Human Pose大类、Human-Object Interaction大类和Human-Human Interaction大类上的结果分别为31.0%,50.8%,22.3%,32.5%。与基准模型相比,分别提高了2.8%,2.0%,2.6%,3.6%。与其他主流算法相比,在所有子类上的指标相较于次好算法提高了0.8%;定性分析中,可视化结果一方面表明了本文模型能精准捕捉动作执行者之间的交互关系,另一方面体现了本文在类别依赖建模上的合理性和可靠性。此外,消融实验证明了各个模块的有效性。结论 本文提出的建模交互关系和类别依赖的视频动作检测方法能够提升交互类动作的识别效果,并在一定程度上解决了多标签分类问题。
Modeling interaction and profiling dependency-relevant video action detection
He Chujing,Liu Qinying,Wang Zilei(School of Big Data, University of Science and Technology of China, Hefei 230027, China;Department of Automation, University of Science and Technology of China, Hefei 230027, China)
Objective Video action detection is one of the challenging tasks in computer vision and video recognition， which aims to locate all actors and recognize their actions in video clips. However，two major problems are required to be resolved in the real world：First，there are pairwise interactions between actors in real scenes，so it is suboptimal to perform action recognition only in terms of their own regional features. Specifically，explicitly modeling interactions may be beneficial to the performance of action detection. Second，action detection is a multi-label classification task because an actor may perform multiple types of actions at the same time. We argue that it is beneficial to consider the inherent dependency between different classes. In this study，we propose a video action detection framework that simultaneously modeling interaction between actors and dependency between categories. Method Our framework proposed consists of three main parts：actor feature extraction，long short-term interaction and classification. In detail，the actor feature extraction part first utilizes the Faster region based convolutional neural network（Faster R-CNN）as a person detector to detect the potential actors in the whole dataset. After this，we only keep the detected actors which have relatively high confidence score. Furthermore，we take the SlowFast network as backbone to extract features from the raw videos. For each actor，we apply the RoIAlign operation on the extracted feature map based on the location of the actor and the corresponding feature of the actor is generated. To include the geometric information，the coordinates of actors is embedded into their features. The actor-related features are as the input of following steps. The long short-term interaction part includes short-term interaction module（STIM）and long-term interaction module（LTIM）. The STIM can leverage the graph attention network（GAT）to model short-term spatio-temporal interactions between actors，and LTIM can use long-term feature banks（LFB）to model long-term temporal dependency between actors. For short-term spatio-temporal interaction modeling，the intersection over union（IoU）-tracker is used to connect the bounding boxes of the same person in temporal. For optimization，we propose a decoupling mechanism to handle spatial and temporal interactions. The interaction between nodes of different actors at the same time step is defined as spatial interaction，and the interaction between nodes of the same actor at different time steps is recognized as temporal interaction. The graph attention network（GAT）is applied for each of them，where the nodes are the actor features，and their pairwise relationships are then represented via the edges. For long-term temporal dependency modeling，a sliding window mechanism is used to obtain a long-term feature bank，which can contain a large scale of temporal contexts. Current short-term features and the long-term feature bank can be transmitted into the non-local module， where the short-term features are used as queries，and the features in the long-term feature bank are used as key-value pairs to extract relevant long-term temporal context information. The classification part is based on a class relationship module （CRM），which first extracts class-specific feature for each class of each actor. The features of different classes of the same actor can be passed into the self-attention module to compute the semantic correlation between action classes. Finally，we propose a two-stage score fusion（TSSF）strategy to update the classification probability scores on the basis of the complementarity of different modules. Result We carry out quantitative and qualitative analysis on the public dataset atomic visual actions version 2. 1 （AVA v2. 1）. For the quantitative analysis，the evaluation metrics is the average precision（AP）with an IoU threshold of 0. 5. For each class，we compute the average precision and report the average over all classes. The results of our method in all test sub-categories，Human Pose category，Human-Object Interaction category，and HumanHuman Interaction category are 31. 0%，50. 8%，22. 3%，and 32. 5%，respectively. Compared to the baseline，it increases by 2. 8%，2. 0%，2. 6% and 3. 6% each. Compared to actor-centric relation network（ACRN），video action transformer network（VAT），actor-context-actor relation network（ACAR-Net）and other related algorithms，our method is optimized by 0. 8% further. For the qualitative analysis，the visualization results demonstrate that our method can capture the interaction accurately between action performers，and the rationality and reliability of the class dependency modeling can be reflected as well. To verify the effectiveness of the proposed modules，a series of ablation experiments are conducted. Additionally，the methods proposed are all end-to-end training，but the method proposed in this paper uses a fixed backbone，which has a faster training speed and lower computing resource consumption. Conclusion To fully interpret the interaction between actors and dependency between classes，a video action detection framework is illustrated. The effectiveness of our proposed method is validated according to the experimental results on AVA v2. 1 dataset.
video action detection multi-label classification interaction relationship modeling two-stage fusion deep learning attention mechanism