Modeling interaction and profiling dependency-relevant video action detection
- Vol. 28, Issue 5, Pages: 1499-1512(2023)
Published: 16 May 2023
DOI: 10.11834/jig.211040
移动端阅览
浏览全部资源
扫码关注微信
Published: 16 May 2023 ,
移动端阅览
贺楚景, 刘钦颖, 王子磊. 2023. 建模交互关系和类别依赖的视频动作检测. 中国图象图形学报, 28(05):1499-1512
He Chujing, Liu Qinying, Wang Zilei. 2023. Modeling interaction and profiling dependency-relevant video action detection. Journal of Image and Graphics, 28(05):1499-1512
目的
2
视频动作检测旨在检测出视频中所有人员的空间位置,并确定其对应的动作类别。现实场景中的视频动作检测主要面临两大问题,一是不同动作执行者之间可能存在交互作用,仅根据本身的区域特征进行动作识别是不准确的;二是一个动作执行者在同一时刻可能有多个动作标签,单独预测每个动作类忽视了它们的内在关联。为此,本文提出了一种建模交互关系和类别依赖的视频动作检测方法。
方法
2
首先,特征提取部分提取出关键帧中每个动作执行者的区域特征;然后,长短期交互部分设计短期交互模块(short-term interaction module,STIM)和长期交互模块(long-term interaction module,LTIM),分别建模动作执行者之间的短期时空交互和长期时序依赖,特别地,基于空间维度和时间维度的异质性,STIM采用解耦机制针对性地处理空间交互和短期时间交互;最后,为了解决多标签问题,分类器部分设计类别关系模块(class relationship module,CRM)计算类别之间的依赖关系以增强表征,并利用不同模块对分数预测结果的互补性,提出一种双阶段分数融合(two-stage score fusion,TSSF)策略更新最终的概率得分。
结果
2
在公开数据集AVA v2.1(atomic visual actions version 2.1)上进行定量和定性分析。定量分析中,以阈值取0.5时的平均精度均值mAP@IoU 0.5(mean average precision @ intersection over union 0.5)作为主要评价指标,本文方法在所有测试子类、Human Pose大类、Human-Object Interaction大类和Human-Human Interaction大类上的结果分别为31.0%,50.8%,22.3%,32.5%。与基准模型相比,分别提高了2.8%,2.0%,2.6%,3.6%。与其他主流算法相比,在所有子类上的指标相较于次好算法提高了0.8%;定性分析中,可视化结果一方面表明了本文模型能精准捕捉动作执行者之间的交互关系,另一方面体现了本文在类别依赖建模上的合理性和可靠性。此外,消融实验证明了各个模块的有效性。
结论
2
本文提出的建模交互关系和类别依赖的视频动作检测方法能够提升交互类动作的识别效果,并在一定程度上解决了多标签分类问题。
Objective
2
Video action detection is one of the challenging tasks in computer vision and video recognition, which aims to locate all actors and recognize their actions in video clips. However, two major problems are required to be resolved in the real world: First, there are pairwise interactions between actors in real scenes, so it is suboptimal to perform action recognition only in terms of their own regional features. Specifically, explicitly modeling interactions may be beneficial to the performance of action detection. Second, action detection is a multi-label classification task because an actor may perform multiple types of actions at the same time. We argue that it is beneficial to consider the inherent dependency between different classes. In this study, we propose a video action detection framework that simultaneously modeling interaction between actors and dependency between categories.
Method
2
Our framework proposed consists of three main parts: actor feature extraction, long short-term interaction and classification. In detail, the actor feature extraction part first utilizes the Faster region based convolutional neural network (Faster R-CNN) as a person detector to detect the potential actors in the whole dataset. After this, we only keep the detected actors which have relatively high confidence score. Furthermore, we take the SlowFast network as backbone to extract features from the raw videos. For each actor, we apply the RoIAlign operation on the extracted feature map based on the location of the actor and the corresponding feature of the actor is generated. To include the geometric information, the coordinates of actors is embedded into their features. The actor-related features are as the input of following steps. The long short-term interaction part includes short-term interaction module (STIM) and long-term interaction module (LTIM). The STIM can leverage the graph attention network (GAT) to model short-term spatio-temporal interactions between actors, and LTIM can use long-term feature banks (LFB) to model long-term temporal dependency between actors. For short-term spatio-temporal interaction modeling, the intersection over union (IoU)-tracker is used to connect the bounding boxes of the same person in temporal. For optimization, we propose a decoupling mechanism to handle spatial and temporal interactions. The interaction between nodes of different actors at the same time step is defined as spatial interaction, and the interaction between nodes of the same actor at different time steps is recognized as temporal interaction. The graph attention network (GAT) is applied for each of them, where the nodes are the actor features, and their pairwise relationships are then represented via the edges. For long-term temporal dependency modeling, a sliding window mechanism is used to obtain a long-term feature bank, which can contain a large scale of temporal contexts. Current short-term features and the long-term feature bank can be transmitted into the non-local module, where the short-term features are used as queries, and the features in the long-term feature bank are used as key-value pairs to extract relevant long-term temporal context information. The classification part is based on a class relationship module (CRM), which first extracts class-specific feature for each class of each actor. The features of different classes of the same actor can be passed into the self-attention module to compute the semantic correlation between action classes. Finally, we propose a two-stage score fusion (TSSF) strategy to update the classification probability scores on the basis of the complementarity of different modules.
Result
2
We carry out quantitative and qualitative analysis on the public dataset atomic visual actions version 2.1(AVA v2.1). For the quantitative analysis, the evaluation metrics is the average precision (AP) with an IoU threshold of 0.5. For each class, we compute the average precision and report the average over all classes. The results of our method in all test sub-categories, Human Pose category, Human-Object Interaction category, and Human-Human Interaction category are 31.0%, 50.8%, 22.3%, and 32.5%, respectively. Compared to the baseline, it increases by 2.8%, 2.0%, 2.6% and 3.6% each. Compared to actor-centric relation network (ACRN), video action transformer network (VAT), actor-context-actor relation network (ACAR-Net) and other related algorithms, our method is optimized by 0.8% further. For the qualitative analysis, the visualization results demonstrate that our method can capture the interaction accurately between action performers, and the rationality and reliability of the class dependency modeling can be reflected as well. To verify the effectiveness of the proposed modules, a series of ablation experiments are conducted. Additionally, the methods proposed are all end-to-end training, but the method proposed in this paper uses a fixed backbone, which has a faster training speed and lower computing resource consumption.
Conclusion
2
To fully interpret the interaction between actors and dependency between classes, a video action detection framework is illustrated. The effectiveness of our proposed method is validated according to the experimental results on AVA v2.1 dataset.
视频动作检测多标签分类交互关系建模双阶段融合深度学习注意力机制
video action detectionmulti-label classificationinteraction relationship modelingtwo-stage fusiondeep learningattention mechanism
Carreira J and Zisserman A. 2017. Quo vadis, action recognition? a new model and the kinetics dataset//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 4724-4733 [DOI: 10.1109/CVPR.2017.502http://dx.doi.org/10.1109/CVPR.2017.502]
Feichtenhofer C, Fan H Q, Malik J and He K M. 2019. SlowFast networks for video recognition//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 6201-6210 [DOI: 10.1109/ICCV.2019.00630http://dx.doi.org/10.1109/ICCV.2019.00630]
Girdhar R, Carreira J J, Doersch C and Zisserman A. 2019. Video action transformer network//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 244-253 [DOI: 10.1109/CVPR.2019.00033http://dx.doi.org/10.1109/CVPR.2019.00033]
Gu C H, Sun C, Ross D A, Vondrick C, Pantofaru C, Li Y Q, Vijayanarasimhan S, Toderici G, Ricco S, Sukthankar R, Schmid C and Malik J. 2018. AVA: a video dataset of spatio-temporally localized atomic visual actions//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 6047-6056 [DOI: 10.1109/CVPR.2018.00633http://dx.doi.org/10.1109/CVPR.2018.00633]
He K M, Gkioxari G, Dollár P and Girshick R. 2017. Mask R-CNN. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2): 386-397 [DOI: 10.1109/TPAMI.2018.2844175http://dx.doi.org/10.1109/TPAMI.2018.2844175]
Li Y X, Zhang B S, Li J, Wang Y B, Lin W Y, Wang C J, Li J L and Huang F Y. 2021. LSTC: boosting atomic action detection with long-short-term context//Proceedings of the 29th ACM International Conference on Multimedia. Chengdu, China: ACM: 2158-2166 [DOI: 10.1145/3474085.3475374http://dx.doi.org/10.1145/3474085.3475374]
Luo H L, Tong K and Kong F S. 2019. The progress of human action recognition in videos based on deep learning: a review. Acta Electronica Sinica, 47(5): 1162-1173
罗会兰, 童康, 孔繁胜. 2019. 基于深度学习的视频中人体动作识别进展综述. 电子学报, 47(5): 1162-1173 [DOI: 10.3969/j.issn.0372-2112.2019.05.025http://dx.doi.org/10.3969/j.issn.0372-2112.2019.05.025]
Ni J C, Qin J and Huang D. 2021. Identity-aware graph memory network for action detection//Proceedings of the 29th ACM International Conference on Multimedia. Chengdu, China: ACM: 3437-3445 [DOI: 10.1145/3474085.3475503http://dx.doi.org/10.1145/3474085.3475503]
Pan J T, Chen S Y, Shou M Z, Liu Y, Shao J and Li H S. 2021. Actor-context-actor relation network for spatio-temporal action localization//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 464-474 [DOI: 10.1109/CVPR46437.2021.00053http://dx.doi.org/10.1109/CVPR46437.2021.00053]
Ren S Q, He K M, Girshick R and Sun J. 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149 [DOI: 10.1109/TPAMI.2016.2577031http://dx.doi.org/10.1109/TPAMI.2016.2577031]
Singh G, Saha S, Sapienza M, Torr P and Cuzzolin F. 2017. Online real-time multiple spatiotemporal action localisation and prediction//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 3657-3666 [DOI: 10.1109/ICCV.2017.393http://dx.doi.org/10.1109/ICCV.2017.393]
Sun C, Shrivastava A, Vondrick C, Murphy K, Sukthankar R and Schmid C. 2018. Actor-centric relation network//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 335-351 [DOI: 10.1007/978-3-030-01252-6_20http://dx.doi.org/10.1007/978-3-030-01252-6_20]
Tang J J, Xia J, Mu X Z, Pang B and Lu C W. 2020. Asynchronous interaction aggregation for action detection//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 71-87 [DOI: 10.1007/978-3-030-58555-6_5http://dx.doi.org/10.1007/978-3-030-58555-6_5]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc.: 6000-6010
Velickovic P, Cucurull G, Casanova A, Romero A, Liò P and Bengio Y. 2018. Graph attention networks//Proceedings of the 6th International Conference on Learning Representations. Vancouver, Canada: ICLR
Wang D Q and Zhao X. 2022. Class-aware network with global temporal relations for video action detection. Journal of Image and Graphics, 27(12): 3566-3580
王东祺, 赵旭. 2022. 类别敏感的全局时序关联视频动作检测. 中国图象图形学报, 27(12): 3566-3580 [DOI: 10.11834/jig.211096http://dx.doi.org/10.11834/jig.211096]
Wang X L, Girshick R, Gupta A and He K M. 2018. Non-local neural networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7794-7803 [DOI: 10.1109/CVPR.2018.00813http://dx.doi.org/10.1109/CVPR.2018.00813]
Wu C Y, Feichtenhofer C, Fan H Q, He K M, Krahenbuhl P and Girshick R. 2019. Long-term feature banks for detailed video understanding//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 284-293 [DOI: 10.1109/CVPR.2019.00037http://dx.doi.org/10.1109/CVPR.2019.00037]
Wu J C, Kuang Z H, Wang L M, Zhang W and Wu G S. 2020. Context-Aware RCNN: a baseline for action detection in videos//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 440-456 [DOI: 10.1007/978-3-030-58595-2_27http://dx.doi.org/10.1007/978-3-030-58595-2_27]
相关文章
相关作者
相关机构