用于骨架行为识别的多维特征嵌合注意力机制
M2FA: multi-dimensional feature fusion attention mechanism for skeleton-based action recognition
- 2022年27卷第8期 页码:2391-2403
纸质出版日期: 2022-08-16 ,
录用日期: 2021-07-02
DOI: 10.11834/jig.210091
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2022-08-16 ,
录用日期: 2021-07-02
移动端阅览
姜权晏, 吴小俊, 徐天阳. 用于骨架行为识别的多维特征嵌合注意力机制[J]. 中国图象图形学报, 2022,27(8):2391-2403.
Quanyan Jiang, Xiaojun Wu, Tianyang Xu. M2FA: multi-dimensional feature fusion attention mechanism for skeleton-based action recognition[J]. Journal of Image and Graphics, 2022,27(8):2391-2403.
目的
2
在行为识别任务中,妥善利用时空建模与通道之间的相关性对于捕获丰富的动作信息至关重要。尽管图卷积网络在基于骨架信息的行为识别方面取得了稳步进展,但以往的注意力机制应用于图卷积网络时,其分类效果并未获得明显提升。基于兼顾时空交互与通道依赖关系的重要性,提出了多维特征嵌合注意力机制(multi-dimensional feature fusion attention mechanism,M2FA)。
方法
2
不同于现今广泛应用的行为识别框架研究理念,如卷积块注意力模块(convolutional block attention module,CBAM)、双流自适应图卷积网络(two-stream adaptive graph convolutional network,2s-AGCN)等,M2FA通过嵌入在注意力机制框架中的特征融合模块显式地获取综合依赖信息。对于给定的特征图,M2FA沿着空间、时间和通道维度使用全局平均池化操作推断相应维度的特征描述符。特征图使用多维特征描述符的融合结果进行过滤学习以达到细化自适应特征的目的,并通过压缩全局动态信息的全局特征分支与仅使用逐点卷积层的局部特征分支相互嵌合获取多尺度动态信息。
结果
2
实验在骨架行为识别数据集NTU-RGBD和Kinetics-Skeleton中进行,分析了M2FA与其基线方法2s-AGCN及最新提出的图卷积模型之间的识别准确率对比结果。在Kinetics-Skeleton验证集中,相比于基线方法2s-AGCN,M2FA分类准确率提高了1.8%;在NTU-RGBD的两个不同基准分支中,M2FA的分类准确率比基线方法2s-AGCN分别提高了1.6%和1.0%。同时,消融实验验证了多维特征嵌合机制的有效性。实验结果表明,提出的M2FA改善了图卷积骨架行为识别方法的分类效果。
结论
2
通过与基线方法2s-AGCN及目前主流图卷积模型比较,多维特征嵌合注意力机制获得了最高的识别精度,可以集成至基于骨架信息的体系结构中进行端到端的训练,使分类结果更加准确。
Objective
2
The contexts of action analysis and recognition is challenged for a number of applications like video surveillance
personal assistance
human-machine interaction
and sports video analysis. Thanks to the video-based action recognition methods
an skeleton data based approach has been focused on recently due to its complex scenarios. To locate the 2D or 3D spatial coordinates of the joints
the skeleton data is mainly obtained via depth sensors or video-based pose estimation algorithms. Graph convolutional networks (GCNs) have been developed to resolve the issue in terms of the traditional methods cannot capture the completed dependence of joints with no graphical structure of skeleton data. The critical viewpoint is challenged to determine an adaptive graph structure for the skeleton data at the convolutional layers. The spatio-temporal graph convolutional network (ST-GCN) has been facilitated to learn spatial and temporal features simultaneously through the temporal edges plus between the corresponding joints of the spatial graph in consistent frames. However
ST-GCN focuses on the physical connection between joints of the human body in the spatial graph
and ignores internal dependencies in motion. Spatio-temporal modeling and channel-wise dependencies are crucial for capturing motion information in videos for the action recognition task. Despite of the credibility in skeleton-based action recognition of GCNs
the relative improvement of classical attention mechanism applications has been constrained. Our research highlights the importance of spatio-temporal interactions and channel-wise dependencies both in accordance with a novel multi-dimensional feature fusion attention mechanism (M2FA).
Method
2
Our proposed model explicitly leverages comprehensive dependency information by feature fusion module embedded in the framework
which is differentiated from other action recognition models with additional information flow or complicated superposition of multiple existing attention modules. Given medium feature maps
M2FA infers the feature descriptors on the spatial
temporal and channel scales sequentially. The fusion of the feature descriptors filters the input feature maps for adaptive feature refinement. As M2FA is being a lightweight and general module
it can be integrated into any skeleton-based architecture seamlessly with end-to-end trainable attributes following the core recognition methods.
Result
2
To verify its effectiveness
our algorithm is validated and analyzed on two large-scale skeleton-based action recognition datasets: NTU-RGBD and Kinetics-Skeleton. Our experiments are carried out ablation studies to demonstrate the advantages of multi-dimensional feature fusion on the two datasets. Our analyses demonstrate the merit of M2FA for skeleton-based action recognition. On the Kinetics-Skeleton dataset
the action recognition rate of the proposed algorithm is 1.8% higher than that of the baseline algorithm (2s-AGCN). On cross-view benchmark of NTU-RGBD dataset
the human action recognition accuracy of the proposed method is 96.1%
which is higher than baseline method. In addition
the action recognition rate of the proposed method is 90.1% on cross-subject benchmark of NTU-RGBD dataset. We showed that the skeleton-based action recognition model
known as 2s-AGCN
can be significantly improved in terms of accuracy based on adaptive attention mechanism incorporation. Our multi-dimensional feature fusion attention mechanism
called M2FA
captures spatio-temporal interactions and interconnections between potential channels.
Conclusion
2
We developed a novel multi-dimensional feature fusion attention mechanism (M2FA) that captures spatio-temporal interactions and channel-wise dependencies at the same time. Our experimental results show consistent improvements in classification and its priorities of M2FA.
行为识别骨架信息图卷积网络(GCN)注意力机制时空交互通道依赖性多维特征嵌合
action recognitionskeleton informationgraph convolutional network (GCN)attention mechanismspatio-temporal interactionchannel-wise dependenciesmulti-dimensional feature fusion
Carreira J and Zisserman A. 2017. Quo vadis, action recognition? A new model and the kinetics dataset//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 4724-4733 [DOI: 10.1109/CVPR.2017.502http://dx.doi.org/10.1109/CVPR.2017.502]
Chen L C, Yang Y, Wang J, Xu W and Yuille A L. 2016. Attention to scale: scale-aware semantic image segmentation//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 3640-3649 [DOI: 10.1109/CVPR.2016.396http://dx.doi.org/10.1109/CVPR.2016.396]
Du W B, Wang Y L and Qiao Y. 2018. Recurrent spatial-temporal attention network for action recognition in videos. IEEE Transactions on Image Processing, 27(3): 1347-1360 [DOI: 10.1109/TIP.2017.2778563]
Fu J, Liu J, Tian H J, Li Y, Bao Y J, Fang Z W and Lu H Q. 2019. Dual attention network for scene segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 3141-3149 [DOI: 10.1109/CVPR.2019.00326http://dx.doi.org/10.1109/CVPR.2019.00326]
Gao X, Hu W, Tang J X, Liu J Y and Guo Z M. 2019. Optimized skeleton-based action recognition via sparsified graph regression//Proceedings of the 27th ACM International Conference on Multimedia. Nice, France: ACM: 601-610 [DOI: 10.1145/3343031.3351170http://dx.doi.org/10.1145/3343031.3351170]
Hu J, Shen L and Sun G. 2018. Squeeze-and-excitation networks//Proceedings 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7132-7141 [DOI: 10.1109/CVPR.2018.00745http://dx.doi.org/10.1109/CVPR.2018.00745]
Li B, Li X, Zhang Z F and Wu F. 2019a. Spatio-temporal graph routing for skeleton-based action recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 33(1): 8561-8568 [DOI: 10.1609/aaai.v33i01.33018561]
Li M S, Chen S H, Chen X, Zhang Y, Wang Y F and Tian Q. 2019b. Actional-structural graph convolutional networks for skeleton-based action recognition//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 3590-3598 [DOI: 10.1109/CVPR.2019.00371http://dx.doi.org/10.1109/CVPR.2019.00371]
Lin J, Gan C and Han S. 2019. TSM: temporal shift module for efficient video understanding//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 7082-7092 [DOI: 10.1109/ICCV.2019.00718http://dx.doi.org/10.1109/ICCV.2019.00718]
Ma M, Li Y B, Wu X Q, Gao J F and Pan H P. 2020. Human action recognition in videos utilizing key semantic region extraction and concatenation. Journal of Image and Graphics, 25(12): 2517-2529
马淼, 李贻斌, 武宪青, 高金凤, 潘海鹏. 2020. 关键语义区域链提取的视频人体行为识别. 中国图象图形学报, 25(12): 2517-2529 [DOI: 10.11834/jig.200049]
Miech A, Laptev I and Sivic J. 2018. Learnable pooling with context gating for video classification [EB/OL]. [2021-02-01].https://arxiv.org/pdf/1706.06905.pdfhttps://arxiv.org/pdf/1706.06905.pdf
Peng W, Hong X P, Chen H Y and Zhao G Y. 2020. Learning graph convolutional network for skeleton-based human action recognition by neural searching. Proceedings of 2020 AAAI Conference on Artificial Intelligence, 34(3): 2669-2676 [DOI: 10.1609/aaai.v34i03.5652]
Qiu Z F, Yao T and Mei T. 2017. Learning spatio-temporal representation with pseudo-3D residual networks//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 5534-5542 [DOI: 10.1109/ICCV.2017.590http://dx.doi.org/10.1109/ICCV.2017.590]
Ran X Y, Liu K, Li G, Ding W W and Chen B. 2018. Human action recognition algorithm based on adaptive skeleton center. Journal of Image and Graphics, 23(4): 519-525
冉宪宇, 刘凯, 李光, 丁文文, 陈斌. 2018. 自适应骨骼中心的人体行为识别算法. 中国图象图形学报, 23(4): 519-525 [DOI: 10.11834/jig.170420]
Shahroudy A, Liu J, Ng T T and Wang G. 2016. NTU RGB+D: a large scale dataset for 3D human activity analysis//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 1010-1019 [DOI: 10.1109/CVPR.2016.115http://dx.doi.org/10.1109/CVPR.2016.115]
Shi L, Zhang Y F, Cheng J and Lu H Q. 2019a. Skeleton-based action recognition with directed graph neural networks//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 7904-7913 [DOI: 10.1109/CVPR.2019.00810http://dx.doi.org/10.1109/CVPR.2019.00810]
Shi L, Zhang Y F, Cheng J and Lu H Q. 2019b. Two-stream adaptive graph convolutional networks for skeleton-based action recognition//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 12018-12027 [DOI: 10.1109/CVPR.2019.01230http://dx.doi.org/10.1109/CVPR.2019.01230]
Tan D T, Li S C, Chang W W and Li D L. 2020. Multi-feature fusion behavior recognition model. Journal of Image and Graphics, 25(12): 2541-2552
谭等泰, 李世超, 常文文, 李登楼. 2020. 多特征融合的行为识别模型. 中国图象图形学报, 25(12): 2541-2552 [DOI: 10.11834/jig.190637]
Tran D, Bourdev L, Fergus R, Torresani L and Paluri M. 2015. Learning spatiotemporal features with 3D convolutional networks//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 4489-4497 [DOI: 10.1109/ICCV.2015.510http://dx.doi.org/10.1109/ICCV.2015.510]
Tran D, Wang H, Torresani L, Ray J, LeCun Y and Paluri M. 2018. A closer look at spatiotemporal convolutions for action recognition//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 6450-6459 [DOI: 10.1109/CVPR.2018.00675http://dx.doi.org/10.1109/CVPR.2018.00675]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, KaiserŁand Polosukhin I. 2017. Attention is all you need [EB/OL]. [2021-02-01].https://arxiv.org/pdf/1706.03762.pdfhttps://arxiv.org/pdf/1706.03762.pdf
Wang L M, Xiong Y J, Wang Z, Qiao Y, Lin D H, Tang X O and Van Gool L. 2016. Temporal segment networks: towards good practices for deep action recognition//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 20-36 [DOI: 10.1007/978-3-319-46484-8_2http://dx.doi.org/10.1007/978-3-319-46484-8_2]
Wang X L, Girshick R, Gupta A and He K M. 2018. Non-local neural networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7794-7803 [DOI: 10.1109/CVPR.2018.00813http://dx.doi.org/10.1109/CVPR.2018.00813]
Woo S, Park J, Lee J Y and Kweon I S. 2018. CBAM: convolutional block attention module//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 3-19 [DOI: 10.1007/978-3-030-01234-2_1http://dx.doi.org/10.1007/978-3-030-01234-2_1]
Xiao T T, Fan Q F, Gutfreund D, Monfort M, Oliva A and Zhou B L. 2019. Reasoning about human-object interactions through dual attention networks//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 3918-3927 [DOI: 10.1109/ICCV.2019.00402http://dx.doi.org/10.1109/ICCV.2019.00402]
Yan S J, Xiong Y J and Lin D H. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. New Orleans, USA: AAAI: 7444-7452
相关作者
相关机构