时空双仿射微分不变量及骨架动作识别
Spatio-temporal dual affine differential invariants for skeleton-based action recognition
- 2021年26卷第12期 页码:2879-2891
收稿:2020-08-21,
修回:2020-11-20,
录用:2020-11-27,
纸质出版:2021-12-16
DOI: 10.11834/jig.200453
移动端阅览

浏览全部资源
扫码关注微信
收稿:2020-08-21,
修回:2020-11-20,
录用:2020-11-27,
纸质出版:2021-12-16
移动端阅览
目的
2
人体骨架的动态变化对于动作识别具有重要意义。从关节轨迹的角度出发,部分对动作类别判定具有价值的关节轨迹传达了最重要的信息。在同一动作的每次尝试中,相应关节的轨迹一般具有相似的基本形状,但其具体形式会受到一定的畸变影响。基于对畸变因素的分析,将人体运动中关节轨迹的常见变换建模为时空双仿射变换。
方法
2
首先用一个统一的表达式以内外变换的形式将时空双仿射变换进行描述。基于变换前后轨迹曲线的微分关系推导设计了双仿射微分不变量,用于描述关节轨迹的局部属性。基于微分不变量和关节坐标在数据结构上的同构特点,提出了一种通道增强方法,使用微分不变量将输入数据沿通道维度扩展后,输入神经网络进行训练与评估,用于提高神经网络的泛化能力。
结果
2
实验在两个大型动作识别数据集NTU(Nanyang Technological University)RGB+D(NTU 60)和NTU RGB+D 120(NTU 120)上与若干最新方法及两种基线方法进行比较,在两种实验设置(跨参与者识别与跨视角识别)中均取得了明显的改进结果。相比于使用原始数据的时空图神经卷积网络(spatio-temporal graph convolutional networks,ST-GCN),在NTU 60数据集中,跨参与者与跨视角的识别准确率分别提高了1.9%和3.0%;在NTU 120数据集中,跨参与者与跨环境的识别准确率分别提高了5.6%和4.5%。同时对比于数据增强,基于不变特征的通道增强方法在两种实验设置下都能有明显改善,更为有效地提升了网络的泛化能力。
结论
2
本文提出的不变特征与通道增强,直观有效地综合了传统特征和深度学习的优点,有效提高了骨架动作识别的准确性,改善了神经网络的泛化能力。
Objective
2
Skeleton-based action recognition has been concerned in recent years
as the dynamics of human skeletons has significant information for the task of action recognition. The action of human skeletons can be seen as time series of human poses
or the combination of human joint trajectories. The trajectory of important joints indicating the action class has conveyed the most significant information among all the human joints. The trajectories of these joints have been subjected to some distortions when performing the same action under different attempts. In this case
two similar trajectories of corresponding joints should share a basic shape. However
these two trajectories have appeared in diverse kinds of distortions due to individual factors. These distortions have been caused by spatial and temporal factors. Spatial factors have included the change of viewpoints
different skeleton sizes and action amplitudes
while temporal factors indicate time scaling along the time series
denoting the order and speed of performing specific action. All the spatial factors can be modeled by the affine transformation in 3D space
whereas the uniform time scaling has been commonly discussed case
which can be seen as affine transformation in 1D space. These two kinds of distortions as the spatio-temporal dual affine transformation have been combined. A novel invariant feature under these distortions has been proposed and utilized for facilitating skeleton-based action recognition. A kind of feature invariant based on the spatio-temporal affine transformation has aided the identification of similar trajectories to be beneficial for action recognition.
Method
2
A general method for constructing spatio-temporal dual affine differential invariant (STDADI) has been proposed. The rational polynomial of derivatives of joint trajectories to obtain the invariants has been utilized in detail via eliminating the transformation parameters effectively. Robust
coordinate-system-independent feature has calculated directly from the 3D coordinates. Bounding the degree of polynomial and the order of derivatives
we generate 8 independent STDADIs and combine them as an invariant vector at each moment for each human joint. Moreover
an intuitive and effective method called channel augmentation has been proposed to extend input data with STDADI along the channel dimension for training and evaluation. Specifically
the coordinate vector and the STDADI vector at each joint for each frame have been concatenated. Channel augmentation has introduced invariant information into input data without changing the inner structure of neural networks. The spatio-temporal graph convolutional networks (ST-GCN) as the basic network have been used. The skeleton data modeling as a graph structure has envolved spatial and temporal connections between human joints simultaneously. Particularly
it has exploited local pattern and correlation from human skeletons. In other words
the importance of joints along the action sequence has been expressed as the weights of human joints in the spatio-temporal graph. This is in line with our STDADI
because both of them focus on describing joint dynamics
and our features further provide an invariant expression which is not affected by the distortions.
Result
2
The synthetic data has been examined to verify the effectiveness of STDADI as well as the large-scale action recognition dataset. First
3D spiral line and selected joint trajectory based on NTU-RGB+D applied with random transformation parameters has shown that STDADI is invariant under the spatio-temporal affine transformations. Next
the effectiveness of the proposed feature and method has been validated on the large-scale action recognition dataset NTU(Nanyang Technological University)RGB+D (NTU 60) and its extended version NTU-RGB+D 120 (NTU 120)
which is currently the largest dataset with 3D joint annotations captured in a constrained indoor environment
and perform some detailed study to examine the contributions of STDADI. A data augmentation technique as well as the original ST-GCN have been as the baseline methods. The data augmentation technique has involved rotation
scaling and shear transformations of 3D skeletons. The same training strategy and hyper-parameters as the original ST-GCN have been used. ST-GCN + channel augmentation has performed well. Compared with the ST-GCN using raw data
in NTU 60
the cross-subject and cross-view recognition accuracy has been increased by 1.9% and 3.0%
respectively; in NTU 120
the cross-subject and cross-setup recognition accuracy has increased by 5.6% and 4.5% respectively. As it is mainly consisted of 3D geometric transformations
the accuracy in cross-view recognition has been much improved but contributes little to the cross-subject setting for data augmentation. The spatio-temporal dual affine transformation assumption has been validated on both evaluation criteria.
Conclusion
2
A general method for constructing spatio-temporal dual affine differential invariant (STDADI) has been proposed. The effectiveness of this invariant feature using a channel augmentation technique has been proved on the large-scale action recognition dataset NTU-RGB+D and NTU-RGB+D 120. The combination of hand-crafted features and data-driven methods has improved the accuracy and generalization well.
Anirudh R, Turaga P, Su J Y and Srivastava A. 2017. Elastic functional coding of riemannian trajectories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(5): 922-936[DOI: 10.1109/TPAMI.2016.2564409]
Boulahia S Y, Anquetil E, Kulpa R and Multon F. 2016. HIF3D: handwriting-inspired features for 3D skeleton-based action recognition//Proceedings of the 23rd International Conference on Pattern Recognition. Cancun, Mexico: IEEE: 985-990[ DOI:10.1109/ICPR.2016.7899764 http://dx.doi.org/10.1109/ICPR.2016.7899764 ]
Brown A B. 1935. Functional dependence. Transactions of the American Mathematical Society, 38(2): 379-379[DOI: 10.1090/S0002-9947-1935-1501816-5]
Cao C Q, Lan C L, Zhang Y F, Zeng W J, Lu H Q and Zhang Y N. 2019. Skeleton-based action recognition with gated convolutional neural networks. IEEE Transactions on Circuits and Systems for Video Technology, 29(11): 3247-3257[DOI: 10.1109/TCSVT.2018.2879913]
Ding C Y, Liu K, Li G, Yan L, Chen B Y and Zhong Y M. 2020. Spatio-temporal weighted posture motion features for human skeleton action recognition research. Chinese Journal of Computers, 43(1): 29-40
丁重阳, 刘凯, 李光, 闫林, 陈博洋, 钟育民. 2020. 基于时空权重姿态运动特征的人体骨架行为识别研究. 计算机学报, 43(1): 29-40[DOI: 10.11897/SP.J.1016.2020.00029]
Du Y, Wang W and Wang L. 2015. Hierarchical recurrent neural network for skeleton based action recognition//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 1110-1118[ DOI:10.1109/CVPR.2015.7298714 http://dx.doi.org/10.1109/CVPR.2015.7298714 ]
Esling P and Agon C. 2012. Time-series data mining. ACM Computing Surveys, 45(1): #12[DOI: 10.1145/2379776.2379788]
He X X, Shao C X and Xiong Y. 2014. A new similarity measure based on shape information for invariant with multiple distortions. Neurocomputing, 129: 556-569[DOI: 10.1016/j.neucom.2013.09.003]
Hussein M E, Torki M, Gowayyed M A and El-Saban M. 2013. Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations//Proceedings of the 23rd International Joint Conference on Artificial Intelligence. Beijing, China: AAAI Press: 2466-2472
Kacem A, Daoudi M, Amor B B, Berretti S and Alvarez-Paiva J C. 2020. A novel geometric framework on gram matrix trajectories for human behavior understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(1): 1-14[DOI: 10.1109/TPAMI.2018.2872564]
Ke Q H, Bennamoun M, An S J, Sohel F and Boussaid F. 2017. A new representation of skeleton sequences for 3D action recognition//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 4570-4579[ DOI:10.1109/CVPR.2017.486 http://dx.doi.org/10.1109/CVPR.2017.486 ]
Ke Q H, Bennamoun M, An S J, Sohel F and Boussaid F. 2018. Learning clip representations for skeleton-based 3d action recognition. IEEE Transactions on Image Processing, 27(6): 2842-2855[DOI: 10.1109/TIP.2018.2812099]
Kim T S and Reiter A. 2017. Interpretable 3D human action analysis with temporal convolutional networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Honolulu, USA: IEEE: 1623-1631[ DOI:10.1109/CVPRW.2017.207 http://dx.doi.org/10.1109/CVPRW.2017.207 ]
Li B, Dai Y C, Cheng X L, Chen H H, Lin Y and He M Y. 2017a. Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep CNN//Proceedings of 2017 IEEE International Conference on Multimedia and Expo Workshops (ICMEW). Hong Kong, China: IEEE: 601-604[ DOI:10.1109/ICMEW.2017.8026282 http://dx.doi.org/10.1109/ICMEW.2017.8026282 ]
Li B, Li X, Zhang Z F and Wu F. 2019a. Spatio-temporal graph routing for skeleton-based action recognition. Proceedings of 2019 AAAI Conference on Artificial Intelligence, 33(1): 8561-8568[DOI: 10.1609/aaai.v33i01.33018561]
Li C, Zhong Q Y, Xie D and Pu S L. 2017b. Skeleton-based action recognition with convolutional neural networks//Proceedings of 2017 IEEE International Conference on Multimedia and Expo Workshops (ICMEW). Hong Kong, China: IEEE: 597-600[ DOI:10.1109/ICMEW.2017.8026285 http://dx.doi.org/10.1109/ICMEW.2017.8026285 ]
Li M S, Chen S, Chen X, Zhang Y, Wang Y F and Tian Q. 2019b. Actional-structural graph convolutional networks for skeleton-based action recognition//Proceedings of 2019 IEEE/CVF Conference on Computer vision and Pattern Recognition. Long Beach, USA: IEEE: 3590-3598[ DOI:10.1109/CVPR.2019.00371 http://dx.doi.org/10.1109/CVPR.2019.00371 ]
Liu H, Tu J H and Liu M Y. 2017 a. Two-stream 3D convolutional neural network for skeleton-based action recognition[EB/OL]. [2020-08-06] . https://arxiv.org/pdf/1705.08106.pdf https://arxiv.org/pdf/1705.08106.pdf
Liu J, Shahroudy A, Perez M, Wang G, Duan L Y and Kot A C. 2020. NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10): 2684-2701[DOI: 10.1109/TPAMI.2019.2916873]
Liu J, Shahroudy A, Xu D and Wang G. 2016. Spatio-temporal LSTM with trust gates for 3D human action recognition//Proceedings of 2016 European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 816-833[ DOI:10.1007/978-3-319-46487-9_50 http://dx.doi.org/10.1007/978-3-319-46487-9_50 ]
Liu J, Wang G, Duan L Y, Abdiyeva K and Kot A C. 2017b. Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Transactions on Image Processing, 27(4): 1586-1599[DOI: 10.1109/TIP.2017.2785279]
Liu J, Wang G, Hu P, Duan L Y and Kot A C. 2017c. Global context-aware attention LSTM networks for 3D action recognition//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 3671-3680[ DOI:10.1109/CVPR.2017.391 http://dx.doi.org/10.1109/CVPR.2017.391 ]
Liu M Y, Liu H and Chen C. 2017d. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognition, 68: 346-362[DOI: 10.1016/j.patcog.2017.02.030]
Liu M Y and Yuan J S. 2018. Recognizing human actions as the evolution of pose estimation maps//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 1159-1168[ DOI:10.1109/CVPR.2018.00127 http://dx.doi.org/10.1109/CVPR.2018.00127 ]
Müller M, R der T and Clausen M. 2005. Efficient content-based retrieval of motion capture data. ACM Transactions on Graphics, 24(3): 677-685[DOI: 10.1145/1073204.1073247]
Pham H H, Salmane H, Khoudour L, Crouzil A, Zegers P and Velastin S A. 2019. Spatio-temporal image representation of 3D skeletal movements for view-invariant action recognition with deep convolutional neural networks. Sensors, 19(8): #1932[DOI: 10.3390/s19081932]
Sadjadi F A and Hall E L. 1980. Three-dimensional moment invariants. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2(2): 127-136[DOI: 10.1109/TPAMI.1980.4766990]
Shahroudy A, Liu J, Ng T T and Wang G. 2016. NTU RGB+D: a large scale dataset for 3D human activity analysis//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 1010-1019[ DOI:10.1109/CVPR.2016.115 http://dx.doi.org/10.1109/CVPR.2016.115 ]
Shao Z P and Li Y F. 2015. Integral invariants for space motion trajectory matching and recognition. Pattern Recognition, 48(8): 2418-2432[DOI: 10.1016/j.patcog.2015.02.029]
Shi L, Zhang Y F, Cheng J and Lu H Q. 2019a. Skeleton-based action recognition with directed graph neural networks//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 7904-7913[ DOI:10.1109/CVPR.2019.00810 http://dx.doi.org/10.1109/CVPR.2019.00810 ]
Shi L, Zhang Y F, Cheng J and Lu H Q. 2019b. Two-stream adaptive graph convolutional networks for skeleton-based action recognition//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 12018-12027[ DOI:10.1109/CVPR.2019.01230 http://dx.doi.org/10.1109/CVPR.2019.01230 ]
Si C Y, Chen W T, Wang W, Wang L and Tan T N. 2019. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 1227-1236[ DOI:10.1109/CVPR.2019.00132 http://dx.doi.org/10.1109/CVPR.2019.00132 ]
Veeriah V, Zhuang N F and Qi G J. 2015. Differential recurrent neural networks for action recognition//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 4041-4049[ DOI:10.1109/ICCV.2015.460 http://dx.doi.org/10.1109/ICCV.2015.460 ]
Vemulapalli R, Arrate F and Chellappa R. 2014. Human action recognition by representing 3D skeletons as points in a lie group//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 588-595[ DOI:10.1109/CVPR.2014.82 http://dx.doi.org/10.1109/CVPR.2014.82 ]
Wang H S and Wang L. 2017. Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 3633-3642[ DOI:10.1109/CVPR.2017.387 http://dx.doi.org/10.1109/CVPR.2017.387 ]
Wang J, Liu Z C, Wu Y and Yuan J S. 2012. Mining actionlet ensemble for action recognition with depth cameras//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 1290-1297[ DOI:10.1109/CVPR.2012.6247813 http://dx.doi.org/10.1109/CVPR.2012.6247813 ]
Wang P C, Li Z Y, Hou Y H and Li W Q. 2016. Action recognition based on joint trajectory maps using convolutional neural networks//Proceedings of the 24th ACM International Conference on Multimedia. Amsterdam, the Netherlands: ACM: 102-106[ DOI:10.1145/2964284.2967191 http://dx.doi.org/10.1145/2964284.2967191 ]
Wang P C, Li W Q, Ogunbona P, Wan J and Escalera S. 2018. RGB-D-based human motion recognition with deep learning: a survey. Computer Vision and Image Understanding, 171: 118-139[DOI: 10.1016/j.cviu.2018.04.007]
Wang Y Y, Li Y B and Ji X F. 2013. Human action recognition based on super-interest points features. Journal of Image and Graphics, 18(7): 805-812
王扬扬, 李一波, 姬晓飞. 2013. 人体动作的超兴趣点特征表述及识别. 中国图象图形学报, 18(7): 805-812[DOI: 10.11834/jig.20130710]
Wu S D and Li Y F. 2009. Flexible signature descriptions for adaptive motion trajectory representation, perception and recognition. Pattern Recognition, 42(1): 194-214[DOI: 10.1016/j.patcog.2008.06.023]
Yan S J, Xiong Y J and Lin D H. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition[EB/OL]. [2020-08-06] . https://arxiv.org/pdf/1801.07455.pdf https://arxiv.org/pdf/1801.07455.pdf
Zhang P E, Lan C L, Xing J L, Zeng W J, Xue J R and Zheng N N. 2017. View adaptive recurrent neural networks for high performance human action recognition from skeleton data//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2136-2145[ DOI:10.1109/ICCV.2017.233 http://dx.doi.org/10.1109/ICCV.2017.233 ]
Zheng X, Peng X D and Wang J X. 2018. Human action recognition based on pose spatio-temporal features. Journal of Computer-Aided Design and Computer Graphics, 30(9): 1615-1624
郑潇, 彭晓东, 王嘉璇. 2018. 基于姿态时空特征的人体行为识别方法. 计算机辅助设计与图形学学报, 30(9): 1615-1624[DOI: 10.3724/SP.J.1089.2018.16848]
Zhong Q B, Zheng C M and Piao S H. 2020. Research on skeleton-based action recognition with spatiotemporal fusion and human-robot interaction. CAAI Transactions on Intelligent Systems, 15(3): 601-608
钟秋波, 郑彩明,朴松昊. 2020. 时空域融合的骨架动作识别与交互研究. 智能系统学报, 15(3): 601-608[DOI: 10.11992/tis.202006029]
相关作者
相关机构
京公网安备11010802024621