多模态时空特征表示及其在行为识别中的应用

施海勇; 侯振杰; 巢新; 钟卓锟

doi:10.11834/jig.211217

图像分析和识别 | 浏览量 : 0 下载量: 0 CSCD: 1

PDF
导出
分享
收藏
专辑

多模态时空特征表示及其在行为识别中的应用
Multimodal spatial-temporal feature representation and its application in action recognition
2023年28卷第4期页码：1041-1055
纸质出版日期： 2023-04-16 ，
DOI： 10.11834/jig.211217
稿件说明：

移动端阅览

施海勇，侯振杰，巢新，钟卓锟. 2023. 多模态时空特征表示及其在行为识别中的应用. 中国图象图形学报， 28(04):1041-1055

Shi Haiyong， Hou Zhenjie， Chao Xin， Zhong Zhuokun. 2023. Multimodal spatial-temporal feature representation and its application in action recognition. Journal of Image and Graphics， 28(04):1041-1055
施海勇，侯振杰，巢新，钟卓锟. 2023. 多模态时空特征表示及其在行为识别中的应用. 中国图象图形学报， 28(04):1041-1055 DOI： 10.11834/jig.211217.

Shi Haiyong， Hou Zhenjie， Chao Xin， Zhong Zhuokun. 2023. Multimodal spatial-temporal feature representation and its application in action recognition. Journal of Image and Graphics， 28(04):1041-1055 DOI： 10.11834/jig.211217.

摘要

目的

在人体行为识别研究中，利用多模态方法将深度数据与骨骼数据相融合，可有效提高动作的识别率。针对深度图像信息数据量大、冗余度高等问题，提出一种通过获取关键时程信息动作帧序列降低冗余的算法，即质心运动路径松弛算法，并根据不同模态数据的特点，提出一种新的时空特征表示方法。

方法

质心运动路径松弛算法根据质心在相邻帧之间的运动距离，计算图像差分后获得的活跃部分的相似系数，然后剔除掉相似度高的帧，获得足以表达行为的关键时程信息。根据图像动态部分的变化特性、人体各部分在运动中的协同性和局部显著性特征构建一种新的时空特征表示方法。

结果

在MSR-Action3D数据集上对本文方法的效果进行验证。在3个子集中进行交叉验证的平均分类识别率为95.743 2%，分别比Multi-fused，CovP3DJ，D3D-LSTM（densely connected 3D-CNN and long short-term memory），Joint Subset Selection方法高2.443 2%，4.763 2%，0.343 2%，0.213 2%。本文方法在使用完整数据集的扩展实验中进行交叉验证的分类识别率为93.040 3%，具有很好的鲁棒性。

结论

实验结果表明，本文提出的去冗余算法在降低冗余后提升了识别效果，提取的特征之间具有相关性低的特点，在组合识别中具有良好的互补性，有效提高了分类识别的精确度。

Abstract

Objective

Human body motion-related recognition has been developing in the context of computer vision and pattern recognition like auxiliary human-computer interaction， motion analysis， intelligent monitoring， and virtual reality. To obtain two-dimensional information for its behavioral recognition， conventional motion behavior recognition is mainly used the RGB image sequence captured by RGB camera. To improve the ability to detect short-duration fragments， current feature descriptors for RGB image sequences are employed to characterize human behavior， such as histogram of oriented gradient （HOG）， histogram of optical flow （HOF）， and a three-dimensional feature pyramid. Some researchers are focused on the feature that image depth is insensitive to ambient light since RGB images are oriented to behavior image sequences of objects in terms of two-dimensional information. The depth information of the image is coordinated with the features of RGB image to describe the related behavior. Human behavior recognition-relevant multi-modal method can be used to fuse depth data and skeleton data， which can improve the recognition rate of action effectively. Recent depth map is widely used in relevant to human behavior recognition. But， the collection of depth information data is required to be optimized because of time complexity of feature extraction and space complexity of feature storage. To resolve the problems， we develop an algorithm to optimize frames of the depth map and resource consumption. At the same time， a new representation of motion features is facilitated as well according to the motion information of the centroid.

Method

First， the temporal feature vector is used in terms of depth map sequence-extracted time sequence information. The centroid motion path relaxation algorithm is used to realize depth image de-duplication and de redundancy， and the skeleton map-extracted spatial structure feature vector from are spliced to form the spatio-temporal feature input. Next， spatial features are extracted in terms of the original skeleton points coordinates-spliced three-channel spatial feature map. Finally， the fusion probability of spatio-temporal features and spatial features is used for classification and recognition. Our centroid motion path relaxation algorithm is focused on the optimization of redundant information， the time complexity of feature extraction， and the space complexity of feature storage. For the skeleton data， the global feature of motion direction is proposed to fully reflect the integrity and coordination of limb movements. The extracted features are concatenated to obtain the spatio-temporal feature vector， and they can be fused and enhanced through the original coordinates of skeleton points-built three-channel spatial feature map. Its effectiveness is verified on the MSR-Action3D dataset.

Result

The experimental setting 1 demonstrate that it is 0.826 0% higher than the depth motion map（DMM）-local binary pattern （LBP） algorithm， 1.015 2% higher than DMM-CRC （collaborative representation classifier）， 3.450 1% higher than gradient local auto correlation （DMM-GLAC） algorithm， 0.605 8% higher than EigenJoint algorithm， and 0.605 8% higher than space-time auto correlation of gradient （STACOG） algorithm is 10.624 5% higher. After removing redundancy， the result of experimental setting 1 is 0.126 1% higher as well. The cross-validation on experimental setting 2 show that the average classification and recognition rate in the three subsets is 95.743 2%， 2.443 2% higher than multi-fused method， 4.763 2% higher than CovP3DJ method， 0.343 2% higher than D3D-LSTM method， and 0.213 2% higher than joint subset selection method. For the overall data set， it is 2.030 3% higher than low latency method， 0.240 3% higher than combination of deep models method， and 2.340 3% higher than complex network coding method. The experimental setting 2 illustrates that the average classification recognition rate of cross-validation in three subsets is 95.743 2%， and the classification recognition rate of the complete dataset is 93.040 3%.

Conclusion

Our algorithm proposed can improve the recognition effect based on redundancy-optimized， and the features-extracted have lower correlation mutually， which can improve the accuracy of classification recognition effectively.

关键词

行为识别质心运动关键时程信息时空特征表示多模态融合

Keywords

action recognitioncentroid motionkey temporal informationspatio-temporal feature representationmultimodal fusion

references

Bian W， Tao D C and Rui Y. 2012. Cross-domain human action recognition. IEEE Transactions on Systems， Man， and Cybernetics， Part B （Cybernetics）， 42（2）： 298-307 ［DOI： 10.1109/TSMCB.2011.2166761http://dx.doi.org/10.1109/TSMCB.2011.2166761］

Cai X Y， Zhou W G， Wu L， Luo J B and Li H Q. 2016. Effective active skeleton representation for low latency human action recognition. IEEE Transactions on Multimedia， 18（2）： 141-154 ［DOI： 10.1109/TMM.2015.2505089http://dx.doi.org/10.1109/TMM.2015.2505089］

Chao X， Hou Z J， Li X， Liang J Z， Huan J and Liu H Y. 2020. Action recognition under depth spatial-temporal energy feature representation. Journal of Image and Graphics， 25（4）： 836-850

巢新，侯振杰，李兴，梁久祯，宦娟，刘浩昱. 2020. 深度时空能量特征表示下的人体行为识别. 中国图象图形学报， 25（4）： 836-850 ［DOI： 10.11834/jig.190351http://dx.doi.org/10.11834/jig.190351］

Chen C， Hou Z J， Zhang B C， Jiang J J and Yang Y. 2015. Gradient local auto-correlations and extreme learning machine for depth-based activity recognition//Proceedings of the 11th International Symposium on Advances in Visual Computing. Las Vegas， USA： Springer： 613-623 ［DOI： 10.1007/978-3-319-27857-5_55http://dx.doi.org/10.1007/978-3-319-27857-5_55］

Chen C， Liu K and Kehtarnavaz N. 2016. Real-time human action recognition based on depth motion maps. Journal of Real-Time Image Processing， 12（1）： 155-163 ［DOI： 10.1007/s11554-013-0370-1http://dx.doi.org/10.1007/s11554-013-0370-1］

Chen C， Zhang B C， Hou Z J， Jiang J J， Liu M Y and Yang Y. 2017. Action recognition from depth sequences using weighted fusion of 2D and 3D auto-correlation of gradients features. Multimedia Tools and Applications， 76（3）： 4651-4669 ［DOI： 10.1007/s11042-016-3284-7http://dx.doi.org/10.1007/s11042-016-3284-7］

El-Ghaish H A， Shoukry A and Hussein M E. 2018. CovP3DJ： skeleton-parts-based-covariance descriptor for human action recognition//Proceedings of the 13th International Joint Conference on Computer Vision， Imaging and Computer Graphics Theory and Applications. Funchal， Portugal： SciTePress： 343-350 ［DOI： 10.5220/0006625703430350http://dx.doi.org/10.5220/0006625703430350］

Gong C and Wu G. 2021. Design of cerebral palsy rehabilitation training system based on human-computer interaction//Proceedings of 2021 International Wireless Communications and Mobile Computing （IWCMC）. Harbin， China： IEEE： 621-625 ［DOI： 10.1109/IWCMC51323.2021.9498976http://dx.doi.org/10.1109/IWCMC51323.2021.9498976］

He J Y， Lei J and Li G H. 2021. Temporal action detection based on feature pyramid hierarchies. Journal of Image and Graphics， 26（7）： 1637-1647

何嘉宇，雷军，李国辉. 2021. 特征金字塔结构的时序行为识别网络. 中国图象图形学报， 26（7）： 1637-1647 ［DOI： 10.11834/jig.200495http://dx.doi.org/10.11834/jig.200495］

Hirota K and Komuro T. 2021. Grasping action recognition in VR environment using object shape and position information//Proceedings of 2021 IEEE International Conference on Consumer Electronics （ICCE）. Las Vegas， USA： IEEE： #9427608 ［DOI： 10.1109/ICCE50685.2021.9427608http://dx.doi.org/10.1109/ICCE50685.2021.9427608］

Hu K J， Jiang M and Kong J. 2018. Human action recognition based on mixed joints feature. Transducer and Microsystem Technologies， 37（3）： 138-140， 144

胡珂杰，蒋敏，孔军. 2018. 基于混合关节特征的人体行为识别. 传感器与微系统， 37（3）： 138-140， 144 ［DOI： 10.13873/J.1000-9787（2018）03-0138-03http://dx.doi.org/10.13873/J.1000-9787（2018）03-0138-03］

Jalal A， Kim Y H， Kim Y J， Kamal S and Kim D. 2017. Robust human activity recognition from depth video using spatiotemporal multi-fused features. Pattern Recognition， 61： 295-308 ［DOI： 10.1016/j.patcog.2016.08.003http://dx.doi.org/10.1016/j.patcog.2016.08.003］

Keçeli A S， Kaya A and Can A B. 2018. Combining 2D and 3D deep models for action recognition with depth information. Signal， Image and Video Processing， 12（6）： 1197-1205 ［DOI： 10.1007/s11760-018-1271-3http://dx.doi.org/10.1007/s11760-018-1271-3］

Kobayshi T and Otsu N. 2012. Motion recognition using local auto-correlation of space–time gradients. Pattern Recognition Letters， 33（9）： 1188-1195 ［DOI： 10.1016/j.patrec.2012.01.007http://dx.doi.org/10.1016/j.patrec.2012.01.007］

Li R F， Wang L L and Wang K. 2014. A survey of human body action recognition. Pattern Recognition and Artificial Intelligence， 27（1）： 35-48

李瑞峰，王亮亮，王珂. 2014. 人体动作行为识别研究综述. 模式识别与人工智能， 27（1）： 35-48 ［DOI： 10.3969/j.issn.1003-6059.2014.01.005http://dx.doi.org/10.3969/j.issn.1003-6059.2014.01.005］

Liu J T and Che Y L. 2021. Action recognition for sports video analysis using part-attention spatio-temporal graph convolutional network. Journal of Electronic Imaging， 30（3）： #033017 ［DOI： 10.1117/1.JEI.30.3.033017http://dx.doi.org/10.1117/1.JEI.30.3.033017］

Liu L， Shao L and Rockett P. 2013. Boosted key-frame selection and correlated pyramidal motion-feature representation for human action recognition. Pattern Recognition， 46（7）： 1810-1818 ［DOI： 10.1016/j.patcog.2012.10.004http://dx.doi.org/10.1016/j.patcog.2012.10.004］

Liu T T， Li Y P and Zhang L. 2019. Human action recognition based on multi-perspective depth motion maps. Journal of Image and Graphics， 24（3）： 400-409

刘婷婷，李玉鹏，张良. 2019. 多视角深度运动图的人体行为识别. 中国图象图形学报， 24（3）： 400-409 ［DOI： 10.11834/jig.180375http://dx.doi.org/10.11834/jig.180375］

Liu T Y， Lu Z， Sun Y F， Liu F， He B M and Zhong J. 2020. Working activity recognition approach based on 3D deep convolutional neural network. Computer Integrated Manufacturing Systems， 26（8）： 2143-2156

刘庭煜，陆增，孙毅锋，刘芳，何必秒，钟杰. 2020. 基于三维深度卷积神经网络的车间生产行为识别. 计算机集成制造系统， 26（8）： 2143-2156 ［DOI： 10.13196/j.cims.2020.08.015http://dx.doi.org/10.13196/j.cims.2020.08.015］

Ma Y X， Tan L， Dong X and Yu C C. 2019. Action recognition for intelligent monitoring. Journal of Image and Graphics， 24（2）： 282-290

马钰锡，谭励，董旭，于重重. 2019. 面向智能监控的行为识别. 中国图象图形学报， 24（2）： 282-290 ［DOI： 10.11834/jig.180392http://dx.doi.org/10.11834/jig.180392］

Mahjoub A B and Atri M. 2016. Human action recognition using RGB data//Proceedings of the 11th International Design and Test Symposium （IDT）. Hammamet， Tunisia： IEEE： 83-87 ［DOI： 10.1109/IDT.2016.7843019http://dx.doi.org/10.1109/IDT.2016.7843019］

Maurice C， Madrigal F， Monin A and Lerasle F. 2019. A new Bayesian modeling for 3D human-object action recognition//Proceedings of the 16th IEEE International Conference on Advanced Video and Signal Based Surveillance （AVSS）. Taipei， China： IEEE： #8909873 ［DOI： 10.1109/AVSS.2019.8909873http://dx.doi.org/10.1109/AVSS.2019.8909873］

Mudgal M， Punj D and Pillai A. 2021. Suspicious action detection in intelligent surveillance system using action attribute modelling. Journal of Web Engineering， 20（1）： 129-146 ［DOI： 10.13052/jwe1540-9589.2017http://dx.doi.org/10.13052/jwe1540-9589.2017］

Nguyen T N， Pham D T， Le T L， Vu H and Tran T H. 2018. Novel skeleton-based action recognition using covariance descriptors on most informative joints//Proceedings of the 10th International Conference on Knowledge and Systems Engineering （KSE）. Ho Chi Minh City， Vietnam： IEEE： 50-55 ［DOI： 10.1109/KSE.2018.8573421http://dx.doi.org/10.1109/KSE.2018.8573421］

Oreifej O and Liu Z C. 2013. HON4D： histogram of oriented 4D normals for activity recognition from depth sequences//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland， USA： IEEE： 716-723 ［DOI： 10.1109/CVPR.2013.98http://dx.doi.org/10.1109/CVPR.2013.98］

Pham D T， Nguyen T N， Le T L and Vu H. 2019. Analyzing role of joint subset selection in human action recognition//Proceedings of the 6th NAFOSTED Conference on Information and Computer Science （NICS）. Hanoi， Vietnam： IEEE： 61-66 ［DOI： 10.1109/NICS48868.2019.9023859http://dx.doi.org/10.1109/NICS48868.2019.9023859］

Ren Z L， Zhang Q S， Qiao P Y， Niu M L， Gao X Y and Cheng J. 2020. Joint learning of convolution neural networks for RGB-D-based human action recognition. Electronics Letters， 56（21）： 1112-1115 ［DOI： 10.1049/el.2020.2148http://dx.doi.org/10.1049/el.2020.2148］

Shen X P and Ding Y R. 2022. Human skeleton representation for 3D action recognition based on complex network coding and LSTM. Journal of Visual Communication and Image Representation， 82： #103386 ［DOI： 10.1016/j.jvcir.2021.103386http://dx.doi.org/10.1016/j.jvcir.2021.103386］

Shotton J， Fitzgibbon A， Cook M， Sharp T， Finocchio M， Moore R， Kipman A and Blake A. 2011. Real-time human pose recognition in parts from single depth images//Proceedings of 2011 IEEE Conference on Computer Vision and Pattern Recognition. Colorado Springs， USA： IEEE： 1297-1304 ［DOI： 10.1109/CVPR.2011.5995316http://dx.doi.org/10.1109/CVPR.2011.5995316］

Singh R， Dhillon J K， Kushwaha A K S and Srivastava R. 2019. Depth based enlarged temporal dimension of 3D deep convolutional network for activity recognition. Multimedia Tools and Applications， 78（21）： 30599-30614 ［DOI： 10.1007/s11042-018-6425-3http://dx.doi.org/10.1007/s11042-018-6425-3］

Sun B， Wang S F， Kong D H， Wang L C and Yin B C. 2022. Real-time human action recognition using locally aggregated kinematic-guided skeletonlet and supervised hashing-by-analysis model. IEEE Transactions on Cybernetics， 52（6）： 4837-4849 ［DOI： 10.1109/TCYB.2021.3100507http://dx.doi.org/10.1109/TCYB.2021.3100507］

Wang H R， Yuan C F， Shen J F， Yang W K and Ling H B. 2018. Action unit detection and key frame selection for human activity prediction. Neurocomputing， 318： 109-119 ［DOI： 10.1016/j.neucom.2018.08.037http://dx.doi.org/10.1016/j.neucom.2018.08.037］

Xia L， Chen C C and Aggarwal J K. 2012. View invariant human action recognition using histograms of 3D joints//Proceedings of 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. Providence， USA： IEEE： 20-27 ［DOI： 10.1109/CVPRW.2012.6239233http://dx.doi.org/10.1109/CVPRW.2012.6239233］

Xu Y， Hou Z J， Liang J Z， Chen C， Jia L and Song Y. 2019. Action recognition using weighted fusion of depth images and skeleton’s key frames. Multimedia Tools and Applications， 78（1）： 25063-25078 ［DOI： 10.1007/s11042-019-7593-5http://dx.doi.org/10.1007/s11042-019-7593-5］

Yang X D and Tian Y L. 2012. EigenJoints-based action recognition using Naïve-Bayes-Nearest-Neighbor//Proceedings of 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. Providence， USA： IEEE： 14-19 ［DOI： 10.1109/CVPRW.2012.6239232http://dx.doi.org/10.1109/CVPRW.2012.6239232］

Yang X D， Zhang C Y and Tian Y L. 2012. Recognizing actions using depth motion maps-based histograms of oriented gradients//Proceedings of the 20th ACM International Conference on Multimedia. Nara， Japan： ACM： 1057-1060 ［DOI： 10.1145/2393347.2396382http://dx.doi.org/10.1145/2393347.2396382］

Yu J H， Gao H W， Yang W， Jiang Y Q， Chin W， Kubota N and Ju Z J. 2020. A discriminative deep model with feature fusion and temporal attention for human action recognition. IEEE Access， 8： 43243-43255 ［DOI： 10.1109/ACCESS.2020.2977856http://dx.doi.org/10.1109/ACCESS.2020.2977856］

文章被引用时，请邮件提醒。

提交

自适应模态融合双编码器MRI脑肿瘤分割网络

图像与点云多重信息感知关联的三维多目标跟踪

骨骼信息的人体行为识别综述

引导性权重驱动的图表问答重定位关系网络

多模态数据的行为识别综述