深度时空能量特征表示下的人体行为识别

巢新; 侯振杰; 李兴; 梁久祯; 宦娟; 刘浩昱

doi:10.11834/jig.190351

ChinaMM 2019会议专栏 | 浏览量 : 0 下载量: 8 CSCD: 2

PDF
导出
分享
收藏
专辑

深度时空能量特征表示下的人体行为识别
Action recognition under depth spatial-temporal energy feature representation
2020年25卷第4期页码：836-850
收稿：2019-07-10，

修回：2019-8-26，

录用：2019-9-2，

纸质出版：2020-04-16
DOI： 10.11834/jig.190351
稿件说明：

移动端阅览

巢新, 侯振杰, 李兴, 梁久祯, 宦娟, 刘浩昱. 深度时空能量特征表示下的人体行为识别[J]. 中国图象图形学报, 2020,25(4):836-850. DOI： 10.11834/jig.190351.

Xin Chao, Zhenjie Hou, Xing Li, Jiuzhen Liang, Juan Huan, Haoyu Liu. Action recognition under depth spatial-temporal energy feature representation[J]. Journal of Image and Graphics, 2020, 25(4): 836-850. DOI： 10.11834/jig.190351.

摘要

目的

利用深度图序列进行人体行为识别是机器视觉和人工智能中的一个重要研究领域，现有研究中存在深度图序列冗余信息过多以及生成的特征图中时序信息缺失等问题。针对深度图序列中冗余信息过多的问题，提出一种关键帧算法，该算法提高了人体行为识别算法的运算效率；针对时序信息缺失的问题，提出了一种新的深度图序列特征表示方法，即深度时空能量图（depth spatial-temporal energy map，DSTEM），该算法突出了人体行为特征的时序性。

方法

关键帧算法根据差分图像序列的冗余系数剔除深度图序列的冗余帧，得到足以表述人体行为的关键帧序列。DSTEM算法根据人体外形及运动特点建立能量场，获得人体能量信息，再将能量信息投影到3个正交轴获得DSTEM。

结果

在MSR_Action3D数据集上的实验结果表明，关键帧算法减少冗余量，各算法在关键帧算法处理后运算效率提高了20% 30%。对DSTEM提取的方向梯度直方图（histogram of oriented gradient，HOG）特征，不仅在只有正序行为的数据库上识别准确率达到95.54%，而且在同时具有正序和反序行为的数据库上也能保持82.14%的识别准确率。

结论

关键帧算法减少了深度图序列中的冗余信息，提高了特征图提取速率；DSTEM不仅保留了经过能量场突出的人体行为的空间信息，而且完整地记录了人体行为的时序信息，在带有时序信息的行为数据上依然保持较高的识别准确率。

Abstract

Objective

Action recognition is a research hotspot in machine vision and artificial intelligence. Action recognition has been applied to human-computer interaction

biometrics

health monitoring

video surveillance systems

somatosensory game

robotics

and other fields. Early studies about action recognition are mainly performed on color video sequences acquired by RGB cameras. However

color video sequences are insensitive to illumination changes. With the development of imaging technology

especially with the launching of deep cameras

researchers begin to conduct human action recognition studies on depth map sequences obtained by deep cameras. However

numerous problems still exist in studies

such as excessive redundant information in the depth map sequences and missing temporal information in the generated feature map. These problems decrease the computational efficiency of human action recognition algorithms and reduce the final accuracy of human action recognition. Aiming at the problem of excessive redundant information in the depth map sequence

this study proposes a key frame algorithm. This algorithm decreases the redundant frames from the depth map sequence. The key frame algorithm improves the computational efficiency of human action recognition algorithms. At the same time

the feature map is accurate in representing human action with the key frame algorithm processing. Aiming at the problem of missing temporal information in the feature map generated by the depth map sequence

this study presents a new representation

namely

depth spatial-temporal energy map (DSTEM). This algorithm completely preserves the temporal information of the depth map sequence. DSTEM improves the accuracy of human action recognition when performing on the database with temporal information.

Method

The key frame algorithm first performs image difference operation between the two adjacent frames of the depth map sequence to produce a differential image sequence. Next

redundancy coefficients of each frame are achieved in the differential image sequence. Then

the redundant frame is placed and deleted by the maximum redundancy coefficient in the depth map sequence. Finally

the above steps are repeated a plurality of times to obtain a key frame sequence to express human action. This algorithm removes redundant information in the depth map sequence by removing redundant frames of the depth map sequence. The DSTEM algorithm first builds the energy field of the human body to obtain the energy information of the human action according to the shape and motion characteristics of the body. Next

the human energy information is projected onto three orthogonal cartesian planes to generate 2D projection maps of three angles. Subsequently

two 2D projection maps are selected and projected on three orthogonal axes to generate 1D energy distribution list. Finally

the 1D energy distribution lists are spliced in temporal to form DSTEM of three orthogonal axes. DSTEM reflects the temporal information of human action through the projection of energy information of human action on three orthogonal axes. Compared with the previous feature map algorithm

DSTEM not only preserves the spatial contour of human action

but also uses the projection of energy information of human action on three orthogonal axes to completely record the temporal information of human action.

Result

In this study

the public dataset MSR_Action3D is used to evaluate the effectiveness of the proposed methods. The experimental results show that the key frame algorithm removes the redundant information of the depth map sequence. The computational efficiency of each feature graph algorithm is improved after the key frame algorithm is processed. Particularly

the DSTEM algorithm improves the computational efficiency by nearly 30% after key frame processing because DSTEM is sensitive to redundant frames in the depth map sequence. After the key frame algorithm is processed

the accuracy of action recognition on each algorithm is improved. Especially

the recognition accuracy of DSTEM in each test is obviously improved

and the accuracy of recognition increases nearly by 5%. The experimental results also show that DSTEM-HOG(histogram of oriented gradient) receives the highest accuracy of human action recognition in all tests or it is consistent with the highest accuracy of human action recognition. DSTEM-HOG has an accuracy of 95.54% on the database with only positive actions. The accuracy is higher than the recognition accuracy of other algorithms. This result indicates that DSTEM completely preserves the spatial information of the depth map sequence. Moreover

DSTEM-HOG maintains an accuracy of 82.14% on the database with both positive and reverse actions. The recognition accuracy is nearly 40% higher than the other algorithms. The recognition rate of DSTEM-HOG is 34% higher than that of MHI(motion history image)-HOG

which retains part of the temporal information. The recognition rate of DSTEM-HOG is 50% higher than that of MHI-HOG and DMM(depth motion map)-HOG

which do not retain temporal information. Result indicates that DSTEM completely describes the temporal information of the depth map sequence.

Conclusion

The experimental results show that the proposed methods are effective. The key frame algorithm reduces the redundant frames in the depth map sequence and improves the computational efficiency of the human action recognition algorithms. After the key frame algorithm is processed

the accuracy of human action recognition is obviously improved on human action recognition algorithms. DSTEM not only retains the spatial information of actions

which is highlighted by the energy field but also completely records the temporal information of actions. In addition

DSTEM maintains the highest recognition accuracy when performing human action recognition on conventional databases. It also maintains superior recognition accuracy when performing human action recognition on the databases with temporal information. Results prove that DSTEM completely retains the spatial information and temporal information of human action. DSTEM also has the ability to distinguish between positive and reverse human action.

关键词

Keywords

references

Ao L, Shu J W and Li M Q. 2010. Data deduplication techniques. Journal of Software, 21(5):916-929

敖莉, 舒继武, 李明强. 2010.重复数据删除技术.软件学报, 21(5):916-929 [DOI:10.3724/SP.J.1001.2010.03761]

Bergmeir C, Hyndman R J and Koo B. 2018. A note on the validity of cross-validation for evaluating autoregressive time series prediction. Computational Statistics and Data Analysis, 120:70-83[DOI:10.1016/j.csda.2017.11.003]

Bobick A F and Davis J W. 2001. The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3):257-267[DOI:10.1109/34.910878]

Chen C, Jafari R and Kehtarnavaz N. 2015a. Improving human action recognition using fusion of depth camera and inertial sensors. IEEE Transactions on Human-Machine Systems, 45(1):51-61[DOI:10.1109/THMS.2014.2362520]

Chen C, Jafari R and Kehtarnavaz N. 2015b. UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor//Proceedings of 2015 International Conference on Image Processing. Quebec City, Canada: IEEE: 168-172[ DOI:10.1109/ICIP.2015.7350781 http://dx.doi.org/10.1109/ICIP.2015.7350781 ]

Chen C, Jafari R and Kehtarnavaz N. 2015c. Action recognition from depth sequences using depth motion maps-based local binary patterns//Proceedings of 2015 IEEE Winter Conference on Applications of Computer Vision. Waikoloa, HI, USA: IEEE: 1092-1099[ DOI:10.1109/WACV.2015.150 http://dx.doi.org/10.1109/WACV.2015.150 ]

Li R F, Wang L L and Wang K. 2014. A survey of human body action recognition. Pattern Recognition and Artificial Intelligence, 27(1):35-48

李瑞峰, 王亮亮, 王珂. 2014.人体动作行为识别研究综述.模式识别与人工智能, 27(1):35-48 [DOI:10.3969/j.issn.1003-6059.2014.01.005]

Li W Q, Zhang Z Y and Liu Z C. 2010. Action recognition based on a bag of 3D points//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops. San Francisco, CA, USA: IEEE: 9-14[ DOI:10.1109/CVPRW.2010.5543273 http://dx.doi.org/10.1109/CVPRW.2010.5543273 ]

Lu J, Lee W S and Gan H and Hu X W. 2018. Immature citrus fruit detection based on local binary pattern feature and hierarchical contour analysis. Biosystems Engineering, 171:78-90[DOI:10.1016/j.biosystemseng.2018.04.009]

Oreifej O and Liu Z C. 2013. HON4D: histogram of oriented 4D normals for activity recognition from depth sequences//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE: 716-723[ DOI:10.1109/CVPR.2013.98 http://dx.doi.org/10.1109/CVPR.2013.98 ]

Peng Y X, Zhao Y Z and Zhang J C. 2019. Two-stream collaborative learning with spatial-temporal attention for video classification. IEEE Transactions on Circuits and Systems for Video Technology, 29(3):773-786[DOI:10.1109/TCSVT.2018.2808685]

Radman A, Zainal N and Suandi S A. 2017. Automated segmentation of iris images acquired in an unconstrained environment using HOG-SVM and GrowCut. Digital Signal Processing, 64:60-70[DOI:10.1016/j.dsp.2017.02.003]

Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A and Blake A. 2011. Real-time human pose recognition in parts from single depth images//Proceedings of CVPR 2011. Providence, RI, USA: IEEE: 1297-1304[ DOI:10.1109/CVPR.2011.5995316 http://dx.doi.org/10.1109/CVPR.2011.5995316 ]

Sun G F, Wu L, Liu Q, Zhu C and Chen E H. 2013. Recommendations based on collaborative filtering by exploiting sequential behaviors. Journal of Software, 24(11):2721-2733

孙光福, 吴乐, 刘淇, 朱琛, 陈恩红. 2013.基于时序行为的协同过滤推荐算法.软件学报, 24(11):2721-2733 [DOI:10.3724/SP.J.1001.2013.04478]

Vemulapalli R, Arrate F and Chellappa R. 2014. Human action recognition by representing 3D skeletons as points in a lie group//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE: 588-595[ DOI:10.1109/CVPR.2014.82 http://dx.doi.org/10.1109/CVPR.2014.82 ]

Xia L and Aggarwal J K. 2013. Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE: 2834-2841[ DOI:10.1109/CVPR.2013.365 http://dx.doi.org/10.1109/CVPR.2013.365 ]

Xu Y, Hou Z J, Liang J Z, Chen C, Jia L and Song Y. 2018. Action recognition using weighted fusion of depth images and skeleton's key frames. Journal of Computer-Aided Design and Computer Graphics, 30(7):1313-1320

许艳, 侯振杰, 梁久祯, 陈宸, 贾靓, 宋毅. 2018.权重融合深度图像与骨骼关键帧的行为识别.计算机辅助设计与图形学学报, 30(7):1313-1320 [DOI:10.3724/SP.J.1089.2018.16771]

Yang X D, Zhang C Y and Tian Y L. 2012. Recognizing actions using depth motion maps-based histograms of oriented gradients//Proceedings of the 20th International Conference on Multimedia. Nara, Japan: ACM: 1057-1060[ DOI:10.1145/2393347.2396382 http://dx.doi.org/10.1145/2393347.2396382 ]

Zhang J C and Peng Y X. 2019a. Hierarchical vision-language alignment for video captioning//Proceedings of the 25th International Conference on Multimedia Modeling. Thessaloniki, Greece, Springer: 42-54[ DOI:10.1007/978-3-030-05710-7_4 http://dx.doi.org/10.1007/978-3-030-05710-7_4 ]

Zhang J C and Peng Y X. 2019b. Object-aware aggregation with bidirectional temporal graph for video captioning. Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA. IEEE: 8319-8328[ DOI:10.1109/CVPR.2019.00852 http://dx.doi.org/10.1109/CVPR.2019.00852 ]

Zhang P and Wang R S. 2005. A survey of detecting regions of interest in a static image. Journal of Image and Graphics, 10(2):142-148

张鹏, 王润生. 2005.静态图像中的感兴趣区域检测技术.中国图象图形学报, 10(2):142-148 [DOI:10.3969/j.issn.1006 -8961.2005.02.002]

Zhang R J and Wang G J. 2005. Constrained bézier curves' best multi-degree reduction in the L 2 -norm. Progress in Natural Science, 15(9):843-850[DOI:10.1080/10020070512331343010]

Zhang T and Ping X J. 2004. Reliable detection of spatial LSB steganography based on difference histogram. Journal of Software, 15(1):151-158

张涛, 平西建. 2004.基于差分直方图实现LSB信息伪装的可靠检测.软件学报, 15(1):151-158 [DOI:10.13328/j.cnki.jos.2004.01.018]

Zhao Y Z and Peng Y X. 2017. Saliency-guided video classification via adaptively weighted learning//Proceedings of 2017 IEEE International Conference on Multimedia and Expo. Hong Kong, China: IEEE: 847-852[ DOI:10.1109/ICME.2017.8019343 http://dx.doi.org/10.1109/ICME.2017.8019343 ]

Zhou X C, Tu D W, Chen Y, Zhao Q J and Zhang Y C. 2010. Moving object detection under dynamic background based on phase-correlation and differential multiplication. Chinese Journal of Scientific Instrument, 31(5):980-983

周许超, 屠大维, 陈勇, 赵其杰, 张翼成. 2010.基于相位相关和差分相乘的动态背景下运动目标检测.仪器仪表学报, 31(5):980-983 [DOI:10.19650/j.cnki.cjsi.2010.05.004]