|
发布时间: 2018-04-16 |
图像理解和计算机视觉 |
|
|
收稿日期: 2017-07-07; 修回日期: 2017-09-14
基金项目: 国家自然科学基金面上项目(61571345,91538101,61550110247)
第一作者简介:
冉宪宇(1990-), 女, 西安电子科技大学计算机学院计算机应用技术专业硕士研究生, 主要研究方向为行为识别等。E-mail:ranxianyu@foxmail.com.
中图法分类号: TP37
文献标识码: A
文章编号: 1006-8961(2018)04-0519-07
|
摘要
目的 基于3维骨架的行为识别研究在计算机视觉领域一直是非常活跃的主题,在监控、视频游戏、机器人、人机交互、医疗保健等领域已取得了非常多的成果。现今的行为识别算法大多选择固定关节点作为坐标中心,导致动作识别率较低,为解决动作行为识别中识别精度低的问题,提出一种自适应骨骼中心的人体行为识别的算法。方法 该算法首先从骨骼数据集中获取三维骨架序列,并对其进行预处理,得到动作的原始坐标矩阵;再根据原始坐标矩阵提取特征,依据特征值的变化自适应地选择坐标中心,重新对原始坐标矩阵进行归一化;最后通过动态时间规划方法对动作坐标矩阵进行降噪处理,借助傅里叶时间金字塔表示的方法减少动作坐标矩阵时间错位和噪声问题,再使用支持向量机对动作坐标矩阵进行分类。论文使用国际上通用的数据集UTKinect-Action和MSRAction3D对算法进行验证。结果 结果表明,在UTKinect-Action数据集上,该算法的行为识别率比HO3D J2算法高4.28%,比CRF算法高3.48%。在MSRAction3D数据集上,该算法比HOJ3D算法高9.57%,比Profile HMM算法高2.07%,比Eigenjoints算法高6.17%。结论 本文针对现今行为识别算法的识别率低问题,探究出问题的原因是采用了固定关节坐标中心,提出了自适应骨骼中心的行为识别算法。经仿真验证,该算法能有效提高人体行为识别的精度。
关键词
人体行为识别; 骨骼序列; 特征提取; 自适应; 归一化
Abstract
Objective Human action recognition based on 3D skeleton has been a popular topic in computer vision, the goal of which is to automatically segment, capture, and recognize human action. Human action recognition has been widely applied in real-world applications. For the past several decades, it has been used in surveillance, video games, robotics, human-human interaction, human-computer interaction, and health care, and has been widely explored by researchers since the 1960s. This study obtains 3D data in four ways. First, a motion capture system is used based on a marker. Second, multiple views are used for 2D image sequence reconstruction of 3D information. Third, range sensors are used. Fourth, RGB videos are used. However, extracting data by using a motion capture system and reconstruction is inconvenient. Range sensors are expensive and difficult to use in a human environment, and they obtain data slowly and provide a poorly estimated distance. Moreover, RGB images usually provide the appearance information of the objects in the scene. Given the limited information provided by RGB images, solving certain problems, such as the partition of the foreground and background with similar colors and textures, is difficult, if not impossible. Moreover, RGB data are highly sensitive to various factors, such as illumination, viewpoint, occlusions, clutter, or diversity of datasets. RGB video sensor data cannot capture the information that human needs. The rapid development of depth sensors, such as 3D Microsoft Kinect sensor, in recent years has provided not only color image data but also 3D depth image information. Three-dimensional depth images record the distance between object and body, thereby producing considerable information. Real-time skeletal-tracking technique and support vector machine recognize various postures and extract key information. The investigation of computer vision algorithms based on 3D skeleton algorithms has thus attracted significant attention in the last few years. Many researchers have been studying skeleton-based algorithms, which have presented numerous achievements and contributions. The present action recognition algorithm selects a fixed joint as the coordinate center, which leads to a low recognition rate. An adaptive skeleton center algorithm for human action recognition is proposed to solve the problem of low accuracy. Method In the algorithm, frames of skeleton action sequences are loaded onto a human action dataset, redundant frames are removed from the sequence frame information, and the original coordinate matrix is obtained by preprocessing the sequences. Rigid vector and joint angle features are generated by extracting the original coordinate matrix. The adaptive value can be determined on the basis of changes in rigid vector and joint angle values. The coordinate center can be adaptively selected according to the adaptive value and used to renormalize the original matrix. The action coordinate matrix is denoised by using a dynamic time-planning method. The Fourier time pyramid method is used to reduce the time displacement and noise problems of the action coordinate matrix. The matrix is classified by using support vector machine. Result Unlike existing algorithms, such as histogram of 3D joint (HO3DJ), conditional random field (CRF), EigenJoints, profile hidden Markov model (HMM), relation matrix of 3D rigid bodies+principal geodesic distance, and actionlet algorithms, the proposed algorithm exhibits improved performances on different datasets. On the UTKinect dataset, the action recognition rate of the proposed algorithm is 4.28% higher than that of the HO3DJ algorithm and 3.48% higher than that of the CRF algorithm. On the MSRAction3D dataset, the action recognition rate of the proposed algorithm is 9.57% higher than that of the HO3DJ algorithm, 2.07% higher than that of the profile HMM algorithm, and 6.17% higher than that of the EigenJoints algorithm. Action Set (AS)1, AS2, and AS3 are subsets of the MSRAction3D dataset. The action recognition rate of the proposed algorithm is not as good as that of the other algorithms on the AS2 dataset, but the action recognition rates of the proposed algorithm are high on the AS1 and AS3 datasets. Conclusion The proposed algorithm solves the low accuracy problem of the existing action recognition algorithm. The coordinate center of a fixed joint is adopted. Simulation results show that the proposed algorithm can effectively improve the accuracy of human action recognition, and its action recognition rate is higher than those of existing algorithms. On the UTKinect dataset, the recognition rate of the proposed algorithm is at least 3% higher than those of other algorithms, and the generated single-action recognition rate is as high as 90%. On the MSRAction3D dataset, the proposed algorithm shows advantages on AS1 and AS2 datasets, but its recognition rate on AS2 is not ideal, particularly in the recognition of the upper limb. Therefore, this algorithm needs improvement. The algorithm is generally efficient for single-action recognition. The next research direction is complex action recognition.
Key words
human action recognition; skeleton sequence; feature extraction; adaptive; renormalize
0 引言
在计算机视觉领域中,行为识别一直是非常活跃的研究主题,已经广泛应用于监控、视频游戏、机器人、人机交互、医疗保健等领域。由于遮挡、光线变化、视角变化以及背景干扰,行为识别的发展困难重重。随着深度传感器的发展,行为识别研究出现了机会,许多研究人员一直在研究基于3维骨架的算法。Rahmani等人[1]提出了一种新型的鲁棒非线性知识转移模型进行人体动作识别。Lu等人[2]提出了通过深度传感器获取深度图像得到完整的骨架信息。Xia等人[3]使用3维关节位置直方图识别人体动作。Han等人[4]提出了一种基于层次学习的人体动作识别方法, 包括基于分层非线性降维的特征提取和基于级联判别模型的动作建模。Yang等人[5]提出了基于节点位置的差异方法,一种新的功能型的特征关节方法,该方法结合了静态姿势、动作信息的运动和偏移。Vemulapalli等人[6]将一个人的骨骼关节点表示在李群上,通过使用旋转和平移明确建模各种身体部位之间的3维几何关系。Ding等人[7]用3维刚体的关系矩阵计算相似程度强的姿势,通过对样本数据进行谱聚类建立代表性姿势,根据谱嵌入构造的全局线性特征函数生成离散符号的动作序列进行动作识别。
上述这些方法都有一个共同的特点:在建模人体动作时,对所有关节点进行归一化的过程中仅使用一个提前定义好的固定坐标中心。然而并不是所有运动关节都适用一个固定的坐标中心,对于关节运动过程与坐标中心相对变化小,就会造成识别的信息有一些误差。针对上述不足,提出了一种自适应骨骼坐标中心选择的算法。该算法根据每个动作的运动状态,在提取行为特征时参数化关节动作特征,自适应选择坐标中心,有效地提高人体动作的识别率。
本文的整体框架图如图 1所示,首先采用自适应骨骼中心的行为识别算法进行特征提取,分别以臀部和颈部为中心计算刚体关节的角速度和角加速度变量,进行比较,得出特征值,重新确定每个动作的坐标中心。然后使用动态时间规整DTW方法[8]处理执行速率的变化,再用傅里叶时间金字塔表示FTP[9]来去除高频系数,最后使用SVM进行分类,实现了动作的分类识别,其中图 1中阴影部分就是本文算法的核心内容,去除阴影部分就是原算法框架图[6]。
1 人体骨骼关节表示
如图 2(a)所示,定义一个刚体由一段任意相邻的两个骨骼关节点组成,那么可以将一个人体动作看做是一组骨骼关节点组成的刚体轨迹,人体的骨骼是用3维关节节点表示。如果定义一个骨骼有M个节点,那么就有M-1个刚体,人体姿势可以表示为关系矩阵
$ {\mathit{\boldsymbol{R}}_F} = {({\mathit{\boldsymbol{R}}_{i, j}})_{(\mathit{M}{\rm{-1}}) \times (\mathit{M}{\rm{-1}})}} $ | (1) |
式中,
2 骨骼中心自适应算法
如图 2(a)所示,人体骨骼可以表示成16个关节节点,15个刚体,图 2(a)中标红的关节点2是颈部关节点neck,8是臀部关节点hip,是预先设定的两个坐标中心。每个刚体的运动,可以表示为以刚体长度为半径的圆周运动,每个刚体有对应的角速度和角加速度,通过这两个变量可以判断刚体的变化。不同动作的刚体的角速度和角加速度变化情况不同,同一动作的相同关节点相对于不同的坐标中心的变化也不同,因此人体的运动可以参数化为不同动作刚体关节角以不同坐标中心下的动作特征。
本文将(x,y)作为显示的屏幕坐标和z作为深度测量,在3维空间中的一系列刚体轨迹可以被参数化矩阵
$ {\mathit{\boldsymbol{P}}_F} = [{\mathit{\boldsymbol{R}}_F}, {\mathit{\boldsymbol{G}}_{iF}}] $ | (2) |
式中,
$ {\mathit{\boldsymbol{G}}_{iF}} = [{\mathit{\alpha }_{iF}}, {\beta _{iF}}, {\delta _{iF}}] $ | (3) |
式中,
$ {\mathit{\alpha }_{iF}} = {\rm{arccos}}\frac{{{r_i} \cdot {r_j}}}{{\sqrt {{{\left| {{r_i}} \right|}^2} + {{\left| {{r_j}} \right|}^2}} }} $ | (4) |
式中,
$ {\beta _{iF}} = \frac{{{\alpha _{i(\mathit{F}{\rm{ + 1}})}}-{\mathit{\alpha }_{iF}}}}{{{t_2}-{t_1}}} $ | (5) |
式中,
$ {\delta _{iF}} = \frac{{{\beta _{i(\mathit{F}{\rm{ + 1}})}}-{\beta _{iF}}}}{{({\mathit{t}_3}-{\mathit{t}_2})-({\mathit{t}_2} - {\mathit{t}_1})}} $ | (6) |
式中,
当刚体关节角相对于坐标中心变化较为明显时,其对应的角速度和角加速度变化范围大。表 1显示的是这两个动作主要刚体关节角角速度和角加速度,角速度变化范围的单位为rad/s,角加速度变化范围的单位为rad/s2,其中刚体关节3.6.11表示关节点3和6组成的刚体与关节点6和11组成的刚体构成的刚体关节夹角,hand catch(hip)表示hand catch以hip为坐标中心进行的相关运算,hand catch(neck)表示hand catch以neck为坐标中心进行的相关运算。对于hand catch(hip)动作,运动关节3.6.11,1.3.6,2.4.7,13.9.8角速度和角加速度相较于其他刚体变化较大,对于draw tick,运动明显的刚体关节角3.6.11,4.7.12,10.8.5,2.4.7,1.3.6,2.5.8的角速度
表 1
不同动作关节角相对于两个坐标中心的角速和角加速度变化范围
Table 1
Different action joint angle angular velocity and angular acceleration relative to the range of two coordinate center
刚体 关节角 |
hand catch(hip) | hand catch(neck) | draw tick(hip) | draw tick(neck) | |||||||
角速度 变化范围 |
角加速度 变化范围 |
角速度 变化范围 |
角加速度 变化范围 |
角速度 变化范围 |
角加速度 变化范围 |
角速度 变化范围 |
角加速度 变化范围 |
||||
3.6.11 | 0.188.97 | 5.31269.25 | 0.158.69 | 4.62260.72 | 0.5215.90 | 15.70477.15 | 0.5414.61 | 16.12438.32 | |||
4.7.12 | 0.082.90 | 2.4087.14 | 0.082.89 | 2.3386.66 | 0.094.32 | 2.83129.65 | 0.134.46 | 4.02133.92 | |||
1.3.6 | 0.023.46 | 0.54103.82 | 0.043.50 | 1.23104.93 | 0.193.04 | 3.2691.34 | 0.162.96 | 4.7988.69 | |||
2.4.7 | 0.144.02 | 4.33120.62 | 0.144.02 | 4.33120.62 | 0.145.96 | 4.18178.73 | 0.145.96 | 4.18178.7 | |||
2.5.8 | 0.092.67 | 2.7280.19 | 0.082.78 | 2.3883.39 | 0.143.08 | 4.2792.31 | 0.153.19 | 4.3695.80 | |||
9.8.5 | 0.252.38 | 7.5571.38 | 0.212.03 | 6.3060.98 | 0.182.80 | 5.5484.04 | 0.102.93 | 2.9987.83 | |||
10.8.5 | 0.102.04 | 2.9961.32 | 0.102.04 | 2.9961.32 | 0.085.08 | 2.48152.50 | 0.085.08 | 2.48152.5 | |||
13.9.8 | 0.023.48 | 0.64104.29 | 0.053.55 | 1.44105.80 | 0.152.02 | 4.5960.57 | 0.112.46 | 3.4173.69 |
对于不同的坐标中心,当一个刚体的角速度
算法名称:自适应骨骼中心的人体行为识别算法
输入:骨骼动作序列帧
输出:识别精度
1) 读取骨骼动作序列帧,去掉序列帧中的冗余帧;再以臀部关节点hip、颈部关节点neck为坐标中心分别进行归一化,得到臀部中心坐标矩阵dh以及颈部中心坐标矩阵dn;
2) 根据步骤1)得到两个坐标矩阵
3) 将根据两个坐标矩阵
4) 确定关节角的角速度和角加速度阈值:将所述序列
5) 根据每个关节角臀部关节角的角速度的阈值
6) 根据步骤5)得到的4个适应值
7) 对得到动作坐标矩阵,使用动态时间规划DTW方法[8]处理执行速率的变化使用傅里叶时间金字塔表示FTP[9]来去除高频系数,使用支持向量机SVM方法进行分类,得到识别精度。
3 实验结果与分析
3.1 UTKinect-Action数据集
3.2 MSRAction3D数据集
MSRAction3D[10]数据集是由类似于Kinect深度摄像机的人体序列,并从深度序列中提取的3维位置的动作数据集。这个数据集包含10个人,每人20个动作,每人重复一个动作3次,一共有557个动作序列。这个数据集有一个挑战就是由于动作变化不是很大,许多动作非常相似。这个数据集包含3个子集,AS1,AS2,AS3,其中AS1和AS2子集是简单的相似动作,AS3是复杂动作。表 3是数据集各动作自适应坐标中心及其识别率,表 4是在MSRAction3D数据集上本文方法与其他方法进行比较的结果,可以看出本文方法高于其他方法的识别率。
表 3
MSRAction数据集各动作的坐标中心和识别率
Table 3
Identify the coordinates of the center movement rate on MSRAction dataset
Action Set | 动作名称 | 坐标中心 | 识别率/% |
Action Set1 (AS1) |
水平挥手 | neck | 97.5 |
锤 | hip | 78.5 | |
冲拳 | neck | 98.33 | |
高抛 | hip | 98.18 | |
拍手 | hip | 100 | |
弯曲 | neck | 85.67 | |
正上手发球 | neck | 97.33 | |
拾取&投掷 | hip | 61.63 | |
Action Set2 (AS2) |
高臂挥手 | neck | 70.33 |
抓 | hip | 62.33 | |
画X | neck | 83.01 | |
画√ | hip | 82.67 | |
画圆 | neck | 37.33 | |
双手挥 | neck | 99.33 | |
向前踢 | neck | 75.48 | |
侧拳 | hip | 100 | |
Action Set3 (AS3) |
高抛 | hip | 100 |
向前踢 | hip | 100 | |
侧踢 | neck | 92.67 | |
慢跑 | neck | 94.00 | |
摇摆 | hip | 92.67 | |
正上手发 | neck | 99.33 | |
高尔夫挥杆 | hip | 100 | |
拾取&投掷 | hip | 80.82 |
表 4
MSRAction3D数据集识别率与其他方法的对比
Table 4
Compared with other algorithms of recognition rate on MSRAction3D dataset
4 结论
在总结现有行为识别方法不足的基础上提出了一种自适应骨骼中心的人体行为识别算法,参数化运动特征,将刚体特征表示成刚体的角速度和角加速度的变化,如图 1所示,图中的特征提取时本文的亮点,根据动作特征自适应地选择坐标中心,能够有效依据动作的最大特征,提高动作识别率。在数据集上测试验证,数据集测试结果表明,本文方法,对于动作的识别率高于现有的方法,具有一定的优势性。其中在UTkinect数据集上,本文方法的识别率高于其他方法至少3%,单个动作识别率平均高达90%;对于MSRAction3D数据集,其中子集AS1优势较大,AS3有一定的优势,但是AS2识别率较为不理想,对于人体动作中的上肢相似动作的识别率不高,这也和本方法需要改进的地方。本文方法现阶段对于单一动作识别率效果较好,下一步的研究方向是研究多个复杂动作的识别。
参考文献
-
[1] Rahmani H, Mian A, Shah M. Learning a deep model for human action recognition from novel viewpoints[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017. [DOI:10.1109/TPAMI.2017.2691768]
-
[2] Lu T W, Peng L, Miao S J. Human action recognition of hidden markov model based on depth information[C]//The 15th International Symposium on Parallel and Distributed Computing. Piscataway: IEEE, 2016: 354-357. [DOI:10.1109/ISPDC.2016.58]
-
[3] Xia L, Chen C C, Aggarwal J K. View invariant human action recognition using histograms of 3D joints[C]//Proceedings of 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. Providence, RI: IEEE, 2012: 20-27. [DOI:10.1109/CVPRW.2012.6239233]
-
[4] Han L, Xu X X, Liang W, et al. Discriminative human action recognition in the learned hierarchical manifold space[J]. Image and Vision Computing, 2010, 28(5): 836–849. [DOI:10.1016/j.imavis.2009.08.003]
-
[5] Yang X D, Tian Y L. Effective 3D action recognition using EigenJoints[J]. Journal of Visual Communication and Image Representation, 2014, 25(1): 2–11. [DOI:10.1016/j.jvcir.2013.03.001]
-
[6] Vemulapalli R, Arrate F, Chellappa R. Human action recognition by representing 3D skeletons as points in a Lie group[C]//Proceeding of the Conference on Computer Vision and Pattern Recognition. Columbus, OH: IEEE, 2014: 88-595. [DOI:10.1109/CVPR.2014.82]
-
[7] Ding W W, Liu K, Li G, et al. Human action recognition using spectral embedding to similarity degree between postures[C]//Proceedings of 2016 Visual Communications and Image Processing. Chengdu: IEEE, 2016: 1-4. [DOI:10.1109/VCIP.2016.7805441]
-
[8] Müller M. Information Retrieval for Music and Motion[M]. Berlin, Heidelberg: Springer-Verlag, 2007: 2-5.
-
[9] Wang J, Liu Z C, Wu Y, et al. Mining actionlet ensemble for action recognition with depth cameras[C]//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, RI: IEEE, 2012: 1290-1297. [DOI:10.1109/CVPR.2012.6247813]
-
[10] Ding W W, Liu K, Fu X J, et al. Profile hmms for skeleton-based human action recognition[J]. Signal Processing:Image Communication, 2016, 42: 109–119. [DOI:10.1016/j.image.2016.01.010]