线性动态系统反演模型在人体行为识别的应用
Inversion model of linear dynamic system for human action recognition
- 2019年24卷第9期 页码:1450-1457
收稿:2018-12-28,
修回:2019-4-13,
纸质出版:2019-09-16
DOI: 10.11834/jig.180657
移动端阅览

浏览全部资源
扫码关注微信
收稿:2018-12-28,
修回:2019-4-13,
纸质出版:2019-09-16
移动端阅览
目的
2
人体行为识别在视频监控、环境辅助生活、人机交互和智能驾驶等领域展现出了极其广泛的应用前景。由于目标物体遮挡、视频背景阴影、光照变化、视角变化、多尺度变化、人的衣服和外观变化等问题,使得对视频的处理与分析变得非常困难。为此,本文利用时间序列正反演构造基于张量的线性动态模型,估计模型的参数作为动作序列描述符,构造更加完备的观测矩阵。
方法
2
首先从深度图像提取人体关节点,建立张量形式的人体骨骼正反向序列。然后利用基于张量的线性动态系统和Tucker分解学习参数元组(
A
F
,
A
I
,
C
),其中
C
表示人体骨架信息的空间信息,
A
F
和
A
I
分别描述正向和反向时间序列的动态性。通过参数元组构造观测矩阵,一个动作就可以表示为观测矩阵的子空间,对应着格拉斯曼流形上的一点。最后通过在格拉斯曼流形上进行字典学习和稀疏编码完成动作识别。
结果
2
实验结果表明,在MSR-Action 3D数据集上,该算法比Eigenjoints算法高13.55%,比局部切从支持向量机(LTBSVM)算法高2.79%,比基于张量的线性动态系统(tLDS)算法高1%。在UT-Kinect数据集上,该算法的行为识别率比LTBSVM算法高5.8%,比tLDS算法高1.3%。
结论
2
通过大量实验评估,验证了基于时间序列正反演构造出来的tLDS模型很好地解决了上述问题,提高了人体动作识别率。
Objective
2
Human action recognition has a very wide application prospect in fields such as video surveillance
human-computer interface
environment-assisted life
human-computer interaction
and intelligent driving. In image or video analysis
most of these tasks use color and texture cues in 2D images for recognition. However
due to occlusion
shadows
illumination changes
perspective changes
scale changes
intra-class variations
and similarities between classes
the recognition rate of human behavior is not ideal. In recent years
with the release of 3D depth cameras
such as Microsoft Kinect
3D depth data can provide pictures of scene changes
thereby improving the recognition rates for the first three challenges of human recognition. In addition
3D depth cameras provide powerful human motion capture technology
which can output the human skeleton of a 3D joint point position. Therefore
much attention has been paid to skeleton-based action recognition. The linear dynamical system (LDS) is the most common method for encoding spatio-temporal time-series data in various disciplines due to its simplicity and efficiency. A new method is proposed to obtain the parameters of a tensor-based LDS with forward and inverse action sequences to construct a complete observation matrix. The linear subspace of the observation matrix
which maps to a point on Grassmann manifold for human action recognition
is obtained. In this manner
an action can be expressed as a subspace spanned by columns of the matrix corresponding to a point on the Grassmann manifold. On the basis of such action
classification can be performed using dictionary learning and sparse coding.
Method
2
Considering the dynamics and persistence of human behavior
we do not vectorize the time series according to the general method but retain its own tensor characteristics
that is
we transform the high-dimensional vector into a low-dimensional subspace to analyze the factors affecting actions from various angles (modules). In this method
human skeletons are modeled using human joint points
which are initially extracted from a depth camera recording. To preserve the original spatio-temporal information of an action video and enhance the accuracy of human action recognition
we develop a time series of skeleton motions on the basis of the data in a three-order tensor and convert the skeleton into a two-order tensor. With this action representation
Tucker tensor decomposition methods are applied to obtain dimensionality reduction. Using the tensor-based LDS model with forward and inverse action sequence
we learn a parameter tuple (
A
F
A
I
C
)
in which
C
represents the spatial appearance of skeleton information
A
F
describes the dynamics of the forward time series
and
A
I
describes the dynamics of the inversion time series. We consider using an
m
-order observable matrix to approximate the extended observable matrix because human behavior has a limited duration and does not extend indefinitely in time. When
$$m$$
is small
it is insufficient to describe the entire action sequence. In case of cyclic sub-actions in human behavior
even adding
$$m$$
cannot simulate the follow-up action of human behavior. When combined with the observable matrix of an inverse action sequence
the description of cyclic sub-actions evidently makes up for this shortcoming
improves the performance of the system
increases the completeness of the finite observation matrix
and reduces computational complexity. Thus
the finite observability matrix can then be adopted as the feature descriptor for an action sequence with forward and inverse actions. In classifying points on the Grassmann manifolds
a simple method is to insert the Grassmann manifolds into a Euclidean space through tangent bundles of manifolds. This method does not necessarily provide an accurate estimate
and it requires intensive computation. Sparse coding and dictionary learning are carried out to classify points on the Grassmann manifolds by maintaining the Grassmann projection distance (chord metric) in differential homeomorphism. Sparse coding on Grassmann manifolds finds a set of linear subspaces to represent each linear subspace as a linear combination of these linear subspaces.
Result
2
The MSR-Action 3D dataset comprises depth sequences captured by depth cameras. It includes time-segment action sequences that have been preprocessed to remove the background. The dataset contains 20 actions performed by 10 different objects
with each action repeated thrice without any interaction with objects. The UT-Kinect dataset is a 200-frame depth sequence acquired indoors by using Kinect sensors. It contains 10 actions
namely
walking
standing up
picking up
moving
waving
throwing
pushing
sitting down
pulling
and clapping. Each action is repeated twice by 10 different people. To assess the effects of different subspace dimensions on recognition rate
we test the subspace dimensions ranging from 1 to 20. Experiments using the MSR-Action 3D and UT-Kinect datasets demonstrate the excellent performance of our proposed method. Through an extensive set of experimental assessments
we verify that the tensor-based LDS (tLDS) model with forward and inverse action sequences significantly improves the rate of human action recognition. Results show that the rate of the algorithm is 13.55% higher than that of the joint eigenvalue decomposition algorithm
2.79% higher than that of LTBSVM (local tangent bundle support vector machine) algorithm
and 1% higher than that of the tLDS algorithm on the MSR-Action 3D dataset. For the UT-Kinect dataset
the recognition rate of the proposed algorithm is 5.8% higher than that of the LTBSVM algorithm and 1.3% higher than that of tLDS algorithm.
Conclusion
2
We develop a novel action representation
namely
the tensor-based LDS model with forward and inverse action sequences. The proposed model translates 3D human skeleton sequences into tensor time series without having to unfold the skeletons onto column vectors. Tucker decomposition is used to estimate the parameters of the model as action descriptors. Through an extensive set of experimental assessments
we verify that the tensor-based LDS model with forward and inverse action sequences significantly improves the rate of human action recognition. Major contributions enabled by the proposed method include several novel skeleton-based tensor representations. Our next intended approach in subsequent research is to apply the tensor-based LDS model with forward and inverse action sequences to multi-person interactions.
Luo H L, Wang C J, Lu F. Survey of video behavior recognition[J]. Journal on Communications, 2018, 39(6): 169-180.
罗会兰, 王婵娟, 卢飞.视频行为识别综述[J].通信学报, 2018, 39(6): 169-180.[DOI: 10.11959/j.issn.1000-436x.2018107]
Chen Y P, Qiu W G. Review of visual-based human behavior recognition algorithms[J]. Application Research of Computers, 2019, (7): 1-10.
陈煜平, 邱卫根.基于视觉的人体行为识别算法研究综述[J].计算机应用研究, 2019, (7): 1-10.
Ding W W. 3D skeleton-based spatio-temporal representation and human action recognition[D]. Xi′an: Xidian University, 2017.
丁文文.基于3维骨架的时空表示与人体行为识别[D].西安: 西安电子科技大学, 2017.
Ma Y X, Tan L, Dong X, et al. Action recognition for intelligent monitoring[J]. Journal of Image and Graphics, 2019, 24(2): 282-290.
马钰锡, 谭励, 董旭, 等.面向智能监控的行为识别[J].中国图象图形学报, 2019, 24(2): 282-290. DOI: 10.11834/jig.180392]
Ran X Y, Liu K, Li G, et al. Human action recognition algorithm based on adaptive skeleton center[J]. Journal of Image and Graphics, 2018, 23(4): 519-525.
冉宪宇, 刘凯, 李光, 等.自适应骨骼中心的人体行为识别算法[J].中国图象图形学报, 2018, 23(4): 519-525. DOI: 10.11834/jig.170420]
Chen L L, Wei H, Ferryman J. A survey of human motion analysis using depth imagery[J]. Pattern Recognition Letters, 2013, 34(15): 1995-2006.[DOI: 10.1016/j.patrec.2013.02.006]
Ye M, Zhang Q, Wang L, et al. A survey on human motion analysis from depth data[C]//Proceedings of Dagstuhl 2012 Seminar on Time-of-Flight Imaging and GCPR 2013 Workshop on Imaging New Modalities. Berlin, Heidelberg: Springer, 2013: 149-187.[ DOI:10.1007/978-3-642-44964-2_8 http://dx.doi.org/10.1007/978-3-642-44964-2_8 ]
Shotton J, Sharp T, Kipman A, et al. Real-time human pose recognition in parts from single depth images[J]. Communications of the ACM, 2013, 56(1): 116-124.[DOI: 10.1145/2398356.2398381]
Abdelkader M F, Abd-Almageed W, Srivastava A, et al. Silhouette-based gesture and action recognition via modeling trajectories on Riemannian shape manifolds[J]. Computer Vision and Image Understanding, 2011, 115(3): 439-455.
Devanne M, Wannous H, Berretti S, et al. 3D human action recognition by shape analysis of motion trajectories on Riemannian manifold[J]. IEEE Transactions on Cybernetics, 2015, 45(7): 1340-1352.[DOI: 10.1109/TCYB.2014.2350774]
Vemulapalli R, Arrate F, Chellappa R. Human action recognition by representing 3D skeletons as points in a lie group[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE, 2014: 588-595.[ DOI:10.1109/CVPR.2014.82 http://dx.doi.org/10.1109/CVPR.2014.82 ]
Slama R, Wannous H, Daoudi M, et al. Accurate 3D action recognition using learning on the Grassmann manifold[J]. Pattern Recognition, 2015, 48(2): 556-567.[DOI: 10.1016/j.patcog.2014.08.011]
Turaga P, Veeraraghavan A, Srivastava A, et al. Statistical analysis on manifolds and its applications to video analysis[M]//Schonfeld D, Shan C, Tao D C, et al. Video Search and Mining. Berlin, Heidelberg: Springer, 2010: 115-144.[ DOI:10.1007/978-3-642-12900-1_5 http://dx.doi.org/10.1007/978-3-642-12900-1_5 ]
Ding W W, Liu K, Belyaev E, et al. Tensor-based linear dynamical systems for action recognition from 3D skeletons[J]. Pattern Recognition, 2018, 77: 75-86.[DOI: 10.1016/j.patcog.2017.12.004]
Kolda T G, Bader B W. Tensor decompositions and applications[J]. SIAM Review, 2009, 51(3): 455-500.[DOI: 10.1137/07070111X]
Xie Y C, Ho J, Vemuri B. On a nonlinear generalization of sparse coding and dictionary learning[C]//Proceedings of the 30th International Conference on Machine Learning. Atlanta: International Conference on Machine Learning, 2013: 1480-1488.
Harandi M, Hartley R, Shen C H, et al. Extrinsic methods for coding and dictionary learning on Grassmann manifolds[J]. International Journal of Computer Vision, 2015, 114(2-3): 113-136.[DOI: 10.1007/s11263-015-0833-x]
Doretto G, Chiuso A, Wu Y N, et al. Dynamic textures[J]. International Journal of Computer Vision, 2003, 51(2): 91-109.[DOI: 10.1023/A:1021669406132]
Li W Q, Zhang Z Y, Liu Z C. Action recognition based on a bag of 3D points[C]//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops. San Francisco: IEEE, 2010: 9-14.[ DOI:10.1109/CVPRW.2010.5543273 http://dx.doi.org/10.1109/CVPRW.2010.5543273 ]
Xia L, Chen C C, Aggarwal J K. View invariant human action recognition using histograms of 3D joints[C]//Proceedings of 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. Providence: IEEE, 2012: 20-27.[ DOI:10.1109/CVPRW.2012.6239233 http://dx.doi.org/10.1109/CVPRW.2012.6239233 ]
相关作者
相关机构
京公网安备11010802024621