线性动态系统反演模型在人体行为识别的应用

丁文文; 刘凯; 唐风琴; 傅绪加

doi:10.11834/jig.180657

图像分析和识别 | 浏览量 : 0 下载量: 46 CSCD: 0

PDF
导出
分享
收藏
专辑

线性动态系统反演模型在人体行为识别的应用
Inversion model of linear dynamic system for human action recognition
2019年24卷第9期页码：1450-1457
收稿：2018-12-28，

修回：2019-4-13，

纸质出版：2019-09-16
DOI： 10.11834/jig.180657
稿件说明：

移动端阅览

丁文文, 刘凯, 唐风琴, 傅绪加. 线性动态系统反演模型在人体行为识别的应用[J]. 中国图象图形学报, 2019,24(9):1450-1457. DOI： 10.11834/jig.180657.

Wenwen Ding, Kai Liu, Fengqin Tang, Xujia Fu. Inversion model of linear dynamic system for human action recognition[J]. Journal of Image and Graphics, 2019, 24(9): 1450-1457. DOI： 10.11834/jig.180657.

摘要

目的

人体行为识别在视频监控、环境辅助生活、人机交互和智能驾驶等领域展现出了极其广泛的应用前景。由于目标物体遮挡、视频背景阴影、光照变化、视角变化、多尺度变化、人的衣服和外观变化等问题，使得对视频的处理与分析变得非常困难。为此，本文利用时间序列正反演构造基于张量的线性动态模型，估计模型的参数作为动作序列描述符，构造更加完备的观测矩阵。

方法

首先从深度图像提取人体关节点，建立张量形式的人体骨骼正反向序列。然后利用基于张量的线性动态系统和Tucker分解学习参数元组（

，

），其中

表示人体骨架信息的空间信息，

和

分别描述正向和反向时间序列的动态性。通过参数元组构造观测矩阵，一个动作就可以表示为观测矩阵的子空间，对应着格拉斯曼流形上的一点。最后通过在格拉斯曼流形上进行字典学习和稀疏编码完成动作识别。

结果

实验结果表明，在MSR-Action 3D数据集上，该算法比Eigenjoints算法高13.55%，比局部切从支持向量机（LTBSVM）算法高2.79%，比基于张量的线性动态系统（tLDS）算法高1%。在UT-Kinect数据集上，该算法的行为识别率比LTBSVM算法高5.8%，比tLDS算法高1.3%。

结论

通过大量实验评估，验证了基于时间序列正反演构造出来的tLDS模型很好地解决了上述问题，提高了人体动作识别率。

Abstract

Objective

Human action recognition has a very wide application prospect in fields such as video surveillance

human-computer interface

environment-assisted life

human-computer interaction

and intelligent driving. In image or video analysis

most of these tasks use color and texture cues in 2D images for recognition. However

due to occlusion

shadows

illumination changes

perspective changes

scale changes

intra-class variations

and similarities between classes

the recognition rate of human behavior is not ideal. In recent years

with the release of 3D depth cameras

such as Microsoft Kinect

3D depth data can provide pictures of scene changes

thereby improving the recognition rates for the first three challenges of human recognition. In addition

3D depth cameras provide powerful human motion capture technology

which can output the human skeleton of a 3D joint point position. Therefore

much attention has been paid to skeleton-based action recognition. The linear dynamical system (LDS) is the most common method for encoding spatio-temporal time-series data in various disciplines due to its simplicity and efficiency. A new method is proposed to obtain the parameters of a tensor-based LDS with forward and inverse action sequences to construct a complete observation matrix. The linear subspace of the observation matrix

which maps to a point on Grassmann manifold for human action recognition

is obtained. In this manner

an action can be expressed as a subspace spanned by columns of the matrix corresponding to a point on the Grassmann manifold. On the basis of such action

classification can be performed using dictionary learning and sparse coding.

Method

Considering the dynamics and persistence of human behavior

we do not vectorize the time series according to the general method but retain its own tensor characteristics

that is

we transform the high-dimensional vector into a low-dimensional subspace to analyze the factors affecting actions from various angles (modules). In this method

human skeletons are modeled using human joint points

which are initially extracted from a depth camera recording. To preserve the original spatio-temporal information of an action video and enhance the accuracy of human action recognition

we develop a time series of skeleton motions on the basis of the data in a three-order tensor and convert the skeleton into a two-order tensor. With this action representation

Tucker tensor decomposition methods are applied to obtain dimensionality reduction. Using the tensor-based LDS model with forward and inverse action sequence

we learn a parameter tuple (

)

in which

represents the spatial appearance of skeleton information

describes the dynamics of the forward time series

and

describes the dynamics of the inversion time series. We consider using an

-order observable matrix to approximate the extended observable matrix because human behavior has a limited duration and does not extend indefinitely in time. When

$$m$$

is small

it is insufficient to describe the entire action sequence. In case of cyclic sub-actions in human behavior

even adding

$$m$$

cannot simulate the follow-up action of human behavior. When combined with the observable matrix of an inverse action sequence

the description of cyclic sub-actions evidently makes up for this shortcoming

improves the performance of the system

increases the completeness of the finite observation matrix

and reduces computational complexity. Thus

the finite observability matrix can then be adopted as the feature descriptor for an action sequence with forward and inverse actions. In classifying points on the Grassmann manifolds

a simple method is to insert the Grassmann manifolds into a Euclidean space through tangent bundles of manifolds. This method does not necessarily provide an accurate estimate

and it requires intensive computation. Sparse coding and dictionary learning are carried out to classify points on the Grassmann manifolds by maintaining the Grassmann projection distance (chord metric) in differential homeomorphism. Sparse coding on Grassmann manifolds finds a set of linear subspaces to represent each linear subspace as a linear combination of these linear subspaces.

Result

The MSR-Action 3D dataset comprises depth sequences captured by depth cameras. It includes time-segment action sequences that have been preprocessed to remove the background. The dataset contains 20 actions performed by 10 different objects

with each action repeated thrice without any interaction with objects. The UT-Kinect dataset is a 200-frame depth sequence acquired indoors by using Kinect sensors. It contains 10 actions

namely

walking

standing up

picking up

moving

waving

throwing

pushing

sitting down

pulling

and clapping. Each action is repeated twice by 10 different people. To assess the effects of different subspace dimensions on recognition rate

we test the subspace dimensions ranging from 1 to 20. Experiments using the MSR-Action 3D and UT-Kinect datasets demonstrate the excellent performance of our proposed method. Through an extensive set of experimental assessments

we verify that the tensor-based LDS (tLDS) model with forward and inverse action sequences significantly improves the rate of human action recognition. Results show that the rate of the algorithm is 13.55% higher than that of the joint eigenvalue decomposition algorithm

2.79% higher than that of LTBSVM (local tangent bundle support vector machine) algorithm

and 1% higher than that of the tLDS algorithm on the MSR-Action 3D dataset. For the UT-Kinect dataset

the recognition rate of the proposed algorithm is 5.8% higher than that of the LTBSVM algorithm and 1.3% higher than that of tLDS algorithm.

Conclusion

We develop a novel action representation

namely

the tensor-based LDS model with forward and inverse action sequences. The proposed model translates 3D human skeleton sequences into tensor time series without having to unfold the skeletons onto column vectors. Tucker decomposition is used to estimate the parameters of the model as action descriptors. Through an extensive set of experimental assessments

we verify that the tensor-based LDS model with forward and inverse action sequences significantly improves the rate of human action recognition. Major contributions enabled by the proposed method include several novel skeleton-based tensor representations. Our next intended approach in subsequent research is to apply the tensor-based LDS model with forward and inverse action sequences to multi-person interactions.

关键词

Keywords

references

Luo H L, Wang C J, Lu F. Survey of video behavior recognition[J]. Journal on Communications, 2018, 39(6): 169-180.

罗会兰, 王婵娟, 卢飞.视频行为识别综述[J].通信学报, 2018, 39(6): 169-180.[DOI: 10.11959/j.issn.1000-436x.2018107]

Chen Y P, Qiu W G. Review of visual-based human behavior recognition algorithms[J]. Application Research of Computers, 2019, (7): 1-10.

陈煜平, 邱卫根.基于视觉的人体行为识别算法研究综述[J].计算机应用研究, 2019, (7): 1-10.

Ding W W. 3D skeleton-based spatio-temporal representation and human action recognition[D]. Xi′an: Xidian University, 2017.

丁文文.基于3维骨架的时空表示与人体行为识别[D].西安: 西安电子科技大学, 2017.

Ma Y X, Tan L, Dong X, et al. Action recognition for intelligent monitoring[J]. Journal of Image and Graphics, 2019, 24(2): 282-290.

马钰锡, 谭励, 董旭, 等.面向智能监控的行为识别[J].中国图象图形学报, 2019, 24(2): 282-290. DOI: 10.11834/jig.180392]

Ran X Y, Liu K, Li G, et al. Human action recognition algorithm based on adaptive skeleton center[J]. Journal of Image and Graphics, 2018, 23(4): 519-525.

冉宪宇, 刘凯, 李光, 等.自适应骨骼中心的人体行为识别算法[J].中国图象图形学报, 2018, 23(4): 519-525. DOI: 10.11834/jig.170420]

Chen L L, Wei H, Ferryman J. A survey of human motion analysis using depth imagery[J]. Pattern Recognition Letters, 2013, 34(15): 1995-2006.[DOI: 10.1016/j.patrec.2013.02.006]

Ye M, Zhang Q, Wang L, et al. A survey on human motion analysis from depth data[C]//Proceedings of Dagstuhl 2012 Seminar on Time-of-Flight Imaging and GCPR 2013 Workshop on Imaging New Modalities. Berlin, Heidelberg: Springer, 2013: 149-187.[ DOI:10.1007/978-3-642-44964-2_8 http://dx.doi.org/10.1007/978-3-642-44964-2_8 ]

Shotton J, Sharp T, Kipman A, et al. Real-time human pose recognition in parts from single depth images[J]. Communications of the ACM, 2013, 56(1): 116-124.[DOI: 10.1145/2398356.2398381]

Abdelkader M F, Abd-Almageed W, Srivastava A, et al. Silhouette-based gesture and action recognition via modeling trajectories on Riemannian shape manifolds[J]. Computer Vision and Image Understanding, 2011, 115(3): 439-455.

Devanne M, Wannous H, Berretti S, et al. 3D human action recognition by shape analysis of motion trajectories on Riemannian manifold[J]. IEEE Transactions on Cybernetics, 2015, 45(7): 1340-1352.[DOI: 10.1109/TCYB.2014.2350774]

Vemulapalli R, Arrate F, Chellappa R. Human action recognition by representing 3D skeletons as points in a lie group[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE, 2014: 588-595.[ DOI:10.1109/CVPR.2014.82 http://dx.doi.org/10.1109/CVPR.2014.82 ]

Slama R, Wannous H, Daoudi M, et al. Accurate 3D action recognition using learning on the Grassmann manifold[J]. Pattern Recognition, 2015, 48(2): 556-567.[DOI: 10.1016/j.patcog.2014.08.011]

Turaga P, Veeraraghavan A, Srivastava A, et al. Statistical analysis on manifolds and its applications to video analysis[M]//Schonfeld D, Shan C, Tao D C, et al. Video Search and Mining. Berlin, Heidelberg: Springer, 2010: 115-144.[ DOI:10.1007/978-3-642-12900-1_5 http://dx.doi.org/10.1007/978-3-642-12900-1_5 ]

Ding W W, Liu K, Belyaev E, et al. Tensor-based linear dynamical systems for action recognition from 3D skeletons[J]. Pattern Recognition, 2018, 77: 75-86.[DOI: 10.1016/j.patcog.2017.12.004]

Kolda T G, Bader B W. Tensor decompositions and applications[J]. SIAM Review, 2009, 51(3): 455-500.[DOI: 10.1137/07070111X]

Xie Y C, Ho J, Vemuri B. On a nonlinear generalization of sparse coding and dictionary learning[C]//Proceedings of the 30th International Conference on Machine Learning. Atlanta: International Conference on Machine Learning, 2013: 1480-1488.

Harandi M, Hartley R, Shen C H, et al. Extrinsic methods for coding and dictionary learning on Grassmann manifolds[J]. International Journal of Computer Vision, 2015, 114(2-3): 113-136.[DOI: 10.1007/s11263-015-0833-x]

Doretto G, Chiuso A, Wu Y N, et al. Dynamic textures[J]. International Journal of Computer Vision, 2003, 51(2): 91-109.[DOI: 10.1023/A:1021669406132]

Li W Q, Zhang Z Y, Liu Z C. Action recognition based on a bag of 3D points[C]//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops. San Francisco: IEEE, 2010: 9-14.[ DOI:10.1109/CVPRW.2010.5543273 http://dx.doi.org/10.1109/CVPRW.2010.5543273 ]

Xia L, Chen C C, Aggarwal J K. View invariant human action recognition using histograms of 3D joints[C]//Proceedings of 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. Providence: IEEE, 2012: 20-27.[ DOI:10.1109/CVPRW.2012.6239233 http://dx.doi.org/10.1109/CVPRW.2012.6239233 ]

文章被引用时，请邮件提醒。

提交

融合骨架大核算子和全局上下文信息的图卷积网络

结合坐标转换和时空信息注入的点云人体行为识别

格拉斯曼流形上的半监督判别分析