Current Issue Cover
线性动态系统反演模型在人体行为识别的应用

丁文文1, 刘凯2, 唐风琴1, 傅绪加1(1.淮北师范大学数学科学学院, 淮北 235000;2.西安电子科技大学计算机科学与技术学院, 西安 710071)

摘 要
目的 人体行为识别在视频监控、环境辅助生活、人机交互和智能驾驶等领域展现出了极其广泛的应用前景。由于目标物体遮挡、视频背景阴影、光照变化、视角变化、多尺度变化、人的衣服和外观变化等问题,使得对视频的处理与分析变得非常困难。为此,本文利用时间序列正反演构造基于张量的线性动态模型,估计模型的参数作为动作序列描述符,构造更加完备的观测矩阵。方法 首先从深度图像提取人体关节点,建立张量形式的人体骨骼正反向序列。然后利用基于张量的线性动态系统和Tucker分解学习参数元组(AFAIC),其中C表示人体骨架信息的空间信息,AFAI分别描述正向和反向时间序列的动态性。通过参数元组构造观测矩阵,一个动作就可以表示为观测矩阵的子空间,对应着格拉斯曼流形上的一点。最后通过在格拉斯曼流形上进行字典学习和稀疏编码完成动作识别。结果 实验结果表明,在MSR-Action 3D数据集上,该算法比Eigenjoints算法高13.55%,比局部切从支持向量机(LTBSVM)算法高2.79%,比基于张量的线性动态系统(tLDS)算法高1%。在UT-Kinect数据集上,该算法的行为识别率比LTBSVM算法高5.8%,比tLDS算法高1.3%。结论 通过大量实验评估,验证了基于时间序列正反演构造出来的tLDS模型很好地解决了上述问题,提高了人体动作识别率。
关键词
Inversion model of linear dynamic system for human action recognition

Ding Wenwen1, Liu Kai2, Tang Fengqin1, Fu Xujia1(1.School of Mathematical Sciences, Huaibei Normal University, Huaibei 235000, China;2.School of Computer Science and Technology, Xidian University, Xi'an 710071, China)

Abstract
Objective Human action recognition has a very wide application prospect in fields such as video surveillance, human-computer interface, environment-assisted life, human-computer interaction, and intelligent driving. In image or video analysis, most of these tasks use color and texture cues in 2D images for recognition. However, due to occlusion, shadows, illumination changes, perspective changes, scale changes, intra-class variations, and similarities between classes, the recognition rate of human behavior is not ideal. In recent years, with the release of 3D depth cameras, such as Microsoft Kinect, 3D depth data can provide pictures of scene changes, thereby improving the recognition rates for the first three challenges of human recognition. In addition, 3D depth cameras provide powerful human motion capture technology, which can output the human skeleton of a 3D joint point position. Therefore, much attention has been paid to skeleton-based action recognition. The linear dynamical system (LDS) is the most common method for encoding spatio-temporal time-series data in various disciplines due to its simplicity and efficiency. A new method is proposed to obtain the parameters of a tensor-based LDS with forward and inverse action sequences to construct a complete observation matrix. The linear subspace of the observation matrix, which maps to a point on Grassmann manifold for human action recognition, is obtained. In this manner, an action can be expressed as a subspace spanned by columns of the matrix corresponding to a point on the Grassmann manifold. On the basis of such action, classification can be performed using dictionary learning and sparse coding. Method Considering the dynamics and persistence of human behavior, we do not vectorize the time series according to the general method but retain its own tensor characteristics, that is, we transform the high-dimensional vector into a low-dimensional subspace to analyze the factors affecting actions from various angles (modules). In this method, human skeletons are modeled using human joint points, which are initially extracted from a depth camera recording. To preserve the original spatio-temporal information of an action video and enhance the accuracy of human action recognition, we develop a time series of skeleton motions on the basis of the data in a three-order tensor and convert the skeleton into a two-order tensor. With this action representation, Tucker tensor decomposition methods are applied to obtain dimensionality reduction. Using the tensor-based LDS model with forward and inverse action sequence, we learn a parameter tuple (AF, AI, C), in which C represents the spatial appearance of skeleton information, AF describes the dynamics of the forward time series, and AI describes the dynamics of the inversion time series. We consider using an m-order observable matrix to approximate the extended observable matrix because human behavior has a limited duration and does not extend indefinitely in time. When m is small, it is insufficient to describe the entire action sequence. In case of cyclic sub-actions in human behavior, even adding m cannot simulate the follow-up action of human behavior. When combined with the observable matrix of an inverse action sequence, the description of cyclic sub-actions evidently makes up for this shortcoming, improves the performance of the system, increases the completeness of the finite observation matrix, and reduces computational complexity. Thus, the finite observability matrix can then be adopted as the feature descriptor for an action sequence with forward and inverse actions. In classifying points on the Grassmann manifolds, a simple method is to insert the Grassmann manifolds into a Euclidean space through tangent bundles of manifolds. This method does not necessarily provide an accurate estimate, and it requires intensive computation. Sparse coding and dictionary learning are carried out to classify points on the Grassmann manifolds by maintaining the Grassmann projection distance (chord metric) in differential homeomorphism. Sparse coding on Grassmann manifolds finds a set of linear subspaces to represent each linear subspace as a linear combination of these linear subspaces. Result The MSR-Action 3D dataset comprises depth sequences captured by depth cameras. It includes time-segment action sequences that have been preprocessed to remove the background. The dataset contains 20 actions performed by 10 different objects, with each action repeated thrice without any interaction with objects. The UT-Kinect dataset is a 200-frame depth sequence acquired indoors by using Kinect sensors. It contains 10 actions, namely, walking, standing up, picking up, moving, waving, throwing, pushing, sitting down, pulling, and clapping. Each action is repeated twice by 10 different people. To assess the effects of different subspace dimensions on recognition rate, we test the subspace dimensions ranging from 1 to 20. Experiments using the MSR-Action 3D and UT-Kinect datasets demonstrate the excellent performance of our proposed method. Through an extensive set of experimental assessments, we verify that the tensor-based LDS (tLDS) model with forward and inverse action sequences significantly improves the rate of human action recognition. Results show that the rate of the algorithm is 13.55% higher than that of the joint eigenvalue decomposition algorithm, 2.79% higher than that of LTBSVM (local tangent bundle support vector machine) algorithm, and 1% higher than that of the tLDS algorithm on the MSR-Action 3D dataset. For the UT-Kinect dataset, the recognition rate of the proposed algorithm is 5.8% higher than that of the LTBSVM algorithm and 1.3% higher than that of tLDS algorithm. Conclusion We develop a novel action representation, namely, the tensor-based LDS model with forward and inverse action sequences. The proposed model translates 3D human skeleton sequences into tensor time series without having to unfold the skeletons onto column vectors. Tucker decomposition is used to estimate the parameters of the model as action descriptors. Through an extensive set of experimental assessments, we verify that the tensor-based LDS model with forward and inverse action sequences significantly improves the rate of human action recognition. Major contributions enabled by the proposed method include several novel skeleton-based tensor representations. Our next intended approach in subsequent research is to apply the tensor-based LDS model with forward and inverse action sequences to multi-person interactions.
Keywords

订阅号|日报