发布时间: 2019-09-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.180657
2019 | Volume 24 | Number 9

图像分析和识别

线性动态系统反演模型在人体行为识别的应用

丁文文¹, 刘凯², 唐风琴¹, 傅绪加¹

1. 淮北师范大学数学科学学院, 淮北 235000;

2. 西安电子科技大学计算机科学与技术学院, 西安 710071

收稿日期: 2018-12-28; 修回日期: 2019-04-13

基金项目: 国家自然科学基金项目（61571345，61550110247）；安徽省自然科学基金项目（1908085MF186）；安徽省高校自然科学研究重点项目（KJ2017A377，KJ2017A376）

第一作者简介: 丁文文, 1976年生, 女, 博士, 主要研究方向为人工智能、机器学习、计算机视觉。E-mail:dww2048@163.com;
唐风琴, 女, 博士, 主要研究方向为统计机器学习、网络分析。E-mail:tfq05@163.com;
傅绪加, 男, 博士研究生, 主要研究方向为图像处理、数值代数。E-mail:fxjmsy@aliyun.com.

中图法分类号: TP391.4

文献标识码: A

文章编号: 1006-8961(2019)09-1450-08

摘要

目的人体行为识别在视频监控、环境辅助生活、人机交互和智能驾驶等领域展现出了极其广泛的应用前景。由于目标物体遮挡、视频背景阴影、光照变化、视角变化、多尺度变化、人的衣服和外观变化等问题，使得对视频的处理与分析变得非常困难。为此，本文利用时间序列正反演构造基于张量的线性动态模型，估计模型的参数作为动作序列描述符，构造更加完备的观测矩阵。方法首先从深度图像提取人体关节点，建立张量形式的人体骨骼正反向序列。然后利用基于张量的线性动态系统和Tucker分解学习参数元组（A_F，A_I，C），其中C表示人体骨架信息的空间信息，A_F和A_I分别描述正向和反向时间序列的动态性。通过参数元组构造观测矩阵，一个动作就可以表示为观测矩阵的子空间，对应着格拉斯曼流形上的一点。最后通过在格拉斯曼流形上进行字典学习和稀疏编码完成动作识别。结果实验结果表明，在MSR-Action 3D数据集上，该算法比Eigenjoints算法高13.55%，比局部切从支持向量机（LTBSVM）算法高2.79%，比基于张量的线性动态系统（tLDS）算法高1%。在UT-Kinect数据集上，该算法的行为识别率比LTBSVM算法高5.8%，比tLDS算法高1.3%。结论通过大量实验评估，验证了基于时间序列正反演构造出来的tLDS模型很好地解决了上述问题，提高了人体动作识别率。

关键词

时间序列正反演; 人体行为识别; 人体骨架; 线性动态系统; 格拉斯曼流形

Inversion model of linear dynamic system for human action recognition

Ding Wenwen¹, Liu Kai², Tang Fengqin¹, Fu Xujia¹

1. School of Mathematical Sciences, Huaibei Normal University, Huaibei 235000, China;

2. School of Computer Science and Technology, Xidian University, Xi'an 710071, China

Supported by: National Natural Science Foundation of China (61571345, 61550110247)

Abstract

Objective Human action recognition has a very wide application prospect in fields such as video surveillance, human-computer interface, environment-assisted life, human-computer interaction, and intelligent driving. In image or video analysis, most of these tasks use color and texture cues in 2D images for recognition. However, due to occlusion, shadows, illumination changes, perspective changes, scale changes, intra-class variations, and similarities between classes, the recognition rate of human behavior is not ideal. In recent years, with the release of 3D depth cameras, such as Microsoft Kinect, 3D depth data can provide pictures of scene changes, thereby improving the recognition rates for the first three challenges of human recognition. In addition, 3D depth cameras provide powerful human motion capture technology, which can output the human skeleton of a 3D joint point position. Therefore, much attention has been paid to skeleton-based action recognition. The linear dynamical system (LDS) is the most common method for encoding spatio-temporal time-series data in various disciplines due to its simplicity and efficiency. A new method is proposed to obtain the parameters of a tensor-based LDS with forward and inverse action sequences to construct a complete observation matrix. The linear subspace of the observation matrix, which maps to a point on Grassmann manifold for human action recognition, is obtained. In this manner, an action can be expressed as a subspace spanned by columns of the matrix corresponding to a point on the Grassmann manifold. On the basis of such action, classification can be performed using dictionary learning and sparse coding. Method Considering the dynamics and persistence of human behavior, we do not vectorize the time series according to the general method but retain its own tensor characteristics, that is, we transform the high-dimensional vector into a low-dimensional subspace to analyze the factors affecting actions from various angles (modules). In this method, human skeletons are modeled using human joint points, which are initially extracted from a depth camera recording. To preserve the original spatio-temporal information of an action video and enhance the accuracy of human action recognition, we develop a time series of skeleton motions on the basis of the data in a three-order tensor and convert the skeleton into a two-order tensor. With this action representation, Tucker tensor decomposition methods are applied to obtain dimensionality reduction. Using the tensor-based LDS model with forward and inverse action sequence, we learn a parameter tuple (A_F, A_I, C), in which C represents the spatial appearance of skeleton information, A_F describes the dynamics of the forward time series, and A_I describes the dynamics of the inversion time series. We consider using an m-order observable matrix to approximate the extended observable matrix because human behavior has a limited duration and does not extend indefinitely in time. When $m$ is small, it is insufficient to describe the entire action sequence. In case of cyclic sub-actions in human behavior, even adding $m$ cannot simulate the follow-up action of human behavior. When combined with the observable matrix of an inverse action sequence, the description of cyclic sub-actions evidently makes up for this shortcoming, improves the performance of the system, increases the completeness of the finite observation matrix, and reduces computational complexity. Thus, the finite observability matrix can then be adopted as the feature descriptor for an action sequence with forward and inverse actions. In classifying points on the Grassmann manifolds, a simple method is to insert the Grassmann manifolds into a Euclidean space through tangent bundles of manifolds. This method does not necessarily provide an accurate estimate, and it requires intensive computation. Sparse coding and dictionary learning are carried out to classify points on the Grassmann manifolds by maintaining the Grassmann projection distance (chord metric) in differential homeomorphism. Sparse coding on Grassmann manifolds finds a set of linear subspaces to represent each linear subspace as a linear combination of these linear subspaces. Result The MSR-Action 3D dataset comprises depth sequences captured by depth cameras. It includes time-segment action sequences that have been preprocessed to remove the background. The dataset contains 20 actions performed by 10 different objects, with each action repeated thrice without any interaction with objects. The UT-Kinect dataset is a 200-frame depth sequence acquired indoors by using Kinect sensors. It contains 10 actions, namely, walking, standing up, picking up, moving, waving, throwing, pushing, sitting down, pulling, and clapping. Each action is repeated twice by 10 different people. To assess the effects of different subspace dimensions on recognition rate, we test the subspace dimensions ranging from 1 to 20. Experiments using the MSR-Action 3D and UT-Kinect datasets demonstrate the excellent performance of our proposed method. Through an extensive set of experimental assessments, we verify that the tensor-based LDS (tLDS) model with forward and inverse action sequences significantly improves the rate of human action recognition. Results show that the rate of the algorithm is 13.55% higher than that of the joint eigenvalue decomposition algorithm, 2.79% higher than that of LTBSVM (local tangent bundle support vector machine) algorithm, and 1% higher than that of the tLDS algorithm on the MSR-Action 3D dataset. For the UT-Kinect dataset, the recognition rate of the proposed algorithm is 5.8% higher than that of the LTBSVM algorithm and 1.3% higher than that of tLDS algorithm. Conclusion We develop a novel action representation, namely, the tensor-based LDS model with forward and inverse action sequences. The proposed model translates 3D human skeleton sequences into tensor time series without having to unfold the skeletons onto column vectors. Tucker decomposition is used to estimate the parameters of the model as action descriptors. Through an extensive set of experimental assessments, we verify that the tensor-based LDS model with forward and inverse action sequences significantly improves the rate of human action recognition. Major contributions enabled by the proposed method include several novel skeleton-based tensor representations. Our next intended approach in subsequent research is to apply the tensor-based LDS model with forward and inverse action sequences to multi-person interactions.

Key words

time series forward inversion; human behavior recognition; human skeleton; linear dynamic system(LDS); Grassmann manifold

0 引言

由于人体行为识别存在诸多挑战性问题，使其成为计算机视觉领域最活跃的研究课题之一，在视频监控、人机接口、环境辅助生活、人机交互和智能驾驶等领域展现出了极其广泛的应用前景。早期的人体行为识别研究主要是对静止图像或视频分析，大多使用2维图像中的颜色和纹理线索进行识别。由于目标物体遮挡、视频背景阴影、光照变化、视角变化、多尺度变化、人的衣服和外观变化等问题，使得对视频的处理与分析变得非常困难^[1-7]。

近年来，随着3D深度摄像机的发布，比如微软公司的Kinect或英特尔的RealSense，不仅提供场景的3维深度数据变化，而且提供了相当强大的人体运动捕获技术，可以输出人体骨骼的3维关节点位置，从而导致基于人体骨骼关节点的人体行为识别备受关注^[8]。Abdelkader等人^[9]利用黎曼流行构建人体轮廓的平面曲线轨迹，使用动态时间规整(DTW)进行时间对齐，最后使用马尔可夫图模型对空间的轨迹特征化后用于动作分类。Devanne等人^[10]通过比较黎曼流形中人体骨骼关节轨迹之间的相似性进行度量，最后在黎曼流形上直接用K最近邻(KNN)进行分类。Vemulapalli等人^[11]将关节点之间的骨骼看做刚体，不同身体部位之间的相对3D几何被描述为刚体与刚体之间的相对3D几何。给定的两个刚体的相对几何形状可以用特殊欧氏群(special Euclidean group) SE(3)表示的刚体变换(旋转和平移)来描述。动作的每一个姿态可以看做李群SE(3)×…×SE(3)上的一点，骨架序列则是李群曲线流形上的一条曲线。为了便于分类，将曲线流形上的曲线SE(3)×…×SE(3)映射到其对应的切平面空间，使用支持向量机(SVM)完成动作分类，取得了非常好的分类效果。Slama等人^[12]提取欧氏空间中3维人体关节的运动轨迹作为时间序列，将其时间序列表示的每个运动表示为线性动态系统(LDS)，以模拟其动态过程。该模型观测性矩阵^[13]的子空间表示格拉斯曼流形上的一个点，利用流形的黎曼几何，结合流形上适当的切向量研究了类间和类内变化的统计建模。Ding等人^[14]在Slama基础上提出了基于张量的线性动态系统(tLDS)对张量表示的动作序列建模，使用Tucker分解^[15]张量动作序列，并估计广义线性动态系统参数，使得一个动作可以映射到一个格拉斯曼流形上的一点。最后利用在格拉斯曼流形图的距离度量，得到各点间的距离，使用稀疏编码和字典学习^[13-14]，对人体行为进行分类。

上述方法有一个共同特点：使用3维人体关节点建模和分析人体运动时，都是利用动作的正向序列(正演)展开，没有考虑动作的反向序列(反演)。针对上述不足，本文提出一种通过动作序列正反演分别求出基于张量的线性动态系统参数，从而构造更加完备的观测矩阵^[10]，有效地提高人体动作识别率，本文的整体框架如图 1所示。

图 1 线性动态系统反演模型结构图

Fig. 1 The structure graph of inversion model of linear dynamic system

1 基于张量的线性动态系统模型

1.1 张量

设$\mathit{\pmb{X}}\in {\bf{\text{R}}^{{{I}_{1}}\times {{I}_{2}}\times \ldots \times {{I}_{N}}}}$为N维数组，或称为N阶张量，其中，${{\mathit{\pmb{X}}}_{{{i}_{1}}}}_{{{i}_{2}}}{{\ldots }_{{{i}_{N}}}}$是X的第$\left( {{i}_{1}}, {{i}_{2}}, \ldots , {{i}_{N}} \right)$个元素，${{I}_{1}}, {{I}_{2}}, \ldots , {{I}_{n}}, \ldots , {{I}_{N}}$表示X每个维度的大小，$n$表示X的模-$n$维度。

张量向量化就是将N阶张量X变形为向量$vec\left( \mathit{\pmb{X}} \right)\in {{\bf{\text{R}} }^{I}}$，特别地，张量向量化后向量中的元素$k$可以表示为$vec{{\left( \mathit{\pmb{X}} \right)}_{k}}={{\mathit{\pmb{X}}}_{{{i}_{1}}}}_{{{i}_{2}}}{{\ldots }_{{{i}_{N}}}}$，其中$k=1+\sum\limits_{p=1}^{n}{\prod\limits_{m=1}^{p=1}{{{I}_{m}}\left( {{i}_{p}}-1 \right)}}$。张量可以按照模-$n$维度展开为矩阵${{\mathit{\pmb{X}}}_{\left( n \right)}}\in {{\bf{\text{R}}}^{{{I}_{n}}\times \left( {{I}_{1}}{{I}_{2}}\times \ldots \times {{I}_{n-1}}{{I}_{n+1\ldots }}{{I}_{N}} \right)}}$，成为一个N阶张量的模-$n$展开，还可以通过Tucker分解^[15]为一个核张量和一些矩阵的模-$n$乘积。

人体动作序列虽然由一组简单的骨架组成，但人体骨架模型不仅包含了形状信息，还包含了动作序列中的时间信息。考虑到动作序列是一个有序的、多维数的张量序列，因此可以从多个角度(模)观察动作，更好地捕捉到每个模内在的变化以及与其他模式之间的无关性。

1.2 人体骨骼序列的张量表示

在很多前期工作中，文献都是将人体骨骼特性向量化，将动作序列简单地用骨架特征向量组合在一起，这样做的缺点是维度较高，容易引发维度灾难。考虑到人体行为的动态性和持续性，对时间序列并不是按一般方法进行向量化，而是保留它本身的张量特性，即不仅限于将高维向量转为低维子空间，这样可以从各个角度(模)分析影响动作的各个因素。

人体骨骼由关节点及关节点之间的骨骼构成，它们之间存在一定的运动连带关系。将关节点之间的骨骼看做链，就可以按照运动关系将各肢体链接起来。假定一个人体骨架有N个关节点，则有M=N－1个刚体，刚体就是两个关节点之间的骨骼。则每个骨架可以用一个2阶张量表示为$\mathit{\pmb{Y}}=\left[ {{\mathit{\pmb{e}}}_{1, 2}}, {{\mathit{\pmb{e}}}_{1, 3}}, \ldots , {{\mathit{\pmb{e}}}_{i, j}}, \ldots , {{\mathit{\pmb{e}}}_{N, N-1}} \right]_{_{M\times 9}}^{\text{T}}$，其中，${{\mathit{\pmb{e}}}_{i, j}}=[{{\mathit{\pmb{v}}}_{i}}, {{\mathit{\pmb{v}}}_{j}}, {{\mathit{\pmb{v}}}_{i}}-{{\mathit{\pmb{v}}}_{j}}], {{\mathit{\pmb{v}}}_{i}}=[{{x}_{i}}, {{y}_{i}}, {{z}_{i}}]$代表关节点的3维坐标。当一个骨骼序列有$\tau $帧，则可以用一个3维张量表示为$\left[ Y\left( 1 \right), Y\left( 2 \right), \ldots , Y\left( t \right), \ldots , Y\left( \tau \right) \right]$。

1.3 线性动态系统(基于时间正演)

线性动态系统通过计算系统的平衡点，围绕该点近似为线性系统，进而研究动态系统的定性行为。给定时间正演(forword)张量动作序列$\left[ Y\left( 1 \right), Y\left( 2 \right), \ldots , Y\left( t \right), \ldots , Y\left( \tau \right) \right]$，则线性动态系统为

$ \left\{ \begin{align} & \mathit{\pmb{Y}}\left( t \right)=\mathit{\pmb{C}}\otimes {\mathit{\pmb{X}}}\left( t \right)+{\mathit{\pmb{W}}}\left( t \right)\ \ \ \ \ \ \ \ \ \ {\mathit{\pmb{W}}}\left( t \right)\tilde{\ }{\text{N}}\left( 0, {\mathit{\pmb{E}}} \right) \\ & {\mathit{\pmb{X}}}\left( t+1 \right)={{\mathit{\pmb{A}}}_{\text{F}}}\otimes {\mathit{\pmb{X}}}\left( t \right)+{\mathit{\pmb{Q}}}\left( t \right)\ \ \ \ {\mathit{\pmb{Q}}}\left( t \right)\tilde{\ }{\text{N}}(0, \mathit{\pmb{M}}) \\ \end{align} \right. $

(1)

式中，第1个方程是观测方程，第2个方程是状态方程。${\mathit{\pmb{Y}}}\left( t \right)$表示$t$时刻的观测状态，用张量模拟视频中每一帧代表骨架；${\mathit{\pmb{X}}}\left( t \right)$表示$t$时刻的隐藏状态。${{\mathit{\pmb{A}}}_{\text{F}}}$是从$t$时刻转移到$t$+1时刻的状态转移矩阵，${\mathit{\pmb{W}}}\left( t \right)$和${\mathit{\pmb{Q}}}\left( t \right)$分别表示测量过程和建模过程的噪音，分别由具有协方差矩阵E和M的零均值正态分布随机变量来表示。C是从隐藏状态到输出的度量矩阵，为一个正交矩阵。若矩阵C的列数等于张量X所有维数的乘积，则⊗就可以定义一个矩阵和一个张量相乘，且有${{\mathit{\pmb{Y}}}_{{{j}_{1}}\ldots {{j}_{N}}}}={{\left( {\mathit{\pmb{C}}}\otimes {\mathit{\pmb{X}}} \right)}_{{{j}_{1}}\ldots {{j}_{N}}}}=\sum\limits_{l}{{{\mathit{\pmb{C}}}_{kl}}{{\mathit{\pmb{X}}}_{{{i}_{1}}\ldots {{i}_{N}}}}}$以及$vec\left( {\mathit{\pmb{C}}}\otimes {\mathit{\pmb{X}}} \right)=C\text{ }vec\left( X \right)$，$k$为行数，$l$为列数。由于C刻画空间形态，${{\mathit{\pmb{A}}}_{\text{F}}}$描述时间上的动态性，从初始条件X(1)开始，可以通过二元组(${{\mathit{\pmb{A}}}_{\text{F}}}$, C)构造期望观测序列^[3]为

$ {\rm{E}}\left[ \begin{array}{l} \;\;{\mathit{\pmb{Y}}}\left( 1 \right)\\ {\mathit{\pmb{Y}}}\left( 2 \right)\\ {\mathit{\pmb{Y}}}\left( 3 \right)\\ \;\;\;\;\; \vdots \end{array} \right] = \left[ \begin{array}{l} \;\;{\mathit{\pmb{C}}^{\rm{T}}}\\ {(\mathit{\pmb{C}}{\mathit{\pmb{A}}_{\rm{F}}})^{\rm{T}}}\\ {(\mathit{\pmb{CA}}_{\rm{F}}^2)^{\rm{T}}}\\ \;\;\;\; \vdots \end{array} \right] \otimes {\mathit{\pmb{X}}}\left( 1 \right) = \mathit{\pmb{O}}_{\rm{F}}^\infty \otimes {\mathit{\pmb{X}}}\left( 1 \right) $

(2)

由于人类行为的持续时间是有限的，并不是无限地在时间上延伸，所以考虑利用$m$阶可观测矩阵$O_{\text{F}}^{m}={{[{{\left( C \right)}^{\text{T}}}, {{(C{{A}_{\text{F}}})}^{\text{T}}}, {{(CA_{\text{F}}^{2})}^{\text{T}}}, \ldots , {{(CA_{\text{F}}^{m-1})}^{\text{T}}}]}^{\text{T}}}$逼近扩展可观测矩阵$\mathit{\pmb{O}}_{\text{F}}^{\infty }={{[{{\left( \mathit{\pmb{C}} \right)}^{\text{T}}}, {{(\mathit{\pmb{C}}{{\mathit{\pmb{A}}}_{\text{F}}})}^{\text{T}}}, {{(\mathit{\pmb{CA}}_{\text{F}}^{2})}^{\text{T}}}, \ldots ]}^{\text{T}}}$来描述LDS模型的内在特征^[12]。由于扩展可观测矩阵$\mathit{\pmb{O}}_{\text{F}}^{m}$的子空间对于状态空间基的选择是不变的，意味着时间序列也可以用$m$阶可观测矩阵$\mathit{\pmb{O}}_{\text{F}}^{m}$的列空间S=P(:, 1:$d$)表示，其中P通过对$\mathit{\pmb{O}}_{\text{F}}^{m}$进行奇异值分解而得^[3]，$\mathit{\pmb{O}}_{\text{F}}^{m}=\mathit{\pmb{P}}\mathit{\pmb{\varSigma}}{{\mathit{\pmb{Q}}}^{\text{T}}}$，$\mathit{\pmb{\varSigma}}$为半正定对角矩阵，对角线的元素为奇异值。这种方式可以将每个动作序列表示为$\mathit{\pmb{O}}_{\text{F}}^{m}$张开的$d$维子空间S看做格拉斯曼流形G($p$, $d$)上的一点，然后可以在格拉斯曼流形下通过对点的分类最终达到进行动作序列分类。格拉斯曼流形是$p$维向量空间中全体$d$维线性子空间构成的集合，其中$p$由$m$决定。

用有限观测矩阵$\mathit{\pmb{O}}_{\text{F}}^{m}$逼近无限可观测矩阵$\mathit{\pmb{O}}_{\text{F}}^{\infty }$，面临如何选择$m$值的问题，如果减小$m$的值，则不足以对扩展可观测性的渐近行为进行建模；如果增加$m$的值，可以使有限观测矩阵含有丰富信息，但增加了计算复杂度。

2 基于时间正反演的线性动态系统

给定时间反演动作序列$\left[ Y\left( \tau \right), Y\left( \tau -1 \right), \ldots , Y\left( t \right), \ldots , Y\left( 1 \right) \right]$，则线性动态系统为

$ \left\{ \begin{align} & \mathit{\pmb{Y}}\left( t \right)=\mathit{\pmb{C}}\otimes {\mathit{\pmb{X}}}\left( t \right)+{\mathit{\pmb{W}}}\left( t \right)\ \ \ \ \ \ \ \ \ \ {\mathit{\pmb{W}}}\left( t \right)\tilde{\ }{\text{N}}\left( 0, {\mathit{\pmb{E}}} \right) \\ & {\mathit{\pmb{X}}}\left( t+1 \right)={{\mathit{\pmb{A}}}_{\text{I}}}\otimes {\mathit{\pmb{X}}}\left( t \right)+{\mathit{\pmb{Q}}}\left( t \right)\ \ \ \ {\mathit{\pmb{Q}}}\left( t \right)\tilde{\ }{\text{N}}(0, \mathit{\pmb{M}}) \\ \end{align} \right. $

(3)

式中，A_I是从$t$时刻转移到$t$－1时刻的状态转移矩阵。在时间反演动作序列中，期望观测序列可以通过(A_I, C)表示为

$ {\rm{E}}\left[ \begin{array}{l} \;\;{\mathit{\pmb{Y}}}\left( \tau \right)\\ {\mathit{\pmb{Y}}}\left( {\tau - 1} \right)\\ {\mathit{\pmb{Y}}}\left( {\tau - 2} \right)\\ \;\;\;\;\; \vdots \end{array} \right] = \left[ \begin{array}{l} \;\;{\mathit{\pmb{C}}^{\rm{T}}}\\ {(\mathit{\pmb{C}}{\mathit{\pmb{A}}_{\rm{I}}})^{\rm{T}}}\\ {(\mathit{\pmb{CA}}_{\rm{I}}^2)^{\rm{T}}}\\ \;\;\;\; \vdots \end{array} \right] \otimes {\mathit{\pmb{X}}}\left( \tau \right) = \mathit{\pmb{O}}_{\rm{F}}^\infty \otimes {\mathit{\pmb{X}}}\left( \tau \right) $

同样，使用$m$阶可观测矩阵$\mathit{\pmb{O}}_{\text{1}}^{m}={{[{{\left( {\mathit{\pmb{C}}} \right)}^{\text{T}}}, {{(\mathit{\pmb{C}}{{\mathit{\pmb{A}}}_{\text{I}}})}^{\text{T}}}, {{(\mathit{\pmb{CA}}_{\text{I}}^{2})}^{\text{T}}}, \ldots {{(\mathit{\pmb{CA}}_{\text{i}}^{m-1})}^{\text{T}}}]}^{\text{T}}}$逼近扩展可观测矩阵$\mathit{\pmb{O}}_{\text{I}}^{\infty }={{[{{\left( {\mathit{\pmb{C}}} \right)}^{\text{T}}}, {{(\mathit{\pmb{C}}{{\mathit{\pmb{A}}}_{\text{I}}})}^{\text{T}}}, {{(\mathit{\pmb{CA}}_{\text{I}}^{2})}^{\text{T}}}, \ldots ]}^{\text{T}}}$。结合上述时间正演动作序列，扩展可观测矩阵可以表示为

$ \mathit{\pmb{O}}_{\text{I}}^{\infty }=\left[ \begin{align} & \ \ {{\mathit{\pmb{C}}}^{\text{T}}} \\ & {{(\mathit{\pmb{C}}{{\mathit{\pmb{A}}}_{\text{F}}})}^{\text{T}}} \\ & {{(\mathit{\pmb{CA}}_{\text{F}}^{2})}^{\text{T}}} \\ & \ \ \ \ \vdots \\ & {{(\mathit{\pmb{CA}}_{\text{I}}^{2})}^{\text{T}}} \\ & {{(\mathit{\pmb{C}}{{\mathit{\pmb{A}}}_{\text{I}}})}^{\text{T}}} \\ \end{align} \right] $

(4)

则基于时间正反演动作序列的$m$阶可观测矩阵可由正演动作序列和反演动作序列结合而成，即

$ {{\mathit{\pmb{O}}}^{m}}=\left[ \begin{align} & \ \ \ {{\mathit{\pmb{C}}}^{\text{T}}} \\ & {{(\mathit{\pmb{C}}{{\mathit{\pmb{A}}}_{\text{F}}})}^{\text{T}}} \\ & \ \ \ \ \vdots \\ & {{(\mathit{\pmb{CA}}_{\text{F}}^{m-1})}^{\text{T}}} \\ & {{(\mathit{\pmb{CA}}_{\text{I}}^{m-1})}^{\text{T}}} \\ & \ \ \ \ \vdots \\ & \ {{(\mathit{\pmb{C}}{{\mathit{\pmb{A}}}_{\text{I}}})}^{\text{T}}} \\ \end{align} \right] $

(5)

采用正演或反演动作序列得到的可观测矩阵，$m$取值较小时，不足以对整个动作行为序列加以描述，只有无限增大$m$的值才可以使得有限观测矩阵较为完备，但又增加了计算复杂度。一旦遇到人体行为包含循环子动作(比如举过头顶挥手包含来回挥手4、5次的循环子动作)，即使增加$m$的值也模拟不了人体行为的后续动作。当结合反演动作序列的可观测矩阵后，对循环子动作后续动作获得描述，显然就弥补了这一不足，提高了系统性能，增加了有限观测矩阵的完备性，同时降低了计算复杂度。

3 格拉斯曼流形上的稀疏编码

为了将格拉斯曼流形上的点进行分类，一种看似简单的方法是通过流形的切线丛将格拉斯曼流形嵌入到欧几里得空间，但是这种方法不仅估计数值不一定准确，而且需要大量密集计算。为了避免这些局限性，常用的方法是根据格拉斯曼流形定义，使用弦度量$d({{\mathit{\pmb{S}}}_{1}}, {{\mathit{\pmb{S}}}_{2}})={{\left\| \Phi ({{\mathit{\pmb{S}}}_{1}})-\Phi ({{\mathit{\pmb{S}}}_{2}}) \right\|}_{\text{F}}}={{\left\| {{{\hat{\mathit{\pmb{S}}}}}_{1}}-{{{\hat{{\mathit{\pmb{S}}}''}}}_{2}} \right\|}_{\text{F}}}$重新定义格拉斯曼流形上的算术计算和距离度量，其中S为上述$m$阶可观测矩阵$\mathit{\pmb{O}}_{\text{F}}^{m}$的列空间。上述分析可以将每个动作序列表示为${{\mathit{\pmb{O}}}^{m}}$张开的$d$维子空间S，并看做是格拉斯曼流形G($p$, $d$)上的一点，则格拉斯曼流形上定义点的距离等同于定义子空间的距离。投影映射函数$\mathit{\Phi }:{\mathit{\pmb{G}}}\left( p, d \right)\to {\mathit{\pmb{PG}}}\left( p, d \right)$实现将格拉斯曼流形G($p$, $d$)嵌入到幂等对称矩阵PG($p$, $d$)的空间，即$\Phi \left( \mathit{\pmb{S}} \right)=\mathit{\pmb{S}}{{\mathit{\pmb{S}}}^{\text{T}}}=\hat{\mathit{\pmb{S}}}$，其中S=$span$(S)是矩阵S的最优子空间。通过微分同胚保持格拉斯曼投影距离(弦度量)，进行稀疏编码和字典学习，达到对格拉斯曼流形上点分类的目的。格拉斯曼流形上的稀疏编码是要找到一组线性子空间，使得每个线性子空间都能表示成这组线性子空间的线性组合。给定一个字典$\mathit{\pmb{D}}=\left\{ {{{\hat{D}}}_{1}}, \ldots , {{{\hat{D}}}_{j}}, \ldots , {{{\hat{D}}}_{n}} \right\}$，一个待测样本${\hat{x}}$和系数$\mathit{\pmb{y}}=[{{y}_{1}}, {{y}_{2}}, \ldots , {{y}_{N}}]$，其中，${{{\hat{D}}}_{j}}$, $\hat{\mathit{\pmb{X}}}\in {\mathit{\pmb{PG}}}\left( p, d \right)$，则带有惩罚项的稀疏编码目标函数可以表示为

$ l(\mathit{\pmb{X}}, \mathit{\pmb{D}})\cong \begin{matrix} \min \\ y \\ \end{matrix}\left\| \hat{X}\text{-}\sum\limits_{j=1}^{N}{{{y}_{j}}{{{\hat{D}}}_{j}}} \right\|_{F}^{2}+\lambda {{\left\| y \right\|}_{1}} $

式中，$\lambda $是稀疏惩罚参数。具体算法可参考文献[16-17]，以获得关于稀疏编码的一般介绍以及关于其在线性子空间中的稀疏编码和字典学习的进一步数学细节。

4 实验结果与性能分析

4.1 MSR-Action 3D数据集

MSR-Action 3D数据集^[18]是由深度相机捕获的深度序列数据集，包括预先处理已去除背景的时间分段动作序列，包含20个由10个不同对象执行的动作，每个动作重复3次，并且与执行动作中的任何对象没有交互。

按照文献[18]的实验方案，将数据集划分为动作子集AS1、AS2和AS3，如表 1所示。每个子集包含8个动作。子集AS1和AS2将相对简单且动作类似的归为一组，相对简单指涉及较少关节点，比如挥手动作，只有一只手在挥动。AS3则将相对复杂的动作归为一组。相对复杂是指涉及更多关节点，比如捡起投掷。

表 1 MSR-Action 3D数据库中的3个动作子集库
Table 1 Three action subsets in MSR-Action 3D database

下载CSV

动作子集	动作内容
AS1	胸前挥手、敲、挥拳、扔、双手拍、弯腰发球、捡起投掷
AS2	举过头挥手、摘、画叉子、画勾子、画圈、双手挥动、前踢、侧面击打
AS3	扔、前踢、侧踢、慢跑、挥舞网球拍、发球、挥动高尔夫球杆、捡起投掷

文献[19]提供了3种测试方法。测试1将动作样例的2/3作为测试集，动作样例的1/3作为训练集。测试2将动作样例的1/3作为测试集，动作样例的2/3作为训练集。为了检验不同受试者完成同一个动作的差异性，测试3将1/2受试者的动作样例作为测试集，其余1/2受试者的动作样例作为训练集。用不同的阶数m重复实验，并提取平均性能，提出的方法与其他方法的性能比较如表 2所示。

表 2 本文方法与其他方法在MSR-Action 3D数据集交叉测试中的识别率
Table 2 Recognition rate in cross test in MSR-Action 3D database of other methods and our method

下载CSV

/%
方法	测试1	测试2	测试3	综合
LAPR^[11]	95.29	83.87	98.22	92.46
LTBSVM^[12]	95.55	84.9	98.73	93.06
tLDS^[14]	96.81	89.14	98.83	94.85
本文	97.34	90.28	99.94	95.85
注：加粗字体表示最优结果。

从表 2可以看出，正反演模型在MSR-Action 3D数据集上的总精度为95.85%，明显优于其他基于骨架的动作识别方法。在子集AS1、AS2和AS3上的实验表明，基于时间序列正反演的人体行为识别模型在区分相似和复杂动作方面都好于其他方法。

遵循文献[19]的实验协议，将交叉测试设置应用到整个数据集，比将数据集分割成3个子集(AS1、AS2和AS3)的协议更具挑战性。在该协议下，MSR-Action 3D数据集识别率如表 3所示，本文方法的识别率达到95.12%。

表 3 基于文献[19]的MSR-Action 3D数据集识别率
Table 3 Recognition rate in MSR-Action 3D database based on reference [19]

下载CSV

/%
方法	识别率
LARP^[11]	89.48
LTBSVM^[19]	91.21
SCK+DCK^[18]	91.45
3RB-tLDS^[11]	94.96
本文	95.12
注：加粗字体表示最优结果。

为了评估不同子空间维数对识别率的影响，对1~20的子空间维数$d$进行测试。作为子空间维数的函数的识别性能如图 2所示。可以看出，子空间维度为1~16时，人体动作识别率随子空间维度数量增加而提高，而后，识别率随维度的增加而下降，说明小维度导致信息缺乏，而具有大量维度的子空间仍然嘈杂，并导致动作类之间的混淆。

图 2 不同学习方法和子空间维数的识别率

Fig. 2 Recognition rate of different learning methods and subspace dimensions

4.2 UT-Kinect数据集

UT-Kinect数据集是使用Kinect传感器在室内获得的共6 220帧的200个深度序列，包含步行、站起来、拾取、搬、挥手、扔、推、坐下、拉和拍手等10个动作，由10个不同的人对每个动作重复做两次，序列的持续时间范围为5~120帧。这是一个具有挑战性的数据集，因为在执行给定动作时会发生较大变化。例如对于拾取动作，用一只或两只手执行相同动作。为了便于比较，本文应用Xia等人^[20]提出的一个序列交叉验证实验方案：在UT-Kinect序列上留下一个序列交叉验证(ROOCV)，对于每个测试，199个序列用于训练，仅有1个用于测试。本文方法与LTBSVM^[12]和tLDS^[14]方法的识别率如表 4所示。

表 4 人体动作类型在UT-Kinect数据集的识别率
Table 4 Recognition rate for each action type in UT-Kinect database

下载CSV

/%
动作	方法
动作	LITSVM^[9]	tLDS^[11]	本文
步行	100	85	90
站起来	100	100	100
拾取	100	100	100
搬	100	95	100
挥手	100	85	90
扔	60	95	95
推	65	100	100
坐下	80	100	100
拉	85	100	100
拍手	95	100	100
平均	88.5	96.48	97.5

从表 4可以看出，在所有情况下，本文方法在识别动作中的精度均优于80%，整体识别率分别比LTBSVM^[12]和tLDS^[14]方法大约高9%和1%。在这种情况下，视图的多样性使得能够捕获来自不同视图的嵌入式变化。遵循文献[19]的实验方案规则，使用1个受试者的样本作为测试数据，9个受试者的样本作为训练数据。将本文方法的识别精度与LITSVM^[12]和tLDS^[14]方法相比，在受试者交叉设置下，本文方法比LITSVM^[9]和tLDS^[14]的识别率分别高出5.8%和1.3%，表明本文方法针对类内变异与类间相似性等问题的有效性更好，鲁棒性更强。

5 结论

本文利用动作序列正反演，基于张量的线性动态系统参数模型，构造了一种新颖的动作表示方法。通过动作正向序列和反向序列分别估计线性动态系统参数，从而构造更加完备的观测矩阵，最后映射到格拉斯曼流形上，进行字典学习和稀疏编码完成动作识别，从而进行人体行为识别。

通过在MSR-Action 3D和UT-Kinect动作数据集的实验，表明本文方法很好地解决了视角变化、尺度变化和类内变异与类间相似性等问题，并有效提高了人体动作的识别率。但是，基于张量的骨架序列表示是通过固定的顺序关节点生成的，相应的语义不明确，关节之间的结构信息丢失，这是本文方法需要改进之处。下一步的研究目标是将基于张量的LDS的正反演模型应用于多人交互。

参考文献

[1] Luo H L, Wang C J, Lu F. Survey of video behavior recognition[J]. Journal on Communications, 2018, 39(6): 169–180. [罗会兰, 王婵娟, 卢飞. 视频行为识别综述[J]. 通信学报, 2018, 39(6): 169–180. ] [DOI:10.11959/j.issn.1000-436x.2018107]

[2] Chen Y P, Qiu W G. Review of visual-based human behavior recognition algorithms[J]. Application Research of Computers, 2019(7): 1–10. [陈煜平, 邱卫根. 基于视觉的人体行为识别算法研究综述[J]. 计算机应用研究, 2019(7): 1–10. ]

[3] Ding W W. 3D skeleton-based spatio-temporal representation and human action recognition[D]. Xi′an: Xidian University, 2017. [丁文文.基于3维骨架的时空表示与人体行为识别[D].西安: 西安电子科技大学, 2017.]

[4] Ma Y X, Tan L, Dong X, et al. Action recognition for intelligent monitoring[J]. Journal of Image and Graphics, 2019, 24(2): 282–290. [马钰锡, 谭励, 董旭, 等. 面向智能监控的行为识别[J]. 中国图象图形学报, 2019, 24(2): 282–290. ] [DOI:10.11834/jig.180392]

[5] Ran X Y, Liu K, Li G, et al. Human action recognition algorithm based on adaptive skeleton center[J]. Journal of Image and Graphics, 2018, 23(4): 519–525. [冉宪宇, 刘凯, 李光, 等. 自适应骨骼中心的人体行为识别算法[J]. 中国图象图形学报, 2018, 23(4): 519–525. ] [DOI:10.11834/jig.170420]

[6] Chen L L, Wei H, Ferryman J. A survey of human motion analysis using depth imagery[J]. Pattern Recognition Letters, 2013, 34(15): 1995–2006. [DOI:10.1016/j.patrec.2013.02.006]

[7] Ye M, Zhang Q, Wang L, et al. A survey on human motion analysis from depth data[C]//Proceedings of Dagstuhl 2012 Seminar on Time-of-Flight Imaging and GCPR 2013 Workshop on Imaging New Modalities. Berlin, Heidelberg: Springer, 2013: 149-187.[DOI:10.1007/978-3-642-44964-2_8]

[8] Shotton J, Sharp T, Kipman A, et al. Real-time human pose recognition in parts from single depth images[J]. Communications of the ACM, 2013, 56(1): 116–124. [DOI:10.1145/2398356.2398381]

[9] Abdelkader M F, Abd-Almageed W, Srivastava A, et al. Silhouette-based gesture and action recognition via modeling trajectories on Riemannian shape manifolds[J]. Computer Vision and Image Understanding, 2011, 115(3): 439–455. [DOI:10.1016/j.cviu.2010.10.006]

[10] Devanne M, Wannous H, Berretti S, et al. 3D human action recognition by shape analysis of motion trajectories on Riemannian manifold[J]. IEEE Transactions on Cybernetics, 2015, 45(7): 1340–1352. [DOI:10.1109/TCYB.2014.2350774]

[11] Vemulapalli R, Arrate F, Chellappa R. Human action recognition by representing 3D skeletons as points in a lie group[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE, 2014: 588-595.[DOI:10.1109/CVPR.2014.82]

[12] Slama R, Wannous H, Daoudi M, et al. Accurate 3D action recognition using learning on the Grassmann manifold[J]. Pattern Recognition, 2015, 48(2): 556–567. [DOI:10.1016/j.patcog.2014.08.011]

[13] Turaga P, Veeraraghavan A, Srivastava A, et al. Statistical analysis on manifolds and its applications to video analysis[M]//Schonfeld D, Shan C, Tao D C, et al. Video Search and Mining. Berlin, Heidelberg: Springer, 2010: 115-144.[DOI:10.1007/978-3-642-12900-1_5]

[14] Ding W W, Liu K, Belyaev E, et al. Tensor-based linear dynamical systems for action recognition from 3D skeletons[J]. Pattern Recognition, 2018, 77: 75–86. [DOI:10.1016/j.patcog.2017.12.004]

[15] Kolda T G, Bader B W. Tensor decompositions and applications[J]. SIAM Review, 2009, 51(3): 455–500. [DOI:10.1137/07070111X]

[16] Xie Y C, Ho J, Vemuri B. On a nonlinear generalization of sparse coding and dictionary learning[C]//Proceedings of the 30th International Conference on Machine Learning. Atlanta: International Conference on Machine Learning, 2013: 1480-1488.

[17] Harandi M, Hartley R, Shen C H, et al. Extrinsic methods for coding and dictionary learning on Grassmann manifolds[J]. International Journal of Computer Vision, 2015, 114(2-3): 113–136. [DOI:10.1007/s11263-015-0833-x]

[18] Doretto G, Chiuso A, Wu Y N, et al. Dynamic textures[J]. International Journal of Computer Vision, 2003, 51(2): 91–109. [DOI:10.1023/A:1021669406132]

[19] Li W Q, Zhang Z Y, Liu Z C. Action recognition based on a bag of 3D points[C]//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops. San Francisco: IEEE, 2010: 9-14.[DOI:10.1109/CVPRW.2010.5543273]

[20] Xia L, Chen C C, Aggarwal J K. View invariant human action recognition using histograms of 3D joints[C]//Proceedings of 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. Providence: IEEE, 2012: 20-27.[DOI:10.1109/CVPRW.2012.6239233]