发布时间: 2017-04-16 摘要点击次数: 全文下载次数: DOI: 10.11834/jig.20170408 2017 | Volume 22 | Number 4 图像处理和编码

1. 华东师范大学信息科学技术学院, 上海市多维度信息处理重点实验室, 上海 200241;
2. 上海交通大学图像处理与模式识别研究所, 上海 200240
 收稿日期: 2016-10-26; 修回日期: 2016-12-15 基金项目: 国家自然科学基金项目（61302125，61377107）；上海市科委资助基金项目（14DZ2260800） 第一作者简介: 凌佩佩 (1991-), 女, 华东师范大学通信与信息系统专业硕士研究生, 主要研究方向为计算机视觉、数字图像处理等。E-mail:ppling.work@hotmail.com 中图法分类号: TP391.4 文献标识码: A 文章编号: 1006-8961(2017)04-0482-10

# 关键词

Human action recognition based on privileged information
Ling Peipei1, Qiu Song1, Cai Mingming1, Xu Wei2, Feng Ying1
1. Shanghai Key Laboratory of Multidimensional Information Processing, College of Information Science Technology, East China Normal University, Shanghai 200241, China;
2. Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai 200240, China
Supported by: National Natural Science Foundation of China (61302125, 61377107); Grants from Science and Technology Commission of Shanghai Municipality (14DZ2260800)

# Abstract

Objective The study of human action recognition is an area with important academic and application values. It is widely applied to the fields of intelligent surveillance, video retrieval, human interaction, live entertainment, virtual reality, and health care. In human learning, a teacher can provide students with information hidden in examples, explanations, comments, and comparisons. However, the information offered by a teacher is seldom applied to the field of human action recognition. This study considers 3D depth features as privileged information to help solve human action recognition problems and to demonstrate the superiority of a new learning paradigm over the classical learning paradigm. This paper reports on the details of the new paradigm and its corresponding algorithms. Method The human body can be represented as an articulated system with rigid segments connected by joints. Human motion can be regarded as a continuous evolution of the spatial configuration of these rigid segments. With the recent release of depth cameras, an increasing number of studies have extracted the 3D positions of tracked joints to represent human activities, these studies have achieved relatively good performance. However, relative 3D algorithms have numerous application limits resulting from inconvenient equipment and costly computation. The extraction of joints from RGB video sequences is difficult, which limits recognition result. This study applies 3D depth features as privileged information to solve the aforementioned challenge. In particular, we apply a new skeletal representation that explicitly models the 3D geometric relationships among different body parts that use rotations and translations in 3D space in the lie group. We use different algorithms, including motion scale-invariant feature transform, motion boundary histograms, and different combined descriptors, for the basic 2D features to unite privileged information. Privileged information is available in the training stage, but not in the testing stage. Similar to the traditional classification problem, the new algorithm focuses on learning a new classifier, i.e., support vector machine+ (SVM+). The SVM+ algorithm, which considers both privileged and unprivileged information, is highly similar to SVM algorithms in terms of determining solutions in the classical pattern recognition framework. In particular, it finds the optimal separating hyperplane, which incurs a few training errors and exhibits a large margin. However, the SVM+ algorithm is computationally costlier than SVM. This study applies the new algorithm to the field of human activity recognition to provide convenience to the testing set because 3D information is only required in the training set. Result We evaluate our method in two challenge databases, namely, UTKinect-Action and Florence3D-Action, with three different 2D features. The SVM+ algorithm considers both 2D basic features and 3D privileged information, whereas SVM only uses 2D basic features. Results show that our proposed SVM+ outperforms SVM. Moreover, SVM+ is less sensitive to relevant parameters than SVM. This paper reports on the details of the recognition performance, varying numbers of training samples, different parameters, and confusion matrix for both SVM and SVM+ on the two datasets. The privileged information can help to reduce the noise of the original 2D basic features and increase the robustness of human activity recognition. Conclusion The role of a teacher in providing remarks, explanations, and analogies is highly important. This study proposes a new human action recognition method based on privileged information. The experimental results of the two datasets show the effectiveness of our method in human action recognition. The proposed method is only required to extract 3D privileged information during the training process. A depth information acquisition device is not required during the testing process. This method exhibits high learning speed and low computational complexity. It can be extensively used in low-cost, real-time human action recognition.

# Key words

human action recognition; privileged information; support vector machine (SVM); support vector machine+(SVM+); 3D lie group features

# 1.1 方法概述

1) 输入：

(1) 人体动作数据集的2维特征和3维深度特征及其所对应的标签$\left( {{x_1}, x_1^{^*}, {y_1}} \right)$, $\left( {{x_2}, x_{_2}^{^*}, {y_2}} \right)$, …, $\left( {{x_L}, x_{_L}^{^*}, {y_L}} \right)$, ${x_i} \in \mathit{\boldsymbol{X}}, x_{_i}^{^*} \in {\mathit{\boldsymbol{X}}^*}, {y_i} \in \{-1, 1\}, {x_i}$为训练数据集2维特征，$x_{_i}^{^*}$为3维，即深度信息特征，$y_i$为数据集对应的标签, L为训练数据集长度。

(2) 待识别的人体动作数据集的2维特征

 $\{ {x_1}, {x_2}, \cdots, {x_N}\}, {x_j} \in \mathit{\boldsymbol{X}}$

$x_j$为测试数据集2维特征，N为测试数据集长度。

2) 训练，在训练过程中采用3维深度特征辅助2维基本特征学习得到结合特权信息的支持向量机 (SVM+)，继而对待识别人体动作进行分类得到识别结果，与经典支持向量机 (SVM) 进行对比。

3) 测试和结果输出，人体动作数据集一般包含多种动作，采用一对多分类方法得到SVM+分类器下的动作识别结果。

# 2.3 实验结果分析

SVM对应的是在训练和测试数据中都采用2维特征得到的识别率，经验证样本求得最优参数后的结果。SVM+对应本文方法，将3维特征作为特权信息用于训练样本，测试样本维持不变。

Table 1 The recognition performance of the proposed method in UTKinect-Action database

 字典长度 Mosift 2维特征 MBH 2维特征 COM 2维特征 SVM 本文方法 SVM 本文方法 SVM 本文方法 500 0.444 4 0.464 7 0.757 6 0.808 1 0.767 7 0.787 9 1 000 0.535 4 0.535 4 0.737 4 0.828 3 0.808 1 0.838 4 2 000 0.555 6 0.555 6 0.798 0 0.818 2 0.818 2 0.818 2

# 2.3.2 特征提取方法分析

COM特征是除了MBH特征外，结合HOG、HOF等特征得到的混合特征。HOG和HOF局部特征采用的是Laptev[15]提出的时空兴趣点检测算法，3个特征结合以后可以相互弥补不足，COM特征识别结果高于MBH特征。

# 参考文献

• [1] Carlsson S, Sullivan J. Action recognition by shape matching to key frames//Proceedings of the 2001 IEEE Computer Society Workshop on Models versus Exemplars in Computer Vision. New York, USA:IEEE, 2001:18.
• [2] Efros A A, Berg A C, Mori G, et al. Recognizing action at a distance//Proceedings of the 9th IEEE International Conference on Computer Vision. Nice, France:IEEE, 2003, 2:726-733.[DOI: 10.1109/ICCV.2003.1238420]
• [3] Denman S, Fookes C, Sridharan S. Improved simultaneous computation of motion detection and optical flow for object tracking//Proceedings of the 2009 Digital Image Computing:Techniques and Applications. Melbourne, VIC:IEEE, 2009:175-182.[DOI: 10.1109/DICTA.2009.35]
• [4] Lowe D G. Distinctive image features from scale-invariantkeypoints. International Journal of Computer Vision, 2004, 60(2):91-110.[DOI: 10.1023/B:VISI.0000029664.99615.94]
• [5] Wang H, Kläser A, Schmid C, et al. Dense trajectories and motion boundary descriptors for action recognition[J]. International Journal of Computer Vision, 2013, 103(1): 60–79. [DOI:10.1007/s11263-012-0594-8]
• [6] Shotton J, Fitzgibbon A, Cook M, et al. Real-time human pose recognition in parts from single depth images//Proceedings of 2011 IEEE Conference on Computer Vision and Pattern Recognition. Providence, RI:IEEE, 2011:1297-1304.[DOI: 10.1109/CVPR.2011.5995316]
• [7] Xia L, Chen CC, Aggarwal J K. View invariant human action recognition using histograms of 3D joints//Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. Providence, RI:IEEE, 2012:20-27.[DOI: 10.1109/CVPRW.2012.6239233]
• [8] Vemulapalli R, Arrate F, Chellappa R. Human action recognition by representing 3D skeletons as points in a Lie group//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH:IEEE, 2014:588-595.[DOI: 10.1109/CVPR.2014.82]
• [9] Wang J, Zheng H C. View-robust action recognition based on temporal self-similarities and dynamic time warping//Proceedings of the 2012 IEEE International Conference on Computer Science and Automation Engineering. Zhangjiajie, China:IEEE, 2012:498-502.[DOI: 10.1109/CSAE.2012.6272822]
• [10] Schuldt C, Laptev I, Caputo B. Recognizing human actions:a local SVM approach//Proceedings of the 17th International Conference on Pattern Recognition. Cambridge:IEEE, 2004:32-36.[DOI: 10.1109/ICPR.2004.1334462]
• [11] Simon T, Nguyen M H, De La Torre F, et al. Action unit detection with segment-based SVMs//Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition. San Francisco, CA, USA:IEEE, 2010:2737-2744.[DOI: 10.1109/CVPR.2010.5539998]
• [12] Wang M Y, Zhang C L, Song Y. An improved multiple instance learning algorithm for object extraction//Proceedings of the 2010 Chinese Conference on Pattern Recognition. Chongqing, China:IEEE, 2010:1-5.[DOI: 10.1109/CCPR.2010.5659221]
• [13] Pechyony D, Vapnik V. Fast optimization algorithms for solving SVM+//Summa M G, Bottou L, Goldfarb B, et al. Statistical Learning and Data Science. Boca Raton, FL:Chapman and Hall, 2011.
• [14] Vapnik V N. Statistical Learning Theory. New York:Wiley, 1998:156-178.
• [15] Seidenari L, Varano V, Berretti S, et al. Recognizing actions from depth cameras as weakly aligned multi-part bag-of-poses//Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Portland, OR:IEEE, 2013:479-485.[DOI: 10.1109/CVPRW.2013.77]
• [16] Chen M Y, Hauptmann A.MoSIFT:Recognizing human actions in surveillance videos, CMU-CS-09-161. Pittsburgh, PA:Carnegie Mellon University, 2009.
• [17] Li W, Dai D X, Tan M K, et al. Fast algorithms for linear and kernel SVM+//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA:IEEE, 2016:2258-2266.[DOI: 10.1109/CVPR.2016.248]
• [18] Zhu Y, Chen W B, Guo G D. Fusing spatiotemporal features and joints for 3D action recognition//Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Portland, OR:IEEE, 2013:486-491.[DOI: 10.1109/CVPRW.2013.78]