Current Issue Cover
融合时空图卷积的多人交互行为识别

成科扬1,2,3, 吴金霞1, 王文杉4, 荣兰1, 詹永照1(1.江苏大学计算机科学与通信工程学院, 镇江 212013;2.江苏省大数据泛在感知与智能农业应用工程研究中心, 镇江 212013;3.江苏大学网络空间安全研究院, 镇江 212013;4.中国电子科学研究院社会安全风险感知与防控大数据 应用国家工程实验室, 北京 100041)

摘 要
目的 多人交互行为的识别在现实生活中有着广泛应用。现有的关于人类活动分析的研究主要集中在对单人简单行为的视频片段进行分类,而对于理解具有多人之间关系的复杂人类活动的问题还没有得到充分的解决。方法 针对多人交互动作中两人肢体行为的特点,本文提出基于骨架的时空建模方法,将时空建模特征输入到广义图卷积中进行特征学习,通过谱图卷积的高阶快速切比雪夫多项式进行逼近。同时对骨架之间的交互信息进行设计,通过捕获这种额外的交互信息增加动作识别的准确性。为增强时域信息的提取,创新性地将切片循环神经网络(recurrent neural network,RNN)应用于视频动作识别,以捕获整个动作序列依赖性信息。结果 本文在UT-Interaction数据集和SBU数据集上对本文算法进行评估,在UT-Interaction数据集中,与H-LSTCM(hierarchical long short-term concurrent memory)等算法进行了比较,相较于次好算法提高了0.7%,在SBU数据集中,相较于GCNConv(semi-supervised classification with graph convolutional networks)、RotClips+MTCNN(rotating cliips+multi-task convolutional neural netowrk)、SGC(simplifying graph convolutional)等算法分别提升了5.2%、1.03%、1.2%。同时也在SBU数据集中进行了融合实验,分别验证了不同连接与切片RNN的有效性。结论 本文提出的融合时空图卷积的交互识别方法,对于交互类动作的识别具有较高的准确率,普遍适用于对象之间产生互动的行为识别。
关键词
Multi-person interaction action recognition based on spatio-temporal graph convolution

Cheng Keyang1,2,3, Wu Jinxia1, Wang Wenshan4, Rong Lan1, Zhan Yongzhao1(1.School of Computer Science and Telecommunications Engineering, Jiangsu University, Zhenjiang 212013, China;2.Jiangsu Province Big Data Ubiquitous Perception and Intelligent Agricultural Application Engineering Research Center, Zhenjiang 212013, China;3.Cyber Space Security Academy of Jiangsu University, Zhenjiang 212013, China;4.National Engineering Laboratory for Public Security Risk Perception and Control by Big Data, China Acadeemy of Electronic Sciences, Beijing 100041, China)

Abstract
Objective The recognition of multi-person interaction behavior has wide applications in real life. At present, human activity analysis research mainly focuses on classifying video clips of behaviors of individual persons, but the problem of understanding complex human activities with relationships between multiple people has not been resolved. When performing multi-person behavior recognition, the body information is more abundant and the description of the two-person action features are more complex. The problems such as complex recognition methods and low recognition accuracy occur easily. When the recognition object changes from a single person to multiple people, we not only need to pay attention to the action information of each person but also need to notice the interaction information between different subjects. At present, the interaction information of multiple people cannot be extracted well. To solve this problem effectively, we propose a multi-person interaction behavior-recognition algorithm based on skeleton graph convolution.Method The advantage of this method is that it can fully utilize the spatial and temporal dependence information between human joints. We design the interaction information between skeletons to discover the potential relationships between different individuals and different key points. By capturing the additional interaction information, we can improve the accuracy of action recognition. Considering the characteristics of multi-person interaction behavior, this study proposes a spatio-temporal graph convolution model based on skeleton. In terms of space, we have various designs for single-person and multi-person connections. We design the single-person connection within each frame. Apart from the physical connections between the points of the body, some potential correlations are also added between joints that represent non-physical connections such as the left and right hands of a single person. We design the interaction connection between two people within each frame. We use Euclidean distance to measure the correlation between interaction nodes and determine which points between the two persons have a certain connection. Through this method, the connection of the key points between the two persons in the frame not only can add new and necessary interaction connections, which can be used as a bridge to describe the interaction information of the two persons’ actions, but can also prevent noise connections and cause the underlying graph to have a certain sparseness. In the time dimension, we segment the action sequence. Every three frames of action are used as a processing unit. We design the joints between three adjacent frames, and use more adjacent joints to expand the receptive field to help us learn the change information in the time domain. Through the modeling design in the time and space dimensions, we have obtained a complex action skeleton diagram. We use the generalized graph convolution model to extract and summarize the two people action features, and approximate high-order fast Chebyshev polynomials of spectral graph convolution to obtain high-level feature maps. At the same time, to enhance the extraction of time domain information, we propose the application of sliced recurrent neural network(RNN) to video action recognition to enhance the characterization of two people actions. By dividing the input sequence into multiple equiling subsequences and using a separate RNN network for feature extraction on each subsequence, we can calculate each subsequence at the same time,thereby overcoming the limitations of sliced RNN that cannot be parallelized. Through the information transfer between layers, the local information on the subsequence can be integrated in the high-level network, which can integrate and summarize the information from local to global, and the network can capture the entire action-sequence dependent information. For the loss of information at the slice, we have solved this problem by taking the three frame actions as a processing unit.Result This study validates the proposed algorithm on two datasets (UT-Interaction and SBU) and compares them with other advanced interaction-recognition methods. The UT-Interaction dataset contains six classes of actions and the SBU interaction dataset has eight classes of actions. We use 10-fold and 5-fold cross-validation for evaluation. In the UT-Interaction dataset, compared with H-LSTCM(Chierarchical long-short-term concurrent memory) and other methods,the performance improves by 0.7% based on the second-best algorithm. In the SBU dataset, compared with GCNConv, RotClips+MTCNN, SGCConv, and other methods,the algorithm has been improved by 5.2%, 1.03%, and 1.2% respectively. At the same time, fusion experiments are conducted in the SBU dataset to verify the effectiveness of various connections and sliced RNN. This method can effectively extract additional information on interactions, and has a good effect on the recognition of interaction actions. Conclusion In this paper, the interactive recognition method of fusion spatio-temporal graph convolution has high accuracy for the recognition of interactive actions, and it is generally applicable to the recognition of behaviors that generate interaction between objects.
Keywords

订阅号|日报