骨骼信息的人体行为识别综述
卢健,李萱峰,赵博,周健(西安工程大学电子信息学院) 摘 要
基于骨骼信息的人体行为识别旨在从输入的包含一个或者多个行为的骨骼序列中,正确的分析出行为的种类,属于计算机视觉领域的研究热点之一。与基于图像的人体行为识别方法相比,基于骨骼信息的人体行为识别方法不受背景、人体外观等干扰因素的影响,且具有更高的准确性、鲁棒性和计算效率。针对基于骨骼信息的人体行为识别方法的重要性和前沿性,对其进行全面和系统的总结分析具有十分重要的意义。在此背景下,本文首先回顾了九个广泛应用的骨骼行为识别数据集,按照数据收集视角的差异将它们分为单视角数据集和多视角数据集,并着重探讨了不同数据集的特点和用法。其次,本文根据算法所使用的基础网络,将基于骨骼信息的行为识别方法分为基于手工制作特征的方法、基于循环神经网络的方法、基于卷积神经网络的方法、基于图卷积网络的方法以及基于Transformer的方法,重点阐述分析了这些方法的原理及优缺点。其中,图卷积方法因其强大的空间关系捕捉能力而成为目前应用最为广泛的方法。本文特别采用了全新的归纳方法,对图卷积方法进行了全面的综述,旨在为研究人员提供更多的思路和方法。最后,本文从八个方面总结现有方法存在的问题,并针对性地提出工作展望。
关键词
A review of skeleton-based human action recognition
lujian,lixuanfeng,zhaobo,zhoujian() Abstract
Skeleton-based human action recognition aims to correctly analyze the classes of actions from skeleton sequences, which contains one or more actions. Skeleton-based human action recognition is a hot research topic in the field of computer vision in recent years. Due to the fact that actions can be used to handle tasks and express human emotions, action recognition can be widely applied in various fields, such as intelligent monitoring systems, human-computer interaction, virtual reality, smart healthcare, and so on. Compared with RGB-based human action recognition, skeleton-based human action recognition methods are less affected by interference factors such as background and human appearance, and have higher accuracy and robustness. In addition, the small amount of data required and high computational efficiency of this method make it have broad application prospects in practical applications. It is of great significance to comprehensively and systematically summarize and analyze the skeleton-based human action recognition methods. Compared with other reviews on skeleton-based action recognition, our contributions are as follows: 1) a more comprehensive summary of skeleton-based action datasets; 2) skeleton-based action recognition methods are more comprehensively summarized, including the latest Transformer technology; 3) the classification of graph convolutional methods is more instructive; 4) the summary of existing problems and the prospect of future research directions are forward-looking. Firstly, we introduce nine datasets commonly used for skeleton-based action recognition, including MSR Action3D dataset, MSR Daily Activity 3D dataset, 3D Action Pairs dataset, SYSU 3DHOI dataset, UTD-MHAD dataset, Northwestern-UCLA dataset, NTU RGB +D 60 dataset, Skeleton-Kinetics dataset and NTU RGB +D 120 dataset. In order to highlight the characteristics of these datasets more prominently, we divide the datasets into single-view datasets and multi-view datasets according to the difference in data collection perspective, and focuse on exploring the traits and uses of each datasets. Secondly, based on the backbone network used by the models, this paper categorizes the skeleton-based action recognition methods into methods based on handcrafted features, methods based on recurrent neural networks (RNN), methods based on convolutional neural networks (CNN), methods based on graph convolutional networks (GCN), and methods based on Transformer. Before the rise of deep learning methods, traditional algorithms (handcrafted features) were usually used to model human skeleton data. The key problem with using traditional methods is how to create an effective feature representation of human skeleton sequences. After the rise of deep learning methods, which have shown excellent performance in various fields such as face recognition, image classification, and image super-resolution, researchers have begun to use deep learning networks to model skeleton data. Among them, RNN can effectively process data in the form of continuous time series, and is good at learning temporal dependencies information in sequence data. CNN can effectively learn the high-level semantic information of skeleton data. Training a CNN-based model requires lower computational cost than RNN. Unlike using the RNN-based methods, before using the convolutional network, it is necessary to reshape the skeleton data into pseudo-images. The columns of the pseudo-image are the features of all joints in one frame, and the rows are the features of a certain joint across all frames. However, when RNN or CNN methods are used to model skeleton data, the topological structure of the human skeleton is ignored. Transforming the skeleton data into sequence vectors of joint coordinates or a 2D grid cannot accurately describe the dynamic skeleton of the human body. Researches has shown that graph convolution has a powerful ability to model topological graph structures, which is particularly suitable for modeling human skeleton. The successful application of graph convolutional methods in skeleton-based action recognition has made it the most widely used method in current applications. This paper specifically adopts a novel inductive approach and provides a comprehensive review of GCN-based methods. The GCN-based methods are further classified according to the problems targeted by the research works, aiming to provide researchers with more ideas and methods. The specific classification is as follows. The works are conducted from the perspective of model optimization and can be specifically divided into: 1) optimization of graph structure; 2) network lightweighting. The works focus on in-depth research on the extraction of discriminative temporal and spatial features: 3) optimization of temporal and spatial features. There are also some works that address issues in special scenarios such as missing joints and noise joint: 4) optimization of joint missing and noise of joint. Finally, this paper provides a comprehensive summary of the existing issues of current methods. It not only points out the limitations and challenges of current methods but also evaluates the future development trend and provides insightful prospects for the field. By doing so, this review not only helps readers gain a deeper understanding of the current state of this task but also provides valuable guidance for future research in this area.
Keywords
|