A review of skeleton-based human action recognition
Skeleton-based human action recognition aims to correctly analyze the classes of actions from skeleton sequences, which contains one or more actions. Skeleton-based human action recognition is a hot research topic in the field of computer vision in recent years. Due to the fact that actions can be used to handle tasks and express human emotions, action recognition can be widely applied in various fields, such as intelligent monitoring systems, human-computer interaction, virtual reality, smart healthcare, and so on. Compared with RGB-based human action recognition, skeleton-based human action recognition methods are less affected by interference factors such as background and human appearance, and have higher accuracy and robustness. In addition, the small amount of data required and high computational efficiency of this method make it have broad application prospects in practical applications. It is of great significance to comprehensively and systematically summarize and analyze the skeleton-based human action recognition methods. Compared with other reviews on skeleton-based action recognition, our contributions are as follows: 1) a more comprehensive summary of skeleton-based action datasets; 2) skeleton-based action recognition methods are more comprehensively summarized, including the latest Transformer technology; 3) the classification of graph convolutional methods is more instructive; 4) the summary of existing problems and the prospect of future research directions are forward-looking. Firstly, we introduce nine datasets commonly used for skeleton-based action recognition, including MSR Action3D dataset, MSR Daily Activity 3D dataset, 3D Action Pairs dataset, SYSU 3DHOI dataset, UTD-MHAD dataset, Northwestern-UCLA dataset, NTU RGB +D 60 dataset, Skeleton-Kinetics dataset and NTU RGB +D 120 dataset. In order to highlight the characteristics of these datasets more prominently, we divide the datasets into single-view datasets and multi-view datasets according to the difference in data collection perspective, and focuse on exploring the traits and uses of each datasets. Secondly, based on the backbone network used by the models, this paper categorizes the skeleton-based action recognition methods into methods based on handcrafted features, methods based on recurrent neural networks (RNN), methods based on convolutional neural networks (CNN), methods based on graph convolutional networks (GCN), and methods based on Transformer. Before the rise of deep learning methods, traditional algorithms (handcrafted features) were usually used to model human skeleton data. The key problem with using traditional methods is how to create an effective feature representation of human skeleton sequences. After the rise of deep learning methods, which have shown excellent performance in various fields such as face recognition, image classification, and image super-resolution, researchers have begun to use deep learning networks to model skeleton data. Among them, RNN can effectively process data in the form of continuous time series, and is good at learning temporal dependencies information in sequence data. CNN can effectively learn the high-level semantic information of skeleton data. Training a CNN-based model requires lower computational cost than RNN. Unlike using the RNN-based methods, before using the convolutional network, it is necessary to reshape the skeleton data into pseudo-images. The columns of the pseudo-image are the features of all joints in one frame, and the rows are the features of a certain joint across all frames. However, when RNN or CNN methods are used to model skeleton data, the topological structure of the human skeleton is ignored. Transforming the skeleton data into sequence vectors of joint coordinates or a 2D grid cannot accurately describe the dynamic skeleton of the human body. Researches has shown that graph convolution has a powerful ability to model topological graph structures, which is particularly suitable for modeling human skeleton. The successful application of graph convolutional methods in skeleton-based action recognition has made it the most widely used method in current applications. This paper specifically adopts a novel inductive approach and provides a comprehensive review of GCN-based methods. The GCN-based methods are further classified according to the problems targeted by the research works, aiming to provide researchers with more ideas and methods. The specific classification is as follows. The works are conducted from the perspective of model optimization and can be specifically divided into: 1) optimization of graph structure; 2) network lightweighting. The works focus on in-depth research on the extraction of discriminative temporal and spatial features: 3) optimization of temporal and spatial features. There are also some works that address issues in special scenarios such as missing joints and noise joint: 4) optimization of joint missing and noise of joint. Finally, this paper provides a comprehensive summary of the existing issues of current methods. It not only points out the limitations and challenges of current methods but also evaluates the future development trend and provides insightful prospects for the field. By doing so, this review not only helps readers gain a deeper understanding of the current state of this task but also provides valuable guidance for future research in this area.