尤凯军, 侯振杰, 梁久祯, 钟卓锟, 施海勇(常州大学计算机与人工智能学院)
摘 要 ：目的 行为识别中广泛使用的深度图序列存在着行为数据时空结构信息体现不足，易受深色等物体影响的缺点，而点云数据则可以提供丰富的空间信息与几何特征，弥补了深度图像的不足，但多数点云数据集规模小且没有时序信息。方法 本文首先将深度图序列转换为三维点云序列以表征人体行为信息，弥补了点云数据集规模小的缺点并加入了帧的时序概念。为了提高时空结构信息的利用率，弥补点云下采样时损失的时空信息，本文提出了基于点云序列时空信息注入的动作识别网络（PTSINet: Action Recognition Network Based on Point Cloud Sequence Temporal and Spatial Information Injection）。该网络由两个模块组成，即特征提取模块和时空信息注入模块。特征提取模块通过抽象操作层，多层感知器和最大池化等操作提取点云深层次的外观轮廓特征。时空信息注入模块通过位置编码和一组可学习的正态分布随机张量投影分别注入时序信息和空间结构信息。结果 在三个公共数据集上进行了大量实验，本文提出的网络结构展现出了良好的性能。其中，在NTU RGB+ d60数据集上的精度分别比PSTNet和SequentialPointNet提升了1.3%和0.2%，在NTU RGB+ d120数据集上的精度比PSTNet提升了1.9%。为了确保网络模型的鲁棒性，对MSR-ACTION3D小数据集进行实验对比，识别精度比SequentialPointNet提升了1.07%。结论 通过上述操作, PTSINet在获取静态的点云外观轮廓特征的同时，融入了动态的时空信息，弥补了特征提取时下采样导致的时空损失。
PTSINet: Action Recognition Network Based on Point Cloud Sequence Temporal and Spatial Information Injection
You Kaijun, Hou Zhenjie, Liang Jiuzhen, Zhong Zhuokun, Shi haiyong(Changzhou University)
Abstract: Objective Human motion recognition and deep learning have become a research hotspot in the field of computer vision because of their extensive applications in video surveillance, virtual reality, human-computer intelligent interaction and so on. In recent years, deep learning algorithm has been widely paid attention to by academic circles and engineering circles. It has been successfully applied in various fields such as speech recognition and graphic recognition. Deep learning theory has made great achievements in the feature extraction of static images, and has been gradually extended to the research of behavior recognition in other directions. Traditional research on human behavior recognition focuses on the depth image sequence under 2D information. The depth image can not only capture 3D information successfully, but also provide depth information. Depth information represents the distance between the target and the depth camera within the visual range, ignoring the influence of external factors such as lighting and background. Although depth image can capture 3D information, most depth image algorithms use the multi-view method to extract behavior features. The extraction effect of spatio-temporal features is affected by the Angle and number of multi-view, which greatly affects the utilization rate of 3D structural information, and the spatio-temporal structure information of 3D data is largely lost. With the rapid development of 3D acquisition technology, 3D sensors are becoming increasingly accessible and affordable, including various types of 3D scanners, lidar and more. The three-dimensional data collected by these sensors can provide rich geometry, shape and scale information. 3D data has many applications in different fields, including autonomous driving, robotics, remote sensing and healthcare. Point cloud representation is a commonly used 3D representation, which retains the original geometric information in 3D space without any discretization. Therefore, it is the preferred representation for understanding related applications in many scenarios such as autonomous driving and robotics. However, deep learning of 3D point cloud still faces some major challenges, such as small data set size. Method In this study, the depth map sequence is first converted into a three-dimensional point cloud sequence to represent human behavior information, and the large and authoritative data sets in the depth data set are converted into point cloud data sets to make up for the shortcoming of the small size of point cloud data sets. Due to the huge amount of point cloud data, the traditional point cloud deep learning network will use a sampling algorithm to sample the point cloud before feature extraction. The most commonly used algorithm is random subsampling, which will inevitably lead to the destruction of point cloud structural information. In order to improve the utilization rate of temporal and spatial structure information and make up for the loss of temporal and spatial information during random subsampling of point cloud, In this paper, Action Recognition Network Based on Point Cloud Sequence Temporal and Spatial Information Injection(PTSINet) are proposed for motion recognition. The network consists of two modules, namely feature extraction module and spatio-temporal information injection module. The feature extraction module extracts the deep appearance contour features of the point cloud through operations such as the abstraction manipulation layer, multi-layer perceptron and maximum pooling, among which the abstraction manipulation layer includes the sampling layer, grouping layer, CBAM layer and PointNet layer. In the module of spatio-temporal information injection, time sequence information and spatial structure information are injected for abstract features. When timing information is injected, sine and cosine functions of different frequencies are used as time position coding, because sine and cosine functions are unique and robust in the position of each vector in the disordered direction. In the process of spatial structure information injection, the abstract features after location coding are multiplied with a group of learnable normal distribution random tensors and projected into the corresponding dimension space. Then the coefficients of the random tensors are learned through the network to find the optimal projection space that can better focus on the structural relations between clouds. Then it enters the interpoint attention mechanism module to further learn the structural relationship between point cloud data points and points through the interpoint attention mechanism. Result A large number of experiments are carried out on three common data sets, and the proposed network structure shows good performance. Among them, the accuracy on NTU RGB+ d60 data sets is 1.3% and 0.2% higher than PSTNet and SequentialPointNet, respectively, and far exceeds the recognition accuracy of other networks. Although the accuracy of NTU RGB+ d120 data set is 0.1% lower than SequentialPointNet, it is still in a leading position compared with other networks, among which the network recognition accuracy proposed in this paper is 1.9% higher than PSTNet. NTU data set is one of the largest human action data sets. In order to ensure the robustness of the network model, verify the effect of PTSINet network on small data sets, and conduct experimental comparison on small data sets of MSR-ACTION3D. The recognition accuracy of the PTSINet network was 1.07% higher than that of SequentialPointNet, and far higher than that of other networks. Conclusion In this paper, we propose a point cloud sequence feature extraction and spatio-temporal injection network (PTSINet) for behavior recognition. In this paper, by means of coordinate transformation, the depth map sequence is converted into three-dimensional point cloud sequence for the characterization of human behavior information, which makes up for the shortcomings of insufficient depth information, spatial information and geometric features, and improves the utilization rate of spatio-temporal structure information. PTSINet not only obtains static point cloud contour features, but also integrates dynamic temporal and spatial information to make up for the temporal and spatial loss caused by sampling during feature extraction.