Current Issue Cover
结合坐标转换和时空信息注入的点云人体行为识别

尤凯军, 侯振杰, 梁久祯, 钟卓锟, 施海勇(常州大学计算机与人工智能学院, 常州 213000)

摘 要
目的 行为识别中广泛使用的深度图序列存在着行为数据时空结构信息体现不足、易受深色物体等因素影响的缺点,点云数据可以提供丰富的空间信息与几何特征,弥补了深度图像的不足,但多数点云数据集规模较小且没有时序信息。为了提高时空结构信息的利用率,本文提出了结合坐标转换和时空信息注入的点云人体行为识别网络。方法 通过将深度图序列转换为三维点云序列,弥补了点云数据集规模较小的缺点,并加入帧的时序概念。本文网络由两个模块组成,即特征提取模块和时空信息注入模块。特征提取模块提取点云深层次的外观轮廓特征。时空信息注入模块为轮廓特征注入时序信息,并通过一组随机张量投影继续注入空间结构信息。最后,将不同层次的多个特征进行聚合,输入到分类器中进行分类。结果 在3个公共数据集上对本文方法进行了验证,提出的网络结构展现出了良好的性能。其中,在NTU RGB+d60数据集上的精度分别比PSTNet(point spatio-temporal network)和SequentialPointNet提升了1.3%和0.2%,在NTU RGB+d120数据集上的精度比PSTNet提升了1.9%。为了确保网络模型的鲁棒性,在MSR Action3D小数据集上进行实验对比,识别精度比SequentialPointNet提升了1.07%。结论 提出的网络在获取静态的点云外观轮廓特征的同时,融入了动态的时空信息,弥补了特征提取时下采样导致的时空损失。
关键词
Point cloud human behavior recognition based on coordinate transformation and spatiotemporal information injection

You Kaijun, Hou Zhenjie, Liang Jiuzhen, Zhong Zhuokun, Shi Haiyong(College of Computer and Artificial Intelligence, Changzhou University, Changzhou 213000, China)

Abstract
Objective Human motion recognition and deep learning have become a research hotspot in the field of computer vision because of their extensive applications in video surveillance,virtual reality,and human computer intelligent interaction.Deep learning theory has made excellent achievements in the feature extraction of static images and has been gradually extended to the research of behavior recognition in other directions.Traditional research on human behavior recognition focuses on depth image sequence under 2D information.Depth image cannot only capture 3D information successfully,but can also provide depth information.Depth information represents the distance between the target and the depth camera within the visual range,disregarding the influence of external factors,such as lighting and background.Although depth image can capture 3D information,most depth image algorithms use the multi-view method to extract behavior features.The extraction effect of spatiotemporal features is affected by the angle and number of multiple views,considerably affecting the utilization rate of 3D structural information,and the spatiotemporal structure information of 3D data is largely lost.With the rapid development of 3D acquisition technology,3D sensors are becoming increasingly accessible and affordable,including various types of 3D scanners and LiDAR.The 3D data collected by these sensors can provide rich geometry,shape,and scale information.3D data have many applications in different fields,including autonomous driving,robotics,remote sensing,and healthcare.Point cloud representation is a commonly used 3D representation;it retains the original geometric information in 3D space without any discretization.Therefore,it is the preferred representation for understanding related applications in many scenarios,such as autonomous driving and robotics.However,the deep learning of a 3D point cloud still faces major challenges,such as small dataset size.Method In this study,the depth map sequence is first converted into a 3D point cloud sequence to represent human behavior information,and the large and authoritative datasets in the depth dataset are converted into point cloud datasets to compensate for the shortcoming of the small size of point cloud datasets.Given the huge amount of point cloud data,the traditional point cloud deep learning network will use a sampling algorithm to sample the point cloud before feature extraction.The most commonly used algorithm is random subsampling,which will inevitably lead to the destruction of point cloud structural information.To improve the utilization rate of temporal and spatial structure information and compensate for the loss of such information during the random subsampling of a point cloud,a point cloud human behavior recognition network that combines coordinate transformation and spatiotemporal information injection is proposed for motion recognition in this study.The network consists of two modules:the feature extraction module and the spatiotemporal information injection module.The feature extraction module extracts the deep appearance contour features of the point cloud through operations,such as the abstraction manipulation layer,multilayer perceptron,and maximum pooling.Among which,the abstraction manipulation layer includes the sampling,grouping,convolutional block attention module(CBAM),and PointNet layers.In the spatiotemporal information injection module,time sequence and spatial structure information are injected for abstract features.When timing information is injected,the sine and cosine functions of different frequencies are used as time position coding,because sine and cosine functions are unique and robust in the position of each vector in the disordered direction.During spatial structure information injection,the abstract features after location coding are multiplied with a group of learnable normal distribution random tensors and projected onto the corresponding dimension space.Then,the coefficients of the random tensors are learned through the network to find the optimal projection space that can better focus on the structural relations between point clouds.Subsequently,the feature enters the interpoint attention mechanism module to further learn the structural relationship between point cloud data points and points through the interpoint attention mechanism.Finally,the multilevel features in feature extraction and information injection are aggregated and inputted into the classifier for classification.Result A large number of experiments are performed on three common datasets,and the proposed network structure exhibits good performance.Accuracy on the NTU RGB+d60 datasets is 1.3% and 0.2% higher than those of PSTNet and SequentialPointNet,respectively,considerably exceeding the recognition accuracy of other networks.Although the accuracy of the NTU RGB+d120 dataset is 0.1% lower than that of SequentialPointNet,it remains in a leading position compared with other networks.The network recognition accuracy proposed in this study is 1.9% higher than that of PSTNet.The NTU dataset is one of the largest human action datasets.To ensure the robustness of the network model,the effect of the point cloud human behavior recognition network that combines coordinate transformation and spatiotemporal information injection on small datasets is verified,and experimental comparison was performed on small datasets of MSR Action3D.The recognition accuracy of the network proposed in this study was 1.07% higher than that of SequentialPointNet,and considerably higher than those of other networks.Conclusion In this study,we propose a point cloud human behavior recognition network that combines coordinate transformation and spatiotemporal information injection for behavior recognition.Through coordinate transformation,the depth map sequence is converted into 3D point cloud sequence for the characterization of human behavior information,compensating for the shortcomings of insufficient depth information,spatial information,and geometric features,and improving the utilization rate of spatiotemporal structure information.The network proposed in this study not only obtains static point cloud contour features,but also integrates dynamic temporal and spatial information to compensate for the temporal and spatial losses caused by sampling during feature extraction.
Keywords

订阅号|日报