结合时空掩码和空间二维位置编码的手势识别
邓淦森^{1}, 丁文文^{1}, 杨超^{1}, 丁重阳^{2}(1.淮北师范大学数学科学学院;2.西安电子科技大学) 摘 要
目的 在动态手势序列特征提取时，忽略了不同动态手势手指间的相关性，是造成手势识别率不高的重要原因，例如，食指和大拇指在物理上是断开的，但它们的相互作用对于识别“捏”这个动作很重要。针对此问题，本文提出了基于二维编码的分块自注意网络进行手势识别，是首次对手部关节点进行空间二维位置编码。方法 首先，根据手部关节序列构造时空图，利用关节点平面坐标生成空间二维编码，并与时间轴的一维编码器合并，生成关节点的时空位置编码，可以有效处理空间上的异常姿态以及避免了时间上的乱序问题；然后，将时空图按照人体手部生物结构进行分块，通过空间自注意力和空间掩码，获取手指与手指之间的潜在信息。采用时间维度扩张的策略，通过时间自注意力和时间掩码，捕获长时间手指序列动态演变信息。结果 在DHG14/28数据集上，该算法比Hpev算法平均高出4.47%，比MSISTGCN 算法平均高出2.71%；在SHREC’17 track数据集上，该算法比Hpev算法平均高出0.47%，利用消融实验，验证了本文所提策略的合理性。结论 通过大量实验评估，验证了基于分块和时空位置编码构造出来的模型很好的解决了上述问题，提高了手势识别率。
关键词
Gesture recognition by combining spatiotemporal mask and spatial 2D position encoding
denggansen, dingwenwen^{1}, yanchao^{1}, dingchongyang^{2}(1.淮北师范大学数学科学学院;2.西安电子科技大学) Abstract
Objective In the process of gesture recognition, we often neglect the correlation between fingers and pay too much attention to the node features, which is an important reason for the low gesture recognition rate. For example, the index finger and thumb are physically disconnected, but their interaction is important for recognizing the "pinch" action, and we found that the inability to properly encode the spatial position of the hand node is another reason for the low recognition rate. To solve the problem of ignoring the correlation between fingers, we proposed to divide the joint of the hand part into blocks. The solution to the second problem is to encode the twodimensional position of the joint through the projection coordinates of the joint. As far as we know, this is the first time to encode the twodimensional position of the node in space. Method The spatiotemporal graph is generated from the gesture sequence. Since the spatiotemporal graph contains both the physical connection of the node and its temporal information, the spatial and temporal characteristics are respectively learned by using mask operations. According to the threedimensional space coordinates of joint nodes, the twodimensional projection coordinates are obtained and the twodimensional projection coordinates are input into the twodimensional space position encoder, which is composed of sine and cosine functions with different frequencies. The plane where the projection coordinates are located is divided into several grid cells, and the encoder composed of sine and cosine functions is calculated in each grid cell. The encoders in all grids are combined to form sine and cosine functions with different frequencies to form the final spatial twodimensional position code. By embedding the encoded information into the spatial features of the nodes not only the stronger spatial structure between the nodes is improved, but also the disorder of the nodes in the process of movement is avoided. Using the graph convolutional network to aggregate and embed the spatial encoded node and neighbor features, the spatiotemporal graph features after the graph convolution are input into the spatial selfattention module to extract the interfinger correlation. In order to take each finger as the research object, the distribution of nodes in the spatiotemporal graph is divided into blocks according to the biological structure of the human hand. Each finger through a linear learnable change to generate the eigenvector of the finger query Q key K value V. Then the selfattention mechanism is used to calculate the correlation between fingers in each frame of the spacetime graph and the correlation weight between fingers is obtained by combining the spatial mask matrix and each finger feature is updated. While updating the finger features, the spatial mask matrix is used to disconnect the time relationship between fingers in the spatiotemporal graph. Avoiding the influence of time dimension on the spatial correlation weight matrix. Similarly using the time selfattention module to learn the timing features of fingers in the spatiotemporal graph. Firstly, temporal sequence embedding is carried out for each frame through temporal onedimensional position coding, so that the temporal sequence information of each frame can be obtained during model learning. In order to capture the interframe correlation at a longer distance, the time dimension expansion strategy is used to fuse the features of the two adjacent frames. Then a learnable linear change generates a feature vector query Q key K and value V for each frame. Finally, the selfattention mechanism is used to calculate the correlation between each frame in the spacetime graph. At the same time, the correlation weight matrix between frames in the spacetime graph is obtained by combining the time mask matrix and the features of each frame are updated. Updating the features of each frame also uses the temporal mask matrix to avoid the influence of spatial dimension on the temporal correlation weight matrix. The fully connected network, Relu activation function and layer normalization are added to the end of each attention module to improve the training efficiency of the model, and the model finally outputs the learned feature vector for gesture recognition. Result The model is tested on two challenging datasets, DHG14/28 and SHREC "17 track. The experimental results show that the model achieves the best recognition rate on DHG14/28, which is 4.47% higher than Hpev algorithm on average and 2.71% higher than MSISTGCN algorithm on average. On the SHREC "17 track dataset, the algorithm is 0.47% higher than the Hpev algorithm on average. The ablation experiment proves the need of twodimensional location coding in space. The experimental test shows that the model has the best recognition rate when the node features are 64 dimensions and the number of selfattention head is 8. Conclusion Through a large number of experimental evaluations, it is verified that the network model constructed by the block strategy and spatial twodimensional position coding not only improves the spatial structure of the nodes, but also improves the recognition rate of gestures by using the selfattention mechanism to learn the correlation between nonphysically connected fingers.
Keywords
