Current Issue Cover
多视图自适应骨架网络的工业装箱动作识别

张学琪, 胡海洋, 潘开来, 李忠金(杭州电子科技大学计算机院)

摘 要
目的 动作识别在工业生产制造中变得越来越重要。在复杂的生产环境中,通过识别工人动作和姿态,提高生产效率和质量。近年来,基于骨骼数据的动作识别受到广泛关注和研究,其中以图卷积(GCN)或长短时记忆网络(LSTM)为主的方法,在实验中表现出优秀的识别效果。然而,这些方法并未考虑到在工厂环境中可能出现的遮挡、视角变化以及相似的细微动作的识别问题。这些问题可能会对后续动作识别产生较大的影响。基于此,本文提出一种结合双视图骨架多流网络的装箱行为识别方法。方法 首先,本文将堆叠的差分图像(residual frames, RF)作为模型的输入,用以更好地提取特征,并且利用互补方向下的多视图来解决人体被遮挡的问题,将两个互补视图下的差分人体骨架提取出来后,传入自适应视角转换模块。在视角转换模块中,将骨骼数据旋转到最佳的虚拟观察角度,并将转换后的骨架数据传入三层堆叠的长短时记忆网络中(long short-term memory, LSTM),然后将不同流下的分类分数进行融合,得到识别结果。此外,为了解决细微动作的识别问题,我们采用了结合注意力机制的局部定位图像卷积网络,传入到ResNeXt网络中进行识别。最后融合骨架和局部图像识别的结果,预测工人的行为动作。结果 本文在实际生产环境下的装箱场景进行了实验,得到装箱行为识别准确率在92.31%,较大幅度领先于现有的主流行为识别方式。此外,该方法在公共数据集NTU RGB+D上进行了评估,结果显示在CS协议和CV协议中的性能分别达到了85.52%和93.64%,且优于其他网络,进一步验证了本方法的有效性和准确性。结论 本文提出了一种人体行为识别方法,能够充分利用多个视图中的人体行为信息,采用骨架网络和卷积神经网络模型相结合的方式,有效提高了行为识别的准确率。
关键词
Adaptive multi-view skeleton network for industrial packing action recognition

Zhang Xueqi, Hu Haiyang, Pan Kailai, Li Zhongjin(School of Computer Science and Technology,Hangzhou Dianzi University)

Abstract
Objective Action recognition has become increasingly important in industrial manufacturing. By recognizing worker actions and postures in complex production environments, production efficiency and quality can be improved. In recent years, action recognition based on skeletal data has received widespread attention and research, with methods mainly based on graph convolutional networks (GCN) or long short-term memory networks (LSTM) exhibiting excellent recognition performance in experiments. However, these methods have not considered the recognition problems of occlusion, viewpoint changes, and similar subtle actions in the factory environment, which may have a significant impact on subsequent action recognition. Therefore, this paper proposes a packing behavior recognition method that combines a dual-view skeleton multi-stream network. Method The network model consists of a main network and a sub-network. The main network uses two RGB videos from different perspectives as input, and records the input of workers at the same time and the same action. Subsequently, the image difference method is used to convert the input video data into a difference image, and the 3D skeleton information of the character is extracted from the depth map by using the 3D pose estimation algorithm, and then transmitted to the subsequent viewing angle conversion module. In the perspective conversion module, the rotation of the bone data is used to find the best viewing angle, and the converted skeleton data is passed into a three-layer stacked long short-term memory network (long short-term memory, LSTM), and then the different The classification scores of the weighted fusion are obtained to obtain the recognition results of the main network. In addition, for some similar behaviors and non-compliant "fake actions", we use a local positioning image convolution network combined with an attention mechanism, and pass it into the ResNeXt network for recognition. At the same time, in order to focus on the key frames of the skeleton sequence, we introduce a spatio-temporal attention mechanism for analyzing video action recognition sequences. The recognition score of the main network and the recognition score of the sub-network are fused in proportion to obtain the final recognition result and predict the behavior of the person. Results Firstly, CNN-based methods usually have better performance than RN-based methods, while GCN-based methods have middling performance. At the same time, in order to better explore the spatiotemporal information of skeleton, CNN and RNN network structure are mixed to better improve the accuracy and recall rate. However, the method proposed in this paper obtained that the identification accuracy of packing behavior is 92.31%, and the recall rate is 89.72%, which is still 3.96% higher than the accuracy and 3.81% higher. It is significantly ahead of other existing mainstream behavior recognition methods. Secondly, the method based on difference image combined with skeleton extraction algorithm can achieve 87.6% accuracy, which is better than RGB as the input method of the original image, although the frame rate is reduced to 55.3 frames per second, which is still within the acceptable range. Third, considering the influence of the adaptive transformation module and the multi-view module on the experiment, we find that the recognition rate of the single-stream network with the adaptive transformation module is greatly improved, while the fps will slightly decrease. The experiment finds that the learning of the module is more inclined to observe the action from the front, because the front observation can scatter the skeleton as much as possible, compared with the side observation. The highest degree of mutual occlusion between bones was the worst observation effect. For dual-view, simply fusing two different single-stream output results can improve the performance, and the weighted average method has the best effect, which is 3.83% and 3.03% higher than the accuracy of single-stream S1 and S2, respectively. Some actions have the problem of object occlusion and human self-occlusion under a certain shooting Angle. The occlusion problem can be solved by two complementary views, that is, the occluded action can be well recognized in one of the views. In addition, evaluations were also carried out on the public NTU RGB+D dataset, where the performance results also outperformed other networks. This further validates the effectiveness and accuracy of the proposed method in the paper. Conclusion This method uses a two-stream network model. The main network is an adaptive multi-view RNN network. Two depth cameras under complementary perspectives are used to collect the data of the same station, and the incoming RGB image is converted into a differential image for extracting skeleton information. Then, the skeleton data is passed into the adaptive view transformation module to obtain the best skeleton observation points, and the three-layer stacked LSTM network is used to obtain the recognition results. Finally, the weighted fusion of the two view features is used, and the main network solves the influence of occlusion and background clutter. In order to make up for the problem of insufficient accuracy of "fake action" and similar action recognition, the sub-network adds the hand image recognition of skeleton positioning, and the intercepted local positioning image is sent to the ResNeXt network for recognition. Finally, the recognition results of the main network and the sub-network are fused. The human behavior recognition method proposed in this paper effectively utilizes the human behavior information from multiple views and combines skeleton network and convolutional neural network models to significantly improve the accuracy of behavior recognition.
Keywords

订阅号|日报