融合多尺度特征的复杂手势姿态估计网络

贾迪; 李宇扬; 安彤; 赵金源

发布时间： 2023-09-20
摘要点击次数： 604
全文下载次数： 435
DOI: 10.11834/jig.220636
2023 | Volume 28 | Number 9

融合多尺度特征的复杂手势姿态估计网络

贾迪^1,2, 李宇扬¹, 安彤¹, 赵金源¹(1.辽宁工程技术大学电子与信息工程学院, 葫芦岛 125105;2.辽宁工程技术大学电气与控制工程学院, 葫芦岛 125105)

摘要

目的基于单幅 RGB 图像的手势姿态估计受手势复杂性、手指特征局部自相似性及遮挡问题的影响,导致手势姿态估计准确率低。为此,提出一种面向单目视觉手势姿态估计的多尺度特征融合网络。方法 1)采用ResNet50(50-layer residual network)模块从 RGB 图像提取不同分辨率特征图,通过通道变换模块显式地学习特征通道间的依赖关系,增强重要的特征通道信息,弱化次要的特征通道信息。2)在全局回归模块中,通过设计节点间的连接方式融合不同分辨率特征图,以便充分利用图像的细节与整体信息。采用局部优化模块继续提取更深层的特征信息,获得手部关节点的高斯热图,以此修正遮挡等原因造成部分关节点回归不准确的问题。3)计算经通道变换模块处理后的最小特征图,通过全局池化和多层感知机处理该特征图以获得手势类别和右手相对于左手的深度。4)综合以上结果获得最终的手势姿态。结果采用 InterHand2.6M 和 RHD(rendered handpose dataset)数据集训练多尺度特征融合网络,评估指标中根节点的平均误差和关节点的平均误差,均低于同类方法,且在一些复杂和遮挡的场景下鲁棒性更高。在 InterHand2.6M 数据集上,与 InterNet 方法相比,本文方法的交互手关节点的平均误差降低5.8%,单手关节点的平均误差降低 8.3%,根节点的平均误差降低 5.1%。从 RHD 数据集的测试结果看,与同类方法相比,本文方法在手部关节点的平均误差上获得最小值。结论本文提出的多尺度特征融合网络能够更准确地预测手部关节点位置,适用于复杂手势或遮挡条件下的手势姿态估计(本文方法代码网址:https://github.com/cor-nersInHeart/hand-pose-esitmation.git)

关键词

手势估计深度学习注意力机制多尺度特征图像处理

Complex gesture pose estimation network fusing multiscale features

Jia Di^1,2, Li Yuyang¹, An Tong¹, Zhao Jinyuan¹(1.School of Electronic and Information Engineering, Liaoning Technical University, Huludao 125105, China;2.Faculty of Electrical and Control Engineering, Liaoning Technical University, Huludao 125105, China)

Abstract

Objective Hand pose estimation aims to identify and localize key points of human hands in images.It has a wide range of applications in computer vision.Hand pose estimation methods can be categorized as depth- or RGB-based methods.Depth-based methods estimate the hand pose by extracting depth features.They require specific devices to constrain the user environment.Scholars use RGB images for hand pose estimation.However, this approach is difficult in an occluded environment.In particular, hand pose estimation based on a single RGB image has low accuracy because of the complexity of the pose, local self-similarity of finger features, and occlusion.Edge information is usually ignored in hand pose estimation.However, this information is important in extracting the information of occluded parts.Moreover, fingertips are small, thereby complicating the recognition of the joints at the fingertips.However, many existing RGB-based gesture estimation methods do not make good use of edge information.A multiscale feature fusion network for monocular vision gesture pose estimation is proposed to address this problem.Method Gesture pictures usually contain complex detailed features.A strong correlation between fingers and joints is present.Therefore, the use of a single feature for hand pose estimation tends to ignore diverse feature information, thereby complicating the accurate extraction of gesture information.Multiscale feature fusion network(MS-FF)aims to estimate the hand pose through a single RGB image.The feature maps of different resolutions are extracted from RGB images through the ResNet50 module.Feature maps are fed into the channel conversion module to learn the dependencies between channels explicitly, thereby enhancing important information and downplaying minor information.The level of feature information depends on the resolution of a feature map.Thus, the global regression module obtains high-resolution feature maps containing semantic information.These maps are separately input in the local optimization module to extract deep information.The Gaussian heatmap of hand joints is obtained to improve the spatial generalization ability of the model.Thus, accurate joint locations can be obtained.We take the feature map with the smallest resolution from the channel conversion module, through which the handedness and relative depth information between the wrist joints are obtained.The above results are combined to estimate the hand pose.Result The PyTorch framework was used for training.The hand image was resized to 256×256 pixels and input to the network.In the experiment, the batch size was set to 16.The network was trained for 20 epochs with an NVIDIA 3090 GPU.The initial learning rate was set to 0.000 1 and reduced by a factor of 10 at the 15th and 17th epochs to optimize the network output.The proposed method achieved better metrics than other methods on different test sets.InterHand2.6M(H+M)was selected as the training set.Compared with the evaluation metrics obtained by InterNet, the mean relative root position error, mean per joint position error of single hand sequences, and mean per joint position error of interacting hand sequences obtained by MS-FF had low errors of 30.92, 11.10, and 15.14, respectively.These values were 5.1%, 8.3%, and 5.8% lower than those obtained by InterNet.We also found that each finger achieved a low error.MS-FF also possesses few model parameters and low computational complexity while improving recognition accuracy.However, the running rate of MS-FF(28 frame/s)is lower than that of InterNet(53 frame/s).The picture shows the hand pose with finger self-occlusion and mutual occlusion of hands.Thus, estimating this interacting hand pose is more difficult than predicting a single hand pose.In the result obtained by our method, the hand joint positions and hand pose estimations are correctly predicted under occlusion.Moreover, our method can accurately predict hand joint positions and hand poses in case of occlusion.The proposed method achieves good recognition results in occluded gestures.Conclusion This study proposes an MS-FF for monocular visual hand pose estimation.MS-FF can extract information of different levels from feature maps of different resolutions to process the detailed information of occluded edges and fingertips effectively and estimate hand poses accurately.MS-FF accurately estimates hand poses in an RGB image and copes well with complex application scenarios.Thus, it can deal with difficult-to-recognize joints and inaccurate gesture recognition in occlusion scenes.Channels contain various implicit information.We need to focus on the information that is important for recognizing gestures.A channel conversion module adjusts the weights of channels to enhance important information.Fingertips occupy a small percentage of an image.They are also relatively difficult to identify.A global regression module generates different resolutions with rich semantic information to utilize image edge details and deep information effectively.This module is important in estimating finger poses.The global regression module may not accurately identify occluded joints.A local optimization module is designed with deep information in the feature map.It fuses all-level feature maps by correcting joints that do not return to the correction position.Thus, these maps can be applied well to the occlusion scene.Our method can effectively estimate single and interacting hand poses.It can also avoid errors caused by occlusion to a certain extent.High accuracy and robustness are achieved using the proposed method.However, the running rate of MS-FF is slower than that of the InterNet method because of the complex construction process of the MS-FF method.This scenario increases serial wait, kernel startup, and synchronization time overhead.In future work, we will continue to optimize our model, reduce the running rate of the model while ensuring recognition accuracy, and achieve a fast recognition speed to pave the way for fast and accurate gesture recognition in real scenes.

Keywords

gesture estimation deep learning attention mechanism multi-scale features image processing

在线采编平台

论文出版

年度会议

下载中心

年度信息