联合多视图可控融合和关节相关性的三维人体姿态估计

董婧; 张鸿儒; 方小勇; 周东生; 杨鑫; 张强; 魏小鹏

发布时间： 2024-05-14
摘要点击次数： 316
全文下载次数： 179
DOI: :10.11834/jig.230908
| Volume | Number

联合多视图可控融合和关节相关性的三维人体姿态估计

董婧¹, 张鸿儒¹, 方小勇², 周东生¹, 杨鑫³, 张强³, 魏小鹏³(1.大连大学;2.湖南工学院;3.大连理工大学)

摘要

目的多视图三维人体姿态估计能够从多方位的二维图像中估计出各个关节点的深度信息，克服单目三维人体姿态估计中因遮挡和深度模糊导致的不适定性问题，但如果系统性能被二维姿态估计结果的有效性所约束，则难以实现最终三维估计精度的进一步提升。为此，本文提出了一种联合多视图可控融合和关节相关性的三维人体姿态估计算法JCNet，包括多视图融合优化模块、二维姿态细化模块和结构化三角剖分模块三部分。方法首先，基于极线几何框架的多视图可控融合优化模块有选择地利用极线几何原理提高二维热图的估计质量，并减少噪声引入；然后，基于图卷积与注意力机制联合学习的二维姿态细化方法以单视图中关节点之间的联系性为约束，更好地学习人体的整体和局部信息，优化二维姿态估计；最后，引入结构化三角剖分以获取人体骨长先验知识，嵌入三维重建过程，改进三维人体姿态的估计性能。结果该算法在两个公共数据集Human3.6M、Total Capture和一个合成数据集Occlusion-person上进行了评估实验，平均关节误差为17.1mm、18.7mm和10.2mm，明显优于现有的多视图三维人体姿态估计算法。结论本文提出了一个能够构建多视图间人体关节一致性联系以及各自视图中人体骨架内在拓扑约束的多视图三维人体姿态估计算法，优化二维估计结果，修正错误姿态，有效地提高了三维人体姿态估计的精确度，取得了最佳的估计结果。

关键词

多视图三维人体姿态估计关节相关性图卷积注意力机制三角剖分

Combined multi-view controlled fusion and joint correlation for 3D human pose estimation

(1.Dalian University;2.Dalian University of Technology)

Abstract

Objective 3D human pose estimation is fundamental to understanding human behavior, which aims to estimate 3D joint points from images or videos. It is widely used in downstream tasks such as human-computer interaction, virtual fitting, autonomous driving, and pose tracking. According to the number of cameras, 3D human pose estimation can be divided into monocular 3D human pose estimation and multi-view 3D human pose estimation. Due to the ill-posed problem caused by occlusion and depth ambiguity, it is difficult to estimate the 3D human joint points by monocular 3D human pose estimation. However multi-view 3D human pose estimation can obtain the depth of each joint from multiple images, which can overcome this problem. Most recent methods leverage triangulation to estimate the 3D information from the 2D poses in multiple images. In most recent methods, they estimate the 3D joint positions by leveraging their 2D counterparts measured in multiple images to 3D space by the module called triangulation. This module is usually used in a two-stage procedure: first estimating the 2D joint coordinates of the human on each view separately with a 2D pose detector, then reconstructing the 3D pose by applying triangulation from multi-view 2D poses. Base on it, some methods work with epipolar geometry to fuse the human joint features to establish the correlation among multiple views and improve the accuracy of 3D estimation. But when the system performance is constrained by the effectiveness of the 2D estimation results, it is difficult to achieve further improvement in the final 3D estimation accuracy. Therefore, to extract human contextual information for more effective 2D features, we construct a novel 3D pose estimation network to explore the correlation of the same joint among multiple views and the correlation between neighbor joints in the single view. Method In this paper, we propose a multi-view 3D human pose estimation method based on joint point correlation (JCNet), which includes three parts: a controllable multi-view controlled fusion optimization module, a 2D pose refinement module, and a structured triangulation module. First of all, a set of RGB images captured from multiple views are fed into the 2D detector to obtain the 2D heatmaps, and then the adaptive weights of each heatmap are learned by a weight learning network with appearance information and geometric information branches. Based on it, we construct a multi-view controlled fusion optimization module based on epipolar geometry framework, which can analyze the estimation quality of joints in each camera view to influence the fuse process. Specifically, it selectively utilizes the principles of epipolar geometry to fuse all views according to the weights, thus ensuring that the low-quality estimation can benefit from auxiliary views while avoiding the introduction of noise in high-quality heatmaps. Subsequently, a 2D pose refine module composed of attention mechanisms and graph convolution is applied. The attention mechanism enables the model to capture the global content by assignment weight, while the graph convolutional network (GCN) can exploit local information by aggregating the features of the neighbor nodes and instruct the topological structure information of the human skeleton. The network combining the attention and GCN can not only learn human information better but also construct the interdependence between joint points in the single view to refine 2D pose estimation results. Finally, structural triangulation is introduced with structural constraints of the human body and human skeleton length in the process of 2D-to-3D inference to improve the accuracy of 3D pose estimation. This paper adopts the pre-trained 2D backbone called simple baseline as the 2D detector to extract 2D heatmaps. The threshold ε=0.99 is used to determine the joint estimation quality, and the number of layers N=3 is designed for the 2D pose refinement. Result We compare the performance of JCNet with that of state-of-the-art models on two public datasets, Human 3.6M and Total Capture, and a synthetic dataset Occlusion-Person. The Mean Per Joint Position Error (MPJPE) is used as the evaluation metric, which measures the Euclidean distance between the estimated 3D joint positions and the ground truth. MPJPE can reflect the quality of the estimated 3D human poses, providing a more intuitive representation of the performance of different methods. On the human3.6M dataset, the proposed method achieves an additional error reduction of 2.4mm compared to the baseline Adafuse. Moreover, since our network introduces rich priori knowledge and effectively constructs the connectivity of human joints, JCNet achieves at least a 10% improvement compared to most methods that do not use the Skinned Multi-Person Linear(SMPL) model. Compared to the method Learnable Human Mesh Triangulation (LMT) incorporating the SMPL model and volumetric triangulation, our method still achieves a 0.5mm error reduction. On the Total Capture dataset, compared to the excellent baseline Adafuse, our method also exhibits a performance improvement of 2.6%. On the Occlusion-Person dataset, the JCNet achieves optimal estimation for the vast majority of joints, which improves performance by 19%. Furthermore, we also compare the visualization results of 3D human pose estimation between our method and the baseline Adafuse on the Human3.6M dataset and the Total Capture dataset to provide a more intuitive demonstration of the estimation performance. The qualitative experimental results on both datasets demonstrate that JCNet can use the prior constraints of skeleton length to correct unreasonable erroneous poses. Conclusion We propose a multi-view 3D human pose estimation method JCNet capable of constructing human joint consistency between multiple views as well as intrinsic topological constraints on the human skeleton in the respective views. The method achieves excellent 3D human pose estimation performance. The experimental results on the public datasets show that JCNet has significant advantages in evaluation metrics compared to other advanced methods, demonstrating its superiority and generalization.

Keywords

multi-view 3D human pose estimation joint point correlation graph convolutional network attention mechanism triangulation