Current Issue Cover
融合时序特征约束与联合优化的 点云3维人体姿态序列估计

廖联军1,2,3, 钟重阳1,2, 张智恒1,2, 胡磊1,2, 张子豪1, 夏时洪1,2(1.中国科学院计算技术研究所, 北京 100190;2.中国科学院大学,计算机科学与技术学院, 北京 100049;3.北方工业大学信息学院, 北京 100144)

摘 要
目的 3维人体姿态估计传统方法通常采用单帧点云作为输入,可能会忽略人体运动平滑度的固有先验知识,导致产生抖动伪影。目前,获取2维人体姿态标注的真实图像数据集相对容易,而采集大规模的具有高质量3维人体姿态标注的真实图像数据集进行完全监督训练有一定难度。对此,本文提出了一种新的点云序列3维人体姿态估计方法。方法 首先从深度图像序列估计姿态相关点云,然后利用时序信息构建神经网络,对姿态相关点云序列的时空特征进行编码。选用弱监督深度学习,以利用大量的更容易获得的带2维人体姿态标注的数据集。最后采用多任务网络对人体姿态估计和人体运动预测进行联合训练,提高优化效果。结果 在两个数据集上对本文算法进行评估。在ITOP(invariant-top view dataset)数据集上,本文方法的平均精度均值(mean average precision,mAP)比对比方法分别高0.99%、13.18%和17.96%。在NTU-RGBD数据集上,本文方法的mAP值比最先进的WSM(weakly supervised adversarial learning methods)方法高7.03%。同时,在ITOP数据集上对模型进行消融实验,验证了算法各个不同组成部分的有效性。与单任务模型训练相比,多任务网络联合进行人体姿态估计和运动预测的mAP可以提高2%以上。结论 本文提出的点云序列3维人体姿态估计方法能充分利用人体运动连续性的先验知识,获得更平滑的人体姿态估计结果,在ITOP和NTU-RGBD数据集上都能获得很好的效果。采用多任务网络联合优化策略,人体姿态估计和运动预测两个任务联合优化求解,有互相促进的作用。
关键词
3D human pose sequence estimation from point clouds combing temporal feature and joint learning strategy

Liao Lianjun1,2,3, Zhong Chongyang1,2, Zhang Zhiheng1,2, Hu Lei1,2, Zhang Zihao1, Xia Shihong1,2(1.Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China;2.School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100049, China;3.School of Information Science and Technology, North China University of Technology, Beijing 100144, China)

Abstract
Objective Point cloud-based 3D human pose estimation is one of the key aspects in computer vision. A wide range of its applications have been developing in augmented reality/virtual reality (AR/VR), human-computer interaction (HCI), motion retargeting, and virtual avatar manipulation. Current deep learning-based 3D human pose estimation has been challenging on the following aspects: 1) the 3D human pose estimation task is constrained of the occlusion and self-occlusion ambiguity. Moreover, the noisy point clouds from depth cameras may cause difficulties to learn a proper human pose estimation model. 2) Current depth-image based methods are mainly focused on single image-derived pose estimation, which may ignore the intrinsic priors of human motion smoothness and leads to jittery artifacts results on consistent point cloud sequences. The potential is to leverage point cloud sequences for high-fidelity human pose estimation via human motion smoothness enforcement. However, it is challenging to design an effective way to get human poses by modeling point cloud sequences. 3) It is hard to collect large-scale real image dataset with high-quality 3D human pose annotations for fully-supervised training, while it is easy to collect real dataset with 2D human pose annotations. Moreover, human pose estimation is closely related to motion prediction, which aims to predict the future motion available. The challenging issue is whether 3D human poses estimation and motion prediction can realize mutual benefit. Method We develop a method to obtain high fidelity 3D human pose from point cloud sequence. The weakly-supervised deep learning architecture is used to learn 3D human pose from 3D point cloud sequences. We design a dual-level human pose estimation pipeline using point cloud sequences as input. 1) The 2D pose information is estimated from the depth maps, so that the background is removed and the pose-aware point clouds are extracted. To ensure that the normalized sequential point clouds are in the same scale, the point clouds normalization is carried out based on a fixed bounding box for all the point clouds. 2) Pose encoding has been implemented via hierarchical PointNet++ backbone and long short-term memory (LSTM) layers based on the spatial-temporal features of pose-aware point cloud sequences. To improve the optimization effect, a multi-task network is employed to jointly resolve human pose estimation and motion prediction problem. In order to use more training data with 2D human pose annotations and release the ambiguity by the supervision of 2D joints, weakly-supervised learning is adopted in our framework. Result In order to validate the performance of the proposed algorithm, several experiments are conducted on two public datasets, including invariant-top view dataset(ITOP) and NTU-RGBD dataset. The performance of our methods is compared to some popular methods including V2VPoseNet, viewpoint invariant method (VI), Inference Embedded method and the weakly supervised adversarial learning methods (WSM). For the ITOP dataset, our mean average precision (mAP) value is 0.99% point higher than that of WSM given the threshold of 10 cm. Compared with VI and Inference Embedded method, each mAP value is 13.18% and 17.96% higher. Each of mean joint errors is 3.33 cm, 5.17 cm, 1.67 cm and 0.67 cm, which is lower than the VI method, Inference Embedded method, V2V-PoseNet and WSM, respectively. The performance gain could be originated from the sequential input data and the constraints from the motion parameters like velocity and the accelerated velocity. 1) The sequential data is encoded through the LSTM units, which could get the smoother prediction and improve the estimation performance. 2) The motion parameters can alleviate the jitters caused by random sampling and yield the direct supervision of the joint coordinates. For the NTU-RGBD dataset, we compare our method with current WSM. The mAP value of our method is 7.03 percentage points higher than that with WSM if the threshold is set to 10 cm. At the same time, ablation experiments are carried out on the ITOP dataset to investigate the effect of multiple components. To understand the effect of the input sequential point clouds, we design experiment with different temporal receptive field of the sequential point clouds. The receptive field is set to 1 for the estimated results of the sequential data excluded. The percentage of correct keypoints (PCK) result drops to the lowest value of 88.57% when the receptive field is set to 1, the PCK values can be increased as the receptive field increases from 1 to 5, and the PCK value becomes more steadily when the receptive field is greater than 13. Our PCK value is 87.55% trained only with fully labeled data and the PCK value of the model trained with fully and weakly labeled data is 90.58%. It shows that our weakly supervised learning methods can improve the performance of our model by 2 point percentage. And, the experiments demonstrate that our weakly supervised learning method can be used for a small amount of fully labeled data as well. Compared with model trained for single task, the mAP of human pose estimation and motion prediction based on multi task network can be improved by more than 2 percentage points. Conclusion To obtain smoother human pose estimation results, our method can make full use of the prior of human motion continuity. All experiments demonstrate that our contributed components are all effective, and our method can achieve the state-of-the-art performance efficiently on ITOP dataset and NTU-RGBD dataset. The joint training strategy is valid for the mutual tasks of human pose estimation and motion prediction. With the weakly supervised method on sequential data, it can use more easy-to-access training data and our model is robust over different levels of training data annotations. It could be applied to such of scenarios, which require high-quality human poses like motion retargeting and virtual fitting. Our method has its related potentials of using sequential data as input.
Keywords

订阅号|日报