3D human pose sequence estimation from point clouds combing temporal feature and joint learning strategy

Lianjun Liao; Chongyang Zhong; Zhiheng Zhang; Lei Hu; Zihao Zhang; Shihong Xia

doi:10.11834/jig.210836

Computer Graphics | Views : 0 下载量: 0 CSCD: 1

PDF
Export
Share
Collection
Album

3D human pose sequence estimation from point clouds combing temporal feature and joint learning strategy
Vol. 27, Issue 12, Pages: 3608-3621(2022)
Published： 16 December 2022 ，

Accepted： 03 January 2022
DOI： 10.11834/jig.210836
稿件说明：

移动端阅览

Lianjun Liao, Chongyang Zhong, Zhiheng Zhang, Lei Hu, Zihao Zhang, Shihong Xia. 3D human pose sequence estimation from point clouds combing temporal feature and joint learning strategy. [J]. Journal of Image and Graphics 27(12):3608-3621(2022)
DOI：

Lianjun Liao, Chongyang Zhong, Zhiheng Zhang, Lei Hu, Zihao Zhang, Shihong Xia. 3D human pose sequence estimation from point clouds combing temporal feature and joint learning strategy. [J]. Journal of Image and Graphics 27(12):3608-3621(2022) DOI： 10.11834/jig.210836.

摘要

目的

3维人体姿态估计传统方法通常采用单帧点云作为输入，可能会忽略人体运动平滑度的固有先验知识，导致产生抖动伪影。目前，获取2维人体姿态标注的真实图像数据集相对容易，而采集大规模的具有高质量3维人体姿态标注的真实图像数据集进行完全监督训练有一定难度。对此，本文提出了一种新的点云序列3维人体姿态估计方法。

方法

首先从深度图像序列估计姿态相关点云，然后利用时序信息构建神经网络，对姿态相关点云序列的时空特征进行编码。选用弱监督深度学习，以利用大量的更容易获得的带2维人体姿态标注的数据集。最后采用多任务网络对人体姿态估计和人体运动预测进行联合训练，提高优化效果。

结果

在两个数据集上对本文算法进行评估。在ITOP(invariant-top view dataset)数据集上，本文方法的平均精度均值(mean average precision，mAP)比对比方法分别高0.99%、13.18%和17.96%。在NTU-RGBD数据集上，本文方法的mAP值比最先进的WSM(weakly supervised adversarial learning methods)方法高7.03%。同时，在ITOP数据集上对模型进行消融实验，验证了算法各个不同组成部分的有效性。与单任务模型训练相比，多任务网络联合进行人体姿态估计和运动预测的mAP可以提高2%以上。

结论

本文提出的点云序列3维人体姿态估计方法能充分利用人体运动连续性的先验知识，获得更平滑的人体姿态估计结果，在ITOP和NTU-RGBD数据集上都能获得很好的效果。采用多任务网络联合优化策略，人体姿态估计和运动预测两个任务联合优化求解，有互相促进的作用。

Abstract

Objective

Point cloud-based 3D human pose estimation is one of the key aspects in computer vision. A wide range of its applications have been developing in augmented reality/virtual reality (AR/VR)

human-computer interaction (HCI)

motion retargeting

and virtual avatar manipulation. Current deep learning-based 3D human pose estimation has been challenging on the following aspects: 1) the 3D human pose estimation task is constrained of the occlusion and self-occlusion ambiguity. Moreover

the noisy point clouds from depth cameras may cause difficulties to learn a proper human pose estimation model. 2) Current depth-image based methods are mainly focused on single image-derived pose estimation

which may ignore the intrinsic priors of human motion smoothness and leads to jittery artifacts results on consistent point cloud sequences. The potential is to leverage point cloud sequences for high-fidelity human pose estimation via human motion smoothness enforcement. However

it is challenging to design an effective way to get human poses by modeling point cloud sequences. 3) It is hard to collect large-scale real image dataset with high-quality 3D human pose annotations for fully-supervised training

while it is easy to collect real dataset with 2D human pose annotations. Moreover

human pose estimation is closely related to motion prediction

which aims to predict the future motion available. The challenging issue is whether 3D human poses estimation and motion prediction can realize mutual benefit.

Method

We develop a method to obtain high fidelity 3D human pose from point cloud sequence. The weakly-supervised deep learning architecture is used to learn 3D human pose from 3D point cloud sequences. We design a dual-level human pose estimation pipeline using point cloud sequences as input. 1) The 2D pose information is estimated from the depth maps

so that the background is removed and the pose-aware point clouds are extracted. To ensure that the normalized sequential point clouds are in the same scale

the point clouds normalization is carried out based on a fixed bounding box for all the point clouds. 2) Pose encoding has been implemented via hierarchical PointNet++ backbone and long short-term memory (LSTM) layers based on the spatial-temporal features of pose-aware point cloud sequences. To improve the optimization effect

a multi-task network is employed to jointly resolve human pose estimation and motion prediction problem. In order to use more training data with 2D human pose annotations and release the ambiguity by the supervision of 2D joints

weakly-supervised learning is adopted in our framework.

Result

In order to validate the performance of the proposed algorithm

several experiments are conducted on two public datasets

including invariant-top view dataset(ITOP) and NTU-RGBD dataset. The performance of our methods is compared to some popular methods including V2VPoseNet

viewpoint invariant method (VI)

Inference Embedded method and the weakly supervised adversarial learning methods (WSM). For the ITOP dataset

our mean average precision (mAP) value is 0.99% point higher than that of WSM given the threshold of 10 cm. Compared with VI and Inference Embedded method

each mAP value is 13.18% and 17.96% higher. Each of mean joint errors is 3.33 cm

5.17 cm

1.67 cm and 0.67 cm

which is lower than the VI method

Inference Embedded method

V2V-PoseNet and WSM

respectively. The performance gain could be originated from the sequential input data and the constraints from the motion parameters like velocity and the accelerated velocity. 1) The sequential data is encoded through the LSTM units

which could get the smoother prediction and improve the estimation performance. 2) The motion parameters can alleviate the jitters caused by random sampling and yield the direct supervision of the joint coordinates. For the NTU-RGBD dataset

we compare our method with current WSM. The mAP value of our method is 7.03 percentage points higher than that with WSM if the threshold is set to 10 cm. At the same time

ablation experiments are carried out on the ITOP dataset to investigate the effect of multiple components. To understand the effect of the input sequential point clouds

we design experiment with different temporal receptive field of the sequential point clouds. The receptive field is set to 1 for the estimated results of the sequential data excluded. The percentage of correct keypoints (PCK) result drops to the lowest value of 88.57% when the receptive field is set to 1

the PCK values can be increased as the receptive field increases from 1 to 5

and the PCK value becomes more steadily when the receptive field is greater than 13. Our PCK value is 87.55% trained only with fully labeled data and the PCK value of the model trained with fully and weakly labeled data is 90.58%. It shows that our weakly supervised learning methods can improve the performance of our model by 2 point percentage. And

the experiments demonstrate that our weakly supervised learning method can be used for a small amount of fully labeled data as well. Compared with model trained for single task

the mAP of human pose estimation and motion prediction based on multi task network can be improved by more than 2 percentage points.

Conclusion

To obtain smoother human pose estimation results

our method can make full use of the prior of human motion continuity. All experiments demonstrate that our contributed components are all effective

and our method can achieve the state-of-the-art performance efficiently on ITOP dataset and NTU-RGBD dataset. The joint training strategy is valid for the mutual tasks of human pose estimation and motion prediction. With the weakly supervised method on sequential data

it can use more easy-to-access training data and our model is robust over different levels of training data annotations. It could be applied to such of scenarios

which require high-quality human poses like motion retargeting and virtual fitting. Our method has its related potentials of using sequential data as input.

关键词

人体运动人体姿态估计人体运动预测点云序列弱监督学习

Keywords

human motionhuman pose estimationhuman motion predictionpoint cloud sequenceweakly-supervised learning

references

Akhter I, Simon T, Khan S, Matthews I and Sheikh Y. 2012. Bilinear spatiotemporal basis models. ACM Transactions on Graphics, 31(2): #17 [DOI: 10.1145/2159516.2159523]

Bütepage J, Black M J, Kragic D and Kjellström H. 2017. Deep representation learning for human motion prediction and classification//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 1591-1599 [DOI: 10.1109/CVPR.2017.173http://dx.doi.org/10.1109/CVPR.2017.173]

Chang J Y, Moon G and Lee K M. 2018. V2V-PoseNet: voxel-to-voxel prediction network for accurate 3D hand and human pose estimation from a single depth map//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 5079-5088 [DOI: 10.1109/CVPR.2018.00533http://dx.doi.org/10.1109/CVPR.2018.00533]

Dabral R, Mundhada A, Kusupati U, Afaque S and Jain A. 2017. Structure-aware and temporally coherent 3D human pose estimation [EB/OL]. [2021-07-10].https://arxiv.org/pdf/1711.09250v1.pdfhttps://arxiv.org/pdf/1711.09250v1.pdf

Fragkiadaki K, Levine S, Felsen P and Malik J. 2015. Recurrent network models for human dynamics//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 4346-4354 [DOI: 10.1109/ICCV.2015.494http://dx.doi.org/10.1109/ICCV.2015.494]

Haque A, Peng B Y, Luo Z L, Alahi A, Yeung S and Li F F. 2016. Towards viewpoint invariant 3D human pose estimation//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 160-177 [DOI: 10.1007/978-3-319-46448-0_10http://dx.doi.org/10.1007/978-3-319-46448-0_10]

Hossain M R I and Little J J. 2018. Exploiting temporal information for 3D human pose estimation//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 69-86 [DOI: 10.1007/978-3-030-01249-6_5http://dx.doi.org/10.1007/978-3-030-01249-6_5]

Kanazawa A, Zhang J Y, Felsen P and Malik J. 2019. Learning 3D human dynamics from video//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 5607-5616 [DOI: 10.1109/CVPR.2019.00576http://dx.doi.org/10.1109/CVPR.2019.00576]

Lee K, Lee I and Lee S. 2018. Propagating LSTM: 3D pose estimation based on joint interdependency//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 123-141 [DOI: 10.1007/978-3-030-01234-2_8http://dx.doi.org/10.1007/978-3-030-01234-2_8]

Li Y Y, Bu R, Sun M C, Wu W, Di X H and Chen B Q. 2018. PointCNN: convolution onX-transformed points//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal, Canada: Curran Associates Inc. : 828-838

Li Z, Wang X, Wang F and Jiang P L. 2019. On boosting single-frame 3D human pose estimation via monocular videos//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 2192-2201 [DOI: 10.1109/ICCV.2019.00228http://dx.doi.org/10.1109/ICCV.2019.00228]

Lin M D, Lin L, Liang X D, Wang K Z and Cheng H. 2017. Recurrent 3D pose sequence machines//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 5543-5552 [DOI: 10.1109/CVPR.2017.588http://dx.doi.org/10.1109/CVPR.2017.588]

Liu J, Shahroudy A, Perez M, Wang G, Duan L Y and Kot A C. 2020. NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10): 2684-2701 [DOI: 10.1109/TPAMI.2019.2916873]

Martinez J, Hossain R, Romero J and Little J J. 2017. A simple yet effective baseline for 3D human pose estimation//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2659-2668 [DOI: 10.1109/ICCV.2017.288http://dx.doi.org/10.1109/ICCV.2017.288]

Min J Y, Chen Y L and Chai J X. 2009. Interactive generation of human animation with deformable motion models. ACM Transactions on Graphics, 29(1): #9 [DOI: 10.1145/1640443.1640452]

Newell A, Yang K Y and Deng J. 2016. Stacked hourglass networks for human pose estimation//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 483-499 [DOI: 10.1007/978-3-319-46484-8_29http://dx.doi.org/10.1007/978-3-319-46484-8_29]

Pavllo D, Feichtenhofer C, Grangier D and Auli M. 2019. 3D human pose estimation in video with temporal convolutions and semi-supervised training//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 7745-7754 [DOI: 10.1109/CVPR.2019.00794http://dx.doi.org/10.1109/CVPR.2019.00794]

Qi C R, Litany O, He K M and Guibas L. 2019. Deep Hough voting for 3D object detection in point clouds//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 9276-9285 [DOI: 10.1109/ICCV.2019.00937http://dx.doi.org/10.1109/ICCV.2019.00937]

Qi C R, Liu W, Wu C X, Su H and Guibas L J. 2018. Frustum PointNets for 3D object detection from RGB-D data//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 918-927 [DOI: 10.1109/CVPR.2018.00102http://dx.doi.org/10.1109/CVPR.2018.00102]

Qi C R, Su H, Kaichun M and Guibas L J. 2017a. PointNet: deep learning on point sets for 3D classification and segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 77-85 [DOI: 10.1109/CVPR.2017.16http://dx.doi.org/10.1109/CVPR.2017.16]

Qi C R, Yi L, Su H and Guibas L J. 2017b. PointNet++: deep hierarchical feature learning on point sets in a metric space//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc. : 5105-5114

Shahroudy A, Liu J, Ng T T and Wang G. 2016. NTU RGB+D: a large scale dataset for 3d human activity analysis//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 1010-1019 [DOI: 10.1109/CVPR.2016.115http://dx.doi.org/10.1109/CVPR.2016.115]

Wang K Z, Zhai S F, Cheng H, Liang X D and Lin L. 2016. Human pose estimation from depth images via inference embedded multi-task learning//Proceedings of the 24th ACM international conference on Multimedia. Amsterdam, the Netherlands: Association for Computing Machinery: 1227-1236 [DOI: 10.1145/2964284.2964322http://dx.doi.org/10.1145/2964284.2964322]

Wang Z Y, Chai J X and Xia S H. 2021. Combining recurrent neural networks and adversarial training for human motion synthesis and control. IEEE Transactions on Visualization and Computer Graphics, 27(1): 14-28 [DOI: 10.1109/TVCG.2019.2938520]

Yao B P and Li F F. 2010. Modeling mutual context of object and human pose in human-object interaction activities//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, USA: IEEE: 17-24 [DOI: 10.1109/CVPR.2010.5540235http://dx.doi.org/10.1109/CVPR.2010.5540235]

Zhang J, Felsen P, Kanazawa A and Malik J. 2019. Predicting 3D human dynamics from video//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 7113-7122 [DOI: 10.1109/ICCV.2019.00721http://dx.doi.org/10.1109/ICCV.2019.00721]

Zhang Z H, Hu L, Deng X M and Xia S H.2020. Weakly supervised adversarial learning for 3D human pose estimation from point clouds. IEEE Transactions on Visualization and Computer Graphics, 26(5): 1851-1859 [DOI: 10.1109/TVCG.2020.2973076]

Zhou Y, Li Z M, Xiao S J, He C, Huang Z and Li H. 2018. Auto-conditioned recurrent networks for extended complex human motion synthesis//Proceedings of the 6th International Conference on Learning Representations. Vancouver, Canada: ICLR: #266

Zhou Y and Tuzel O. 2018. Voxelnet: end-to-end learning for point cloud based 3D object detection//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4490-4499 [DOI: 10.1109/CVPR.2018.00472http://dx.doi.org/10.1109/CVPR.2018.00472]

Zhou Y F, Dong H W and Saddik A E. 2020. Learning to estimate 3D human pose from point cloud. IEEE Sensors Journal, 20(20): 12334-12342 [DOI: 10.1109/JSEN.2020.2999849]

Alert me when the article has been cited

提交

Lightweight high-resolution human pose estimation combined with densely connected network

Point cloud human behavior recognition based on coordinate transformation and spatiotemporal information injection

Deep learning based two-dimension human pose estimation： a critical analysis

Domain-adaptive-learning based diabetic retinopathy grading diagnosis

Progress and challenges in facial action unit detection