结合稀疏表示和深度学习的视频中3D人体姿态估计
Video based 3D human pose estimation combining sparse representation and deep learning
- 2020年25卷第3期 页码:456-467
收稿:2019-08-20,
修回:2019-10-28,
录用:2019-11-4,
纸质出版:2020-03-16
DOI: 10.11834/jig.190422
移动端阅览

浏览全部资源
扫码关注微信
收稿:2019-08-20,
修回:2019-10-28,
录用:2019-11-4,
纸质出版:2020-03-16
移动端阅览
目的
2
2D姿态估计的误差是导致3D人体姿态估计产生误差的主要原因,如何在2D误差或噪声干扰下从2D姿态映射到最优、最合理的3D姿态,是提高3D人体姿态估计的关键。本文提出了一种稀疏表示与深度模型联合的3D姿态估计方法,以将3D姿态空间几何先验与时间信息相结合,达到提高3D姿态估计精度的目的。
方法
2
利用融合稀疏表示的3D可变形状模型得到单帧图像可靠的3D初始值。构建多通道长短时记忆MLSTM(multi-channel long short term memory)降噪编/解码器,将获得的单帧3D初始值以时间序列形式输入到其中,利用MLSTM降噪编/解码器学习相邻帧之间人物姿态的时间依赖关系,并施加时间平滑约束,得到最终优化的3D姿态。
结果
2
在Human3.6M数据集上进行了对比实验。对于两种输入数据:数据集给出的2D坐标和通过卷积神经网络获得的2D估计坐标,相比于单帧估计,通过MLSTM降噪编/解码器优化后的视频序列平均重构误差分别下降了12.6%,13%;相比于现有的基于视频的稀疏模型方法,本文方法对视频的平均重构误差下降了6.4%,9.1%。对于2D估计坐标数据,相比于现有的深度模型方法,本文方法对视频的平均重构误差下降了12.8%。
结论
2
本文提出的基于时间信息的MLSTM降噪编/解码器与稀疏模型相结合,有效利用了3D姿态先验知识,视频帧间人物姿态连续变化的时间和空间依赖性,一定程度上提高了单目视频3D姿态估计的精度。
Objective
2
3D human pose estimation from monocular videos has become an open research problem in the computer vision and graphics community for a long time. An understanding of human posture and limb articulation is important for high-level computer vision tasks
such as human-computer interaction
augmented and virtual reality
and human action or activity recognition. The recent success of deep networks has led many state-of-the-art methods for 3D pose estimation to train deep networks end to end for direct image prediction. The top-performing approaches have shown the effectiveness of dividing the task of 3D pose estimation into two steps
as follows:using a state-of-the-art 2D pose estimator to estimate the 2D poses from images and then mapping them into 3D space. Results indicate that a large portion of the error of modern deep 3D pose estimation systems stems from 2D pose estimation error. Therefore
mapping a 2D pose containing error or noise into its optimum and most reasonable 3D pose is crucial. We propose a 3D pose estimation method by jointly using a sparse representation and a depth model. Through this method
we combine the spatial geometric priori of 3D poses with temporal information to improve the 3D pose estimation accuracy.
Method
2
First
we use a 3D variable shape model that integrates sparse representation (SR) to represent rich 3D human posture changes. A convex relaxation method based on L
1/2
regularization is used to transform the nonconvex optimization problem of a single-frame image in a shape-space model into a convex programming problem and provide reasonable initial values for a single frame of image. In this manner
the possibility of ambiguous reconstructions is considerably reduced. Second
the initial 3D poses obtained from the SR module
regarded as the 3D data with noise
are fed into a multi-channel long short term memory (MLSTM) denoising en-decoder in the form of pose sequences in temporal dimension. The 3D data with noise are converted into three components of
X
Y
and
Z
to ensure the spatial structure of the 3D pose. For each component
multilayer LSTM cells are used to capture the different frames of time variation. The output of the LSTM unit is not the optimization result on the corresponding component; it is the time dependence between the two adjacent frames of the character posture of the input sequence implicitly encoded by the hidden layer of the LSTM unit. The time information learned is added with the initial value by using residual connection to maintain the time consistency of the 3D pose and effectively alleviate the problem of sequence jitter. Moreover
the shaded joints can be corrected by smoothing the constraint between the two frames. Lastly
we obtain the optimized 3D pose estimation results by decoding the last linear layer.
Result
2
A comparative experiment is conducted to verify the validity of the proposed method. The method is conducted using the Human3.6M dataset
and the results are compared with the state-of-the-art methods. The quantitative evaluation metrics contain a common approach used to align the predicted 3D pose with the ground truth 3D pose using a similarity transformation. We use the average error per joint in millimeters between the estimated and the ground truth 3D pose. 2D joint ground truth and 2D pose estimations using a convolutional network are separately used as inputs. The quantitative experimental results suggest that the proposed method can remarkably improve the 3D estimation accuracy. When the input data are the 2D joint ground truth given by the Human 3.6 M dataset
the average reconstruction error is decreased by 12.6% after the optimization of our model as compared with individual frame estimation. Compared with the existing sparse model method based on video
the average reconstruction error is decreased by 6.4% after using our method. When the input data are 2D pose estimations using a convolutional network
the average reconstruction error is decreased by 13% after the optimization of our model as compared with single frame estimation. Compared with the existing depth model method
the average reconstruction error is decreased by 12.8% after using our method. Compared with the existing sparse model method based on video
the average reconstruction error is decreased by 9.1% after using our method.
Conclusion
2
Combining our MLSTM en-decoder based on temporal information with the sparse model
we adequately exploit the 3D pose prior knowledge
temporal
and spatial dependence of continuous human pose changes and achieve a remarkable improvement in monocular video 3D pose estimation accuracy.
Bo L F, Sminchisescu C, Kanaujia A and Metaxas D. 2008. Fast algorithms for large scale conditional 3D prediction//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Anchorage: IEEE: 1-8[ DOI:10.1109/CVPR.2008.4587578 http://dx.doi.org/10.1109/CVPR.2008.4587578 ]
Chen C H and Ramanan D. 2017. 3D human pose estimation=2D pose estimation+ matching//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE: 5759-5767[ DOI:10.1109/CVPR.2017.610 http://dx.doi.org/10.1109/CVPR.2017.610 ]
Cootes T F, Taylor C J, Cooper D H and Graham J. 1995. Active shape models-their training and application. Computer Vision and Image Understanding, 61(1):8-59[DOI:10.1006/cviu.1995.1004]
Donahue J, Hendricks L A, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K and Darrell T. 2017. Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4):677-691[DOI:10.1109/TPAMI.2016.2599174]
Fan X C, Zheng K, Zhou Y J and Wang S. 2014. Pose locality constrained representation for 3D human pose reconstruction//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 174-188[ DOI:10.1007/978-3-319-10590-1_12 http://dx.doi.org/10.1007/978-3-319-10590-1_12 ]
Glorot X and Bengio Y. 2010. Understanding the difficulty of training deep feedforward neural networks//Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. Sardinia, Italy: JMLR: 249-256
Graves A, Mohamed A R and Hinton G. 2013. Speech recognition with deep recurrent neural networks//Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver: IEEE: 6645-6649[ DOI:10.1109/ICASSP.2013.6638947 http://dx.doi.org/10.1109/ICASSP.2013.6638947 ]
He K M, Zhang X Y, Ren S Q and Sun J. 2015. Delving deep into rectifiers: surpassing human-level performance on imagenet classification//Proceedings of the IEEE International Conference on Computer Vision. Santiago: IEEE: 1026-1034[ DOI:10.1109/ICCV.2015.123 http://dx.doi.org/10.1109/ICCV.2015.123 ]
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of the IEEE Conference on Computer Vision andPattern Recognition. Las Vegas: IEEE: 770-778[ DOI:10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]
Hong J H, Zhang R and Guo L J. 2018. 3D Human body pose reconstruction via L 1/2 regularization. Acta Automatica Sinica, 44(6):1086-1095
洪金华, 张荣, 郭立君. 2018.基于L 1/2 正则化的三维人体姿态重构.自动化学报, 44(6):1086-1095)[DOI:10.16383/j.aas.2018.c170199]
Hossain M R I and Little J J. 2018. Exploiting temporal information for 3D human pose estimation//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 69-86[ DOI:10.1007/978-3-030-01249-6_5 http://dx.doi.org/10.1007/978-3-030-01249-6_5 ]
Ioffe S and Szegedy C. 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift//Proceedings of the 32nd International Conference on Machine Learning. Lille, France: JMLR: 448-456
Ionescu C, Papava D, Olaru V and Sminchisescu C. 2014. Human3.6M:large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325-1339[DOI:10.1109/TPAMI.2013.248]
Jiang H. 2010. 3D human pose reconstruction using millions of exemplars//Proceedings of the 20th International Conference on Pattern Recognition. Istanbul, Turkey: IEEE: 1674-1677[ DOI:10.1109/ICPR.2010.414 http://dx.doi.org/10.1109/ICPR.2010.414 ]
Kingma D P and Ba J. 2015. Adam: A Method for Stochastic Optimization[EB/OL].[2019-08-16] . https://arxiv.org/pdf/1412.6980.pdf https://arxiv.org/pdf/1412.6980.pdf
Lee H J and Chen Z. 1985. Determination of 3D human body postures from a single view. Computer Vision, Graphics, and Image Processing, 30(2):148-168[DOI:10.1016/0734-189X(85)90094-5]
Lin M D, Lin L, Liang X D, Wang K Z and Cheng H. 2017. Recurrent 3D pose sequence machines//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE: 5543-5552[ DOI:10.1109/CVPR.2017.588 http://dx.doi.org/10.1109/CVPR.2017.588 ]
Mairal J, Bach F, Ponce J and Sapiro G. 2010. Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research, 11:19-60[DOI:10.1145/1756006.1756008]
Martinez J, Hossain R, Romero J and Little J J. 2017. A simple yet effective baseline for 3D human pose estimation//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice: IEEE: 2659-2668[ DOI:10.1109/ICCV.2017.288 http://dx.doi.org/10.1109/ICCV.2017.288 ]
Mehta D, Sridhar S, Sotnychenko O, Rhodin H, Shafiei M, Seidel H P, Xu W P, Casas D and Theobalt C. 2017. Vnect:real-time 3D human pose estimation with a single RGB camera. ACM Transactions on Graphics (TOG), 36(4):1-14[DOI:10.1145/3072959.3073596]
Mori G and Malik J. 2006. Recovering 3D human body configurations using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(7):1052-1062[DOI:10.1109/TPAMI.2006.149]
Nair V and Hinton G E. 2010. Rectified linear units improve restricted Boltzmann machines//Proceedings of the 27th International Conference on Machine Learning. Haifa: ACM: 807-814
Newell A, Yang K Y and Deng J. 2016. Stacked hourglass networks for human pose estimation//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer: 483-499[ DOI:10.1007/978-3-319-46484-8_29 http://dx.doi.org/10.1007/978-3-319-46484-8_29 ]
Ramakrishna V, Kanade T and Sheikh Y. 2012. Reconstructing 3D human pose from 2D image landmarks//Proceedings of the 12th European Conference on Computer Vision. Florence, Italy: Springer: 573-586[ DOI:10.1007/978-3-642-33765-9_41 http://dx.doi.org/10.1007/978-3-642-33765-9_41 ]
Sigal L, Isard M, Haussecker H and Black M J. 2012. Loose-limbed people:estimating 3D human pose and motion using non-parametric belief propagation. International Journal of Computer Vision, 98(1):15-48[DOI:10.1007/s11263-011-0493-4]
Srivastava N, Hinton G, Krizhevsky A, Sutskever I and Salakhutdinov R. 2014. Dropout:a simple way to prevent neural networks from Overfitting. Journal of Machine Learning Research, 15:1929-1958
Sun K, Xiao B, Liu D and Wang J D. 2019. Deep high-resolution representation learning for human pose estimation.[EB/OL].[2019-08-16] . https://arxiv.org/pdf/1902.09212.pdf https://arxiv.org/pdf/1902.09212.pdf
Tome D, Russell C and Agapito L. 2017. Lifting from the deep: convolutional 3D pose estimation from asingle image//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE: 5689-5698[ DOI:10.1109/CVPR.2017.603 http://dx.doi.org/10.1109/CVPR.2017.603 ]
Wei S E, Ramakrishna V, Kanade T and Sheikh Y. 2016. Convolutional pose machines//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE: 4724-4732[ DOI:10.1109/CVPR.2016.511 http://dx.doi.org/10.1109/CVPR.2016.511 ]
Yasin H, Iqbal U, Krüger B, Weber A and Gall J. 2016. A dual-source approach for 3D pose estimation from a single image//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE: 4948-4956[ DOI:10.1109/CVPR.2016.535 http://dx.doi.org/10.1109/CVPR.2016.535 ]
Zhou X W, Leonardos S, Hu X Y and Daniilidis K. 2015. 3D shape estimation from 2D landmarks: a convex relaxation approach//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE: 4447-4455[ DOI:10.1109/CVPR.2015.7299074 http://dx.doi.org/10.1109/CVPR.2015.7299074 ]
Zhou X W, Zhu M L, Leonardos S and Daniilidis K. 2017. Sparse representation for 3D shape estimation:a convex relaxation approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(8):1648-1661[DOI:10.1109/TPAMI.2016.2605097]
Zhou X W, Zhu M L, Pavlakos G, Leonardos S, Derpanis K G and Daniilidis K. 2019. MonoCap:monocular human motion capture using a CNN coupled with a geometric prior. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(4):901-914[DOI:10.1109/TPAMI.2018.2816031]
相关作者
相关机构
京公网安备11010802024621