Dual auto-encoder network for human skeleton motion data optimization

Shujie Li; Haisheng Zhu; Lei Wang; Xiaoping Liu

doi:10.11834/jig.200780

Chinagraph 2020 | Views : 0 下载量: 0 CSCD: 0

PDF
Export
Share
Collection
Album

Dual auto-encoder network for human skeleton motion data optimization
Vol. 27, Issue 4, Pages: 1277-1289(2022)
Published： 16 April 2022 ，

Accepted： 10 February 2021
DOI： 10.11834/jig.200780
稿件说明：

移动端阅览

Shujie Li, Haisheng Zhu, Lei Wang, Xiaoping Liu. Dual auto-encoder network for human skeleton motion data optimization. [J]. Journal of Image and Graphics 27(4):1277-1289(2022)
DOI：

Shujie Li, Haisheng Zhu, Lei Wang, Xiaoping Liu. Dual auto-encoder network for human skeleton motion data optimization. [J]. Journal of Image and Graphics 27(4):1277-1289(2022) DOI： 10.11834/jig.200780.

摘要

目的

针对包含混合噪声的3维坐标形式的骨骼运动数据优化问题，提出一种由双向循环自编码器和卷积自编码器串联构成的优化网络，其中双向循环自编码器用于使网络输出的优化数据具有更高的位置精度，卷积自编码器用于使优化数据具有更好的平滑性。

方法

首先，利用高精度动捕数据库预训练一个感知自编码器; 然后，用“噪声—高精度”数据对训练双自编码器，并在训练过程中添加隐变量约束。其中隐变量约束由预训练的感知自编码器返回，其作用在于能够使网络输出保持较高的精度并具有合理骨骼结构，使算法适用于提升运动数据的细节层次。

结果

实验分别在合成噪声数据集和真实噪声数据集上进行，与最新的卷积自编码器(convolutional auto-encoder，CAE)、双向循环自编码器(bidirectional recurrent auto-encoder，BRA)以及双向循环自编码器加感知约束(BRA with perceptual constraint

BRA-P)3种深度学习方法进行比较，在位置误差、骨骼长度误差和平滑性误差3项量化指标上，本文方法的优化结果与最新的3种方法在合成噪声数据集上相比，分别提高了33.1 %、25.5 %、12.2 %; 在真实噪声数据集上分别提高了27.2 %、39.2 %、16.8 %。

结论

本文提出的双自编码器优化网络综合了两种自编码器的优点

使网络输出的优化数据具有更高的数据精度和更好的平滑性，且能够较好地保持运动数据的骨骼结构。

Abstract

Objective

Human motion data are widely used in virtual reality

human computer interactions

computer games

sports and medical applications. Human motions capture technique aims to obtain highly precise human motion data. Motion capture sensors (MoCap) like Vicon and Xsens can offer high precision motion data costly. These MoCap systems are not fitted to wear for users. Low-cost motion capture technologies have been developed and can serve as alternatives for capturing human motion

including depth sensor-based and camera-based technologies. However

the raw 3D skeleton motion data captured derived from these low-cost sensors are constrained of calibration error

sensor noise

poor sensor resolution

and occlusion due to body parts or clothing. Thus

the raw MoCap data should be optimized

i.e.

filling in missing data and de-noising in the pre-stage for users. The accuracy of optimized data for human motion in the context of convolutional auto-encoder (CAE) based multi-noises features and noise amplitudes. Raw MoCap data like the Kinect skeleton motion captured data contain mixed noise different noise types and amplitudes in the capture process due to scenarios changes or self-occlusion. Thus

the bi-directional recurrent auto-encoder (BRA) has been used to raw motion data based on heterogeneous mixed noise. However

the result of BRA has higher position accuracy but CAE is much smoother. Hence

we represent an optimized dual auto-encoder network named BCH

which consists of a BRA and a series of CAE. The BRA is used to make the optimized data on the aspect of the higher position accuracy network and the CAE is used to make the optimized data have better smoothness.

Method

First

a perceptual auto-encoder is pre-trained using high precision motion capture data. The loss function for the pre-trained perceptual auto-encoder consists of 3 factors including position loss

bone-length loss and smooth loss. Next

we train the dual auto-encoder for optimization via paired "noisy-clean" data. The perceptual autoencoder is composed of the convolution encoder and the convolution decoder. The convolutional encoder in the context of convolutional layer

max pooling layer

activation layer and the convolutional decoder is melted inverse-pooling layer and convolutional layer in. The dual autoencoder consists of BRA and CAE. The convolutional autoencoder network structure is similar to the perceptual autoencoder mentioned above. The BRA has two components

including the bidirectional recurrent encoder and the bidirectional recurrent decoder. The BRA consists of 2 overall interconnected layers followed by 1 bidirectional long short-term memory (LSTM) cell. The structure of the decoder is symmetric with that of the encoder. All those of the encoder and the decoder structure can be used to recover corrupted motion data derived of projection and inverse projection. Hidden-units constraint is imposed for training dual autoencoder

which is defined based on the perceptual autoencoder. Adam stochastic gradient descent is used to minimize the loss function of two networks. The batch size is set to 16 and the learning rate is set to 0.000 01. To avoid overfitting

we use a dropout of 0.2. The perceptual autoencoder is trained by 200 epochs and the dual autoencoder is trained by 300 epochs.

Result

The demonstrations based on synthetic noise dataset (Carnegie Mellon University (CMU) Graphics Lab Motion Capture Database) and raw motion dataset (the dataset synchronously captured by Kinect and the NOITOM MoCap system) are conducted for verification. Beyond the 3 deep learning methods (CAE

BRA and BRA with perceptual constraint

called BRA-P method)

the ablation studies verifies each component of our approach on the above two datasets. The quantitative evaluation metrics contained position loss (mean square error

MSE)

bone-length loss and smooth loss

and we illustrated a comparison of motion data optimization that add hidden constraint versus those that did not. The demonstrated results illustrates that our network structure based on synthetic noise dataset and raw motion dataset has its priority of 3 existed deep learning networks in terms of position loss

bone-length loss and smooth loss. The ablation studies on 2 datasets are used to facilitate the refined motion data based on our dual auto-encoder and hidden constraint. In addition

we also analyzed the time performance of the proposed method on raw motion testing dataset. The analyzed results represent that the time-consuming issue of BCH motion data refinement is approximately consistent to the sum of motion data optimization time cost derived of BRA and CAE method

which is close to by BRA method.

Conclusion

We harness an optimized network with dual autoencoder that contains hidden constraint. The results of synthetic noise data and raw motion data demonstrate that the proposed network and hidden-unit constraint yield the higher position accuracy and better smoothness optimized data and maintain the bone-length consistency of the motion data.

关键词

深度学习骨骼运动数据优化双自编码器隐变量约束Kinect运动数据

Keywords

deep learningskeleton motion data refinementdual autoencoderhidden-unit constraintKinect motion data

references

Aristidou A, Lasenby J, Chrysanthou Y and Shamir A. 2018. Inverse kinematics techniques in computer graphics: a survey. Computer Graphics Forum, 37(6): 35-58 [DOI: 10.1111/cgf.13310]

Bütepage J, Black M J, Kragic D and Kjellström H. 2017. Deep representation learning for human motion prediction and classification//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 1591-1599 [DOI: 10.1109/CVPR.2017.173http://dx.doi.org/10.1109/CVPR.2017.173]

CMU. 2020. CMU graphics lab motion capture database [DB/OL]. [2020-12-01].http://mocap.cs.cmu.edu/http://mocap.cs.cmu.edu/

Cui Q J, Chen B J and Sun H J. 2019. Nonlocal low-rank regularization for human motion recovery based on similarity analysis. Information Sciences, 493: 57-74 [DOI: 10.1016/j.ins.2019.04.031]

Feng Y F, Xiao J, Zhuang Y T, Yang X S, Zhang J J and Song R. 2014. Exploiting temporal stability and low-rank structure for motion capture data refinement. Information Sciences, 277: 777-793 [DOI: 10.1016/j.ins.2014.03.013]

Fieraru M, Khoreva A, Pishchulin L and Schiele B. 2018. Learning to refine human pose estimation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Salt Lake City, USA: IEEE: 318-327 [DOI: 10.1109/CVPRW.2018.00058http://dx.doi.org/10.1109/CVPRW.2018.00058]

Günter S, Schraudolph N N and Vishwanathan S V N. 2007. Fast iterative kernel principal component analysis. The Journal of Machine Learning Research, 8: 1893-1918 [DOI: 10.1007/s10846-007-9145-x]

Holden D, Saito J and Komura T. 2016. A deep learning framework for character motion synthesis and editing. ACM Transactions on Graphics, 35(4): #138 [DOI: 10.1145/2897824.2925975]

Holden D, Saito J, Komura T and Joyce T. 2015. Learning motion manifolds with convolutional autoencoders//Proceedings of the SIGGRAPH Asia 2015 Technical Briefs. Kobe, Japan: ACM: #18 [DOI: 10.1145/2820903.2820918http://dx.doi.org/10.1145/2820903.2820918]

Holden D. 2018. Robust solving of optical motion capture data by denoising. ACM Transactions on Graphics, 37(4): #165 [DOI: 10.1145/3197517.3201302]

Hsieh C C and Kuo P L. 2008. An impulsive noise reduction agent for rigid body motion data using B-spline wavelets. Expert Systems with Applications, 34(3): 1733-1741 [DOI: 10.1016/j.eswa.2007.01.030]

Huang Y H, Kaufmann M, Aksan E, Black M J, Hilliges O and Pons-Moll G. 2018. Deep inertial poser: learning to reconstruct human pose from sparse inertial measurements in real time. ACM Transactions on Graphics, 37(6): #185 [DOI: 10.1145/3272127.3275108]

Li S J, Zhou Y, Zhu H S, Xie W J, Zhao Y and Liu X P. 2019. Bidirectional recurrent autoencoder for 3D skeleton motion data refinement. Computers and Graphics, 81: 92-103 [DOI: 10.1016/j.cag.2019.03.010]

Li S J, Zhu H S, Zheng L P and Li L. 2020. A perceptual-based noise-agnostic 3D skeleton motion data refinement network. IEEE Access, 8: 52927-52940 [DOI: 10.1109/ACCESS.2020.2980316]

Liu G D and McMillan L. 2006. Estimation of missing markers in human motion capture. The Visual Computer, 22(9/11): 721-728 [DOI: 10.1007/s00371-006-0080-9]

Liu Z G, Zhou L Y, Leung H and Shum H P H. 2016. Kinect posture reconstruction based on a local mixture of gaussian process models. IEEE Transactions on Visualization and Computer Graphics, 22(11): 2437-2450 [DOI: 10.1109/TVCG.2015.2510000]

Lou H and Chai J X. 2010. Example-based human motion denoising. IEEE Transactions on Visualization and Computer Graphics, 16(5): 870-879 [DOI: 10.1109/TVCG.2010.23]

Mall U, Lal G R, Chaudhuri S and Chaudhuri P. 2017. A deep recurrent framework for cleaning motion capture data [EB/OL]. [2020-12-01]. https://arxiv.org/pdf/1712.03380.pdf

Moon G, Chang J Y and Lee K M. 2019. PoseFix: Model-agnostic general human pose refinement network//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 7765-7773 [DOI: 10.1109/CVPR.2019.00796http://dx.doi.org/10.1109/CVPR.2019.00796]

Tangkuampien T and Suter D. 2006. Human motion de-noising via greedy kernel principal component analysis filtering//Proceedings of the 18th International Conference on Pattern Recognition (ICPR'06). Hong Kong, China: IEEE: 457-460 [DOI: 10.1109/ICPR.2006.639http://dx.doi.org/10.1109/ICPR.2006.639]

Xia G Y, Sun H J, Zhang G Q and Feng L. 2016. Human motion recovery jointly utilizing statistical and kinematic information. Information Sciences, 339: 189-205 [DOI: 10.1016/j.ins.2015.12.041]

Xiao J, Feng Y F and Hu W Y. 2011. Predicting missing markers in human motion capture using /1-sparse representation. Computer Animation and Virtual Worlds, 22(2/3): 221-228 [DOI: 10.1002/cav.413]

Xiao J, Feng Y F, Ji M M, Yang X S, Zhang J J and Zhuang Y T. 2015. Sparse motion bases selection for human motion denoising. Signal Processing, 110: 108-122 [DOI: 10.1016/j.sigpro.2014.08.017]

Yang J Y, Guo X, Li K, Wang M Y, Lai Y K and Wu F. 2020. Spatio-temporal reconstruction for 3D motion recovery. IEEE Transactions on Circuits and Systems for Video Technology, 30(6): 1583-1596 [DOI: 10.1109/TCSVT.2019.2907324]

Alert me when the article has been cited

提交

Survey of digital face rendering and appearance recovery methods

Comprehensive review of methods for vehicle logo recognition in intelligent transportation systems

Review of various vessels and airway segmentation in medical imaging

A review of adversarial examples for optical character recognition

Review of cross-view image geolocalization methods