Current Issue Cover
面向人体骨骼运动数据优化的双自编码器网络

李书杰, 朱海生, 王磊, 刘晓平(合肥工业大学计算机与信息学院, 合肥 230601)

摘 要
目的 针对包含混合噪声的3维坐标形式的骨骼运动数据优化问题,提出一种由双向循环自编码器和卷积自编码器串联构成的优化网络,其中双向循环自编码器用于使网络输出的优化数据具有更高的位置精度,卷积自编码器用于使优化数据具有更好的平滑性。方法 首先,利用高精度动捕数据库预训练一个感知自编码器;然后,用“噪声—高精度”数据对训练双自编码器,并在训练过程中添加隐变量约束。其中隐变量约束由预训练的感知自编码器返回,其作用在于能够使网络输出保持较高的精度并具有合理骨骼结构,使算法适用于提升运动数据的细节层次。结果 实验分别在合成噪声数据集和真实噪声数据集上进行,与最新的卷积自编码器(convolutional auto-encoder, CAE)、双向循环自编码器(bidirectional recurrent auto-encoder,BRA)以及双向循环自编码器加感知约束(BRA with perceptual constraint, BRA-P)3种深度学习方法进行比较,在位置误差、骨骼长度误差和平滑性误差3项量化指标上,本文方法的优化结果与最新的3种方法在合成噪声数据集上相比,分别提高了33.1%、25.5%、12.2%;在真实噪声数据集上分别提高了27.2%、39.2%、16.8%。结论 本文提出的双自编码器优化网络综合了两种自编码器的优点, 使网络输出的优化数据具有更高的数据精度和更好的平滑性,且能够较好地保持运动数据的骨骼结构。
关键词
Dual auto-encoder network for human skeleton motion data optimization

Li Shujie, Zhu Haisheng, Wang Lei, Liu Xiaoping(School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, China)

Abstract
Objective Human motion data are widely used in virtual reality, human computer interactions, computer games, sports and medical applications. Human motions capture technique aims to obtain highly precise human motion data. Motion capture sensors (MoCap) like Vicon and Xsens can offer high precision motion data costly. These MoCap systems are not fitted to wear for users. Low-cost motion capture technologies have been developed and can serve as alternatives for capturing human motion, including depth sensor-based and camera-based technologies. However, the raw 3D skeleton motion data captured derived from these low-cost sensors are constrained of calibration error, sensor noise, poor sensor resolution, and occlusion due to body parts or clothing. Thus, the raw MoCap data should be optimized, i.e., filling in missing data and de-noising in the pre-stage for users. The accuracy of optimized data for human motion in the context of convolutional auto-encoder (CAE) based multi-noises features and noise amplitudes. Raw MoCap data like the Kinect skeleton motion captured data contain mixed noise different noise types and amplitudes in the capture process due to scenarios changes or self-occlusion. Thus, the bi-directional recurrent auto-encoder (BRA) has been used to raw motion data based on heterogeneous mixed noise. However, the result of BRA has higher position accuracy but CAE is much smoother. Hence, we represent an optimized dual auto-encoder network named BCH, which consists of a BRA and a series of CAE. The BRA is used to make the optimized data on the aspect of the higher position accuracy network and the CAE is used to make the optimized data have better smoothness. Method First, a perceptual auto-encoder is pre-trained using high precision motion capture data. The loss function for the pre-trained perceptual auto-encoder consists of 3 factors including position loss, bone-length loss and smooth loss. Next, we train the dual auto-encoder for optimization via paired “noisy-clean” data. The perceptual autoencoder is composed of the convolution encoder and the convolution decoder. The convolutional encoder in the context of convolutional layer, max pooling layer, activation layer and the convolutional decoder is melted inverse-pooling layer and convolutional layer in. The dual autoencoder consists of BRA and CAE. The convolutional autoencoder network structure is similar to the perceptual autoencoder mentioned above. The BRA has two components, including the bidirectional recurrent encoder and the bidirectional recurrent decoder. The BRA consists of 2 overall interconnected layers followed by 1 bidirectional long short-term memory (LSTM) cell. The structure of the decoder is symmetric with that of the encoder. All those of the encoder and the decoder structure can be used to recover corrupted motion data derived of projection and inverse projection. Hidden-units constraint is imposed for training dual autoencoder, which is defined based on the perceptual autoencoder. Adam stochastic gradient descent is used to minimize the loss function of two networks. The batch size is set to 16 and the learning rate is set to 0.000 01. To avoid overfitting, we use a dropout of 0.2. The perceptual autoencoder is trained by 200 epochs and the dual autoencoder is trained by 300 epochs. Result The demonstrations based on synthetic noise dataset (Carnegie Mellon University (CMU) Graphics Lab Motion Capture Database) and raw motion dataset (the dataset synchronously captured by Kinect and the NOITOM MoCap system) are conducted for verification. Beyond the 3 deep learning methods (CAE, BRA and BRA with perceptual constraint, called BRA-P method), the ablation studies verifies each component of our approach on the above two datasets. The quantitative evaluation metrics contained position loss (mean square error, MSE), bone-length loss and smooth loss, and we illustrated a comparison of motion data optimization that add hidden constraint versus those that did not. The demonstrated results illustrates that our network structure based on synthetic noise dataset and raw motion dataset has its priority of 3 existed deep learning networks in terms of position loss, bone-length loss and smooth loss. The ablation studies on 2 datasets are used to facilitate the refined motion data based on our dual auto-encoder and hidden constraint. In addition, we also analyzed the time performance of the proposed method on raw motion testing dataset. The analyzed results represent that the time-consuming issue of BCH motion data refinement is approximately consistent to the sum of motion data optimization time cost derived of BRA and CAE method, which is close to by BRA method. Conclusion We harness an optimized network with dual autoencoder that contains hidden constraint. The results of synthetic noise data and raw motion data demonstrate that the proposed network and hidden-unit constraint yield the higher position accuracy and better smoothness optimized data and maintain the bone-length consistency of the motion data.
Keywords

订阅号|日报