目的 2D姿态估计的误差是导致3D 人体姿态估计产生误差的主要原因，如何在2D误差或噪声干扰下从2D姿态映射到最优、最合理的3D姿态，是提高3D人体姿态估计的关键。本文提出了一种稀疏表示与深度模型联合的3D姿态估计方法，以将3D姿势空间几何先验与时间信息相结合，达到提高3D姿态估计精度的目的。方法 首先，利用融合稀疏表示的3D可变形状模型得到单帧图像可靠的3D初始值。然后，构建MLSTM降噪编/解码器，将获得的单帧3D初始值以时间序列形式输入到其中，利用MLSTM降噪编/解码器学习相邻帧之间人物姿态的时间依赖关系，并施加时间平滑约束，得到最终优化的3D姿态。结果 在Human3.6M数据集上进行了对比实验。对于两种输入数据：数据集给出的2D坐标和通过卷积神经网络获得的2D估计坐标，相比于单帧估计，通过MLSTM降噪编/解码器优化后的视频序列平均重构误差分别下降了12.6%，13%；相比于现有的基于视频的稀疏模型方法，本文方法对视频的的平均重构误差下降了6.4%，9.1%；对于2D估计坐标数据，相比于现有的深度模型方法，本文方法对视频的的平均重构误差下降了12.8%。结论 本文提出的基于时间信息的MLSTM降噪编/解码器与稀疏模型相结合，有效利用了3D姿态先验知识、视频帧间人物姿态连续变化的时间和空间依赖性，提高了单目视频3D姿态估计的精度。
Spatial-temporal model for 3D human pose estimation via sparseness and deepness
WeinanWang,RongZhang,LijunGuo(Faculty of Electrical Engineering and Computer Science,Ningbo University,Ningbo)
Objective The task of estimating 3D human pose from monocular videos became an open research problem among the computer vision and graphics community for a long time. An understanding of human posture and limb articulation is important for high level computer vision tasks such as human-computer interaction, augmented and virtual reality and human action or activity recognition and so on. The recent success of deep networks has led many state-of-the-art methods for 3D pose estimation to train deep networks end-to-end to predict from images directly. The top-performing approaches have shown the effectiveness of dividing the task of 3D pose estimation into two steps: using a state-of-the-art 2D pose estimator to estimate the 2D poses from images and then mapping them into the 3D space. These results indicate that a large portion of the error of modern deep 3d pose estimation systems stems from 2D pose estimation error. Therefore, it is crucial for the 3D human pose task to map a 2D pose with error or noise into its optimum and the most reasonable 3D pose. We propose a 3D pose estimation method jointly using a sparse representation and a depth model, by which we combine spatial geometric priori of 3D poses with temporal information to improve the 3D pose estimation accuracy. Method Firstly, we employ a 3D variable shape model integrating sparse representation(SR) to represent rich 3D human posture changes. A convex relaxation method based on regularization to transform the non-convex optimization problem of single frame image in a shape space model into a convex programming problem and to provide reasonable initial values for a single frame of image which significantly reduces the possibility of ambiguous reconstructions. Secondly, the initial 3D poses obtained from the SR module is regarded as the 3D data with noise are fed into a multi-channel Long Short-Term Memory (MLSTM) denoising en-decoder in the form of pose sequences in temporal dimension. In order to ensure the spatial structure of 3D pose, the 3D data with noise is converted into three components of X, Y and Z . For each components , multi-layer LSTM cells are used to capture the time variation different frames. The output of the LSTM unit is not the optimization result on the corresponding component, but the time dependence between the two adjacent frames of the character posture of the input sequence implicitly encoded by the hidden layer of the LSTM unit. By means of residual connection, the time information learned is added with the initial value, so as to maintain the time consistency of 3D pose and effectively alleviate the problem of sequence jitter. Moreover, the shaded joints can be corrected by smoothing constraint between the two frames. Finally, we obtain the optimized 3d pose estimation results by decoding the last linear layer. Result In order to verify the validity of the proposed method, we did a comparative experiment. The result was conducted on the Human3.6M dataset and compared with the state-of-the-art methods. The quantitative evaluation metrics contained a common approach used is to align the predicted 3D pose with the ground truth 3D pose using a similarity transformation. We use the average error per joint in millimeters between the estimated and the ground truth 3D pose. 2D joint ground truth and 2D pose estimations using a convolutional network are used as inputs separately. The quantitative experimental results suggest that the proposed method can greatly improve the 3D estimation accuracy. When the input data is the 2D joint ground truth given by the Human3.6M dataset，compared with the individual frame estimation, the average reconstruction error is decreased by 12.6% after the optimization of our model. Compared with the existing sparse model method based on video, the average reconstruction error of our method is decreased by 6.4%. When the input data is 2D pose estimations using a convolutional network, compared with single frame estimation, the average reconstruction error is decreased by 13% after the optimization of our model. Compared with the existing depth model method， the average reconstruction error of our method is decreased by 12.8%. Compared with the existing sparse model method based on video, the average reconstruction error of our method is decreased by 9.1%. Conclusion Combining our MLSTM en-decoder based on temporal information with the sparse model, we adequately exploit 3D pose prior knowledge, temporal and spatial dependence of continuous human pose changes and achieve a great improvement of monocular video 3D pose estimation accuracy.