跨阶段结构下的人体姿态估计

杨兴明; 周亚辉; 张顺然; 吴克伟; 孙永宣

doi:10.11834/jig.190028

图像分析和识别 | 浏览量 : 0 下载量: 5 CSCD: 2

PDF
导出
分享
收藏
专辑

跨阶段结构下的人体姿态估计
Human pose estimation based on cross-stage structure
2019年24卷第10期页码：1692-1702
收稿：2019-01-30，

修回：2019-5-5，

录用：2019-7-4，

纸质出版：2019-10-16
DOI： 10.11834/jig.190028
稿件说明：

移动端阅览

杨兴明, 周亚辉, 张顺然, 吴克伟, 孙永宣. 跨阶段结构下的人体姿态估计[J]. 中国图象图形学报, 2019,24(10):1692-1702. DOI： 10.11834/jig.190028.

Xingming Yang, Yahui Zhou, Shunran Zhang, Kewei Wu, Yongxuan Sun. Human pose estimation based on cross-stage structure[J]. Journal of Image and Graphics, 2019, 24(10): 1692-1702. DOI： 10.11834/jig.190028.

摘要

目的

基于图像的人体姿态估计是计算机视觉领域中一个非常重要的研究课题，并广泛应用于人机交互、监控以及图像检索等方面。但是，由于人体视觉外观的多样性、遮挡和混杂背景等因素的影响，导致人体姿态估计问题一直是计算机视觉领域的难点和热点。本文主要关注于初始特征对关节点定位的作用，提出一种跨阶段卷积姿态机（CSCPM）。

方法

首先，采用VGG（visual geometry group）网络获得初步的图像初始特征，该初始特征既是图像关节点定位的基础，同时，也由于受到自遮挡和混杂背景的干扰难以学习。其次，在初始特征的基础上，构建多层模型学习不同尺度下的结构特征，同时为了解决深度学习中的梯度消失问题，在后续的各层特征中都串联该初始特征。最后，设计了多尺度关节点定位的联合损失，用于学习深度网络参数。

结果

本文实验在两大人体姿态数据集MPII（MPII human pose dataset）和LSP（leeds sport pose）上分别与近3年的人体姿态估计方法进行了定性与定量比较，在MPII数据集中，模型的总检测率为89.1%，相比于性能第2的模型高出了0.7%；在LSP数据集中，模型的总检测率为91.0%，相比于性能第2的模型高出了0.5%。

结论

实验结果表明，初始特征学习能够有效判断关节点的自遮挡和混杂背景干扰情况，引入跨阶段结构的CSCPM姿态估计模型能够胜出现有人体姿态估计模型。

Abstract

Objective

The rapid development of modern network and computer technology has led people to gradually move toward the information and intelligent era. In human pose estimation

advanced semantic interpretation and judgment results are obtained through processing

analyzing

and comprehending the input image or image sequence with computer. Human pose estimation has a wide range of applications and development prospects in human-computer interaction

surveillance

image retrieval

motion analysis

virtual reality

perception interface

etc. Thus

human pose estimation based on image is an extremely important research topic in the field of computer vision. However

the problem of human pose estimation has always been a difficult and hot topic because of the influence of the diversity of human visual appearance

occlusion

and complex background. In this paper we consider the problem of human pose estimation from a single still image. Traditional 2D human pose estimation algorithms are based on the pictorial strictures (PS) models. Solving the problem with the following is difficult:human pose estimation algorithms based on the PS model need to detect human parts in images

but in real world

detecting a single member of the human body is very difficult because of the background noise and the wide variety of human appearance. In recent years

the development of deep learning has led to new methods for human pose estimation. Compared with traditional algorithms

deep models have deeper hierarchies and ability to learn more complex patterns. In this work

we mainly focus on the effect of initial features on human joint point positioning and propose cross-stage convolutional pose machines (CSCPM).

Method

First

the VGG network is used to obtain the preliminary initial features of the image

which is the basis of the image joint point positioning. The VGG network inherits the frameworks of LeNet and AlexNet and adopts a 19-layer deep network. The VGG network is the preferred algorithm to extract convolutional neural network (CNN) features from the images. The initial features retain more original information because the VGG network directly processes the image. Learning parameters in the deep convolutional network is difficult due to the interference of self-occlusion and mixed background. Second

on the basis of initial features

a multistage model is constructed to study the structural features at different scales. The multistage model consists of a sequence of convolutional networks that repeatedly produce 2D belief maps for the location of each part. The initial features are concatenated in each subsequent stage feature to solve the problem of gradient disappearance in initial feature learning. The network is divided into six stages. The first and second stages use the original image as input

and the third to sixth stages use the feature maps produced by the second stage as input. Finally

the joint loss function of the multi-scale joint location is designed to learn parameters in the deep convolutional network. Each stage of the cross-stage convolutional pose machines (CSCPM) effectively enforces supervision in intermediate stages through the network. Intermediate supervision has the advantage that even though the full deep learning architecture can have many layers

it does not fall prey to the vanishing gradient problem as the intermediate loss functions replenish the gradients at each stage. We encourage the network to repeatedly arrive at such representation by defining a loss function at the output of each stage that minimizes the Euclidean distance between the predicted and ideal belief maps for each part.

Result

We evaluate the proposed method on two widely used benchmarks

namely

MPⅡ (MPⅡ human pose dataset) and extended LSP (leeds sport pose) dataset

and compare the method with other human pose estimation methods in the past three years in terms of qualitative and quantitative analyses. In the experiments

percentage of corrected keypoints (PCK) measure is used to evaluate the performance of human pose estimation methods

where a key-point location is considered correct if its distance to the ground truth location is no more than a certain threshold for the length of a portion of the body. The official benchmark on the MPⅡ dataset adopts PCKh (using portion of head length as reference) at 0.5

while the official benchmark on the LSP dataset adopts PCK at 0.2. In the MPⅡ dataset

the total detection rate of the model is 89.1%

which 0.7% points higher than that of the model with the second highest performance. In the LSP dataset

the total detection rate of the model is 91.0%

which is 0.5% points higher than that of the model with the second highest performance. The qualitative results fully show the benefits of the cross-stage structure. The detection results are improved in some scenes

such as occlusion and complex background

because the concatenated initial features retain the original information.

Conclusion

The human pose estimation model CSCPM is designed aiming at the failure cases of the convolutional pose machines (CPM) in some complex

scenes such as self-occlusion

mixed background

and joints of nearby people. The model provides a sequential prediction framework for the task of human pose estimation

which introduces a cross-stage structure based on the CPM model. The experimental results show that the proposed model improves the accuracy of human pose estimation and further accurately locates the points of the joins. The effectiveness of the proposed initial features learning and the benefit in the cross-stage structure are evaluated on two widely used human pose estimation benchmarks. Our approach achieves state-of-the-art performance on both datasets. The initial feature learning can effectively judge the self-occlusion and mixed background interference of the joints. The CSCPM

a human pose estimation model with cross-stage structure

is superior to existing human pose estimation models.

关键词

Keywords

references

Wei S E, Ramakrishna V, Kanade T, et al. Convolutional pose machines[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 4724-4732.[ DOI: 10.1109/CVPR.2016.511 http://dx.doi.org/10.1109/CVPR.2016.511 ]

Huang L, Yang Y, Wang Q J, et al. Indoor scene segmentation based on fully convolutional neural networks[J]. Journal of Image and Graphics, 2019, 24(1):64-72.

黄龙, 杨媛, 王庆军, 等.结合全卷积神经网络的室内场景分割[J].中国图象图形学报, 2019, 24(1):64-72.[DOI:10.11834/jig.180364]

Fischler M A, Elschlager R A. The representation and matching of pictorial structures[J]. IEEE Transactions on Computers, 1973, C-22(1):67-92.[DOI:10.1109/T-C.1973.223602]

Felzenszwalb P F, Huttenlocher D P. Pictorial structures for object recognition[J]. International Journal of Computer Vision, 2005, 61(1):55-79.[DOI:10.1023/b:visi.0000042934.15159.49]

Ramanan D. Learning to parse images of articulated objects[C]//Proceedings of Conference on Advances in Neural Information Processing Systems. Vancouver, BC, Canada: DBLP, 2006: 1129-1136.

Jiang H, Martin D R. Global pose estimation using non-tree models[C]//Proceedings of 2008 IEEE Conference on Computer Vision and Pattern Recognition. Anchorage, USA: IEEE, 2008: 1-8.[ DOI: 10.1109/CVPR.2008.4587457 http://dx.doi.org/10.1109/CVPR.2008.4587457 ]

Hao A M, Zhao Y T, Wu W H, et al. Pose-independent skeleton extraction from virtual human model[J]. Journal of Image and Graphics, 2011, 16(6):1008-1014.

郝爱民, 赵永涛, 吴伟和, 等.任意姿态虚拟人网格模型骨骼提取算法[J].中国图象图形学报, 2011, 16(6):1008-1014.[DOI:10.11834/jig.20110626]

Han G J. Human pose estimation based on improved CNN and weighted SVDD algorithm[J]. Computer Engineering and Applications, 2018, 54(24):198-203.

韩贵金.基于改进CNN和加权SVDD算法的人体姿态估计[J].计算机工程与应用, 2018, 54(24):198-203.[DOI:10.3778/j.issn.1002-8331.1709-0045]

Toshev A, Szegedy C. Deeppose: human pose estimation via deep neural networks[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE, 2014: 1653-1660.[ DOI: 10.1109/CVPR.2014.214 http://dx.doi.org/10.1109/CVPR.2014.214 ]

Chen X, Yuille A. Articulated pose estimation by a graphical model with image dependent pairwise relations[C]//Proceedings of Conference on Advances in Neural Information Processing Systems. Montreal, QC, Canada: DBLP, 2014: 1736-1744.

Park S, Ji M, Chun J. 2D human pose estimation based on object detection using RGB-D information[J]. KSII Transactions on Internet and Information Systems, 2018, 12(2):800-816.[DOI:10.3837/tiis.2018.02.015]

Güler R A, Neverova N, Kokkinos I. DensePose: dense human pose estimation in the wild[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 7297-7306.[ DOI: 10.1109/CVPR.2018.00762 http://dx.doi.org/10.1109/CVPR.2018.00762 ]

Simon T, Joo H, Matthews I, et al. Hand keypoint detection in single images using multiview bootstrapping[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 4645-4653.[ DOI: 10.1109/CVPR.2017.494 http://dx.doi.org/10.1109/CVPR.2017.494 ]

Lifshitz I, Fetaya E, Ullman S. Human pose estimation using deep consensus voting[C]//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer, 2016: 246-260.[ DOI: 10.1007/978-3-319-46475-6_16 http://dx.doi.org/10.1007/978-3-319-46475-6_16 ]

Insafutdinov E, Pishchulin L, Andres B, et al. Deepercut: a deeper, stronger, and faster multi-person pose estimation model[C]//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer, 2016: 34-50.[ DOI: 10.1007/978-3-319-46466-4_3 http://dx.doi.org/10.1007/978-3-319-46466-4_3 ]

Tompson J, Goroshin R, Jain A, et al. Efficient object localization using convolutional networks[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 648-656.[ DOI: 10.1109/CVPR.2015.7298664 http://dx.doi.org/10.1109/CVPR.2015.7298664 ]

Stewart R, Andriluka M, Ng A Y. End-to-end people detection in crowded scenes[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 2325-2333.[ DOI: 10.1109/CVPR.2016.255 http://dx.doi.org/10.1109/CVPR.2016.255 ]

Bradley D M. Learning in Modular Systems[M]. Pittsburgh, PA, USA:Carnegie Mellon University, 2010:127-134.

He K M, Zhang X Y, Ren S Q, et al. Delving deep into rectifiers: surpassing human-level performance on imagenet classification[C]//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 1026-1034.[ DOI: 10.1109/ICCV.2015.123 http://dx.doi.org/10.1109/ICCV.2015.123 ]

Andriluka M, Pishchulin L, Gehler P, et al. 2D human pose estimation: new benchmark and state of the art analysis[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE, 2014: 3686-3693.[ DOI: 10.1109/CVPR.2014.471 http://dx.doi.org/10.1109/CVPR.2014.471 ]

Carreira J, Agrawal P, Fragkiadaki K, et al. Human pose estimation with iterative error feedback[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 4733-4742.[ DOI: 10.1109/cvpr.2016.512 http://dx.doi.org/10.1109/cvpr.2016.512 ]

Pishchulin L, Insafutdinov E, Tang S Y, et al. Deepcut: joint subset partition and labeling for multi person pose estimation[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 4929-4937.[ DOI: 10.1109/CVPR.2016.533 http://dx.doi.org/10.1109/CVPR.2016.533 ]

Belagiannis V, Zisserman A. Recurrent human pose estimation[C]//Proceedings of 2017 IEEE International Conference on Automatic Face & Gesture Recognition. Washington, DC, USA: IEEE, 2017: 468-475.[ DOI: 10.1109/FG.2017.64 http://dx.doi.org/10.1109/FG.2017.64 ]