跨阶段结构下的人体姿态估计
Human pose estimation based on cross-stage structure
- 2019年24卷第10期 页码:1692-1702
收稿:2019-01-30,
修回:2019-5-5,
录用:2019-7-4,
纸质出版:2019-10-16
DOI: 10.11834/jig.190028
移动端阅览

浏览全部资源
扫码关注微信
收稿:2019-01-30,
修回:2019-5-5,
录用:2019-7-4,
纸质出版:2019-10-16
移动端阅览
目的
2
基于图像的人体姿态估计是计算机视觉领域中一个非常重要的研究课题,并广泛应用于人机交互、监控以及图像检索等方面。但是,由于人体视觉外观的多样性、遮挡和混杂背景等因素的影响,导致人体姿态估计问题一直是计算机视觉领域的难点和热点。本文主要关注于初始特征对关节点定位的作用,提出一种跨阶段卷积姿态机(CSCPM)。
方法
2
首先,采用VGG(visual geometry group)网络获得初步的图像初始特征,该初始特征既是图像关节点定位的基础,同时,也由于受到自遮挡和混杂背景的干扰难以学习。其次,在初始特征的基础上,构建多层模型学习不同尺度下的结构特征,同时为了解决深度学习中的梯度消失问题,在后续的各层特征中都串联该初始特征。最后,设计了多尺度关节点定位的联合损失,用于学习深度网络参数。
结果
2
本文实验在两大人体姿态数据集MPII(MPII human pose dataset)和LSP(leeds sport pose)上分别与近3年的人体姿态估计方法进行了定性与定量比较,在MPII数据集中,模型的总检测率为89.1%,相比于性能第2的模型高出了0.7%;在LSP数据集中,模型的总检测率为91.0%,相比于性能第2的模型高出了0.5%。
结论
2
实验结果表明,初始特征学习能够有效判断关节点的自遮挡和混杂背景干扰情况,引入跨阶段结构的CSCPM姿态估计模型能够胜出现有人体姿态估计模型。
Objective
2
The rapid development of modern network and computer technology has led people to gradually move toward the information and intelligent era. In human pose estimation
advanced semantic interpretation and judgment results are obtained through processing
analyzing
and comprehending the input image or image sequence with computer. Human pose estimation has a wide range of applications and development prospects in human-computer interaction
surveillance
image retrieval
motion analysis
virtual reality
perception interface
etc. Thus
human pose estimation based on image is an extremely important research topic in the field of computer vision. However
the problem of human pose estimation has always been a difficult and hot topic because of the influence of the diversity of human visual appearance
occlusion
and complex background. In this paper we consider the problem of human pose estimation from a single still image. Traditional 2D human pose estimation algorithms are based on the pictorial strictures (PS) models. Solving the problem with the following is difficult:human pose estimation algorithms based on the PS model need to detect human parts in images
but in real world
detecting a single member of the human body is very difficult because of the background noise and the wide variety of human appearance. In recent years
the development of deep learning has led to new methods for human pose estimation. Compared with traditional algorithms
deep models have deeper hierarchies and ability to learn more complex patterns. In this work
we mainly focus on the effect of initial features on human joint point positioning and propose cross-stage convolutional pose machines (CSCPM).
Method
2
First
the VGG network is used to obtain the preliminary initial features of the image
which is the basis of the image joint point positioning. The VGG network inherits the frameworks of LeNet and AlexNet and adopts a 19-layer deep network. The VGG network is the preferred algorithm to extract convolutional neural network (CNN) features from the images. The initial features retain more original information because the VGG network directly processes the image. Learning parameters in the deep convolutional network is difficult due to the interference of self-occlusion and mixed background. Second
on the basis of initial features
a multistage model is constructed to study the structural features at different scales. The multistage model consists of a sequence of convolutional networks that repeatedly produce 2D belief maps for the location of each part. The initial features are concatenated in each subsequent stage feature to solve the problem of gradient disappearance in initial feature learning. The network is divided into six stages. The first and second stages use the original image as input
and the third to sixth stages use the feature maps produced by the second stage as input. Finally
the joint loss function of the multi-scale joint location is designed to learn parameters in the deep convolutional network. Each stage of the cross-stage convolutional pose machines (CSCPM) effectively enforces supervision in intermediate stages through the network. Intermediate supervision has the advantage that even though the full deep learning architecture can have many layers
it does not fall prey to the vanishing gradient problem as the intermediate loss functions replenish the gradients at each stage. We encourage the network to repeatedly arrive at such representation by defining a loss function at the output of each stage that minimizes the Euclidean distance between the predicted and ideal belief maps for each part.
Result
2
We evaluate the proposed method on two widely used benchmarks
namely
MPⅡ (MPⅡ human pose dataset) and extended LSP (leeds sport pose) dataset
and compare the method with other human pose estimation methods in the past three years in terms of qualitative and quantitative analyses. In the experiments
percentage of corrected keypoints (PCK) measure is used to evaluate the performance of human pose estimation methods
where a key-point location is considered correct if its distance to the ground truth location is no more than a certain threshold for the length of a portion of the body. The official benchmark on the MPⅡ dataset adopts PCKh (using portion of head length as reference) at 0.5
while the official benchmark on the LSP dataset adopts PCK at 0.2. In the MPⅡ dataset
the total detection rate of the model is 89.1%
which 0.7% points higher than that of the model with the second highest performance. In the LSP dataset
the total detection rate of the model is 91.0%
which is 0.5% points higher than that of the model with the second highest performance. The qualitative results fully show the benefits of the cross-stage structure. The detection results are improved in some scenes
such as occlusion and complex background
because the concatenated initial features retain the original information.
Conclusion
2
The human pose estimation model CSCPM is designed aiming at the failure cases of the convolutional pose machines (CPM) in some complex
scenes such as self-occlusion
mixed background
and joints of nearby people. The model provides a sequential prediction framework for the task of human pose estimation
which introduces a cross-stage structure based on the CPM model. The experimental results show that the proposed model improves the accuracy of human pose estimation and further accurately locates the points of the joins. The effectiveness of the proposed initial features learning and the benefit in the cross-stage structure are evaluated on two widely used human pose estimation benchmarks. Our approach achieves state-of-the-art performance on both datasets. The initial feature learning can effectively judge the self-occlusion and mixed background interference of the joints. The CSCPM
a human pose estimation model with cross-stage structure
is superior to existing human pose estimation models.
Wei S E, Ramakrishna V, Kanade T, et al. Convolutional pose machines[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 4724-4732.[ DOI: 10.1109/CVPR.2016.511 http://dx.doi.org/10.1109/CVPR.2016.511 ]
Huang L, Yang Y, Wang Q J, et al. Indoor scene segmentation based on fully convolutional neural networks[J]. Journal of Image and Graphics, 2019, 24(1):64-72.
黄龙, 杨媛, 王庆军, 等.结合全卷积神经网络的室内场景分割[J].中国图象图形学报, 2019, 24(1):64-72.[DOI:10.11834/jig.180364]
Fischler M A, Elschlager R A. The representation and matching of pictorial structures[J]. IEEE Transactions on Computers, 1973, C-22(1):67-92.[DOI:10.1109/T-C.1973.223602]
Felzenszwalb P F, Huttenlocher D P. Pictorial structures for object recognition[J]. International Journal of Computer Vision, 2005, 61(1):55-79.[DOI:10.1023/b:visi.0000042934.15159.49]
Ramanan D. Learning to parse images of articulated objects[C]//Proceedings of Conference on Advances in Neural Information Processing Systems. Vancouver, BC, Canada: DBLP, 2006: 1129-1136.
Jiang H, Martin D R. Global pose estimation using non-tree models[C]//Proceedings of 2008 IEEE Conference on Computer Vision and Pattern Recognition. Anchorage, USA: IEEE, 2008: 1-8.[ DOI: 10.1109/CVPR.2008.4587457 http://dx.doi.org/10.1109/CVPR.2008.4587457 ]
Hao A M, Zhao Y T, Wu W H, et al. Pose-independent skeleton extraction from virtual human model[J]. Journal of Image and Graphics, 2011, 16(6):1008-1014.
郝爱民, 赵永涛, 吴伟和, 等.任意姿态虚拟人网格模型骨骼提取算法[J].中国图象图形学报, 2011, 16(6):1008-1014.[DOI:10.11834/jig.20110626]
Han G J. Human pose estimation based on improved CNN and weighted SVDD algorithm[J]. Computer Engineering and Applications, 2018, 54(24):198-203.
韩贵金.基于改进CNN和加权SVDD算法的人体姿态估计[J].计算机工程与应用, 2018, 54(24):198-203.[DOI:10.3778/j.issn.1002-8331.1709-0045]
Toshev A, Szegedy C. Deeppose: human pose estimation via deep neural networks[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE, 2014: 1653-1660.[ DOI: 10.1109/CVPR.2014.214 http://dx.doi.org/10.1109/CVPR.2014.214 ]
Chen X, Yuille A. Articulated pose estimation by a graphical model with image dependent pairwise relations[C]//Proceedings of Conference on Advances in Neural Information Processing Systems. Montreal, QC, Canada: DBLP, 2014: 1736-1744.
Park S, Ji M, Chun J. 2D human pose estimation based on object detection using RGB-D information[J]. KSII Transactions on Internet and Information Systems, 2018, 12(2):800-816.[DOI:10.3837/tiis.2018.02.015]
Güler R A, Neverova N, Kokkinos I. DensePose: dense human pose estimation in the wild[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 7297-7306.[ DOI: 10.1109/CVPR.2018.00762 http://dx.doi.org/10.1109/CVPR.2018.00762 ]
Simon T, Joo H, Matthews I, et al. Hand keypoint detection in single images using multiview bootstrapping[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 4645-4653.[ DOI: 10.1109/CVPR.2017.494 http://dx.doi.org/10.1109/CVPR.2017.494 ]
Lifshitz I, Fetaya E, Ullman S. Human pose estimation using deep consensus voting[C]//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer, 2016: 246-260.[ DOI: 10.1007/978-3-319-46475-6_16 http://dx.doi.org/10.1007/978-3-319-46475-6_16 ]
Insafutdinov E, Pishchulin L, Andres B, et al. Deepercut: a deeper, stronger, and faster multi-person pose estimation model[C]//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer, 2016: 34-50.[ DOI: 10.1007/978-3-319-46466-4_3 http://dx.doi.org/10.1007/978-3-319-46466-4_3 ]
Tompson J, Goroshin R, Jain A, et al. Efficient object localization using convolutional networks[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 648-656.[ DOI: 10.1109/CVPR.2015.7298664 http://dx.doi.org/10.1109/CVPR.2015.7298664 ]
Stewart R, Andriluka M, Ng A Y. End-to-end people detection in crowded scenes[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 2325-2333.[ DOI: 10.1109/CVPR.2016.255 http://dx.doi.org/10.1109/CVPR.2016.255 ]
Bradley D M. Learning in Modular Systems[M]. Pittsburgh, PA, USA:Carnegie Mellon University, 2010:127-134.
He K M, Zhang X Y, Ren S Q, et al. Delving deep into rectifiers: surpassing human-level performance on imagenet classification[C]//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 1026-1034.[ DOI: 10.1109/ICCV.2015.123 http://dx.doi.org/10.1109/ICCV.2015.123 ]
Andriluka M, Pishchulin L, Gehler P, et al. 2D human pose estimation: new benchmark and state of the art analysis[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE, 2014: 3686-3693.[ DOI: 10.1109/CVPR.2014.471 http://dx.doi.org/10.1109/CVPR.2014.471 ]
Carreira J, Agrawal P, Fragkiadaki K, et al. Human pose estimation with iterative error feedback[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 4733-4742.[ DOI: 10.1109/cvpr.2016.512 http://dx.doi.org/10.1109/cvpr.2016.512 ]
Pishchulin L, Insafutdinov E, Tang S Y, et al. Deepcut: joint subset partition and labeling for multi person pose estimation[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 4929-4937.[ DOI: 10.1109/CVPR.2016.533 http://dx.doi.org/10.1109/CVPR.2016.533 ]
Belagiannis V, Zisserman A. Recurrent human pose estimation[C]//Proceedings of 2017 IEEE International Conference on Automatic Face & Gesture Recognition. Washington, DC, USA: IEEE, 2017: 468-475.[ DOI: 10.1109/FG.2017.64 http://dx.doi.org/10.1109/FG.2017.64 ]
相关作者
相关机构
京公网安备11010802024621