视频中多特征融合人体姿态跟踪
Human pose tracking based on multi-feature fusion in videos
- 2020年25卷第7期 页码:1459-1472
收稿:2019-10-08,
修回:2019-12-12,
录用:2019-12-19,
纸质出版:2020-07-16
DOI: 10.11834/jig.190494
移动端阅览

浏览全部资源
扫码关注微信
收稿:2019-10-08,
修回:2019-12-12,
录用:2019-12-19,
纸质出版:2020-07-16
移动端阅览
目的
2
目前已有的人体姿态跟踪算法的跟踪精度仍有待提高,特别是对灵活运动的手臂部位的跟踪。为提高人体姿态的跟踪精度,本文首次提出一种将视觉时空信息与深度学习网络相结合的人体姿态跟踪方法。
方法
2
在人体姿态跟踪过程中,利用视频时间信息计算出人体目标区域的运动信息,使用运动信息对人体部位姿态模型在帧间传递;考虑到基于图像空间特征的方法对形态较为固定的人体部位如躯干和头部能够较好地检测,而对手臂的检测效果较差,构造并训练一种轻量级的深度学习网络,用于生成人体手臂部位的附加候选样本;利用深度学习网络生成手臂特征一致性概率图,与视频空间信息结合计算得到最优部位姿态,并将各部位重组为完整人体姿态跟踪结果。
结果
2
使用两个具有挑战性的人体姿态跟踪数据集VideoPose2.0和YouTubePose对本文算法进行验证,得到的手臂关节点平均跟踪精度分别为81.4%和84.5%,与现有方法相比有明显提高;此外,通过在VideoPose2.0数据集上的实验,验证了本文提出的对下臂附加采样的算法和手臂特征一致性计算的算法能够有效提高人体姿态关节点的跟踪精度。
结论
2
提出的结合时空信息与深度学习网络的人体姿态跟踪方法能够有效提高人体姿态跟踪的精度,特别是对灵活运动的人体姿态下臂关节点的跟踪精度有显著提高。
Objective
2
Human pose tracking in video sequences aims to estimate the pose of a certain person in each frame using image and video cues and consecutively track the human pose throughout the entire video. This field has been increasingly investigated because the development of artificial intelligence and the Internet of Things makes human-computer interaction frequent. Robots or intelligent agents would understand human action and intention by visually tracking human poses. At present
researchers frequently use pictorial structure model to express human poses and use inference methods for tracking. However
the tracking accuracy of current human pose tracking methods needs to be improved
especially for flexible moving arm parts. Although different types of features describe different types of information
the crucial point of human pose tracking depends on utilizing and combining appropriate features. We investigate the construction of effective features to accurately describe the poses of different body parts and propose a method that combines video spatial and temporal features and deep learning features to improve the accuracy of human pose tracking. This paper presents a novel human pose tracking method that effectively uses various video information to optimize human pose tracking in video sequences.
Method
2
An evaluation criterion should be used to track a visual target. Human pose is an articulated complex visual target
and evaluating it as a whole leads to ambiguity. In this case
this paper proposes a decomposable human pose expression model that can track each part of human body separately during the video and recombine parts into an entire body pose in each single image. Human pose is expressed as a principal component analysis model of trained contour shape similar to a puppet
and each human part pose contour can be calculated using key points and model parameters. As human pose unpredictably changes
tracking while detecting would improve the human pose tracking accuracy
which is different from traditional visual tracking tasks. During tracking
the video temporal information in the region of each human part target is used to calculate the motion information of each human part pose
and then the motion information is used to propagate the human part contour from each frame to the next. The propagated human parts are treated as human body part candidates in the current frame for subsequent calculation. During propagation
the background motion would disturb and pollute the foreground target motion information
resulting in the deviations of human part candidates obtained through propagation using motion information. To avoid the influence of propagated human part pose deviation
a pictorial structure-based method is adopted to generate additional human body pose candidates and are then decomposed into human body part poses for body part tracking and optimization. The pictorial structure-based method can detect relatively fixed body parts
such as trunk and head
whereas the detection effect of arms is poor because arms move flexibly and their shapes substantially and frequently change. In this circumstance
the problem of arm detection should be solved. A lightweight deep learning network is constructed and trained to generate probability graphs for the key points of human lower arms to solve this problem. Sampling from the generated probability graphs can obtain additional candidates of human lower arm poses. The propagated and generated human part pose candidates need to evaluated. The proposed evaluation method considers image spatial information and deep learning knowledge. Spatial information includes color and contour likelihoods
where the color likelihood function ensures the consistency of part color during tracking
and the contour likelihood function ensures the consistency of human part model contour with image contour feature. The proposed deep learning network can generate probability maps of lower arm feature consistency for each side to reveal the image feature consistency for each calculated lower arm candidates. The spatial and deep learning features work together to evaluate and optimize the part poses for each human part
and the optimized parts are recombined into integrated human pose
where the negative recombined human body poses are eliminated by the shape constraints of the proposed decomposable human model. The recombined optimized human entire pose is the human pose tracking result for the current video frame and is decomposed and propagated to the next frame for subsequent human pose tracking.
Result
2
Two publicly available challenging human pose tracking datasets
namely
VideoPose2.0 and YouTubePose datasets
are used to verify the proposed human pose tracking method. For the VideoPose2.0 dataset
the key point accuracy of human pose tracking for shoulders
elbows
and wrists are 90.5%
82.6%
and 71.2%
respectively
and the average key point accuracy is 81.4%. The results are higher than state-of-the-art methods
such as the method based on conditional random field model (higher by 15.3%)
the method based on tree structure reasoning model (higher by 3.9%)
and the method based on max-margin Markova model (higher by 8.8%). For the YouTubePose dataset
the key point accuracy of human pose tracking for shoulders
elbows
and wrists are 86.2%
84.8%
and 81.6%
respectively
and the average key point accuracy is 84.5%. The results are higher than state-of-the-art methods
such as the method based on flowing convent model (higher by 13.7%)
the method based on dependent pairwise relation model (higher by 1.1%)
and the method based on mixed part sequence reasoning model (higher by 15.9%). The proposed crucial algorithms of additional sampling and feature consistency of lower arm are verified on the VideoPose2.0 dataset
thereby effectively improving the tracking accuracy of lower arm joints by 5.2% and 31.2%
respectively.
Conclusion
2
Experimental results show that the proposed human pose tracking method that uses spatial-temporal cues coupled with deep learning probability maps can effectively improve the pose tracking accuracy
especially for the flexible moving lower arms.
Andriluka M, Pishchulin L, Gehler P and Schiele B. 2014. 2D human pose estimation: new benchmark and state of the art analysis//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE: 3686-3693[ DOI: 10.1109/CVPR.2014.471 http://dx.doi.org/10.1109/CVPR.2014.471 ]
Anguelov D, Srinivasan P, Koller D, Thrun S, Rodgers J and Davis J. 2005. SCAPE:shape completion and animation of people. ACM Transactions on Graphics, 24(3):408-416[DOI:10.1145/1073204.1073207]
Bajcsy R, Aloimonos Y and Tsotsos J K. 2018. Revisiting active perception. Autonomous Robots, 42(2):177-196[DOI:10.1007/s10514-017-9615-3]
Cao Z, Simon T, Wei S E and Sheikh Y. 2017. Realtime multi-person 2D pose estimation using part affinity fields//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE: 7291-7299[ DOI: 10.1109/CVPR.2017.143 http://dx.doi.org/10.1109/CVPR.2017.143 ]
Charles J, Pfister T, Magee D, Hogg D and Zisserman A. 2016. Personalizing human video pose estimation//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE: 3063-3072[ DOI: 10.1109/CVPR.2016.334 http://dx.doi.org/10.1109/CVPR.2016.334 ]
Chen Y L, Wang Z C, Peng Y X, Zhang Z Q, Yu G and Sun J. 2018. Cascaded pyramid network for multi-person pose estimation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE: 7103-7112[ DOI: 10.1109/CVPR.2018.00742 http://dx.doi.org/10.1109/CVPR.2018.00742 ]
Cherian A, Mairal J, Alahari K and Schmid C. 2014. Mixing body-part sequences for human pose estimation//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE: 2353-2360[ DOI: 10.1109/CVPR.2014.302 http://dx.doi.org/10.1109/CVPR.2014.302 ]
Duckworth P, Hogg D C and Cohn A G. 2019. Unsupervised human activity analysis for intelligent mobile robots. Artificial Intelligence, 270:67-92[DOI:10.1016/j.artint.2018.12.005]
Fischier M A and Elschlager R A. 1973. The representation and matching of pictorial structures. IEEE Transactions on Computers, C-22(1): 67-92[ DOI: 10.1109/T-C.1973.223602 http://dx.doi.org/10.1109/T-C.1973.223602 ]
Fragkiadaki K, Hu H and Shi J B. 2013. Pose from flow and flow from pose//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland: IEEE: 2059-2066[ DOI: 10.1109/CVPR.2013.268 http://dx.doi.org/10.1109/CVPR.2013.268 ]
Girshick R, Donahue J, Darrell T and Malik J. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE: 580-587[ DOI: 10.1109/CVPR.2014.81 http://dx.doi.org/10.1109/CVPR.2014.81 ]
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE: 770-778[ DOI: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]
Krizhevsky A, Sutskever I and Hinton G E. 2012. Imagenet classification with deep convolutional neural networks//Proceedings of the 25th International Conference on Neural Information Processing System. Red Hook, NY: ACM: 1097-1105
Kumar R and Batra D. 2016. Pose tracking by efficiently exploiting global features//Proceedings of 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). Lake Placid, New York: IEEE: 1-9[ DOI: 10.1109/WACV.2016.7477563 http://dx.doi.org/10.1109/WACV.2016.7477563 ]
Liu C. 2009. Beyond Pixels: Exploring New Representations and Applications for Motion Analysis. Cambridge, MA: Massachusetts Institute of Technology
López-Quintero M I, Marín-Jiménez M J, Muñoz-Salinas R and Medina-Carnicer R. 2017. Mixing body-parts model for 2D human pose estimation in stereo videos. IET Computer Vision, 11(6):426-433[DOI:10.1049/iet-cvi.2016.0249]
Lowe D G. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91-110[DOI:10.1023/B:VISI.0000029664.99615.94]
Ma M, Marturi N, Li Y B, Stolkin R and Leonardis A. 2016. A local-global coupled-layer puppet model for robust online human pose tracking. Computer Vision and Image Understanding, 153:163-178[DOI:10.1016/j.cviu.2016.08.010]
Newell A, Yang K Y and Deng J. 2016. Stacked hourglass networks for human pose estimation//Proceedings of the 14th European Conference on Computer Vision. Amsterdam: Springer: 483-499[ DOI: 10.1007/978-3-319-46484-8_29 http://dx.doi.org/10.1007/978-3-319-46484-8_29 ]
Pfister T, Charles J and Zisserman A. 2015. Flowing ConvNets for human pose estimation in videos//Proceedings of 2015 International Conference on Computer Vision. Santiago: IEEE: 1913-1921[ DOI: 10.1109/ICCV.2015.222 http://dx.doi.org/10.1109/ICCV.2015.222 ]
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S A, Huang Z H, Karpathy A, Khosla A, Bernstein M, Berg A C and Li F. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211-252[DOI:10.1007/s11263-015-0816-y]
Samanta S and Chanda B. 2016. A data-driven approach for human pose tracking based on spatio-temporal pictorial structure[EB/OL].[2019-08-22] . https://arxiv.org/pdf/1608.00199.pdf https://arxiv.org/pdf/1608.00199.pdf
Sapp B, Weiss D and Taskar B. 2011. Parsing human motion with stretchable models//Proceedings of CVPR 2011. Providence: IEEE: 1281-1288[ DOI: 10.1109/CVPR.2011.5995607 http://dx.doi.org/10.1109/CVPR.2011.5995607 ]
Simonyan K and Zisserman A. 2015. Very deep convolutional networks for large-scale image recognition[EB/OL].[2019-08-22] . https://arxiv.org/pdf/1409.1556.pdf https://arxiv.org/pdf/1409.1556.pdf
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V and Rabinovich A. 2015. Going deepen with convolutions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE: 1-9[ DOI: 10.1109/CVPR.2015.7298594 http://dx.doi.org/10.1109/CVPR.2015.7298594 ]
Wei S E, Ramakrishna V, Kanade T and Sheikh Y. 2016. Convolutional pose machines//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE: 4724-4732[ DOI: 10.1109/CVPR.2016.511 http://dx.doi.org/10.1109/CVPR.2016.511 ]
Xu Y X and Chen F. 2015. Recent advances in local image descriptor. Journal of Image and Graphics, 20(9):1133-1150
许允喜, 陈方. 2015.局部图像描述符最新研究进展.中国图象图形学报, 20(9):1133-1150)[DOI:10.11834/jig.20150901]
Yang Y and Ramanan D. 2012. Articulated human detection with flexible mixtures of parts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2878-2890[DOI:10.1109/TPAMI.2012.261]
Zhao L, Gao X B, Tao D C and Li X L. 2015. Tracking human pose using max-margin markov models. IEEE Transactions on Image Processing, 24(12):5274-5287[DOI:10.1109/TIP.2015.2473662]
Zheng L, Huang Y J, Lu H C and Yang Y. 2019. Pose-invariant embedding for deep person re-identification. IEEE Transactions on Image Processing, 28(9):4500-4509[DOI:10.1109/TIP.2019.2910414]
Zuffi S, Freifeld O and Black M J. 2012. From pictorial structures to deformable structures//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence: IEEE: 3546-3553[ DOI: 10.1109/CVPR.2012.6248098 http://dx.doi.org/10.1109/CVPR.2012.6248098 ]
Zuffi S, Romero J, Schmid C and Black M J. 2013. Estimating human pose with flowing puppets//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney: IEEE: 3312-3319[ DOI: 10.1109/ICCV.2013.411 http://dx.doi.org/10.1109/ICCV.2013.411 ]
相关作者
相关机构
京公网安备11010802024621