视频中多特征融合人体姿态跟踪

马淼; 李贻斌; 武宪青; 高金凤; 潘海鹏

发布时间： 2020-07-15
摘要点击次数： 2095
全文下载次数： 770
DOI: 10.11834/jig.190494
2020 | Volume 25 | Number 7

视频中多特征融合人体姿态跟踪

马淼¹, 李贻斌², 武宪青¹, 高金凤¹, 潘海鹏¹(1.浙江理工大学, 杭州 310018;2.山东大学, 济南 250100)

摘要

目的目前已有的人体姿态跟踪算法的跟踪精度仍有待提高，特别是对灵活运动的手臂部位的跟踪。为提高人体姿态的跟踪精度，本文首次提出一种将视觉时空信息与深度学习网络相结合的人体姿态跟踪方法。方法在人体姿态跟踪过程中，利用视频时间信息计算出人体目标区域的运动信息，使用运动信息对人体部位姿态模型在帧间传递；考虑到基于图像空间特征的方法对形态较为固定的人体部位如躯干和头部能够较好地检测，而对手臂的检测效果较差，构造并训练一种轻量级的深度学习网络，用于生成人体手臂部位的附加候选样本；利用深度学习网络生成手臂特征一致性概率图，与视频空间信息结合计算得到最优部位姿态，并将各部位重组为完整人体姿态跟踪结果。结果使用两个具有挑战性的人体姿态跟踪数据集VideoPose2.0和YouTubePose对本文算法进行验证，得到的手臂关节点平均跟踪精度分别为81.4%和84.5%，与现有方法相比有明显提高；此外，通过在VideoPose2.0数据集上的实验，验证了本文提出的对下臂附加采样的算法和手臂特征一致性计算的算法能够有效提高人体姿态关节点的跟踪精度。结论提出的结合时空信息与深度学习网络的人体姿态跟踪方法能够有效提高人体姿态跟踪的精度，特别是对灵活运动的人体姿态下臂关节点的跟踪精度有显著提高。

关键词

人体姿态跟踪视觉目标跟踪人机交互深度学习网络关节点概率图

Human pose tracking based on multi-feature fusion in videos

Ma Miao¹, Li Yibin², Wu Xianqing¹, Gao Jinfeng¹, Pan Haipeng¹(1.Zhejiang Sci-Tech University, Hangzhou 310018, China;2.Shandong University, Jinan 250100, China)

Abstract

Objective Human pose tracking in video sequences aims to estimate the pose of a certain person in each frame using image and video cues and consecutively track the human pose throughout the entire video. This field has been increasingly investigated because the development of artificial intelligence and the Internet of Things makes human-computer interaction frequent. Robots or intelligent agents would understand human action and intention by visually tracking human poses. At present, researchers frequently use pictorial structure model to express human poses and use inference methods for tracking. However, the tracking accuracy of current human pose tracking methods needs to be improved, especially for flexible moving arm parts. Although different types of features describe different types of information, the crucial point of human pose tracking depends on utilizing and combining appropriate features. We investigate the construction of effective features to accurately describe the poses of different body parts and propose a method that combines video spatial and temporal features and deep learning features to improve the accuracy of human pose tracking. This paper presents a novel human pose tracking method that effectively uses various video information to optimize human pose tracking in video sequences. Method An evaluation criterion should be used to track a visual target. Human pose is an articulated complex visual target, and evaluating it as a whole leads to ambiguity. In this case, this paper proposes a decomposable human pose expression model that can track each part of human body separately during the video and recombine parts into an entire body pose in each single image. Human pose is expressed as a principal component analysis model of trained contour shape similar to a puppet, and each human part pose contour can be calculated using key points and model parameters. As human pose unpredictably changes, tracking while detecting would improve the human pose tracking accuracy, which is different from traditional visual tracking tasks. During tracking, the video temporal information in the region of each human part target is used to calculate the motion information of each human part pose, and then the motion information is used to propagate the human part contour from each frame to the next. The propagated human parts are treated as human body part candidates in the current frame for subsequent calculation. During propagation, the background motion would disturb and pollute the foreground target motion information, resulting in the deviations of human part candidates obtained through propagation using motion information. To avoid the influence of propagated human part pose deviation, a pictorial structure-based method is adopted to generate additional human body pose candidates and are then decomposed into human body part poses for body part tracking and optimization. The pictorial structure-based method can detect relatively fixed body parts, such as trunk and head, whereas the detection effect of arms is poor because arms move flexibly and their shapes substantially and frequently change. In this circumstance, the problem of arm detection should be solved. A lightweight deep learning network is constructed and trained to generate probability graphs for the key points of human lower arms to solve this problem. Sampling from the generated probability graphs can obtain additional candidates of human lower arm poses. The propagated and generated human part pose candidates need to evaluated. The proposed evaluation method considers image spatial information and deep learning knowledge. Spatial information includes color and contour likelihoods, where the color likelihood function ensures the consistency of part color during tracking, and the contour likelihood function ensures the consistency of human part model contour with image contour feature. The proposed deep learning network can generate probability maps of lower arm feature consistency for each side to reveal the image feature consistency for each calculated lower arm candidates. The spatial and deep learning features work together to evaluate and optimize the part poses for each human part, and the optimized parts are recombined into integrated human pose, where the negative recombined human body poses are eliminated by the shape constraints of the proposed decomposable human model. The recombined optimized human entire pose is the human pose tracking result for the current video frame and is decomposed and propagated to the next frame for subsequent human pose tracking. Result Two publicly available challenging human pose tracking datasets, namely, VideoPose2.0 and YouTubePose datasets, are used to verify the proposed human pose tracking method. For the VideoPose2.0 dataset, the key point accuracy of human pose tracking for shoulders, elbows, and wrists are 90.5%, 82.6%, and 71.2%, respectively, and the average key point accuracy is 81.4%. The results are higher than state-of-the-art methods, such as the method based on conditional random field model (higher by 15.3%), the method based on tree structure reasoning model (higher by 3.9%), and the method based on max-margin Markova model (higher by 8.8%). For the YouTubePose dataset, the key point accuracy of human pose tracking for shoulders, elbows, and wrists are 86.2%, 84.8%, and 81.6%, respectively, and the average key point accuracy is 84.5%. The results are higher than state-of-the-art methods, such as the method based on flowing convent model (higher by 13.7%), the method based on dependent pairwise relation model (higher by 1.1%), and the method based on mixed part sequence reasoning model (higher by 15.9%). The proposed crucial algorithms of additional sampling and feature consistency of lower arm are verified on the VideoPose2.0 dataset, thereby effectively improving the tracking accuracy of lower arm joints by 5.2% and 31.2%, respectively. Conclusion Experimental results show that the proposed human pose tracking method that uses spatial-temporal cues coupled with deep learning probability maps can effectively improve the pose tracking accuracy, especially for the flexible moving lower arms.

Keywords

human pose tracking visual target tracking human-computer interaction deep learning network probability map for joints

在线采编平台

在线出版

年度会议

下载中心

年度信息