YOLOv3剪枝模型的多人姿态估计
Research on multiperson pose estimation combined with YOLOv3 pruning model
- 2021年26卷第4期 页码:837-846
收稿:2020-05-13,
修回:2020-7-3,
录用:2020-7-20,
纸质出版:2021-04-16
DOI: 10.11834/jig.200138
移动端阅览

浏览全部资源
扫码关注微信
收稿:2020-05-13,
修回:2020-7-3,
录用:2020-7-20,
纸质出版:2021-04-16
移动端阅览
目的
2
为了解决复杂环境中多人姿态估计存在的定位和识别等问题,提高多人姿态估计的准确率,减少算法存在的大量冗余参数,提高姿态估计的运行速率,提出了基于批量归一化层(batch normalization,BN)通道剪枝的多人姿态估计算法(YOLOv3 prune pose estimator,YLPPE)。
方法
2
以目标检测算法YOLOv3(you only look once v3)和堆叠沙漏网络(stacked hourglass network,SHN)算法为基础,通过重叠度K-means算法修改YOLOv3网络锚框以更适应行人目标检测,并训练得到Trimming-YOLOv3网络;利用批量归一化层的缩放因子对Trimming-YOLOv3网络进行循环迭代式通道剪枝,设置剪枝阈值与缩放因子,实现较为有效的模型剪枝效果,训练得到Trim-Prune-YOLOv3网络;为了结合单人姿态估计网络,重定义图像尺寸为256×256像素(非正方形图像通过补零实现);再级联4个Hourglass子网络得到堆叠沙漏网络,从而提升整体姿态估计精度。
结果
2
利用斯坦福大学的MPⅡ数据集(MPⅡ human pose dataset)进行实验验证,本文算法对姿态估计的准确率达到了83.9%;同时,时间复杂度为O(
n
2
),模型参数量与未剪枝原始YOLOv3相比下降42.9%。
结论
2
结合YOLOv3剪枝算法的多人姿态估计方法可以有效减少复杂环境对人体姿态估计的负面影响,实现复杂环境下的多人姿态估计并提高估计精度,有效减少模型冗余参数,提高算法的整体运行速率,能够实现较为准确的多人姿态估计,并具有较好的鲁棒性和泛化能力。
Objective
2
Estimation of human body posture has always been one of the engaging research directions in computer vision. Attitude estimation in a multiperson complex background is much more difficult than single-person pose estimation (SPPE) in a simple background. Negative factors such as complex background
multiperson recognition
and human occlusion add a large amount of difficulty to the accurate implementation of multiperson pose estimation algorithms. Multiperson pose estimation algorithms can be mainly divided into "top-down"and "bottom-up" frameworks. The essence of the "top-down" framework is from the holistic-local-to-integral process
by detecting the bounding box of the human body and then independently estimating the pose within each frame to complete multiperson pose estimation. The process of the "bottom-up" framework is from the local-to-integral process by first detecting the body parts independently and then assembling the detected body parts into a human body posture. Both frameworks have their own advantages and disadvantages. The use of a "top-down" framework is susceptible to redundant bounding boxes. The accuracy of pose estimation depends mainly on the quality of the human bounding box. With the "bottom-up" framework
when two or more people are very close together
the gestures that are detected and combined will become very blurred because the framework is localbased and lacks globality. Control is more prone to pose combination errors when applied to multiperson pose estimation in complex environments. We want to complete a more accurate multiperson pose estimation while grasping the overall situation. Therefore
a multiperson pose estimation method combining the YOLOv3 pruning model and SPPE is proposed to solve the problem of positioning and identification of multi-person pose estimation in complex environments
and improve the accuracy of multi-person pose estimation. The YOLOv3 algorithm is a type of end-to-end target detection algorithm proposed in 2018. It uses multiple residual networks for feature extraction and feature pyramid network to achieve feature fusion. Therefore
the YOLOv3 algorithm greatly improves the accuracy of target detection based on maintaining real-time performance. Moreover
the YOLOv3 model has many redundant parameters that greatly affect network operation rate and overall performance. The role of model pruning is to filter the importance of discriminative parameters and remove redundant parameters to reduce the overall model complexity and increase the operating rate. The stacked hourglass network introduced in 2016 consists of multiple hourglass subnets and is extremely malleable. The hourglass subnetwork consists of a residual network that exploits the excellent combining capabilities and feature extraction capabilities of the residual network to extract features of the picture or video. The idea of the primary network-subnetwork provides extremely flexible plasticity for the stacked hourglass network. Multiple subnetwork stacks can help subsequent subnetworks utilize the information extracted by the previous subnetwork
improving the accuracy of the overall network prediction of the human joint points.
Method
2
The algorithm is based on the target detection algorithm YOLOv3 and the stacked hourglass algorithm. The YOLOv3 network anchor box is modified by the overlap K-means algorithm to adapt to pedestrian target detection better
and the Trimming-YOLOv3 network is trained. In batch normalization
the scaling factor of the layer performs cyclic iterative channel pruning on the Trimming-YOLOv3 network
sets the pruning threshold and scaling factor
achieves a more effective model pruning effect
and trains to obtain the Trim-Prune-YOLOv3 network. To combine the SPPE network
the picture size is redefined to 256×256 pixels (nonsquare pictures are implemented by zero padding)
then the four hourglass subnetworks are cascaded to obtain a stacked hourglass network
improving the overall attitude estimation accuracy.
Result
2
This method has been verified by the MPⅡ human pose dataset (MPⅡ dataset) of Stanford University
which is one of the most authoritative datasets in the field of human pose estimation. The MPⅡ is a very challenging multiperson pose dataset
which contains 3 844 training combinations and 1 758 test groups
including occluded people and overlapping people. The MPⅡ dataset contains 16 personal markers
which are the head
shoulders
elbows
wrists
hips
knees
and ankles. On the MPⅡ dataset
the accuracy of the multiperson pose estimation algorithm reaches 83.9%
the time complexity is O(
n
2
)
and the model parameter amount decreases by 42.9% compared with the unpruned original YOLOv3.
Conclusion
2
The multiperson pose estimation method combined with the YOLOv3 pruning algorithm can effectively reduce the negative effect of complex environments on human pose estimation
achieve multiperson pose estimation
and improve estimation accuracy in complex environments
while using model pruning methods can effectively reduce model redundancy parameters to improve the overall speed of the algorithm. Experimental results show that the method can achieve a more accurate multiperson pose estimation and has better robustness and generalization ability compared with other methods.
Chen X J and Yuille A. 2015. Parsing occluded people by flexible compositions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3945-3954[ DOI: 10.1109/CVPR.2015.7299020 http://dx.doi.org/10.1109/CVPR.2015.7299020 ]
Dantone M, Gall J, Leistner C and Van Gool L. 2013. Human pose estimation using body parts dependent joint regressors//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, USA: IEEE: 3041-3048[ DOI: 10.1109/CVPR.2013.391 http://dx.doi.org/10.1109/CVPR.2013.391 ]
Fan J R. 2018. Multi-Person Pose Estimation Based on Deep Learning. Hangzhou: Zhejiang University
范佳柔. 2018. 基于深度学习的多人姿态估计. 杭州: 浙江大学
Fang H S, Xie S Q, Tai Y W and Lu C W. 2017. RMPE: regional multi-person pose estimation//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2353-2362[ DOI: 10.1109/ICCV.2017.256 http://dx.doi.org/10.1109/ICCV.2017.256 ]
Gkioxari G, Hariharan B, Girshick R and Malik J. 2014. Using k-poselets for detecting people and localizing their keypoints//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 3582-3589[ DOI: 10.1109/CVPR.2014.458 http://dx.doi.org/10.1109/CVPR.2014.458 ]
Hara K and Chellappa R. 2013. Computationally efficient regression on a dependency graph for human pose estimation//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, USA: IEEE: 3390-3397[ DOI: 10.1109/CVPR.2013.435 http://dx.doi.org/10.1109/CVPR.2013.435 ]
Huang D, Ying N and Cai Z D. 2019. Optimization of estimation algorithm for the multi-person pose based on reinforcement learning. Computer Applications and Software, 36(4): 186-191
黄铎, 应娜, 蔡哲栋. 2019. 基于强化学习的多人姿态检测算法优化. 计算机应用与软件, 36(4): 186-191)[DOI:10.3969/j.issn.1000-386x.2019.04.029]
Insafutdinov E, Pishchulin L, Andres B, Andriluka M and Schiele B. 2016. DeeperCut: a deeper, stronger, and faster multi-person pose estimation model//Proceedings of 2016 European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 34-50[ DOI: 10.1007/978-3-319-46466-4_3 http://dx.doi.org/10.1007/978-3-319-46466-4_3 ]
Iqbal U and Gall J. 2016. Multi-person pose estimation with local joint-to-person associations//Proceedings of 2016 European Conference on Computer Vision. Amsterdam, The Netherlands: Springer: 627-642[ DOI: 10.1007/978-3-319-48881-3_44 http://dx.doi.org/10.1007/978-3-319-48881-3_44 ]
Kiefel M and Gehler P V. 2014. Human pose Estimation with fields of parts//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 331-346[ DOI: 10.1007/978-3-319-10602-1_22 http://dx.doi.org/10.1007/978-3-319-10602-1_22 ]
Newell A, Yang K Y and Deng J. 2016. Stacked hourglass networks for human pose estimation//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 483-499[ DOI: 10.1007/978-3-319-46484-8_29 http://dx.doi.org/10.1007/978-3-319-46484-8_29 ]
Pishchulin L, Insafutdinov E, Tang S Y, Andres B, Andriluka M, Gehler P and Schiele B. 2016. DeepCut: joint subset partition and labeling for multi person pose estimation//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 4929-4937[ DOI: 10.1109/CVPR.2016.533 http://dx.doi.org/10.1109/CVPR.2016.533 ]
Sapp B, Toshev A and Taskar B. 2010. Cascaded models for articulated pose estimation//Proceedings of the 11th European Conference on Computer Vision. Crete, Greece: Springer: 406-420[ DOI: 10.1007/978-3-642-15552-9_30 http://dx.doi.org/10.1007/978-3-642-15552-9_30 ]
Sun M, Kohli P and Shotton J. 2012a. Conditional regression forests for human pose estimation//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 3394-3401[ DOI: 10.1109/CVPR.2012.6248079 http://dx.doi.org/10.1109/CVPR.2012.6248079 ]
Sun M, Telaprolu M, Lee H and Savarese S. 2012b. An efficient branch-and-bound algorithm for optimal human pose estimation//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 1616-1623[ DOI: 10.1109/CVPR.2012.6247854 http://dx.doi.org/10.1109/CVPR.2012.6247854 ]
Toshev A and Szegedy C. 2014. DeepPose: human pose estimation via deep neural networks//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 1653-1660[ DOI: 10.1109/CVPR.2014.214 http://dx.doi.org/10.1109/CVPR.2014.214 ]
Xu Z X. 2018. Research and Application of Real Time Multi-person Pose Estimation for Surveillance Video. Wuhan: Wuhan University
许忠雄. 2018. 监控视频实时多人姿态估计算法研究与应用. 武汉: 武汉大学
Yuan P C. 2019. Research on Human Body Gesture Recognition Algorithm Based on Convolutional Neural Network. Xi'an: Xi'an Shiyou University
袁鹏程. 2019. 基于卷积神经网络的人体姿态识别算法研究. 西安: 西安石油大学
Zhang K J. 2019. Key Technology and Application of Visual Object Detection and Recognition Based on Deep Learning. Nanjing: Nanjing University
张开军. 2019. 基于深度学习的视觉目标检测与识别关键技术及应用. 南京: 南京大学
Zhang X Q, Li C C, Tong X F, Hu W M, Maybank S and Zhang Y M. 2009. Efficient human pose estimation via parsing a tree structure based human model//Proceedings of the 12th IEEE International Conference on Computer Vision. Kyoto, Japan: IEEE: 1349-1356[ DOI: 10.1109/ICCV.2009.5459306 http://dx.doi.org/10.1109/ICCV.2009.5459306 ]
相关作者
相关机构
京公网安备11010802024621