Current Issue Cover
YOLOv3剪枝模型的多人姿态估计

蔡哲栋, 应娜, 郭春生, 郭锐, 杨鹏(杭州电子科技大学通信工程学院, 杭州 310018)

摘 要
目的 为了解决复杂环境中多人姿态估计存在的定位和识别等问题,提高多人姿态估计的准确率,减少算法存在的大量冗余参数,提高姿态估计的运行速率,提出了基于批量归一化层(batch normalization,BN)通道剪枝的多人姿态估计算法(YOLOv3 prune pose estimator,YLPPE)。方法 以目标检测算法YOLOv3(you only look once v3)和堆叠沙漏网络(stacked hourglass network,SHN)算法为基础,通过重叠度K-means算法修改YOLOv3网络锚框以更适应行人目标检测,并训练得到Trimming-YOLOv3网络;利用批量归一化层的缩放因子对Trimming-YOLOv3网络进行循环迭代式通道剪枝,设置剪枝阈值与缩放因子,实现较为有效的模型剪枝效果,训练得到Trim-Prune-YOLOv3网络;为了结合单人姿态估计网络,重定义图像尺寸为256×256像素(非正方形图像通过补零实现);再级联4个Hourglass子网络得到堆叠沙漏网络,从而提升整体姿态估计精度。结果 利用斯坦福大学的MPⅡ数据集(MPⅡ human pose dataset)进行实验验证,本文算法对姿态估计的准确率达到了83.9%;同时,时间复杂度为O(n2),模型参数量与未剪枝原始YOLOv3相比下降42.9%。结论 结合YOLOv3剪枝算法的多人姿态估计方法可以有效减少复杂环境对人体姿态估计的负面影响,实现复杂环境下的多人姿态估计并提高估计精度,有效减少模型冗余参数,提高算法的整体运行速率,能够实现较为准确的多人姿态估计,并具有较好的鲁棒性和泛化能力。
关键词
Research on multiperson pose estimation combined with YOLOv3 pruning model

Cai Zhedong, Ying Na, Guo Chunsheng, Guo Rui, Yang Peng(School of Communication Engineering, Hangzhou Dianzi University, Hangzhou 310018, China)

Abstract
Objective Estimation of human body posture has always been one of the engaging research directions in computer vision. Attitude estimation in a multiperson complex background is much more difficult than single-person pose estimation (SPPE) in a simple background. Negative factors such as complex background, multiperson recognition, and human occlusion add a large amount of difficulty to the accurate implementation of multiperson pose estimation algorithms. Multiperson pose estimation algorithms can be mainly divided into "top-down"and "bottom-up" frameworks. The essence of the "top-down" framework is from the holistic-local-to-integral process, by detecting the bounding box of the human body and then independently estimating the pose within each frame to complete multiperson pose estimation. The process of the "bottom-up" framework is from the local-to-integral process by first detecting the body parts independently and then assembling the detected body parts into a human body posture. Both frameworks have their own advantages and disadvantages. The use of a "top-down" framework is susceptible to redundant bounding boxes. The accuracy of pose estimation depends mainly on the quality of the human bounding box. With the "bottom-up" framework, when two or more people are very close together, the gestures that are detected and combined will become very blurred because the framework is localbased and lacks globality. Control is more prone to pose combination errors when applied to multiperson pose estimation in complex environments. We want to complete a more accurate multiperson pose estimation while grasping the overall situation. Therefore, a multiperson pose estimation method combining the YOLOv3 pruning model and SPPE is proposed to solve the problem of positioning and identification of multi-person pose estimation in complex environments, and improve the accuracy of multi-person pose estimation. The YOLOv3 algorithm is a type of end-to-end target detection algorithm proposed in 2018. It uses multiple residual networks for feature extraction and feature pyramid network to achieve feature fusion. Therefore, the YOLOv3 algorithm greatly improves the accuracy of target detection based on maintaining real-time performance. Moreover, the YOLOv3 model has many redundant parameters that greatly affect network operation rate and overall performance. The role of model pruning is to filter the importance of discriminative parameters and remove redundant parameters to reduce the overall model complexity and increase the operating rate. The stacked hourglass network introduced in 2016 consists of multiple hourglass subnets and is extremely malleable. The hourglass subnetwork consists of a residual network that exploits the excellent combining capabilities and feature extraction capabilities of the residual network to extract features of the picture or video. The idea of the primary network-subnetwork provides extremely flexible plasticity for the stacked hourglass network. Multiple subnetwork stacks can help subsequent subnetworks utilize the information extracted by the previous subnetwork, improving the accuracy of the overall network prediction of the human joint points. Method The algorithm is based on the target detection algorithm YOLOv3 and the stacked hourglass algorithm. The YOLOv3 network anchor box is modified by the overlap K-means algorithm to adapt to pedestrian target detection better, and the Trimming-YOLOv3 network is trained. In batch normalization, the scaling factor of the layer performs cyclic iterative channel pruning on the Trimming-YOLOv3 network, sets the pruning threshold and scaling factor, achieves a more effective model pruning effect, and trains to obtain the Trim-Prune-YOLOv3 network. To combine the SPPE network, the picture size is redefined to 256×256 pixels (nonsquare pictures are implemented by zero padding), then the four hourglass subnetworks are cascaded to obtain a stacked hourglass network, improving the overall attitude estimation accuracy. Result This method has been verified by the MPⅡ human pose dataset (MPⅡ dataset) of Stanford University, which is one of the most authoritative datasets in the field of human pose estimation. The MPⅡ is a very challenging multiperson pose dataset, which contains 3 844 training combinations and 1 758 test groups, including occluded people and overlapping people. The MPⅡ dataset contains 16 personal markers, which are the head, shoulders, elbows, wrists, hips, knees, and ankles. On the MPⅡ dataset, the accuracy of the multiperson pose estimation algorithm reaches 83.9%, the time complexity is O(n2),and the model parameter amount decreases by 42.9% compared with the unpruned original YOLOv3. Conclusion The multiperson pose estimation method combined with the YOLOv3 pruning algorithm can effectively reduce the negative effect of complex environments on human pose estimation, achieve multiperson pose estimation, and improve estimation accuracy in complex environments, while using model pruning methods can effectively reduce model redundancy parameters to improve the overall speed of the algorithm. Experimental results show that the method can achieve a more accurate multiperson pose estimation and has better robustness and generalization ability compared with other methods.
Keywords

订阅号|日报