结合密集连接的轻量级高分辨率人体姿态估计

高坤; 李汪根; 束阳; 葛英奎; 王志格

发布时间： 2024-05-20
摘要点击次数： 261
全文下载次数： 246
DOI: 10.11834/jig.230228
2024 | Volume 29 | Number 5

结合密集连接的轻量级高分辨率人体姿态估计

高坤, 李汪根, 束阳, 葛英奎, 王志格(安徽师范大学计算机与信息学院, 芜湖 241002)

摘要

目的为了更好地实现轻量化的人体姿态估计，在轻量级模型极为有限的资源下实现更高的检测性能。基于高分辨率网络（high resolution network，HRNet）提出了结合密集连接网络的轻量级高分辨率人体姿态估计网络（lightweight high-resolution human estimation combined with densely connected network，LDHNet）。方法通过重新设计HRNet中的阶段分支结构以及提出新的轻量级特征提取模块，构建了轻量高效的特征提取单元，同时对多分支之间特征融合部分进行了轻量化改进，进一步降低模型的复杂度，最终大幅降低了模型的参数量与计算量，实现了轻量化的设计目标，并且保证了模型的性能。结果实验表明，在MPII（Max Planck Institute for Informatics）测试集上相比于自顶向下的轻量级人体姿态估计模型LiteHRNet，LDHNet仅通过增加少量参数量与计算量，平均预测准确度即提升了1.5%，与LiteHRNet的改进型DiteHRNet相比也提升了0.9%，在COCO（common objects in context）验证集上的结果表明，与LiteHRNet相比，LDHNet的平均检测准确度提升了3.4%，与DiteHRNet相比也提升了2.3%，与融合Transformer的HRFormer相比，LDHNet在参数量和计算量都更低的条件下有近似的检测性能，在面对实际场景时LDHNet也有着稳定的表现，在同样的环境下LDHNet的推理速度要高于基线HRNet以及LiteHRNet等。结论该模型有效实现了轻量化并保证了预测性能。

关键词

人体姿态估计轻量级网络密集连接网络高分辨率网络多分支结构

Lightweight high-resolution human pose estimation combined with densely connected network

Gao Kun, Li Wanggen, Shu Yang, Ge Yingkui, Wang Zhige(School of Computer and Information, Anhui Normal University, Wuhu 241002, China)

Abstract

Objective Human pose estimation is a technology that can be widely used in life. In recent years，many excellent high-precision methods have been proposed，but they are often accompanied by a very large model scale，which will encounter the problem of computing power bottleneck in application. Whether for model training or deployment，large models require considerable computing power as the basis. Most of them have low computing power. Similarly，for the scenes in daily life，the equipment needs further applicability and detection speed of the model，which is difficult to achieve by large models. Given such requirements，lightweight human pose estimation has become a hot research field. The main problem is how to achieve high detection accuracy and fast detection speed under the extremely limited number of resources. Lightweight models will inevitably fall into a disadvantage in detection accuracy compared with large models. However，fortunately，from many studies in recent years，the lightweight model can also achieve higher detection accuracy than large ones. A good balance can be reached between them. Method Based on a high-resolution network（HRNet），a lightweight high-resolution human pose estimation network combined with a dense connection network（LDHNet）was proposed. First，dense connection and multi-scale were integrated to construct a lightweight and efficient feature extraction unit by redesigning the stage branch structure in HRNet and proposing a new lightweight feature extraction module. Then， the feature extraction module is composed of modules similar to the pyramid structure，and the dilated convolution of three scales is used to obtain a wide range of feature information in the feature map by stacking the multi-layer feature extraction modules and fusing the output of each layer. The concatenation of the output feature map of the feature extraction module is to reuse the feature map and fully extract the information contained in the feature map. These two points can make up for the problem of insufficient utilization of feature information that may exist in the lightweight model and use limited resources to achieve high feature extraction performance. Second，a wide range of cross-branch feature information interactions exists in the original HRNet structure，including feature fusion and the generation of new branches in each stage. LDHNet replaces the convolution in this process with the depthwise separable convolution by changing the size of the feature map through convolution downsampling or upsampling to add with other branch feature maps for feature fusion. This case further reduces the number of parameters of the model based on almost no loss of detection performance. In addition，LDHNet improves the original data preprocessing module and uses the double-branch form to fully extract the information from the original image. Experiments show that considerable information from the original image is of great help to improve the detection performance of the model. LDHNet also uses coordinate attention to reinforce spatial location information. Result After the above improvements，the size of the model has been greatly reduced to less than one-tenth of the original HRNet. Although some gap still exists between the model size and the current smallest lightweight model LiteHRNet，the design of the lightweight model is not only concerned with the size of the model. This study mainly compares LDHNet with mainstream lightweight models. Through experimental verification on two mainstream datasets of human pose estimation MPII and common objects in context （COCO）dataset and comparison with the current mainstream methods，the following conclusions can be obtained. Compared with the top-down lightweight human pose estimation LiteHRNet on the MPII test set，the average prediction accuracy of LDHNet is improved by 1. 5% by only adding a small number of parameters and calculations. The results on the COCO validation set show that compared with LiteHRNet，the average detection accuracy of LDHNet is improved by 3. 4%. Compared with the improved DiteHRNet of LiteHRNet，the detection accuracy of LDHNet is improved by 2. 3%. Compared with the HRFormer fused with Transformer，the detection accuracy of LDHNet is the same when the scale is smaller. The experimental results on public datasets show that LDHNet achieves excellent results in model lightweight. LDHNet achieves a very good balance between lightweight and model detection performance. For lightweight human pose estimation，the performance of LDHNet is similar to that of Transformer. In addition to experimental verification in the public data set，this study also tests the inference speed and detection accuracy of the model in the actual scene. Compared with the original HRNet，LDHNet has a significant improvement in the detection speed under GPU acceleration. Compared with other lightweight methods，such as LiteHRNet，LDHNet can make full use of hardware resources. The results of the actual test show that LDHNet has a stable performance in the face of the actual scene. The above experimental results show that LDHNet has achieved the design expectations. Moreover，using extremely limited resources，compared with other lightweight human pose estimation models，LDHNet can make full use of all the computing power regardless of the level of hardware computing power. The detection accuracy is greatly improved，and the inference speed of the model is also significantly improved. Conclusion For LDHNet，some problems also need to be solved，mainly in that the reasoning speed of the model has not reached the level matching the improvement of the model in terms of lightweight. Follow-up works can focus on how to improve the reasoning speed of the model，make the training and reasoning of the model glass，and use the method of emphasizing parameters for reference to further improve the model. Thus，LDHNet can be further competent for the needs of actual production and life.

Keywords

human pose estimation lightweight network densely connected network high resolution network multibranch structure