结合轻量化骨干与多尺度融合的单阶段检测器

黄健宸; 王晗; 卢昊

发布时间： 2022-12-19
摘要点击次数： 1169
全文下载次数： 1007
DOI: 10.11834/jig.211028
2022 | Volume 27 | Number 12

结合轻量化骨干与多尺度融合的单阶段检测器

黄健宸^1,2, 王晗^1,2, 卢昊^1,2(1. 北京林业大学信息学院,北京 100083;2. 国家林业和草原局林业智能信息处理工程技术研究中心,北京 100083)

摘要

目的基于卷积神经网络的单阶段目标检测网络具有高实时性与高检测精度，但其通常存在两个问题：1)模型中存在大量冗余的卷积计算；2)多尺度特征融合结构导致额外的计算开销。这导致单阶段检测器需要大量的计算资源，难以在计算资源不足的设备上应用。针对上述问题，本文在YOLOv5(you only look once version 5)的结构基础上，提出一种轻量化单阶段目标检测网络架构，称为E-YOLO(efficient-YOLO)。方法利用E-YOLO架构构建了E-YOLOm(efficient-YOLO medium)与E-YOLOs(efficient-YOLO small)两种不同大小的模型。首先，设计了多种更加高效的特征提取模块以减少冗余的卷积计算，对模型中开销较大的特征图通过下采样、特征提取、通道升降维与金字塔池化进行了轻量化设计。其次，为解决多尺度特征融合带来的冗余开销，提出了一种高效多尺度特征融合结构，使用多尺度特征加权融合方案减少通道降维开销，设计中层特征长跳连接缓解特征流失。结果实验表明，E-YOLOm、E-YOLOs与YOLOv5m、YOLOv5s相比，参数量分别下降了71.5%和61.6%，运算量下降了67.3%和49.7%。在VOC(visual object classes)数据集上的平均精度(average precision,AP)，E-YOLOm比YOLOv5m仅下降了2.3%，E-YOLOs比YOLOv5s提升了3.4%。同时，E-YOLOm的参数量和运算量相比YOLOv5s分别低15.5%与1.7%，mAP@0.5和AP比其高3.9%和11.1%，具有更小的计算开销与更高的检测效率。结论本文提出的E-YOLO架构显著降低了单阶段目标检测网络中冗余的卷积计算与多尺度融合开销，且具有良好的鲁棒性，并优于对比网络轻量化方案，在低运算性能的环境中具有重要的实用意义。

关键词

卷积神经网络(CNN) 目标检测模型轻量化注意力模块多尺度融合

One-stage detectors combining lightweight backbone and multi-scale fusion

Huang Jianchen^1,2, Wang Han^1,2, Lu Hao^1,2(1. School of Information Science and Technology, Beijing Forestry University, Beijing 100083，China;2. National Forestry and Grassland Administration Engineering Research Center for Forestry-oriented Intelligent Information Processing, Beijing 100083, China)

Abstract

Objective Computer vision-related object detection has been widely used in public security, clinical, automatic driving and contexts. Current convolutional neural network based (CNN-based) object detectors are divided into one-stage and two-stage according to the process status. The two-stage method is based on a feature extraction network to extract multiple candidate regions at the beginning and the following additional convolution modules are used to perform detection bounding boxes regression and object classification on the candidate regions. The one-stage method is based on a single convolution model to extract features straightforwardly derived from the original image in terms of regression and outputs information, such as the number, position, and size of detection boxes, which has realistic real-time performance. one-stage object detectors like single shot multibox detector(SSD) and you only look once(YOLO) have high real-time performance and high detection accuracy. However, these models require a huge amount computing resources and are challenged to deploy and apply in embedded scenes like automatic driving, automatic production, urban monitoring, human face recognition, and mobile terminals. There are two problems to be resolved in the one-stage object detection network: 1) redundant convolution calculations in the feature extraction and feature fusion parts of the network. Conventional object detection models are usually optimized the width of the model by reducing the number of feature channels in the convolution layer in the feature extraction part and the depth of the model is resilient by reducing the number of convolution layers stacked. However, the redundant calculations cannot be dealt with in the convolution layer and cause intensive detection accuracy loss; 2) one-stage models often use feature pyramid network(FPN) or path aggregation network(PANet) modules for multi-scale feature fusion, which leads to more calculation costs. Method First, we design and construct a variety of efficient lightweight modules. The GhostBottleneck layer is used to optimize the channel dimension and down-sample the feature maps at the same time, which can reduce the computational cost and enhance the feature extraction capability of the backbone. The GhostC3 module is designed for feature extraction and multi-scale feature fusion at different stages, which is cost-effective in feature extraction and keeps the feature extraction capability. An attention module local channel and spatial(LCS) is proposed to enhance the local information of regions and channels, so as to increase the attention of the model to the regions and channels of interest with smaller cost. The efficient spatial pyramid pooling (ESPP) module is designed, in which GhostConv is used to reduce the huge cost of dimension reduction of network deep channel, and the redundant calculation of multiple pooling is optimized. For the extra cost caused by multi-scale feature fusion, a more efficient and lightweight efficient PANet (EPANet) structure is designed, a multi-scale feature weighted fusion is linked to weaken the overhead of channel dimension reduction, and a long skip connection of middle-level features is added to alleviate the problem of feature loss in PANet. A lightweight one-stage object detector framework illustrated based on YOLOv5, which is called Efficient-YOLO. We use the Efficient-YOLO framework to construct two networks with different sizes, E-YOLOm and E-YOLOs. Our methods are implemented in Ubuntu18.04 in terms of PyTorch deep learning framework and the YOLOv5 project. The default parameter settings of the YOLOv5 is used with the version of v5.0 during training. The pre-training weights are not loaded for scratch training on the visual object classes(VOC) dataset. The pre-training weights on the VOC dataset are used for fine-tuning with the same network structure on the GlobalWheat2020 dataset. Result The number of parameters in E-YOLOm and E-YOLOs are decreased by 71.5% and 61.6% in comparison with YOLOv5m and YOLOv5s, and the FLOPs of them are decreased by 67.3% and 49.7%. For the average precision(AP), the AP of E-YOLOm on generic object detection dataset VOC is 2.3% lower than YOLOv5m, and E-YOLOs is 3.4% higher than YOLOv5s. To get smaller computation cost and higher detection efficiency, E-YOLOm has 15.5% and 1.7% lower parameters and 1.9% higher FLOPs compared to YOLOv5s, while mAP@0.5 and AP are 3.9% and 11.1% higher than it. Compared with YOLOv5m and YOLOv5s, the AP of E-YOLOm and E-YOLOs are decreased by 1.4% and 0.4% only of each on GlobalWheat2020. This indicates that Efficient-YOLO is also robust for detecting small objects. Similarly, the AP of E-YOLOm is 0.3% higher than those of YOLOv5s. It reflects that Efficient-YOLO is still more efficient in detecting small objects. At the same time, the lightweight improvement of the backbone proposed by Efficient-YOLO is optimized the latest lightweight CNN architectures like ShuffleNetv2 and MobileNetv3. In addition, the GhostBottleneck layer with the stride of 2 is used to upgrade and down-sample the feature in the backbone, and the GhostConv is used to reduce the channel dimension in ESPP. It can reduce the cost of parameters and computation of the model effectively and improve the detection accuracy dramatically. The results indicate that GhostConv can reduce the number of redundant convolution kernels and improve the information content of the output feature map. Conclusion Experiments show that our Efficient-YOLO framework is cost-effective for redundant convolution computation and multi-scale fusion in one-stage object detection networks. It has good robustness. At the same time, our lightweight feature extraction block and attention module can optimize the performance of the detectors further.

Keywords

convolutional neural network(CNN) object detection lightweight models attention module multi-scale fusion

在线采编平台

论文出版

年度会议

下载中心

年度信息