发布时间: 2022-12-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.211028
2022 | Volume 27 | Number 12

图像理解和计算机视觉

结合轻量化骨干与多尺度融合的单阶段检测器

黄健宸, 王晗, 卢昊

1. 北京林业大学信息学院, 北京 100083;

2. 国家林业和草原局林业智能信息处理工程技术研究中心, 北京 100083

收稿日期: 2021-11-04; 修回日期: 2022-03-16; 预印本日期: 2022-03-23

基金项目: 国家重点研发计划资助(2020YFE0200800)；中央高校基本科研业务费专项资金资助(BLX201720)；国家自然科学基金项目(61902201)

作者简介: 黄健宸，男, 硕士研究生, 主要研究方向为计算机视觉、图像与视频分析。E-mail: huangjianchen96@163.com
王晗，女，副教授，硕士生导师，主要研究方向为图像分析、视频内容理解、机器学习。E-mail: wanghan@bjfu.edu.cn
卢昊，通信作者，男，讲师，主要研究方向为遥感信息技术和智能遥感数据处理。E-mail: luhao@bjfu.edu.cn
*通信作者: 卢昊 luhao@bjfu.edu.cn

中图法分类号: TP391.41;TP183

文献标识码: A

文章编号: 1006-8961(2022)12-3596-12

摘要

目的基于卷积神经网络的单阶段目标检测网络具有高实时性与高检测精度，但其通常存在两个问题：1)模型中存在大量冗余的卷积计算；2)多尺度特征融合结构导致额外的计算开销。这导致单阶段检测器需要大量的计算资源，难以在计算资源不足的设备上应用。针对上述问题，本文在YOLOv5(you only look once version 5)的结构基础上，提出一种轻量化单阶段目标检测网络架构，称为E-YOLO(efficient-YOLO)。方法利用E-YOLO架构构建了E-YOLOm(efficient-YOLO medium)与E-YOLOs(efficient-YOLO small)两种不同大小的模型。首先，设计了多种更加高效的特征提取模块以减少冗余的卷积计算，对模型中开销较大的特征图通过下采样、特征提取、通道升降维与金字塔池化进行了轻量化设计。其次，为解决多尺度特征融合带来的冗余开销，提出了一种高效多尺度特征融合结构，使用多尺度特征加权融合方案减少通道降维开销，设计中层特征长跳连接缓解特征流失。结果实验表明，E-YOLOm、E-YOLOs与YOLOv5m、YOLOv5s相比，参数量分别下降了71.5%和61.6%，运算量下降了67.3%和49.7%。在VOC(visual object classes)数据集上的平均精度(average precision, AP)，E-YOLOm比YOLOv5m仅下降了2.3%，E-YOLOs比YOLOv5s提升了3.4%。同时，E-YOLOm的参数量和运算量相比YOLOv5s分别低15.5%与1.7%，mAP@0.5和AP比其高3.9%和11.1%，具有更小的计算开销与更高的检测效率。结论本文提出的E-YOLO架构显著降低了单阶段目标检测网络中冗余的卷积计算与多尺度融合开销，且具有良好的鲁棒性，并优于对比网络轻量化方案，在低运算性能的环境中具有重要的实用意义。

关键词

卷积神经网络(CNN); 目标检测; 模型轻量化; 注意力模块; 多尺度融合

One-stage detectors combining lightweight backbone and multi-scale fusion

Huang Jianchen, Wang Han, Lu Hao

1. School of Information Science and Technology, Beijing Forestry University, Beijing 100083, China;

2. National Forestry and Grassland Administration Engineering Research Center for Forestry-oriented Intelligent Information Processing, Beijing 100083, China

Supported by: National Key R&D Program of China (2020YFE0200800); Fundamental Research Funds for the Central Universities (BLX201720); National Natural Science Foundation of China (61902201)

Abstract

Objective Computer vision-related object detection has been widely used in public security, clinical, automatic driving and contexts. Current convolutional neural network based (CNN-based) object detectors are divided into one-stage and two-stage according to the process status. The two-stage method is based on a feature extraction network to extract multiple candidate regions at the beginning and the following additional convolution modules are used to perform detection bounding boxes regression and object classification on the candidate regions. The one-stage method is based on a single convolution model to extract features straightforwardly derived from the original image in terms of regression and outputs information, such as the number, position, and size of detection boxes, which has realistic real-time performance. one-stage object detectors like single shot multibox detector(SSD) and you only look once(YOLO) have high real-time performance and high detection accuracy. However, these models require a huge amount computing resources and are challenged to deploy and apply in embedded scenes like automatic driving, automatic production, urban monitoring, human face recognition, and mobile terminals. There are two problems to be resolved in the one-stage object detection network: 1) redundant convolution calculations in the feature extraction and feature fusion parts of the network. Conventional object detection models are usually optimized the width of the model by reducing the number of feature channels in the convolution layer in the feature extraction part and the depth of the model is resilient by reducing the number of convolution layers stacked. However, the redundant calculations cannot be dealt with in the convolution layer and cause intensive detection accuracy loss; 2) one-stage models often use feature pyramid network(FPN) or path aggregation network(PANet) modules for multi-scale feature fusion, which leads to more calculation costs. Method First, we design and construct a variety of efficient lightweight modules. The GhostBottleneck layer is used to optimize the channel dimension and down-sample the feature maps at the same time, which can reduce the computational cost and enhance the feature extraction capability of the backbone. The GhostC3 module is designed for feature extraction and multi-scale feature fusion at different stages, which is cost-effective in feature extraction and keeps the feature extraction capability. An attention module local channel and spatial(LCS) is proposed to enhance the local information of regions and channels, so as to increase the attention of the model to the regions and channels of interest with smaller cost. The efficient spatial pyramid pooling (ESPP) module is designed, in which GhostConv is used to reduce the huge cost of dimension reduction of network deep channel, and the redundant calculation of multiple pooling is optimized. For the extra cost caused by multi-scale feature fusion, a more efficient and lightweight efficient PANet (EPANet) structure is designed, a multi-scale feature weighted fusion is linked to weaken the overhead of channel dimension reduction, and a long skip connection of middle-level features is added to alleviate the problem of feature loss in PANet. A lightweight one-stage object detector framework illustrated based on YOLOv5, which is called Efficient-YOLO. We use the Efficient-YOLO framework to construct two networks with different sizes, E-YOLOm and E-YOLOs. Our methods are implemented in Ubuntu18.04 in terms of PyTorch deep learning framework and the YOLOv5 project. The default parameter settings of the YOLOv5 is used with the version of v5.0 during training. The pre-training weights are not loaded for scratch training on the visual object classes(VOC) dataset. The pre-training weights on the VOC dataset are used for fine-tuning with the same network structure on the GlobalWheat2020 dataset. Result The number of parameters in E-YOLOm and E-YOLOs are decreased by 71.5% and 61.6% in comparison with YOLOv5m and YOLOv5s, and the FLOPs of them are decreased by 67.3% and 49.7%. For the average precision(AP), the AP of E-YOLOm on generic object detection dataset VOC is 2.3% lower than YOLOv5m, and E-YOLOs is 3.4% higher than YOLOv5s. To get smaller computation cost and higher detection efficiency, E-YOLOm has 15.5% and 1.7% lower parameters and 1.9% higher FLOPs compared to YOLOv5s, while mAP@0.5 and AP are 3.9% and 11.1% higher than it. Compared with YOLOv5m and YOLOv5s, the AP of E-YOLOm and E-YOLOs are decreased by 1.4% and 0.4% only of each on GlobalWheat2020. This indicates that Efficient-YOLO is also robust for detecting small objects. Similarly, the AP of E-YOLOm is 0.3% higher than those of YOLOv5s. It reflects that Efficient-YOLO is still more efficient in detecting small objects. At the same time, the lightweight improvement of the backbone proposed by Efficient-YOLO is optimized the latest lightweight CNN architectures like ShuffleNetv2 and MobileNetv3. In addition, the GhostBottleneck layer with the stride of 2 is used to upgrade and down-sample the feature in the backbone, and the GhostConv is used to reduce the channel dimension in ESPP. It can reduce the cost of parameters and computation of the model effectively and improve the detection accuracy dramatically. The results indicate that GhostConv can reduce the number of redundant convolution kernels and improve the information content of the output feature map. Conclusion Experiments show that our Efficient-YOLO framework is cost-effective for redundant convolution computation and multi-scale fusion in one-stage object detection networks. It has good robustness. At the same time, our lightweight feature extraction block and attention module can optimize the performance of the detectors further.

Key words

convolutional neural network(CNN); object detection; lightweight models; attention module; multi-scale fusion

0 引言

在计算机视觉中，目标检测一直是一个活跃的、重要的研究领域，在安防、医疗和自动驾驶等领域具有广泛应用(Liu等，2020)。目前基于卷积神经网络(convolutional neural network，CNN)(Qin等，2018；张珂等，2021)的目标检测方法可以有效提取图像特征并进行检测。与使用感兴趣区域(region of interest, ROI)提取来获取候选区域的两阶段方法(Girshick等，2014；Ren等，2015)不同，单阶段方法(Liu等，2016；Redmon和Farhadi，2018；Ultralytics，2020)使用单一卷积模型基于回归从图像中进行特征提取并输出检测框的数量、位置和大小等信息，实时性较好(赵永强等，2020)，但具有较高检测精度的模型存在着参数量和运算量过大的问题，难以在缺乏计算资源的环境部署与应用。因此，如何在维持检测精度的条件下对模型进行轻量化成为重要的研究方向。

现有的单阶段目标检测网络通常通过减少网络的深度和宽度来构建轻量化的模型(Liu等，2016；Ultralytics，2020)。然而这种模型轻量化方法会明显削弱模型的特征提取能力，导致检测精度相比于原模型显著下降。人工设计网络的简化结构以及轻量化模块是应用最广泛的轻量化方案之一。VGGNet(Visual Geometry Group network)(Simonyan和Zisserman，2015)提出用3×3卷积堆叠代替大卷积核；ResNet(He等，2016)提出Bottleneck结构减少大卷积的通道数；MobileNet(Howard等，2017；Sandler等，2018；Howard等，2019)、GhostNet(Han等，2020)、ShuffleNet(Zhang等，2018；Ma等，2018)等利用深度可分离卷积，设计了更为高效的特征提取模块，并构建了轻量化的骨干网络。然而，由于目标检测网络需要更大的感受野和不同层次的高质量特征来获得物体的准确定位能力，直接将轻量化骨干网络引入单阶段检测器会一定程度上降低检测精度。因此，设计高效检测器的首要目标是设计适合于检测任务的轻量化骨干，并维持性能。

在典型的单阶段目标检测网络中，SSD(single shot multibox detector)(Liu等，2016)对骨干网(backbone)提取到的不同阶段的特征图直接进行回归，然而由于网络浅层特征的语义信息较弱，SSD在检测小目标时效果不够好。YOLOv3(you only look once version 3)(Redmon和Farhadi，2018)使用FPN(feature pyramid network)(Lin等，2017)进行多尺度融合，将深层特征信息传递回浅层，加强浅层的特征。YOLOv4(Bochkovskiy等，2020)与YOLOv5(Ultralytics，2020)使用更为复杂的PANet(path aggregation network)(Liu等，2018)，将融合后的浅层特征再传递到深层，用于进一步将浅层的纹理信息传递到深层，加强对大目标的精准定位。多尺度融合模块能够有效促进不同层次特征图的信息流通(陈科圻等，2021)，但却存在大量额外开销。由于不同层次特征图通道数和大小不一致，多尺度融合模块需要大量的通道升降维与上下采样操作，同时每次特征融合也需要特征提取模块来整合不同来源的特征。因此，这些操作导致了FPN与PANet等多尺度融合模块会显著增加模型的复杂度。

针对上述单阶段目标检测网络存在的问题，本文首先设计并构建了多种高效的轻量化CNN模块，对网络中存在的大量冗余卷积计算进行了优化，构建了轻量化的骨干网络。对于多尺度特征融合造成的额外开销，本文设计了更为高效轻量的EPANet(efficient PANet)结构，提出多尺度特征加权融合缓解通道降维的开销，并增加中层特征的长跳连接缓解PANet存在的特征流失问题。以YOLOv5网络结构为基础，设计了E-YOLO(efficient-YOLO)网络架构。在此基础上构建了E-YOLOm与E-YOLOs两种大小的模型以应对对于模型大小有极端要求的不同场景。实验表明，本文提出的改进特征提取模块能够显著降低网络特征提取部分的开销并提升检测精度；改进的多尺度特征融合模块EPANet具有与原始PANet接近的多尺度融合效果，并大幅降低开销；与YOLOv5原架构相比，E-YOLO具有更高的性价比，能够以极低的模型开销，在多种数据集上有匹配甚至超越原架构的检测精度，并优于其他CNN模型轻量化方法。

1 E-YOLO

1.1 整体结构

E-YOLO的整体结构如图 1所示。Focus层是YOLOv5中使用切片方式对通道数进行扩增的同时进行下采样的模块。GhostBottleneck、GhostC3和ESPP(efficient spatial pyramid pooling)分别为本文提出用于下采样、特征提取和金字塔池化的轻量化改进模块。在EPANet结构中，从深到浅的路径使用加权融合方法来连接特征，并添加了一条从C4到P4的长跳加权融合(图中虚线)。

图 1 E-YOLO网络结构

Fig. 1 Structure of Efficient-YOLO network

1.2 轻量化模块

1.2.1 Ghost模块

GhostNet(Han等，2020)是华为团队提出的一种轻量化CNN网络，其核心Ghost模块(GhostConv)是通过原始卷积生成一部分特征图，另一部分使用Cheap operation生成，Cheap operation可以是其余特征图的线性变换，也可以是原始卷积的结果上通过Depthwise卷积生成相似的特征图。GhostConv的结构如图 2(a)所示，C1和C2是输入和输出通道数，输出特征图的一半来自一次常规卷积，另一半在第1次的结果上通过5×5的Depthwise卷积生成。与原始卷积层相比，GhostConv能够使用更少的复杂度达到相同甚至更高效的特征提取效果。在CNN网络中，普遍存在大量冗余的卷积计算与中间特征图，而Ghost卷积迫使网络从一半的卷积核中学习有用的特征。同时，使用Depthwise卷积生成这部分卷积的“幻影”特征图，5×5大小的Depthwise卷积能够使得生成的特征图的感受野大于第1次卷积运算，使得特征图整体的信息量得以提升。

图 2 4种Ghost模块结构

Fig. 2 Structures of four Ghost modules

((a) GhostConv; (b) GhostBottleneck(stride=1); (c) GhostBottleneck(stride=2); (d) GhostC3)

GhostBottleneck层是对ResNet(He等，2016)中Bottleneck层使用Ghost卷积进行轻量化改进，图 2(b)为步长为1时的结构，由两个核为1×1的Ghost卷积堆叠和残差连接组成。exp作为超参，代表GhostBottleneck层的膨胀率，在第1次GhostConv卷积后增加或降低通道数，第2次进行还原。膨胀率大于1时，GhostBottleneck层使得特征图先通过1×1的GhostConv扩大通道数，在更大的通道数中通过Depthwise卷积进行大感受野特征提取，获得和低通道数下使用3×3卷积相同的效果，减少3×3卷积所带来的计算开销。图 2(c)为步长为2时GhostBottleneck的结构，增加了Depthwise卷积用于在两次GhostConv之间对特征图进行下采样。为了保证残差分支的特征与原始特征的维度一致，使用Depthwise卷积进行下采样，再使用原始的1×1卷积改变通道数。步长为2的GhostBottleneck层能够使得在同时进行特征图下采样与通道升降维操作时通过堆叠的GhostConv进行特征提取，同时能够通过残差连接缓解梯度消失问题，相比于直接使用3×3的卷积层或池化层，具有更高的特征提取与梯度传递能力。

本文使用上述3种Ghost模块对网络进行卷积层轻量化，其中，GhostConv单独用于在EPANet中进行通道降维，步长为1的GhostBottleneck用于构建GhostC3特征提取模块，步长为2时用于在backbone中进行特征图下采样与通道升维。

YOLOv5在backbone部分使用步长为2的3×3的卷积层来进行通道升维与特征图下采样操作，而单独使用卷积层同时进行这两种操作是十分低效的。本文使用步长为2的GhostBottleneck模块完成相同的操作，可以有效减少3×3卷积层带来的额外计算开销，使网络从原特征图中学习有用的信息，丢弃无用信息，从而在不减少特征提取能力的条件下有效减少参数量。同时，GhostBottleneck模块中的残差连接结构能够促进网络的浅层特征向深层传递，有效缓解普通卷积带来的梯度消失问题。YOLOv5中PANet使用1×1的逐点卷积进行通道降维，这种方式同样忽略了大量冗余特征图带来的无用计算。本文使用1×1的GhostConv将其替换。由于使用了Depthwise卷积，一定程度上也促进了空间信息的流通。

YOLOv5提出使用C3模块在Backbone中的各个阶段进行特征提取与在PANet中进行多尺度特征融合。C3模块是YOLOv5借鉴CSPNet(cross stage partial network)(Wang等，2020a)所设计的特征提取模块，其中多个Bottleneck层的堆叠造成了极大的开销。本文设计了特征提取效率更高的GhostC3模块以减少网络的主要开销，如图 2(d)所示，提出使用膨胀率为0.5的GhostBottleneck层替换了Bottleneck层。在GhostC3的主分支使用1×1卷积降低通道数后，使用膨胀率小于1的GhostBottleneck层能够迫使网络从更小的通道数中提取更高质量的特征图，再通过GhostConv还原通道数。

为了加强GhostC3模块的特征提取效率，在GhostBottleneck之后，在残差连接之前，本文设计了用于对GhostBottleneck提取到的特征图同时进行空间域与通道域增强特征信息的LCS(local channel and spatial)模块，以极低的开销使得所有特征图中更加重要的区域与通道维度具有更多信息量的特征图获得更高的权重，辅助提升特征提取的效率。LCS结构如图 3所示，由空间注意模块SA(spatial attention)和通道注意模块CA(channel attention)组成。SA模块是CBAM(convolutional block attention module)(Woo等，2018)模块中的一部分，对于任意通道数大小的特征图具有相同的开销，能够加强所有特征图中的前景表达。其使用通道域的平均池化(channel average pool，CAP)和最大值池化(channel maximum pool，CMP)操作生成全部特征图的聚合特征，并使用7×7的卷积来获得感兴趣的区域，经Sigmoid激活后与输入特征点乘。ECA(efficient channel attention)模块(Wang等，2020b)是一种高效的局部通道注意力机制，其全局平均池化(global average pool，GAP)操作生成所有通道的表征1维向量，再通过1维卷积学习某一通道和周围通道的关系来自适应地得到更加重要的通道特征图，本文使用ECA模块作为LCS的CA部分。由于SA和CA仅在低维空间进行卷积操作，其不会带来额外的计算开销。

图 3 LCS注意力模块结构

Fig. 3 Structure of attention module LCS

1.2.2 高效特征金字塔池化ESPP模块

ESPP是SPP(spatial pyramid pooling)(He等，2015)的轻量化改进，其结构如图 4所示。ESPP使用3个串联的5×5池化层等价代替SPP中3个并联的池化层，减少多余的池化操作。对不同感受野池化后的特征图进行融合对于物体大小差异较大的情况是有利的，其对于最终的检测结果有一定的改善。在SPP中，两个1×1卷积层用于降维和通道间信息融合。由于SPP通常用于骨干网络中通道维度较高的最后阶段，这导致了巨大的卷积计算开销。因此，两个核为1×1的GhostConv用来代替1×1卷积。同时，ESPP中的两个GhostConv可以认为是一个没有残差连接的GhostBottleneck模块，这也促进了空间的信息流通。

图 4 SPP与ESPP结构图

Fig. 4 Structures of SPP and ESPP

((a) SPP; (b) ESPP)

1.3 高效多尺度融合结构EPANet

本文提出高效的多尺度融合结构EPANet来缓解PANet的冗余开销，其结构如图 5所示。在EPANet中，首先提出一种高效的加权融合方法，避免PANet中上行融合的Concatenate操作导致的通道数翻倍，计算为

图 5 EPANet结构

Fig. 5 Structure of EPANet

$ \boldsymbol{F}=\boldsymbol{F}_1 \cdot({Sigmoid}(w))+\boldsymbol{F}_2 \cdot(1-{Sigmoid}(w)) $

(1)

该方法通过引入一个可学习的参数$w$，使模型自适应地学习不同阶段特征(${\mathit{\boldsymbol{F}}}_{\rm {1}}$和${\mathit{\boldsymbol{F}}}_{\rm {2}}$)的重要性权重，并通过Sigmoid激活函数保证获取到的权重值在0和1之间，并使两者相加等于1，以此计算融合特征${\mathit{\boldsymbol{F}}}$。这种加权方式使得本阶段的特征可以从其他阶段的特征中学习到自适应的注意力增强，而不是使用Concatenate直接在通道维度将两者叠加，再通过卷积层进行降维，造成不必要的开销。

EPANet的下行融合通过步长为2的3×3最大池化层进行特征图下采样操作。由于下采样操作伴随着信息流失，因此单独使用卷积层进行是非常低效的。相比于卷积层，池化层在下采样时的性价比更高。

EPANet的原结构的P3与P5阶段特征相较于P4阶段特征，从backbone获取到的原始特征都只经过了一次Concatenate或加权融合，直接传递到最终整合阶段的GhostC3模块。在多尺度融合结构中，从backbone到Neck的同一级别特征的最短路径，称其为“干路”，其余路径称为“支路”。支路的作用是对干路特征进行多尺度注意力增强，干路特征则保留了backbone中的原始特征信息。在PANet中，C4阶段的特征由于在干路上通过了两次融合操作，其本身的原始特征会被深层特征和浅层特征进行多次融合，原始特征的作用会减弱。在EPANet中，为了缓解这种“特征流失”现象，增加了一条由backbone中C4特征到Neck中P4特征的长跳连接，使用加权融合的方式，自适应地加强P4特征的原始信息。改进后的EPANet与原始PANet相比，第1阶段和第2阶段都可以直接获得backbone中提取到的原始特征，且不会增加额外的计算开销。

2 实验

2.1 实验设置

本文方法在Ubuntu18.04环境下，基于PyTorch深度学习框架和YOLOv5(Ultralytics，2020)框架实现。训练时使用YOLOv5框架v5.0版本默认的参数设置。在训练VOC(visual object classes)数据集时不加载预训练参数进行从头训练，训练GlobalWheat2020数据集时使用相同网络结构在VOC数据集上的预训练参数进行微调。

通过手工设计或使用神经结构搜索(neural architecture search，NAS)(唐浪等，2021)来确定Efficient-YOLO在backbone与EPANet中的特征通道数是一项耗时的工作。因此，本文基本上遵循YOLOv5框架通过调整网络深度和宽度构建的最小的两个模型YOLOv5m和YOLOv5s的各阶段通道数，使用Efficient-YOLO的网络架构构建E-YOLOm和E-YOLOs两个不同大小的模型。其中，GhostC3中步长为1的GhostBottleneck模块的数量也被设置为与YOLOv5中C3模块的Bottleneck模块数量相同，不同阶段的特征图的大小也与之相同。按照该原则构建后的E-YOLOm与E-YOLOs的计算量与参数量均小于YOLOv5框架提供的最小的模型YOLOv5s。E-YOLO的架构设置提供了一个基本的设计供参考，具体而言，E-YOLOs相较于E-YOLOm模型参数量和计算量更低，适合需要极端低复杂度的环境。

2.2 数据集

1) VOC数据集。全称为The PASCAL Visual Object Classes(Everingham等，2010)，是一个广泛使用的图像目标检测数据集，包含16 551幅训练图像和4 952幅测试图像。其中包含4个大类，20个小类的目标。本文使用VOC数据集来验证模型在通用常规场景下的检测性能。

2) GlobalWheat2020数据集。GlobalWheat2020(David等，2020)包含3 000余幅图像。该数据集只有麦穗一个类别，平均每幅图像超过20个麦穗。本文使用该数据集验证模型的小目标的检测能力，选取1/10的数据作为测试集。

2.3 评价指标和方法

使用mAP@0.5和AP(average precision)两种检测精度指标来评估提出方法的有效性。mAP@0.5为交并比(intersection over union，IoU)阈值为0.5时的平均精度，能有效表示目标检测方法中目标的漏检率、准确率和目标分类的准确性；AP又称mAP@0.5 ∶0.95，为IoU阈值为0.5~0.95，步长为0.05时多个阈值的平均精度，主要用于评估目标检测方法的边界框回归与真实标签的近似程度。同时，使用模型的总浮点运算次数(floating point operations, FLOPs)与模型的总参数量来评价轻量化网络Efficient-YOLO的轻量化程度。

2.4 定量实验结果

如表 1所示，在细节上，本文依据各阶段特征图的大小分别设置E-YOLO用于下采样与通道升维的GhostBottleneck层的扩张率Exp(expansion rate)，避免过高的通道维数导致卷积层过高的计算量与参数量开销。E-YOLO网络与原架构相比，计算量与参数量均减少到了约1/3，同时，E-YOLOm的开销要小于YOLOv5s，具备极高的模型压缩比。在VOC数据集上，E-YOLOs的mAP@0.5与AP均略高于YOLOv5s。E-YOLOm的AP与mAP@0.5均明显高于YOLOv5s，相比于YOLOv5m十分接近。实验表明，E-YOLO网络相比于其原架构网络能够显著提升同等参数量与计算量下在常规目标检测数据集VOC上的检测精度，具备更高的性价比。在GlobalWheat2020数据集上，本文提出的两种网络虽然不能在存在大量小目标的小麦检测数据集上超越同架构的网络，但检测能力十分接近。同时，E-YOLOm在该数据集上依然能够获得超越YOLOv5s的检测精度，证明其在相近的开销下能在小目标居多的任务中获得更高的检测。

表 1 在VOC和GlobalWheat2020上的平均准确率与模型大小对比
Table 1 Comparison of mAP and model size of different models on VOC and GlobalWheat2020

下载CSV

模型	Exp Rate	FLOPs/G	参数量/M	VOC-mAP@0.5/%	VOC-AP/%	GW2020-mAP@0.5/%	GW2020-AP/%
YOLOv5m	-	51.4	21.4	82.9	60.4	95.2	54.1
YOLOv5s	-	17.1	7.3	79.0	53.1	94.6	53.1
E-YOLOm	1/1/1/0.5	16.8	6.1	82.1	59.0	94.8	53.3
E-YOLOs	1.5/1.5/1.5/1	8.6	2.8	79.6	54.9	94.3	52.7
注：GW2020为GlobalWheat2020数据集，Exp Rate为E-YOLO中使用的步长为2的GhostBottleneck的膨胀率，“-”表示不存在该设置。

图 6为不同模型在VOC数据集与GlobalWheat2020数据集的检测效果图。可以看出，E-YOLO网络具有非常接近轻量化前模型的检测效果。

图 6 不同模型的检测结果

Fig. 6 Detection results of different models

((a) YOLOv5m; (b) E-YOLOm; (c) YOLOv5s; (d) E-YOLOs)

2.5 消融实验

为了验证本文方法对于模型大小以及检测精度的影响，设计了一系列消融实验。如表 2所示，以YOLOv5s与YOLOv5m的原始网络作为基线网络，在VOC数据集上验证了所提方法对最终结果的影响。

表 2 在VOC数据集上的消融实验
Table 2 Ablation study on VOC

下载CSV

模型	GhostBottleneck下采样	GhostC3 (no LCS)	GhostC3	ESPP	简单融合	加权融合	GhostConv降维	池化层下采样	长跳连接	FLOPs/G	参数量/M	mAP@0.5/%	AP/%
YOLOv5s	×	×	×	×	×	×	×	×	×	17.1	7.3	79.0	53.1
E-YOLOs	√	×	×	×	×	×	×	×	×	16.2	6.2	81.2	56.5
	√	√	×	×	×	×	×	×	×	10.3	4.1	78.7	55.3
	√	×	√	×	×	×	×	×	×	10.3	4.1	79.5	55.5
	√	×	√	√	×	×	×	×	×	10.1	3.8	80.3	55.9
	√	×	√	√	√	×	×	×	×	9.7	3.7	79.9	55.7
	√	×	√	√	×	√	×	×	×	9.7	3.7	80.4	56.0
	√	×	√	√	×	√	√	√	×	8.6	2.8	79.5	54.7
	√	×	√	√	×	√	√	√	√	8.6	2.8	79.6	54.9
YOLOv5m	×	×	×	×	×	×	×	×	×	51.4	21.4	82.9	60.4
E-YOLOm	√	×	×	×	×	×	×	×	×	47.2	18.7	83.9	61.3
	√	√	×	×	×	×	×	×	×	20.7	8.9	82.6	60.6
	√	×	√	×	×	×	×	×	×	20.7	8.9	83.1	60.8
	√	×	√	√	×	×	×	×	×	20.1	8.2	83.3	61.0
	√	×	√	√	√	×	×	×	×	19.1	8.0	82.2	59.1
	√	×	√	√	×	√	×	×	×	19.1	8.0	82.4	59.6
	√	×	√	√	×	√	√	√	×	16.8	6.1	81.8	58.8
	√	×	√	√	×	√	√	√	√	16.8	6.1	82.1	59.0
注：简单融合为直接使用逐元素加操作进行特征融合，√表示采用，×表示未采用。

实验表明，使用GhostBottleneck层进行通道升维与特征图下采样能够在降低模型大小的条件下显著提升模型的检测精度，并对目标定位提升更为明显。使用ESPP同样也可以降低一定的模型大小，提升微量精度。LCS模块、加权融合以及长跳连接不增加额外的开销，且可提升检测精度。使用GhostC3进行特征提取能够显著地降低模型的参数量与运算量，并且只降低微量的检测精度。其余方法能够少量降低参数量与运算量，但会减少微量精度。

使用GhostBottleneck下采样与ESPP能够提升检测性能的结果表明，使用Ghost模块代替常规卷积能够有效提升模型的性能，证明其能够一定程度减少卷积层的冗余度，提升特征的质量。

2.6 不同骨干网络对比实验

本文设计了提出的轻量化模型与其他轻量化CNN网络的横向对比实验。目前的轻量化网络主要是针对图像分类任务中的特征提取部分进行改进，因此使用固定的多尺度融合Neck，只比较不同backbone的性能。现有的轻量级backbone一般在每个阶段都具有YOLOv5s相当数量的通道数，因此所有模型均使用原始YOLOv5s的PANet进行多尺度融合，故将不同模型的backbone部分按照特征图的大小分成C3、C4和C5这3个阶段与PANet进行连接来对比不同backbone对准确率的影响。

如表 3所示，主流轻量化结构均能够有效降低模型的参数量与运算量，但会伴随着一定准确率的损失。实验表明，本文提出的E-YOLO的backbone轻量化方法能够在显著降低运算量与参数量的同时显著提升准确率，证明其相比于轻量化前的模型在复杂度上具备更高的效率。从VOC数据集上6.0%的AP的相对提升来看，E-YOLO的骨干网对于网络对物体的精确定位有明显帮助，其Backbone网络能够更加有效地区分前景特征与背景特征，其特征提取能力明显高于其他主流轻量化网络。

表 3 在VOC和GlobalWheat2020上不同骨干网络对比实验
Table 3 Comparative experiment of different backbone networks on VOC and GlobalWheat2020

下载CSV

骨干网络	FLOPs/G	参数量/M	VOC-mAP@0.5/%	VOC-AP/%	GW2020-mAP@0.5/%	GW2020-AP/%
YOLOv5s-Backbone(基线)	17.1	7.3	79.0	53.1	94.6	53.1
ShuffleNetV2(Ma等，2018)	8.4	4.0	74.2↓	49.2↓	92.8↓	50.8↓
MobileNetV3-large(Howard等，2019)	10.8	5.4	77.2↓	51.3↓	94.1↓	52.3↓
GhostNet(Han等，2020)	11.8	7.0	76.3↓	45.8↓	93.4↓	51.2↓
E-YOLOs-Backbone	12.1	4.7	80.5↑	56.3↑	94.7↑	53.4↑
注：加粗字体为每列最优值，GW2020为GlobalWheat2020数据集, ↑表示相较于基线有所提升, ↓表示相较于基线有所下降。

3 结论

为了更好地平衡单阶段目标检测器中的模型复杂度与准确率，提升模型效率，基于YOLOv5检测框架，构建了大幅削减参数量与运算量的轻量化架构Efficient-YOLO。通过对特征提取模块的仔细设计构建了在减少复杂度的同时提升准确率的轻量化骨干，并对含有多余操作的多尺度融合结构进行优化，构建了EPANet结构。Efficient-YOLO框架提供了E-YOLOm与E-YOLOs两个不同规格的模型来应对具有不同程度复杂度需求的场景。

研究表明，基于Ghost模块构建高效的卷积模块可以有效提升检测器的性能并降低复杂度；引入注意力模块LCS可以以很低的计算成本提高骨干网络特征提取的效率；在多尺度融合结构中，用加权融合操作合并不同阶段的特征并不会牺牲性能，并且避免了降维；长跳连接可以缓解多尺度融合中的特征损失问题。

实验结果表明，Efficient-YOLO具有良好的模型复杂度与准确率间的平衡，可以在较低的参数量和计算量下超越轻量化前模型在VOC以及GlobalWheat2020数据集上的准确率，并且AP的显著提升表明, 其对目标精确定位具备更好的表现。与其他轻量级骨干网的对比实验表明，Efficient-YOLO的骨干设计能够在不牺牲精度的条件下降低复杂度。

在未来的工作中，将进一步研究本文所基于Ghost模块设计的特征提取模块、注意力模块与它们构成的骨干网络是否能够应用在其他目标检测架构中，如两阶段架构。所提出的加权融合方案能否替代其他CNN任务中的特征融合，如语义分割等。

参考文献

Bochkovskiy A, Wang C Y and Liao H Y M. 2020. YOLOv4: optimal speed and accuracy of object detection[EB/OL]. [2021-09-08]. https://arxiv.org/pdf/2004.10934.pdf

Chen K Q, Zhu Z L, Deng X M, Ma C X, Wang H A. 2021. Deep learning for multi-scale object detection: a survey. Journal of Software, 32(4): 1201-1227 (陈科圻, 朱志亮, 邓小明, 马翠霞, 王宏安. 2021. 多尺度目标检测的深度学习研究综述. 软件学报, 32(4): 1201-1227) [DOI:10.13328/j.cnki.jos.006166]

David E, Madec S, Sadeghi-Tehran P, Aasen H, Zheng B Y, Liu S Y, Kirchgessner N, Ishikawa G, Nagasawa K, Badhon M A, Pozniak C, de Solan B, Hund A, Chapman S C, Baret F, Stavness I, Guo W. 2020. Global wheat head detection (GWHD) dataset: a large and diverse dataset of high-resolution RGB-labelled images to develop and benchmark wheat head detection methods. Plant Phenomics, 2020: #3521852 [DOI:10.34133/2020/3521852]

Everingham M, Van Gool L, Williams C K I, Winn J, Zisserman A. 2010. The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2): 303-338 [DOI:10.1007/s11263-009-0275-4]

Girshick R, Donahue J, Darrell T and Malik J. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 580-587 [DOI: 10.1109/CVPR.2014.81]

Han K, Wang Y H, Tian Q, Guo J Y, Xu C J and Xu C. 2020. GhostNet: more features from cheap operations//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 1577-1586 [DOI: 10.1109/CVPR42600.2020.00165]

He K M, Zhang X Y, Ren S Q, Sun J. 2015. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9): 1904-1916 [DOI:10.1109/TPAMI.2015.2389824]

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778 [DOI: 10.1109/CVPR.2016.90]

Howard A G, Sandler M, Chen B, Wang W J, Chen L C, Tan M X, Chu G, Vasudevan V, Zhu Y K, Pang R M, Adam H and Le Q. 2019. Searching for MobileNetV3//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea(South): IEEE: 1314-1324 [DOI: 10.1109/ICCV.2019.00140]

Howard A G, Zhu M L, Chen B, Kalenichenko D, Wang W J, Weyand T, Andreetto M and Adam H. 2017. MobileNets: efficient convolutional neural networks for mobile vision applications[EB/OL]. [2021-09-08]. https://arxiv.org/pdf/1704.04861.pdf

Lin T Y, Dollár P, Girshick R, He K M, Hariharan B and Belongie S. 2017. Feature pyramid networks for object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 936-944 [DOI: 10.1109/CVPR.2017.106]

Liu L, Ouyang W L, Wang X G, Fieguth P, Chen J, Liu X W, Pietikäinen M. 2020. Deep learning for generic object detection: a survey. International Journal of Computer Vision, 128(2): 261-318 [DOI:10.1007/s11263-019-01247-4]

Liu S, Qi L, Qin H F, Shi J P and Jia J Y. 2018. Path aggregation network for instance segmentation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 8759-8768 [DOI: 10.1109/CVPR.2018.00913]

Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C Y and Berg A C. 2016. SSD: single shot MultiBox detector//Proceedings of the 14th European Conference on Computer Vision. Amsterdam Holland: Springer: 21-37 [DOI: 10.1007/978-3-319-46448-0_2]

Ma N N, Zhang X Y, Zheng H T and Sun J. 2018. ShuffleNet V2: practical guidelines for efficient CNN architecture design//Proceedings of the 15th European Conference on Computer Vision. Munich Germany: Springer: 122-138 [DOI: 10.1007/978-3-030-01264-9_8]

Qin Z W, Yu F X, Liu C C, Chen X. 2018. How convolutional neural networks see the world-A survey of convolutional neural network visualization methods. Mathematical Foundations of Computing, 1(2): 149-180 [DOI:10.3934/mfc.2018008]

Redmon J and Farhadi A. 2018. YOLOv3: an incremental improvement[EB/OL]. [2021-09-08]. https://arxiv.org/pdf/1804.02767.pdf

Ren S Q, He K M, Girshick R and Sun J. 2015. Faster R-CNN: towards real-time object detection with region proposal networks//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal Canada: MIT Press: 91-99

Sandler M, Howard A, Zhu M L, Zhmoginov A and Chen L C. 2018. MobileNetV2: inverted residuals and linear bottlenecks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4510-4520 [DOI: 10.1109/CVPR.2018.00474]

Simonyan K and Zisserman A. 2015. Very deep convolutional networks for large-scale image recognition[EB/OL]. [2021-09-08]. https://arxiv.org/pdf/1409.1556.pdf

Tang L, Li H X, Yan C Q, Zheng X W, Ji R R. 2021. Survey on neural architecture search. Journal of Image and Graphics, 26(2): 245-264 (唐浪, 李慧霞, 颜晨倩, 郑侠武, 纪荣嵘. 2021. 深度神经网络结构搜索综述. 中国图象图形学报, 26(2): 245-264) [DOI:10.11834/jig.200202]

Ultralytics. 2020. Yolov5[EB/OL]. [2021-09-08]. https://github.com/ultralytics/yolov5

Wang C Y, Liao H Y M, Wu Y H, Chen P Y, Hsieh J W and Yeh I H. 2020a. CSPNet: a new backbone that can enhance learning capability of CNN//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Seattle, USA: IEEE: 1571-1580 [DOI: 10.1109/CVPRW50498.2020.00203]

Wang Q L, Wu B G, Zhu P F, Li P H, Zuo W M and Hu Q H. 2020b. ECA-Net: efficient channel attention for deep convolutional neural networks//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 11531-11539 [DOI: 10.1109/CVPR42600.2020.01155]

Woo S, Park J, Lee J Y and Kweon I S. 2018. CBAM: convolutional block attention module//Proceedings of the 15th European Conference on Computer Vision. Munich Germany: Springer: 3-19 [DOI: 10.1007/978-3-030-01234-2_1]

Zhang K, Feng X H, Guo Y R, Su Y K, Zhao K, Zhao Z B, Ma Z Y, Ding Q L. 2021. Overview of deep convolutional neural networks for image classification. Journal of Image and Graphics, 26(10): 2305-2325 (张珂, 冯晓晗, 郭玉荣, 苏昱坤, 赵凯, 赵振兵, 马占宇, 丁巧林. 2021. 图像分类的深度卷积神经网络模型综述. 中国图象图形学报, 26(10): 2305-2325) [DOI:10.11834/jig.200302]

Zhang X Y, Zhou X Y, Lin M X and Sun J. 2018. ShuffleNet: an extremely efficient convolutional neural network for mobile devices//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 6848-6856 [DOI: 10.1109/CVPR.2018.00716]

Zhao Y Q, Rao Y, Dong S P, Zhang J Y. 2020. Survey on deep learning object detection. Journal of Image and Graphics, 25(4): 629-654 (赵永强, 饶元, 董世鹏, 张君毅. 2020. 深度学习目标检测方法综述. 中国图象图形学报, 25(4): 629-654) [DOI:10.11834/jig.190307]