One-stage detectors combining lightweight backbone and multi-scale fusion

Jianchen Huang; Han Wang; Hao Lu

doi:10.11834/jig.211028

Image Understanding and Computer Vision | Views : 0 下载量: 0 CSCD: 1

PDF
Export
Share
Collection
Album

One-stage detectors combining lightweight backbone and multi-scale fusion
Vol. 27, Issue 12, Pages: 3596-3607(2022)
Published： 16 December 2022 ，

Accepted： 23 March 2022
DOI： 10.11834/jig.211028
稿件说明：

移动端阅览

Jianchen Huang, Han Wang, Hao Lu. One-stage detectors combining lightweight backbone and multi-scale fusion. [J]. Journal of Image and Graphics 27(12):3596-3607(2022)
DOI：

Jianchen Huang, Han Wang, Hao Lu. One-stage detectors combining lightweight backbone and multi-scale fusion. [J]. Journal of Image and Graphics 27(12):3596-3607(2022) DOI： 10.11834/jig.211028.

摘要

目的

基于卷积神经网络的单阶段目标检测网络具有高实时性与高检测精度，但其通常存在两个问题：1)模型中存在大量冗余的卷积计算；2)多尺度特征融合结构导致额外的计算开销。这导致单阶段检测器需要大量的计算资源，难以在计算资源不足的设备上应用。针对上述问题，本文在YOLOv5(you only look once version 5)的结构基础上，提出一种轻量化单阶段目标检测网络架构，称为E-YOLO(efficient-YOLO)。

方法

利用E-YOLO架构构建了E-YOLOm(efficient-YOLO medium)与E-YOLOs(efficient-YOLO small)两种不同大小的模型。首先，设计了多种更加高效的特征提取模块以减少冗余的卷积计算，对模型中开销较大的特征图通过下采样、特征提取、通道升降维与金字塔池化进行了轻量化设计。其次，为解决多尺度特征融合带来的冗余开销，提出了一种高效多尺度特征融合结构，使用多尺度特征加权融合方案减少通道降维开销，设计中层特征长跳连接缓解特征流失。

结果

实验表明，E-YOLOm、E-YOLOs与YOLOv5m、YOLOv5s相比，参数量分别下降了71.5%和61.6%，运算量下降了67.3%和49.7%。在VOC(visual object classes)数据集上的平均精度(average precision

AP)，E-YOLOm比YOLOv5m仅下降了2.3%，E-YOLOs比YOLOv5s提升了3.4%。同时，E-YOLOm的参数量和运算量相比YOLOv5s分别低15.5%与1.7%，mAP@0.5和AP比其高3.9%和11.1%，具有更小的计算开销与更高的检测效率。

结论

本文提出的E-YOLO架构显著降低了单阶段目标检测网络中冗余的卷积计算与多尺度融合开销，且具有良好的鲁棒性，并优于对比网络轻量化方案，在低运算性能的环境中具有重要的实用意义。

Abstract

Objective

Computer vision-related object detection has been widely used in public security

clinical

automatic driving and contexts. Current convolutional neural network based (CNN-based) object detectors are divided into one-stage and two-stage according to the process status. The two-stage method is based on a feature extraction network to extract multiple candidate regions at the beginning and the following additional convolution modules are used to perform detection bounding boxes regression and object classification on the candidate regions. The one-stage method is based on a single convolution model to extract features straightforwardly derived from the original image in terms of regression and outputs information

such as the number

position

and size of detection boxes

which has realistic real-time performance. one-stage object detectors like single shot multibox detector(SSD) and you only look once(YOLO) have high real-time performance and high detection accuracy. However

these models require a huge amount computing resources and are challenged to deploy and apply in embedded scenes like automatic driving

automatic production

urban monitoring

human face recognition

and mobile terminals. There are two problems to be resolved in the one-stage object detection network: 1) redundant convolution calculations in the feature extraction and feature fusion parts of the network. Conventional object detection models are usually optimized the width of the model by reducing the number of feature channels in the convolution layer in the feature extraction part and the depth of the model is resilient by reducing the number of convolution layers stacked. However

the redundant calculations cannot be dealt with in the convolution layer and cause intensive detection accuracy loss; 2) one-stage models often use feature pyramid network(FPN) or path aggregation network(PANet) modules for multi-scale feature fusion

which leads to more calculation costs.

Method

First

we design and construct a variety of efficient lightweight modules. The GhostBottleneck layer is used to optimize the channel dimension and down-sample the feature maps at the same time

which can reduce the computational cost and enhance the feature extraction capability of the backbone. The GhostC3 module is designed for feature extraction and multi-scale feature fusion at different stages

which is cost-effective in feature extraction and keeps the feature extraction capability. An attention module local channel and spatial(LCS) is proposed to enhance the local information of regions and channels

so as to increase the attention of the model to the regions and channels of interest with smaller cost. The efficient spatial pyramid pooling (ESPP) module is designed

in which GhostConv is used to reduce the huge cost of dimension reduction of network deep channel

and the redundant calculation of multiple pooling is optimized. For the extra cost caused by multi-scale feature fusion

a more efficient and lightweight efficient PANet (EPANet) structure is designed

a multi-scale feature weighted fusion is linked to weaken the overhead of channel dimension reduction

and a long skip connection of middle-level features is added to alleviate the problem of feature loss in PANet. A lightweight one-stage object detector framework illustrated based on YOLOv5

which is called Efficient-YOLO. We use the Efficient-YOLO framework to construct two networks with different sizes

E-YOLOm and E-YOLOs. Our methods are implemented in Ubuntu18.04 in terms of PyTorch deep learning framework and the YOLOv5 project. The default parameter settings of the YOLOv5 is used with the version of v5.0 during training. The pre-training weights are not loaded for scratch training on the visual object classes(VOC) dataset. The pre-training weights on the VOC dataset are used for fine-tuning with the same network structure on the GlobalWheat2020 dataset.

Result

The number of parameters in E-YOLOm and E-YOLOs are decreased by 71.5% and 61.6% in comparison with YOLOv5m and YOLOv5s

and the FLOPs of them are decreased by 67.3% and 49.7%. For the average precision(AP)

the AP of E-YOLOm on generic object detection dataset VOC is 2.3% lower than YOLOv5m

and E-YOLOs is 3.4% higher than YOLOv5s. To get smaller computation cost and higher detection efficiency

E-YOLOm has 15.5% and 1.7% lower parameters and 1.9% higher FLOPs compared to YOLOv5s

while mAP@0.5 and AP are 3.9% and 11.1% higher than it. Compared with YOLOv5m and YOLOv5s

the AP of E-YOLOm and E-YOLOs are decreased by 1.4% and 0.4% only of each on GlobalWheat2020. This indicates that Efficient-YOLO is also robust for detecting small objects. Similarly

the AP of E-YOLOm is 0.3% higher than those of YOLOv5s. It reflects that Efficient-YOLO is still more efficient in detecting small objects. At the same time

the lightweight improvement of the backbone proposed by Efficient-YOLO is optimized the latest lightweight CNN architectures like ShuffleNetv2 and MobileNetv3. In addition

the GhostBottleneck layer with the stride of 2 is used to upgrade and down-sample the feature in the backbone

and the GhostConv is used to reduce the channel dimension in ESPP. It can reduce the cost of parameters and computation of the model effectively and improve the detection accuracy dramatically. The results indicate that GhostConv can reduce the number of redundant convolution kernels and improve the information content of the output feature map.

Conclusion

Experiments show that our Efficient-YOLO framework is cost-effective for redundant convolution computation and multi-scale fusion in one-stage object detection networks. It has good robustness. At the same time

our lightweight feature extraction block and attention module can optimize the performance of the detectors further.

关键词

卷积神经网络(CNN)目标检测模型轻量化注意力模块多尺度融合

Keywords

convolutional neural network(CNN)object detectionlightweight modelsattention modulemulti-scale fusion

references

Bochkovskiy A, Wang C Y and Liao H Y M. 2020. YOLOv4: optimal speed and accuracy of object detection[EB/OL]. [2021-09-08].https://arxiv.org/pdf/2004.10934.pdfhttps://arxiv.org/pdf/2004.10934.pdf

Chen K Q, Zhu Z L, Deng X M, Ma C X and Wang H A. 2021. Deep learning for multi-scale object detection: a survey. Journal of Software, 32(4): 1201-1227

陈科圻, 朱志亮, 邓小明, 马翠霞, 王宏安. 2021. 多尺度目标检测的深度学习研究综述. 软件学报, 32(4): 1201-1227 [DOI: 10.13328/j.cnki.jos.006166]

David E, Madec S, Sadeghi-Tehran P, Aasen H, Zheng B Y, Liu S Y, Kirchgessner N, Ishikawa G, Nagasawa K, Badhon M A, Pozniak C, de Solan B, Hund A, Chapman S C, Baret F, Stavness I and Guo W. 2020. Global wheat head detection (GWHD) dataset: a large and diverse dataset of high-resolution RGB-labelled images to develop and benchmark wheat head detection methods. Plant Phenomics, 2020: #3521852 [DOI: 10.34133/2020/3521852]

Everingham M, Van Gool L, Williams C K I, Winn J and Zisserman A. 2010. The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2): 303-338 [DOI: 10.1007/s11263-009-0275-4]

Girshick R, Donahue J, Darrell T and Malik J. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 580-587 [DOI: 10.1109/CVPR.2014.81http://dx.doi.org/10.1109/CVPR.2014.81]

Han K, Wang Y H, Tian Q, Guo J Y, Xu C J and Xu C. 2020. GhostNet: more features from cheap operations//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 1577-1586 [DOI: 10.1109/CVPR42600.2020.00165http://dx.doi.org/10.1109/CVPR42600.2020.00165]

He K M, Zhang X Y, Ren S Q and Sun J. 2015. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9): 1904-1916 [DOI: 10.1109/TPAMI.2015.2389824]

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778 [DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]

Howard A G, Sandler M, Chen B, Wang W J, Chen L C, Tan M X, Chu G, Vasudevan V, Zhu Y K, Pang R M, Adam H and Le Q. 2019. Searching for MobileNetV3//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea(South): IEEE: 1314-1324 [DOI: 10.1109/ICCV.2019.00140http://dx.doi.org/10.1109/ICCV.2019.00140]

Howard A G, Zhu M L, Chen B, Kalenichenko D, Wang W J, Weyand T, Andreetto M and Adam H. 2017. MobileNets: efficient convolutional neural networks for mobile vision applications[EB/OL]. [2021-09-08].https://arxiv.org/pdf/1704.04861.pdfhttps://arxiv.org/pdf/1704.04861.pdf

Lin T Y, Dollár P,Girshick R, He K M, Hariharan B and Belongie S. 2017. Feature pyramid networks for object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 936-944 [DOI: 10.1109/CVPR.2017.106http://dx.doi.org/10.1109/CVPR.2017.106]

Liu L, Ouyang W L, Wang X G, Fieguth P, Chen J, Liu X W and Pietikäinen M. 2020. Deep learning for generic object detection: a survey. International Journal of Computer Vision, 128(2): 261-318 [DOI: 10.1007/s11263-019-01247-4]

Liu S, Qi L, Qin H F, Shi J P and Jia J Y. 2018. Path aggregation network for instance segmentation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt LakeCity, USA: IEEE: 8759-8768 [DOI: 10.1109/CVPR.2018.00913http://dx.doi.org/10.1109/CVPR.2018.00913]

Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C Y and Berg A C. 2016. SSD: single shot MultiBox detector//Proceedings of the 14th European Conference on Computer Vision. Amsterdam Holland: Springer: 21-37 [DOI: 10.1007/978-3-319-46448-0_2http://dx.doi.org/10.1007/978-3-319-46448-0_2]

Ma N N, Zhang X Y, Zheng H T and Sun J. 2018. ShuffleNet V2: practical guidelines for efficient CNN architecture design//Proceedings of the 15th European Conference on Computer Vision. Munich Germany: Springer: 122-138 [DOI: 10.1007/978-3-030-01264-9_8http://dx.doi.org/10.1007/978-3-030-01264-9_8]

Qin Z W, Yu F X, Liu C C and Chen X. 2018. How convolutional neural networks see the world-A survey of convolutional neural networkvisualization methods. Mathematical Foundations of Computing, 1(2): 149-180 [DOI: 10.3934/mfc.2018008]

Redmon J and Farhadi A. 2018. YOLOv3: an incremental improvement[EB/OL]. [2021-09-08].https://arxiv.org/pdf/1804.02767.pdfhttps://arxiv.org/pdf/1804.02767.pdf

Ren S Q, He K M, Girshick R and Sun J. 2015. Faster R-CNN: towards real-time object detection with region proposal networks//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal Canada: MIT Press: 91-99

Sandler M, Howard A, Zhu M L, Zhmoginov A and Chen L C. 2018. MobileNetV2: inverted residuals and linear bottlenecks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4510-4520 [DOI: 10.1109/CVPR.2018.00474http://dx.doi.org/10.1109/CVPR.2018.00474]

Simonyan K and Zisserman A. 2015. Very deep convolutional networks for large-scale image recognition[EB/OL]. [2021-09-08].https://arxiv.org/pdf/1409.1556.pdfhttps://arxiv.org/pdf/1409.1556.pdf

Tang L, Li H X, Yan C Q, Zheng X W and Ji R R. 2021. Survey on neural architecture search. Journal of Image and Graphics, 26(2): 245-264

唐浪, 李慧霞, 颜晨倩, 郑侠武, 纪荣嵘. 2021. 深度神经网络结构搜索综述. 中国图象图形学报, 26(2): 245-264 [DOI: 10.11834/jig.200202]

Ultralytics. 2020. Yolov5[EB/OL]. [2021-09-08].https://github.com/ultralytics/yolov5https://github.com/ultralytics/yolov5

Wang C Y, Liao H Y M, Wu Y H, Chen P Y, Hsieh J W andYeh I H. 2020a. CSPNet: a new backbone that can enhance learning capability of CNN//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Seattle, USA: IEEE: 1571-1580 [DOI: 10.1109/CVPRW50498.2020.00203http://dx.doi.org/10.1109/CVPRW50498.2020.00203]

Wang Q L, Wu B G, Zhu P F, Li P H, Zuo W M and Hu Q H. 2020b. ECA-Net: efficient channel attention for deep convolutional neural networks//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 11531-11539 [DOI: 10.1109/CVPR42600.2020.01155http://dx.doi.org/10.1109/CVPR42600.2020.01155]

Woo S, Park J, Lee J Y and Kweon I S. 2018. CBAM: convolutional block attention module//Proceedings of the 15th European Conference on Computer Vision. Munich Germany: Springer: 3-19 [DOI: 10.1007/978-3-030-01234-2_1http://dx.doi.org/10.1007/978-3-030-01234-2_1]

Zhang K, Feng X H, Guo Y R, Su Y K, Zhao K, Zhao Z B, Ma Z Y and Ding Q L. 2021. Overview of deep convolutional neural networks for image classification. Journal of Image and Graphics, 26(10): 2305-2325

张珂, 冯晓晗, 郭玉荣, 苏昱坤, 赵凯, 赵振兵, 马占宇, 丁巧林. 2021. 图像分类的深度卷积神经网络模型综述. 中国图象图形学报, 26(10): 2305-2325 [DOI: 10.11834/jig.200302]

Zhang X Y, Zhou X Y, Lin M X and Sun J. 2018. ShuffleNet: an extremely efficient convolutional neural network for mobile devices//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 6848-6856 [DOI: 10.1109/CVPR.2018.00716http://dx.doi.org/10.1109/CVPR.2018.00716]

Zhao Y Q, Rao Y, Dong S P and Zhang J Y. 2020. Survey on deep learning object detection. Journal of Image and Graphics, 25(4): 629-654

赵永强, 饶元, 董世鹏, 张君毅. 2020. 深度学习目标检测方法综述. 中国图象图形学报, 25(4): 629-654 [DOI: 10.11834/jig.190307]

Alert me when the article has been cited

提交

Underwater-relevant image object detection based feature-degraded enhancement method

Dual optical flow network-guided video object detection

Infrared-visible image object detection algorithm using feature dynamic selection

Survey on the fusion of point clouds and images for environmental object detection

Automatic texture exemplar extraction with jointed deep and broad learning models