融合策略优选和双注意力的单阶段目标检测

戴坤; 许立波; 黄世旸; 李鋆铃

doi:10.11834/jig.210204

图像分析和识别 | 浏览量 : 0 下载量: 0 CSCD: 4

PDF
导出
分享
收藏
专辑

融合策略优选和双注意力的单阶段目标检测
Single stage object detection algorithm based on fusing strategy optimization selection and dual attention mechanism
2022年27卷第8期页码：2430-2443
纸质出版日期： 2022-08-16 ，

录用日期： 2021-06-02
DOI： 10.11834/jig.210204
稿件说明：

移动端阅览

戴坤, 许立波, 黄世旸, 李鋆铃. 融合策略优选和双注意力的单阶段目标检测[J]. 中国图象图形学报, 2022,27(8):2430-2443.

Kun Dai, Libo Xu, Shiyang Huang, Yunling Li. Single stage object detection algorithm based on fusing strategy optimization selection and dual attention mechanism[J]. Journal of Image and Graphics, 2022,27(8):2430-2443.
戴坤, 许立波, 黄世旸, 李鋆铃. 融合策略优选和双注意力的单阶段目标检测[J]. 中国图象图形学报, 2022,27(8):2430-2443. DOI： 10.11834/jig.210204.

Kun Dai, Libo Xu, Shiyang Huang, Yunling Li. Single stage object detection algorithm based on fusing strategy optimization selection and dual attention mechanism[J]. Journal of Image and Graphics, 2022,27(8):2430-2443. DOI： 10.11834/jig.210204.

摘要

目的

特征融合是改善模糊图像、小目标以及受遮挡物体等目标检测困难的有效手段之一，为了更有效地利用特征融合来整合不同网络层次的特征信息，显著表达其中的重要特征，本文提出一种基于融合策略优选和双注意力机制的单阶段目标检测算法FDA-SSD(fusion double attention single shot multibox detector)。

方法

设计融合策略优化选择方法，结合特征金字塔(feature pyramid network，FPN)来确定最优的多层特征图组合及融合过程，之后连接双注意力模块，通过对各个通道和空间特征的权重再分配，提升模型对通道特征和空间信息的敏感性，最终产生包含丰富语义信息和凸显重要特征的特征图组。

结果

本文在公开数据集PASCAL VOC2007(pattern analysis

statistical modelling and computational learning visual object classes)和TGRS-HRRSD-Dataset(high resolution remote sensing detection)上进行对比实验，结果表明，在输入为300×300像素的PASCAL VOC2007测试集上，FDA-SSD模型的精度达到79.8%，比SSD(single shot multibox detector)、RSSD(rainbow SSD)、DSSD(de-convolution SSD)、FSSD(feature fusion SSD)模型分别高了2.6%、1.3%、1.2%、1.0%，在Titan X上的检测速度为47帧/s(frame per second，FPS)，与SSD算法相当，分别高于RSSD和DSSD模型12 FPS和37.5 FPS。在输入像素为300×300的TGRS-HRRSD-Dataset测试集上的精度为84.2%，在Tesla V100上的检测速度高于SSD模型10%的情况下，准确率提高了1.5%。

结论

通过在单阶段目标检测模型中引入融合策略选择和双注意力机制，使得预测的速度和准确率同时得到提升，并且对于小目标、受遮挡以及模糊图像等难目标的检测能力也得到较大提升。

Abstract

Objective

Object detection is essential to computer vision and in-depth learning recently. It has been widely used in industrial detection

intelligent transportation

human facial recognition and contexts. There are two main categories of recognized target detection algorithms. One of current target detection algorithms is two-stage algorithm

such as region-based convolution neural network (R-CNN)

Fast R-CNN

online hard example mining (OHEM)

Faster R-CNN

Mask R-CNN etc. The methods generate target candidate boxes first

and implement the candidate boxes classification and regression following. The other one is single-stage algorithms

such as you only look once (YOLO)

single shot multibox detector (SSD) etc. In addition

the demonstrated corner network(CornerNet) & center network(CenterNet)-anchor free models have tried to ignore the anchor frame and conduct detection and matching based on key points

which has achieved quite good results

but there is still a little gap from the detection method based on anchor frame. In the practical application of single-stage target detection

a main challenging issue is target detection like blurred image

small target and occluded object

and the predicted performance and efficiency. Feature fusion can improve the detection ability of difficult targets effectively by fusing different deep and shallow features of the network

which has been used in many improved SSD models in common. However

most of the improved models use feature fusion methods directly

and the specific fusion strategies like the issues of fused graphs option and fused graphs processing. In addition

current attention mechanism can make the feature graph have a certain "focus" effect by giving dimension weight. The issue of combining attention mechanism to single-stage target detection effectively has its potentials.

Method

The shallow Visual Geometry Group (VGG) network in the original SSD algorithm is replaced by the deep residual network as the backbone network. First

an optimized selection method of fusion strategy is designed in accordance with the idea of feature pyramid network (FPN). FPN is applied to the four layers of backbone network output to accomplish the detailed feature information description

the lower layer features are retained by down sampling during enhanced fusion process

while the size of the largest graph remains stable. The speed and performance are taken into account. In the operation sequence

FPN is used first

and then enhanced fusion is used

which is equivalent to one-step reversed FPN

It is better than initial enhancing and following FPN

and final removing the r_c5_conv4 layer which is the same as the r_c5_conv3 layer to reduce the interference. To better describe the target object

the feature mapping combines the detailed features of high pixel feature mapping with the rich semantic representation of low pixel feature mapping. Then

In respect of the ideas of bottleneck attention module (BAM) and squeeze-and-excitation network (SENet)

our research designs a parallel dual attention mechanism to integrate the channel and spatial information of the feature map. The dual clustered effect of the feature map on the shape and position of the target is improved through channel attention and spatial attention. The parallel addition processing of channel and spatial attention mechanism strengthens the supervision of key features in terms of the parallel addition of channel attention mechanism and spatial attention mechanism for each feature graph. The key features are strengthened and the redundant interference features are weakened. At the same time

the spatial information in the feature graph is transformed to extract the key spatial position information. Finally

rich semantic information and distinctive features related feature groups are obtained.

Result

This comparative experiment is carried out on pattern analysis

statistical modelling and computational learning visual object classes (PASCAL VOC 2007) and IEEE Transactions on Geoscience and Remote Sensing-High Resolution Remote Sensing Detection (TGRS-HRRSD-Dataset). Our experimental results show that on PASCAL VOC2007 test set with 300×300 input pixels

the accuracy of fusion double attention(FDA)-SSD model reaches 79.8%

which is 2.6%

1.3%

1.2% and 1.0% higher than SSD

rainbow single shot detector (RSSD)

de-convolution single shot detector (DSSD) and feature fusion single shot detector (FSSD) models

respectively. The detection speed on Titan X is 47 frames per second (FPS)

which is equivalent to SSD algorithm

higher than RSSD and DSSD model 12 FPS and 37.5 FPS

respectively. The accuracy of the proposed algorithm is 81.6% on PASCAL VOC2007 test set with 512×512 pixels

and the detection speed on Titan X is 18 FPS

which is better than most algorithms in terms of the obtained accuracy and speed. The accuracy of TGRS-HRRSD-Dataset with 300×300 input pixels is 84.2%. The detection speed of Tesla V100 is 10% higher than SSD model

and the accuracy is improved by 1.5%. Our algorithm has good performance for regular data sets and aerial datasets both

which reflects the stability and portability of the algorithm.

Conclusion

Our research proposes an optimized feature map selection method and dual attention mechanism. Compared to many existing SSD improved models

this model has its dual advantages in accuracy and speed

and has good performance in the detection of small targets

occluded

blurred images and other challenging targets as well. Although the FDA-SSD model performs well in the SSD

our analysis is mainly based on the optimization of the featured graphs. The future prediction box generation and non-maximum suppression methods have their potentials to be studied further.

关键词

单阶段目标检测SSD算法特征金字塔(FPN)特征融合注意力机制

Keywords

single-stage object detectionsingle shot multibox detector(SSD)feature pyramid network(FPN)feature fusionattention mechanism

references

Bae S H. 2019. Object detection based on region decomposition and assembly. Proceedings of the AAAI Conference on Artificial Intelligence, 33(1): 8094-8101[DOI: 10.1609/aaai.v33i01.33018094]

Dai J F, Li Y, He K M and Sun J. 2016. R-FCN: object detection via region-based fully convolutional network[EB/OL]. [2021-03-02].https://arxiv.org/pdf/1605.06409.pdfhttps://arxiv.org/pdf/1605.06409.pdf

Everingham M, Van Gool L, Williams C K I, Winn J and Zisserman A. 2010. The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2): 303-338[DOI: 10.1007/s11263-009-0275-4]

Fang L P, He H J and Zhou G M. 2018. Research overview of object detection methods. Computer Engineering and Applications, 54(13): 11-18, 33

方路平, 何杭江, 周国民. 2018. 目标检测算法研究综述. 计算机工程与应用, 54(13): 11-18, 33 [DOI: 10.3778/j.issn.1002-8331.1804-0167]

Fu C Y, Liu W, Ranga A, Tyagi A and Berg A C. 2017. DSSD: deconvolutional single shot detector[EB/OL]. [2021-03-02].https://arxiv.org/pdf/1701.06659.pdfhttps://arxiv.org/pdf/1701.06659.pdf

Ge B Y, Zuo X Z and Hu Y J. 2018. Review of visual object tracking technology. Journal of Image and Graphics, 23(8): 1091-1107

葛宝义, 左宪章, 胡永江. 2018. 视觉目标跟踪方法研究综述. 中国图象图形学报, 23(8): 1091-1107 [DOI: 10.11834/jig.170604]

Girshick R, Donahue J, Darrell T and Malik J. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 580-587[DOI: 10.1109/CVPR.2014.81http://dx.doi.org/10.1109/CVPR.2014.81]

Girshick R. 2015. Fast R-CNN//Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV).Santiago, Chile: IEEE: 1440-1448[DOI: 10.1109/ICCV.2015.169http://dx.doi.org/10.1109/ICCV.2015.169]

He K M, Gkioxari G, Dollar P and Girshick R. 2020. Mask R-CNN. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2): 386-397[DOI: 10.1109/TPAMI.2018.2844175]

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 770-778[DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]

Hu J, Shen L, Albanie S, Sun G and Wu E H. 2018. Squeeze-and-excitation networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7132-7141[DOI: 10.1109/CVPR.2018.00745http://dx.doi.org/10.1109/CVPR.2018.00745]

Jeong J, Park H and Kwak N. 2017. Enhancement of SSD by concatenating feature maps for object detection//Proceedings of the British Machine Vision Conference. London, UK: BMVA Press: 76.1-76.12[DOI: 10.5244/C.31.76http://dx.doi.org/10.5244/C.31.76]

Law H and Deng J. 2020. CornerNet: detecting objects as paired keypoints. International Journal of Computer Vision, 128(3): 642-656[DOI: 10.1007/s11263-019-01204-1]

Li Z X and Zhou F Q. 2017. FSSD: feature fusion single shot multibox detector[EB/OL]. [2021-03-02].https://arxiv.org/pdf/1712.00960.pdfhttps://arxiv.org/pdf/1712.00960.pdf

Lin T Y, Dollár P, Girshick R, He K M, Hariharan B and Belongie S J. 2017. Feature pyramid networks for object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 936-944[DOI: 10.1109/CVPR.2017.106http://dx.doi.org/10.1109/CVPR.2017.106]

Lin T Y, Goyal P, Girshick R, He K M and Dollár P. 2020. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2): 318-327[DOI: 10.1109/TPAMI.2018.2858826]

Liu T and Wang X L. 2020. Single-stage object detection using filter pyramid and atrous convolution. Journal of Image and Graphics, 25(1): 102-112

刘涛, 汪西莉. 2020. 采用卷积核金字塔和空洞卷积的单阶段目标检测. 中国图象图形学报, 25(1): 102-112 [DOI: 10.11834/jig.190166]

Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C Y and Berg A C. 2016. SSD: single shot multibox detector//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 21-37[DOI: 10.1007/978-3-319-46448-0_2http://dx.doi.org/10.1007/978-3-319-46448-0_2]

Redmon J, Divvala S, Girshick R and Farhadi A. 2016. You only look once: unified, real-time object detection//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 779-788[DOI: 10.1109/CVPR.2016.91http://dx.doi.org/10.1109/CVPR.2016.91]

Redmon J and Farhadi A. 2017. YOLO9000: better, faster, stronger//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 6517-6525[DOI: 10.1109/CVPR.2017.690http://dx.doi.org/10.1109/CVPR.2017.690]

Redmon J and Farhadi A. 2018. YOLOv3: an incremental improvement[EB/OL]. [2021-04-02].https://arxiv.org/pdf/1804.02767.pdfhttps://arxiv.org/pdf/1804.02767.pdf

Ren S Q, He K M, Girshick R and Sun J. 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149[DOI: 10.1109/TPAMI.2016.2577031]

Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S A, Huang Z H, Karpathy A, Khosla A, Bernstein M, Berg A C and Li F F. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3): 211-252[DOI: 10.1007/s11263-015-0816-y]

Shen Z Q, Liu Z, Li J G, Jiang Y G, Chen Y R and Xue X Y. 2020. Object detection from scratch with deep supervision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2): 398-412[DOI: 10.1109/TPAMI.2019.2922181]

Shrivastava A, Gupta A and Girshick R. 2016. Training region-based object detectors with online hard example mining//Proceedigns of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 761-769[DOI: 10.1109/CVPR.2016.89http://dx.doi.org/10.1109/CVPR.2016.89]

Simonyan K and Zisserman A. 2015. Very deep convolutional networks for large-scale image recognition[EB/OL]. [2021-03-02].https://arxiv.org/pdf/1409.1556.pdfhttps://arxiv.org/pdf/1409.1556.pdf

Tang Q K and Hu Y. 2020. PosNeg-balanced anchors with aligned features for single-shot object detection. Journal of Computer-Aided Design and Computer Graphics, 32(11): 1773-1783

唐乾坤, 胡瑜. 2020. 基于正负锚点框均衡及特征对齐的单阶段目标检测算法. 计算机辅助设计与图形学学报, 32(11): 1773-1783 [DOI: 10.3724/SP.J.1089.2020.18175]

Wang H, Wang Q L, Gao M Q, Li P H and Zuo W M. 2018. Multi-scale location-aware kernel representation for object detection//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 1248-1257[DOI: 10.1109/CVPR.2018.00136http://dx.doi.org/10.1109/CVPR.2018.00136]

Woo S, Park J, Lee J Y and Kweon I S. 2018. CBAM: convolutional block attention module//Proceedings of the15th European Conference on Computer Vision. Munich, Germany: Springer: 3-19[DOI: 10.1007/978-3-030-01234-2_1http://dx.doi.org/10.1007/978-3-030-01234-2_1]

Wu X W, Sahoo D, Zhang D X, Zhu J K and Hoi S C H. 2020. Single-shot bidirectional pyramid networks for high-quality object detection. Neurocomputing, 401: 1-9[DOI: 10.1016/j.neucom.2020.02.116]

Yin H P, Chen B, Chai Y and Liu Z D. 2016. Vision-based object detection and tracking: a review. Acta Automatica Sinica, 42(10): 1466-1489

尹宏鹏, 陈波, 柴毅, 刘兆栋. 2016. 基于视觉的目标检测与跟踪综述. 自动化学报, 42(10): 1466-1489 [DOI: 10.16383/j.aas.2016.c150823]

Zhang H L, Hu S Q and Yang G S. 2015. Video object tracking based on appearance models learning. Journal of Computer Research and Development, 52(1): 177-190

张焕龙, 胡士强, 杨国胜. 2015. 基于外观模型学习的视频目标跟踪方法综述. 计算机研究与发展, 52(1): 177-190 [DOI: 10.7544/issnl000-1239.2015.20130995]

Zhang S F, Wen L Y, Bian X, Lei Z and Li S Z. 2018. Single-shot refinement neural network for object detection//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4203-4212[DOI: 10.1109/CVPR.2018.00442http://dx.doi.org/10.1109/CVPR.2018.00442]

Zhang Y, Yuan Y, Feng Y and Lu X Q. 2019. Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection. IEEE Transactions on Geoscience and Remote Sensing, 57(8): 535-554[DOI: 10.1109/TGRS.2019.2900302]

Zheng P, Bai H Y, Li W and Guo H W. 2020. Small target detection algorithm in complex background. Journal of Zhejiang University (Engineering Science), 54(9): 1777-1784

郑浦, 白宏阳, 李伟, 郭宏伟. 2020. 复杂背景下的小目标检测算法. 浙江大学学报(工学版), 54(9): 1777-1784 [DOI: 10.3785/j.issn.1008-973X.2020.09.014]

Zhou P, Ni B B, Geng C, Hu J G and Xu Y. 2018. Scale-transferrable object detection//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 528-537[DOI: 10.1109/CVPR.2018.00062http://dx.doi.org/10.1109/CVPR.2018.00062]

Zhou X Y, Wang D Q and Krähenbühl P. 2019. Objects as points[EB/OL]. [2021-03-02].https://arxiv.org/pdf/1904.07850.pdfhttps://arxiv.org/pdf/1904.07850.pdf

文章被引用时，请邮件提醒。

提交

红外与可见光图像特征动态选择的目标检测网络