多尺度特征图融合的目标检测

姜文涛; 张驰; 张晟翀; 刘万军

doi:10.11834/jig.190021

图像分析和识别 | 浏览量 : 0 下载量: 78 CSCD: 9

PDF
导出
分享
收藏
专辑

多尺度特征图融合的目标检测
Multiscale feature map fusion algorithm for target detection
2019年24卷第11期页码：1918-1931
收稿：2019-01-30，

修回：2019-5-12，

录用：2019-5-19，

纸质出版：2019-11-16
DOI： 10.11834/jig.190021
稿件说明：

移动端阅览

姜文涛, 张驰, 张晟翀, 刘万军. 多尺度特征图融合的目标检测[J]. 中国图象图形学报, 2019,24(11):1918-1931. DOI： 10.11834/jig.190021.

Wentao Jiang, Chi Zhang, Shengchong Zhang, Wanjun Liu. Multiscale feature map fusion algorithm for target detection[J]. Journal of Image and Graphics, 2019, 24(11): 1918-1931. DOI： 10.11834/jig.190021.

摘要

目的

自然场景图像中，特征提取的质量好坏是决定目标检测性能高低的关键因素。大多数检测算法都是利用卷积神经网络（CNN）强大的学习能力来获得目标的先验知识，并根据这些知识进行目标检测。卷积神经网络的低层次特征缺乏特征的代表性，而高层次的特征则对小尺度目标的监测能力弱。

方法

利用原始SSD（single shot multiBox detector）网络提取特征图，通过1×1卷积层将提取的特征图统一为256维；通过反卷积操作增加自顶向下特征图的空间分辨率；通过对应元素相加的操作，将两个方向的特征图进行融合。将融合后的特征图采用3×3的卷积核进行卷积操作，减小特征图融合后的混叠效应。根据以上步骤构建具有较强语义信息的特征图，同时保留原有特征图的细节信息；对预测框进行聚合，利用非极大抑制（NMS）实现最终的检测效果。

结果

在PASCAL VOC 2007和PASCAL VOC 2012数据集上进行实验测试，该模型的mAP（mean average precision）为78.9%和76.7%，相对于经典的SSD算法，分别提高了1.4%和0.9%；此外，本文方法在检测小尺度目标时相较于经典SSD模型mAP提升了8.3%。

结论

提出了一种多尺度特征图融合的目标检测算法，以自顶向下的方式扩展了语义信息，构造了高强度语义特征图用于实现精确目标检测。

Abstract

Objective

The development and progress of science and technology have made it possible to obtain numerous images from imaging equipment

the Internet

or image databases and have increased people's requirements for image processing. Consequently

image-processing technology has been deeply

widely

and rapidly developed. Target detection is an important research content in the field of computer vision. Rapid and accurate positioning and recognition of specific targets in uncontrolled natural scenes are vital functional bases of many artificial intelligence application scenes. However

several major difficulties presently exist in the field of target detection. First

many small objects are widely distributed in visual scenes. The existence of these small objects challenges the agility and reliability of detection algorithms. Second

detection accuracy and speed are linked

and many technical bottle necks must be overcome to consider the performance of these two factors. Finally

large-scale model parameters are an important reason restricting the loading of deep network chips. The compression of model size while ensuring detection accuracy is a meaningful and urgent problem. Targets with simple background

sufficient illumination

and no occlusion are relatively easy to detect

whereas targets with mixed background and target

occlusion near the target

excessively weak illumination intensity

or diverse target posture are difficult to detect. In natural scene images

the quality of feature extraction is the key factor to determine the performance of target detection. Decades of research have resulted in a more robust detection algorithm. Deep learning technology in the field of computer vision has also achieved great breakthroughs in recent years. Target detection framework based on deep learning has become the mainstream

and two main branches of target detection algorithms based on candidate regions and regression have been derived. Most of the current detection algorithms use the powerful learning ability of convolutional neural networks (CNNs) to obtain the prior knowledge of the target and perform target detection according to such knowledge. The low-level features of convolutional neural networks are characterized by high resolution ratio

low abstract semantics

limited position information

and lack of representation of features. High-level features are characterized by high identification

low resolution ratio

and a weak ability to detect small-scale targets. Therefore

in this study

the semantic information of context is transmitted by combining high- and low-level feature graphs to make the semantic information complete and evenly distributed.

Method

While balancing detection speed and accuracy

the multiscale feature graph fusion target detection algorithm in this study takes a single-shot multibox detector (SSD) network structure as the basic network and adds a feature fusion module to obtain feature graphs with rich semantic information and uniform distribution. The speech information of feature graphs on different levels is transmitted from top to bottom by feature fusion structure to reduce the semantic difference among feature graphs at different levels. The original SSD network is first used to extract a feature graph

which is then unified into 256 channels through a 1×1 convolution layer. The spatial resolution of the top-down feature maps is subsequently increased by deconvolution. Hence

the feature graph coming from two directions has the same spatial resolution. Feature graphs in both directions are then fused to obtain feature graphs with complete semantic information and uniform distribution by adding corresponding elements. The fused feature graph is convolved with a 3×3 convolution kernel to reduce the aliasing effect of the fused feature graph. A feature graph with strong semantic information is constructed according to the abovementioned steps

and the details of the original feature graph are retained. Lastly

the predicted bounding boxes are aggregated and non maximum suppression is used to achieve the final detection results.

Result

Key problems in the practical application of target detection algorithms and difficult problems in related target detection are analyzed according to the research progress and task requirements of visual target detection-related technology. Current solutions are also given. The target detection algorithm based on multiscale feature graph fusion in this study can achieve good results when dealing with weak targets

multiple targets

messy background

occlusion

and other detection difficulties. Experimental tests are performed on PASCAL VOC 2007 and 2012 data sets. The mean average precision values of the proposed model are 78.9% and 76.7%

which are 1.4 and 0.9 percentage points higher than those of the classical SSD algorithm

respectively. In addition

the method in this paper improves by 8.3% mAP compared with the classical SSD model when detecting small-scale targets. Compared with the classical SSD model

the method proposed in this study significantly improves the detection effect when detecting small-scale targets.

Conclusion

The multiscale feature graph fusion target detection algorithm proposed in this study uses convolutional neural network to extract convolutional features instead of the traditional manual feature extraction process

thereby expanding semantic information in a top-down manner and constructing a high-strength semantic feature graph. The model can be used to detect new scene images with strong visual task. In combination with the idea of deep learning convolutional neural network

the convolution feature is used to replace the traditional manual feature

thus avoiding the problem of feature selection in the traditional detection problem. The deep convolution feature has improved expressive ability. The target detection model of multiscale feature map fusion is finally obtained through repeated iteration training on the basis of the SSD network. The detection model has good detection effect for small-scale target detection tasks. While realizing end-to-end training of detection algorithm

the model also improves its robustness to various complex scenes and the accuracy of target detection. Therefore

accurate target detection is achieved. This study provides a general and concise way to solve the problem of small-scale target detection.

关键词

Keywords

references

He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas, NV, USA: IEEE, 2016: 770-778.[ DOI: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]

Xu G, Yue J G, Dong Y C, et al. Cement plant detection on satellite images using deep convolution network[J]. Journal of Image and Graphics, 2019, 24(4):550-561.

徐刚, 岳继光, 董延超, 等.深度卷积网络卫星图像水泥厂目标检测[J].中国图象图形学报, 2019, 24(4):550-561. [DOI:10.11834/jig.180424]

Bai C, Huang L, Chen J N, et al. Optimization of deep convolutional neural network for large scale image classification[J]. Journal of Software, 2018, 29(4):1029-1038.

白琮, 黄玲, 陈佳楠, 等.面向大规模图像分类的深度卷积神经网络优化[J].软件学报, 2018, 29(4):1029-1038. [DOI:10.13328/j.cnki.jos.005404]

Noh H, Hong S, Han B. Learning deconvolution network for semantic segmentation[C]//Proceedings of 2015 IEEE International Conference on Computer Vision.Santiago, Chile: IEEE, 2015: 1520-1528.[ DOI: 10.1109/ICCV.2015.178 http://dx.doi.org/10.1109/ICCV.2015.178 ]

Li L H, Qian B, Lian J, et al. Study on traffic scene semantic segmentation method based on convolutional neural network[J]. Journal on Communications, 2008, 39(4):123-130.

李琳辉, 钱波, 连静, 等.基于卷积神经网络的交通场景语义分割方法研究[J].通信学报, 2018, 39(4):123-130. [DOI:10.11959/j.issn.1000-436x.2018053]

Yu F, Koltun V. Multi-scale context aggregation by dilated convolutions[EB/OL].2016-04-30[2019-01-02] . https://arxiv.org/pdf/1511.07122.pdf https://arxiv.org/pdf/1511.07122.pdf .

Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition.Columbus, OH, USA: IEEE, 2014: 580-587.[ DOI: 10.1109/CVPR.2014.81 http://dx.doi.org/10.1109/CVPR.2014.81 ]

Girshick R. Fast R-CNN[C]//Proceedings of 2015 IEEE International Conference on Computer Vision.Santiago, Chile: IEEE, 2015: 1440-1448.[ DOI: 10.1109/ICCV.2015.169 http://dx.doi.org/10.1109/ICCV.2015.169 ]

Redmon J, Divvala S, Girshick R, et al. You only look once: unified, real-time object detection[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas, NV, USA: IEEE, 2016: 779-788.[ DOI: 10.1109/CVPR.2016.91 http://dx.doi.org/10.1109/CVPR.2016.91 ]

Liu W, Anguelov D, Erhan D, et al. SSD: single shot multiBox detector[C]//Proceedings of the 14th European Conference on Computer Vision.Amsterdam, The Netherlands: Springer, 2016: 21-37.[ DOI: 10.1007/978-3-319-46448-0_2 http://dx.doi.org/10.1007/978-3-319-46448-0_2 ]

Uijlings J R R, Van De Sande K E A, Gevers T, et al. Selective search for object recognition[J]. International Journal of Computer Vision, 2013, 104(2):154-171.[DOI:10.1007/s11263-013-0620-5]

Ren S Q, He K M, Girshick R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems.Montreal, Canada: ACM, 2015: 91-99.

Felzenszwalb P, McAllester D, Ramanan D. A discriminatively trained, multiscale, deformable part model[C]//Proceedings of 2008 IEEE Conference on Computer Vision and Pattern Recognition.Anchorage, AK, USA: IEEE, 2008: 1-8.[ DOI: 10.1109/CVPR.2008.4587597 http://dx.doi.org/10.1109/CVPR.2008.4587597 ]

Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[EB/OL].2017-04-19[2019-01-02] . https://arxiv.or-g/pdf/1612.03144.pdf https://arxiv.or-g/pdf/1612.03144.pdf .

Cai Z W, Fan Q F, Feris R S, et al. A unified multi-scale deep convolutional neural network for fast object detection[EB/OL]. 2016-07-25[2019-01-02] . https://arxiv.org/pdf/1607.07155.pdf https://arxiv.org/pdf/1607.07155.pdf .

Huang J, Rathod V, Sun C, et al. Speed/accuracy trade-offs for modern convolutional object detectors[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Honolulu, HI, USA: IEEE, 2017: 3296-3297.[ DOI: 10.1109/CVPR.2017.351 http://dx.doi.org/10.1109/CVPR.2017.351 ]

Zhou B L, Khosla A, Lapedriza A, et al. Object detectors e-merge in deep scene CNNs[EB/OL].2015-04-15[2019-01-02] . https://arxiv.org/pdf/1412.6856.pdf https://arxiv.org/pdf/1412.6856.pdf .

Fu C Y, Liu W, Ranga A, et al. DSSD: deconvolutional single shot detector[EB/OL].2017-01-23[2019-01-02] . https://arxiv.org/pdf/1701.06659.pdf https://arxiv.org/pdf/1701.06659.pdf .

Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[EB/OL].2015-04 -10[2019-01-02]. https://arxiv.org/pdf/1409.1556.pdf https://arxiv.org/pdf/1409.1556.pdf .

Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift[EB/OL].2015-03-02[2019-01-02] . https://arxiv.org/pdf/1502.03167,2015.pdf https://arxiv.org/pdf/1502.03167,2015.pdf .

Nair V, Hinton G E. Rectified linear units improve restricted Boltzmann machines[C]//Proceedings of the 27th International Conference on International Conference on Machine Learning. Haifa, Israel: ACM, 2010: 807-814.

Li J N, Liang X D, Shen S M, et al. Scale-aware fast R-CNN for pedestrian detection[J]. IEEE Transactions on Multimedia, 2018, 20(4):985-996.

Everingham M, Van Gool L, Williams C K I, et al. The Pascal visual object classes (VOC) challenge[J]. International Journal of Computer Vision, 2010, 88(2):303-338.[DOI:10.1007/s11263-009-0275-4]