多尺度特征图融合的目标检测
Multiscale feature map fusion algorithm for target detection
- 2019年24卷第11期 页码:1918-1931
收稿:2019-01-30,
修回:2019-5-12,
录用:2019-5-19,
纸质出版:2019-11-16
DOI: 10.11834/jig.190021
移动端阅览

浏览全部资源
扫码关注微信
收稿:2019-01-30,
修回:2019-5-12,
录用:2019-5-19,
纸质出版:2019-11-16
移动端阅览
目的
2
自然场景图像中,特征提取的质量好坏是决定目标检测性能高低的关键因素。大多数检测算法都是利用卷积神经网络(CNN)强大的学习能力来获得目标的先验知识,并根据这些知识进行目标检测。卷积神经网络的低层次特征缺乏特征的代表性,而高层次的特征则对小尺度目标的监测能力弱。
方法
2
利用原始SSD(single shot multiBox detector)网络提取特征图,通过1×1卷积层将提取的特征图统一为256维;通过反卷积操作增加自顶向下特征图的空间分辨率;通过对应元素相加的操作,将两个方向的特征图进行融合。将融合后的特征图采用3×3的卷积核进行卷积操作,减小特征图融合后的混叠效应。根据以上步骤构建具有较强语义信息的特征图,同时保留原有特征图的细节信息;对预测框进行聚合,利用非极大抑制(NMS)实现最终的检测效果。
结果
2
在PASCAL VOC 2007和PASCAL VOC 2012数据集上进行实验测试,该模型的mAP(mean average precision)为78.9%和76.7%,相对于经典的SSD算法,分别提高了1.4%和0.9%;此外,本文方法在检测小尺度目标时相较于经典SSD模型mAP提升了8.3%。
结论
2
提出了一种多尺度特征图融合的目标检测算法,以自顶向下的方式扩展了语义信息,构造了高强度语义特征图用于实现精确目标检测。
Objective
2
The development and progress of science and technology have made it possible to obtain numerous images from imaging equipment
the Internet
or image databases and have increased people's requirements for image processing. Consequently
image-processing technology has been deeply
widely
and rapidly developed. Target detection is an important research content in the field of computer vision. Rapid and accurate positioning and recognition of specific targets in uncontrolled natural scenes are vital functional bases of many artificial intelligence application scenes. However
several major difficulties presently exist in the field of target detection. First
many small objects are widely distributed in visual scenes. The existence of these small objects challenges the agility and reliability of detection algorithms. Second
detection accuracy and speed are linked
and many technical bottle necks must be overcome to consider the performance of these two factors. Finally
large-scale model parameters are an important reason restricting the loading of deep network chips. The compression of model size while ensuring detection accuracy is a meaningful and urgent problem. Targets with simple background
sufficient illumination
and no occlusion are relatively easy to detect
whereas targets with mixed background and target
occlusion near the target
excessively weak illumination intensity
or diverse target posture are difficult to detect. In natural scene images
the quality of feature extraction is the key factor to determine the performance of target detection. Decades of research have resulted in a more robust detection algorithm. Deep learning technology in the field of computer vision has also achieved great breakthroughs in recent years. Target detection framework based on deep learning has become the mainstream
and two main branches of target detection algorithms based on candidate regions and regression have been derived. Most of the current detection algorithms use the powerful learning ability of convolutional neural networks (CNNs) to obtain the prior knowledge of the target and perform target detection according to such knowledge. The low-level features of convolutional neural networks are characterized by high resolution ratio
low abstract semantics
limited position information
and lack of representation of features. High-level features are characterized by high identification
low resolution ratio
and a weak ability to detect small-scale targets. Therefore
in this study
the semantic information of context is transmitted by combining high- and low-level feature graphs to make the semantic information complete and evenly distributed.
Method
2
While balancing detection speed and accuracy
the multiscale feature graph fusion target detection algorithm in this study takes a single-shot multibox detector (SSD) network structure as the basic network and adds a feature fusion module to obtain feature graphs with rich semantic information and uniform distribution. The speech information of feature graphs on different levels is transmitted from top to bottom by feature fusion structure to reduce the semantic difference among feature graphs at different levels. The original SSD network is first used to extract a feature graph
which is then unified into 256 channels through a 1×1 convolution layer. The spatial resolution of the top-down feature maps is subsequently increased by deconvolution. Hence
the feature graph coming from two directions has the same spatial resolution. Feature graphs in both directions are then fused to obtain feature graphs with complete semantic information and uniform distribution by adding corresponding elements. The fused feature graph is convolved with a 3×3 convolution kernel to reduce the aliasing effect of the fused feature graph. A feature graph with strong semantic information is constructed according to the abovementioned steps
and the details of the original feature graph are retained. Lastly
the predicted bounding boxes are aggregated and non maximum suppression is used to achieve the final detection results.
Result
2
Key problems in the practical application of target detection algorithms and difficult problems in related target detection are analyzed according to the research progress and task requirements of visual target detection-related technology. Current solutions are also given. The target detection algorithm based on multiscale feature graph fusion in this study can achieve good results when dealing with weak targets
multiple targets
messy background
occlusion
and other detection difficulties. Experimental tests are performed on PASCAL VOC 2007 and 2012 data sets. The mean average precision values of the proposed model are 78.9% and 76.7%
which are 1.4 and 0.9 percentage points higher than those of the classical SSD algorithm
respectively. In addition
the method in this paper improves by 8.3% mAP compared with the classical SSD model when detecting small-scale targets. Compared with the classical SSD model
the method proposed in this study significantly improves the detection effect when detecting small-scale targets.
Conclusion
2
The multiscale feature graph fusion target detection algorithm proposed in this study uses convolutional neural network to extract convolutional features instead of the traditional manual feature extraction process
thereby expanding semantic information in a top-down manner and constructing a high-strength semantic feature graph. The model can be used to detect new scene images with strong visual task. In combination with the idea of deep learning convolutional neural network
the convolution feature is used to replace the traditional manual feature
thus avoiding the problem of feature selection in the traditional detection problem. The deep convolution feature has improved expressive ability. The target detection model of multiscale feature map fusion is finally obtained through repeated iteration training on the basis of the SSD network. The detection model has good detection effect for small-scale target detection tasks. While realizing end-to-end training of detection algorithm
the model also improves its robustness to various complex scenes and the accuracy of target detection. Therefore
accurate target detection is achieved. This study provides a general and concise way to solve the problem of small-scale target detection.
He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas, NV, USA: IEEE, 2016: 770-778.[ DOI: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]
Xu G, Yue J G, Dong Y C, et al. Cement plant detection on satellite images using deep convolution network[J]. Journal of Image and Graphics, 2019, 24(4):550-561.
徐刚, 岳继光, 董延超, 等.深度卷积网络卫星图像水泥厂目标检测[J].中国图象图形学报, 2019, 24(4):550-561. [DOI:10.11834/jig.180424]
Bai C, Huang L, Chen J N, et al. Optimization of deep convolutional neural network for large scale image classification[J]. Journal of Software, 2018, 29(4):1029-1038.
白琮, 黄玲, 陈佳楠, 等.面向大规模图像分类的深度卷积神经网络优化[J].软件学报, 2018, 29(4):1029-1038. [DOI:10.13328/j.cnki.jos.005404]
Noh H, Hong S, Han B. Learning deconvolution network for semantic segmentation[C]//Proceedings of 2015 IEEE International Conference on Computer Vision.Santiago, Chile: IEEE, 2015: 1520-1528.[ DOI: 10.1109/ICCV.2015.178 http://dx.doi.org/10.1109/ICCV.2015.178 ]
Li L H, Qian B, Lian J, et al. Study on traffic scene semantic segmentation method based on convolutional neural network[J]. Journal on Communications, 2008, 39(4):123-130.
李琳辉, 钱波, 连静, 等.基于卷积神经网络的交通场景语义分割方法研究[J].通信学报, 2018, 39(4):123-130. [DOI:10.11959/j.issn.1000-436x.2018053]
Yu F, Koltun V. Multi-scale context aggregation by dilated convolutions[EB/OL].2016-04-30[2019-01-02] . https://arxiv.org/pdf/1511.07122.pdf https://arxiv.org/pdf/1511.07122.pdf .
Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition.Columbus, OH, USA: IEEE, 2014: 580-587.[ DOI: 10.1109/CVPR.2014.81 http://dx.doi.org/10.1109/CVPR.2014.81 ]
Girshick R. Fast R-CNN[C]//Proceedings of 2015 IEEE International Conference on Computer Vision.Santiago, Chile: IEEE, 2015: 1440-1448.[ DOI: 10.1109/ICCV.2015.169 http://dx.doi.org/10.1109/ICCV.2015.169 ]
Redmon J, Divvala S, Girshick R, et al. You only look once: unified, real-time object detection[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas, NV, USA: IEEE, 2016: 779-788.[ DOI: 10.1109/CVPR.2016.91 http://dx.doi.org/10.1109/CVPR.2016.91 ]
Liu W, Anguelov D, Erhan D, et al. SSD: single shot multiBox detector[C]//Proceedings of the 14th European Conference on Computer Vision.Amsterdam, The Netherlands: Springer, 2016: 21-37.[ DOI: 10.1007/978-3-319-46448-0_2 http://dx.doi.org/10.1007/978-3-319-46448-0_2 ]
Uijlings J R R, Van De Sande K E A, Gevers T, et al. Selective search for object recognition[J]. International Journal of Computer Vision, 2013, 104(2):154-171.[DOI:10.1007/s11263-013-0620-5]
Ren S Q, He K M, Girshick R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems.Montreal, Canada: ACM, 2015: 91-99.
Felzenszwalb P, McAllester D, Ramanan D. A discriminatively trained, multiscale, deformable part model[C]//Proceedings of 2008 IEEE Conference on Computer Vision and Pattern Recognition.Anchorage, AK, USA: IEEE, 2008: 1-8.[ DOI: 10.1109/CVPR.2008.4587597 http://dx.doi.org/10.1109/CVPR.2008.4587597 ]
Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[EB/OL].2017-04-19[2019-01-02] . https://arxiv.or-g/pdf/1612.03144.pdf https://arxiv.or-g/pdf/1612.03144.pdf .
Cai Z W, Fan Q F, Feris R S, et al. A unified multi-scale deep convolutional neural network for fast object detection[EB/OL]. 2016-07-25[2019-01-02] . https://arxiv.org/pdf/1607.07155.pdf https://arxiv.org/pdf/1607.07155.pdf .
Huang J, Rathod V, Sun C, et al. Speed/accuracy trade-offs for modern convolutional object detectors[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Honolulu, HI, USA: IEEE, 2017: 3296-3297.[ DOI: 10.1109/CVPR.2017.351 http://dx.doi.org/10.1109/CVPR.2017.351 ]
Zhou B L, Khosla A, Lapedriza A, et al. Object detectors e-merge in deep scene CNNs[EB/OL].2015-04-15[2019-01-02] . https://arxiv.org/pdf/1412.6856.pdf https://arxiv.org/pdf/1412.6856.pdf .
Fu C Y, Liu W, Ranga A, et al. DSSD: deconvolutional single shot detector[EB/OL].2017-01-23[2019-01-02] . https://arxiv.org/pdf/1701.06659.pdf https://arxiv.org/pdf/1701.06659.pdf .
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[EB/OL].2015-04 -10[2019-01-02]. https://arxiv.org/pdf/1409.1556.pdf https://arxiv.org/pdf/1409.1556.pdf .
Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift[EB/OL].2015-03-02[2019-01-02] . https://arxiv.org/pdf/1502.03167,2015.pdf https://arxiv.org/pdf/1502.03167,2015.pdf .
Nair V, Hinton G E. Rectified linear units improve restricted Boltzmann machines[C]//Proceedings of the 27th International Conference on International Conference on Machine Learning. Haifa, Israel: ACM, 2010: 807-814.
Li J N, Liang X D, Shen S M, et al. Scale-aware fast R-CNN for pedestrian detection[J]. IEEE Transactions on Multimedia, 2018, 20(4):985-996.
Everingham M, Van Gool L, Williams C K I, et al. The Pascal visual object classes (VOC) challenge[J]. International Journal of Computer Vision, 2010, 88(2):303-338.[DOI:10.1007/s11263-009-0275-4]
相关作者
相关机构
京公网安备11010802024621