采用卷积核金字塔和空洞卷积的单阶段目标检测

刘涛; 汪西莉

doi:10.11834/jig.190166

图像分析和识别 | 浏览量 : 0 下载量: 32 CSCD: 5

PDF
导出
分享
收藏
专辑

采用卷积核金字塔和空洞卷积的单阶段目标检测
Single-stage object detection using filter pyramid and atrous convolution
2020年25卷第1期页码：102-112
收稿：2019-05-08，

修回：2019-7-8，

纸质出版：2020-01-16
DOI： 10.11834/jig.190166
稿件说明：

移动端阅览

刘涛, 汪西莉. 采用卷积核金字塔和空洞卷积的单阶段目标检测[J]. 中国图象图形学报, 2020,25(1):102-112. DOI： 10.11834/jig.190166.

Tao Liu, Xili Wang. Single-stage object detection using filter pyramid and atrous convolution[J]. Journal of Image and Graphics, 2020, 25(1): 102-112. DOI： 10.11834/jig.190166.

摘要

目的

在基于深度学习的目标检测模型中，浅层特征图包含更多细节但缺乏语义信息，深层特征图则相反，为了利用不同深度特征图的优势，并在此基础上解决检测目标的多尺度问题，本文提出基于卷积核金字塔和空洞卷积的单阶段目标检测模型。

方法

所提模型采用多种方式融合特征信息，先使用逐像素相加方式融合多层不同大小的特征图信息，然后在通道维度拼接不同阶段的特征图，形成具有丰富语义信息和细节信息的信息融合特征层作为模型的预测层。模型在锚框机制中引入卷积核金字塔结构，以解决检测目标的多尺度问题，采用空洞卷积减少大尺寸卷积核增加的参数量，合理地降低锚框数量。

结果

实验结果表明，在PASCAL VOC2007测试数据集上，所提检测框架在300×300像素的输入上检测精度达到79.3% mAP（mean average precision），比SSD（single shot multibox detector）高1.8%，比DSSD（deconvolutional single shot detector）高0.9%。在UCAS-AOD遥感数据测试集上，所提模型的检测精度分别比SSD和DSSD高2.8%和1.9%。在检测速度上，所提模型在Titan X GPU上达到21帧/s，速度超过DSSD。

结论

本文模型提出在两个阶段融合特征信息并改进锚框机制，不仅具有较快的检测速度和较高的精度，而且较好地解决了小目标以及重叠目标难以被检出的问题。

Abstract

Objective

Object detection is a fundamental topic in computer vision. Deep learning-based object detection networks consist of two basic parts:feature extraction and object detection modules. Convolution neural networks (CNNs) are used to extract image features. On the one hand

deep feature maps are rich in object semantic information; sensitive to category information; lacking in detailed information; insensitive to position

translation

and rotation information; and widely used in classification tasks. On the other hand

shallow feature maps are rich in detailed information; sensitive to location

translation

and rotation information; lacking in semantic information; and insensitive to category information. The two main subtasks of object detection are classification and location. The former classifies the candidate regions and requires the semantic information of the object

whereas the latter locates the candidate regions and requires detailed information (e.g.

location). In the anchor mechanism of faster region-based CNN (R-CNN)

each anchor point of the predicted feature map corresponds to nine anchors with different sizes and ratios. A 1×1 convolution filter is used to predict the positions and confidence scores (i.e.

the probability that the object contained in the anchor box belongs to a certain category) of multiple anchors with different sizes. Therefore

for the anchors with different sizes that correspond to the anchor points

the same feature region on the feature map is used for prediction. This condition results in the mismatch between the feature region used in prediction and the corresponding anchor. To utilize the advantages of feature maps with different depths and overcome the mismatch problem in the anchor mechanism to accurately solve the problem of multi-scale object detection

we present a single-stage object detection model using convolution filter pyramid and atrous convolution.

Method

Feature information is fused in a variety of ways. First

multiple convolutional layers are added to the feature extraction network. The feature information in these layers is fused layer by layer (from the deep layers to the shallow ones) through pixel-by-pixel addition

thereby forming feature maps with rich semantic information and detailed information. Second

to further enhance the fusion of feature information

feature maps with different stages are concatenated for the fusion feature maps obtained in the previous step. To address the mismatch between the feature region and the corresponding anchor used for prediction

this study introduces a convolution filter pyramid structure into the anchor mechanism to detect objects with different sizes. Consequently

the sizes of the convolution filter corresponding to the anchors with different sizes is distinct and those corresponding to anchors with equal sizes but different ratios are the same. This condition alleviates the mismatch problem. In addition

the model uses the atrous convolution mechanism to design a convolution filter with different receptive fields because the large-scale convolution filter increases the number of parameters and the time complexity should be reduced. Under the action of convolution filters with different sizes

the prediction tensors (i.e.

feature maps) of different resolutions are generated on the feature maps with rich semantic and detailed information. The model determines the number of anchors according to the generated prediction tensors. The number of small anchors corresponding to small objects is large

whereas those corresponding to large objects is small

thereby reducing the number of anchors.

Result

The proposed method was tested and evaluated on PASCAL visual object classes (VOC) and UCAS-AOD remote sensing datasets

respectively. The code was implemented on the Caffe deep learning framework

where some components of the Caffe open-source library of single-shot multibox detector (SSD) and deconvolutional single-shot detector (DSSD) were utilized. All experiments were performed on an HP workstation with a Titan X GPU. SSD was used as the pre-training model of the proposed method. The model was fine-tuned on PASCAL VOC and UCAS-AOD and the performance was evaluated using the mean average precision (mAP) on VOC2007 and UCAS-AOD test sets. The proposed method was then compared with other advanced deep learning object detection methods in terms of mAP results and detection speed. Experimental results show that on the PASCAL VOC2007 test set

the proposed model can achieve 79.3% mAP for an input size of 300×300

which is higher than SSD and DSSD by 1.8% and 0.9%

respectively. On the UCAS-AOD remote sensing dataset

the proposed model obtained a 91.0% mAP

which is 2.8% and 1.9% higher than SSD and DSSD

respectively. The testing speed of the model is 21 frame per second on Titan X GPU

which is much faster than DSSD.

Conclusion

In this study

a single-stage object detection model using convolution filter pyramid and atrous convolution is proposed. First

feature information is merged through pixel-by-pixel addition and channel concatenation to form a feature map with rich semantic and detailed information. The obtained information was used as a prediction feature map to provide rich feature information in predicting boundary box categories and locations. Then

the convolution filter pyramid structure is introduced into the anchor mechanism to overcome the mismatch between the feature region and corresponding anchor

as well as to accurately detect multiscale objects. At the same time

the atrous convolution is introduced to increase the receptive field of the convolution filter without increasing the number of parameters.The number of anchors is determined according to the generated prediction tensor to reduce the time complexity. The proposed model exhibited a faster detection speed and higher detection accuracy than the current advanced methods

especially in solving the problems of small objects and detecting overlapped objects

due to the effective information fusion and introduction of a convolution filter pyramid structure in the anchor mechanism. Although the proposed method demonstrated a good result in terms of detection speed and accuracy

the detection accuracy of the algorithm can still be further improved compared with the two-stage algorithm because the research on the former is limited in the feature fusion part. In the future

further research will be conducted in the feature fusion part to improve the detection accuracy of the algorithm.

关键词

Keywords

references

Bell S, Zitnick C L and Bala K.2016. Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 2874-2883[ DOI:10.1109/CVPR.2016.314 http://dx.doi.org/10.1109/CVPR.2016.314 ]

Cai Z W, Fan Q F and Feris R S.2016.A unified multi-scale deep convolutional neural network for fast object detection//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer: 354-370[ DOI:10.1007/978-3-319-46493-0_22 http://dx.doi.org/10.1007/978-3-319-46493-0_22 ]

Cao G M, Xie X M and Yang W Z. 2018.Feature-fused SSD: fast detection for small objects//Proceedings of the 9th International Conference on Graphic and Image Processing. Qingdao, China: SPIE: 10615[ DOI:10.1117/12.2304811 http://dx.doi.org/10.1117/12.2304811 ]

Chen L C, Papandreou G and Kokkinos I. 2018.Deeplab:Semantic image segmentation with deep convolutional nets, atrousconvolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834-848[DOI:10.1109/TPAMI.2017.2699184]

Dai J F, Li Y and He K M.2016.R-FCN: object detection via region-based fully convolutional networks//Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain: Curran Associates Inc: 379-387

Fu C Y, Liu W and Ranga A. 2017. DSSD: deconvolutional single shot detector. 2017-01-23[2019-05-01]. https://arxiv.org/pdf/1701.06659.pdf https://arxiv.org/pdf/1701.06659.pdf

Gidaris S and Komodakis N. 2015. Object detection via a multi-region and semantic segmentation-aware CNN model//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 1134-1142[ DOI:10.1109/ICCV.2015.135 http://dx.doi.org/10.1109/ICCV.2015.135 ]

Girshick R. 2015. Fast R-CNN//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE[ DOI:10.1109/ICCV.2015.169 http://dx.doi.org/10.1109/ICCV.2015.169 ]

He K M, Gkioxari G and Dollár P. 2017. Mask R-CNN//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2980-2988[ DOI:10.1109/ICCV.2017.322 http://dx.doi.org/10.1109/ICCV.2017.322 ]

He K M, Zhang X Y and Ren S Q. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 770-778[ DOI:10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]

Hu P Y and Ramanan D. 2017. Finding tiny faces//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE: 951-959[ DOI:10.1109/CVPR.2017.166 http://dx.doi.org/10.1109/CVPR.2017.166 ]

Jeong J, Park H and Kwak N. 2017. Enhancement of SSD by concatenating feature maps for object detection. 2017-05-26[2019-04-23]. https://arxiv.org/pdf/1705.09587.pdf https://arxiv.org/pdf/1705.09587.pdf

Kong T, Yao A B and Chen Y R. 2016. HyperNet: towards accurate region proposal generation and joint object detection//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 845-853[ DOI:10.1109/CVPR.2016.98 http://dx.doi.org/10.1109/CVPR.2016.98 ]

Li Z X and Zhou F Q. 2017. FSSD: feature fusion single shot multiboxdetector. 2017-12-04[2018-04-23]. https://arxiv.org/pdf/1712.00960.pdf https://arxiv.org/pdf/1712.00960.pdf

Lin T Y, Dollár P and Girshick R. 2017. Feature pyramid networks for object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE: 936-944[ DOI:10.1109/CVPR.2017.106 http://dx.doi.org/10.1109/CVPR.2017.106 ]

Liu W, Anguelov D and Erhan D. 2016. SSD: single shot multibox detector//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer: 21-37[ DOI:10.1007/978-3-319-46448-0_2 http://dx.doi.org/10.1007/978-3-319-46448-0_2 ]

Redmon J, Divvala S and Girshick R. 2016. You only look once: unified, real-time object detection//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 779-788[ DOI:10.1109/CVPR.2016.91 http://dx.doi.org/10.1109/CVPR.2016.91 ]

Redmon J and Farhadi A. 2017. YOLO9000: better, faster, stronger//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE: 7263-7271[ DOI:10.1109/CVPR.2017.690 http://dx.doi.org/10.1109/CVPR.2017.690 ]

Redmon J and Farhadi A. 2018. YOLOv3: an incremental improvement[EB/OL].2018-04-08[2018-04-23] . https://arxiv.org/pdf/1804.02767.pdf https://arxiv.org/pdf/1804.02767.pdf

Ren S Q, He K M and Girshick R. 2015. Faster R-CNN: towards real-time object detection with region proposal networks//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press: 91-99

Shen Z Q, Liu Z and Li J G. 2017. DSOD: learning deeply supervised object detectors from scratch//Proceedings of 2017 IEEE International Conference on Computer Vision.Venice, Italy: IEEE: 1937-1945[ DOI:10.1109/ICCV.2017.212 http://dx.doi.org/10.1109/ICCV.2017.212 ]

Simonyan K and Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition[EB/OL]. 2014-09-04[2015-04-23] . https://arxiv.org/pdf/1409.1556.pdf https://arxiv.org/pdf/1409.1556.pdf

Singh B and Davis L S. 2018. An analysis of scale invariance in object detection-SNIP//Proceedings of 2018 IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE: 3578-3587[ DOI:10.1109/CVPR.2018.00377 http://dx.doi.org/10.1109/CVPR.2018.00377 ]

Wang P Q, Chen P F and Yuan Y. 2018. Understanding convolution for semantic segmentation//Proceedings of 2018 IEEE Winter Conference on Applications of Computer Vision. Lake Tahoe, NV, USA: IEEE: 1451-1460[ DOI:10.1109/WACV.2018.00163 http://dx.doi.org/10.1109/WACV.2018.00163 ]

Xu M L, Cui L S and Lv P. 2018.MDSSD: multi-scale deconvolutional single shot detector for small objects. 2018-05-18[2018-08-19]. https://arxiv.org/pdf/1805.07009.pdf https://arxiv.org/pdf/1805.07009.pdf

Zhu H G, Chen X G and Dai W Q. 2015. Orientation robust object detection in aerial images using deep convolutional neural network//Proceedings of 2015 IEEE International Conference on Image Processing.Quebec City, Canada: IEEE: 3735-3739[ DOI:10.1109/ICIP.2015.7351502 http://dx.doi.org/10.1109/ICIP.2015.7351502 ]