发布时间: 2020-01-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.190166
2020 | Volume 25 | Number 1

图像分析和识别

采用卷积核金字塔和空洞卷积的单阶段目标检测

刘涛, 汪西莉

陕西师范大学计算机科学学院, 西安 710119

收稿日期: 2019-05-08; 修回日期: 2019-07-08

基金项目: 国家自然科学基金项目（41471280，61701290，61701289）

第一作者简介: 刘涛, 1994年生, 男, 硕士研究生, 主要研究方向为深度学习、计算机视觉、目标检测。E-mail:935165996@qq.com.

中图法分类号: TP391.4

文献标识码: A

文章编号: 1006-8961(2020)01-0102-11

摘要

目的在基于深度学习的目标检测模型中，浅层特征图包含更多细节但缺乏语义信息，深层特征图则相反，为了利用不同深度特征图的优势，并在此基础上解决检测目标的多尺度问题，本文提出基于卷积核金字塔和空洞卷积的单阶段目标检测模型。方法所提模型采用多种方式融合特征信息，先使用逐像素相加方式融合多层不同大小的特征图信息，然后在通道维度拼接不同阶段的特征图，形成具有丰富语义信息和细节信息的信息融合特征层作为模型的预测层。模型在锚框机制中引入卷积核金字塔结构，以解决检测目标的多尺度问题，采用空洞卷积减少大尺寸卷积核增加的参数量，合理地降低锚框数量。结果实验结果表明，在PASCAL VOC2007测试数据集上，所提检测框架在300×300像素的输入上检测精度达到79.3% mAP（mean average precision），比SSD（single shot multibox detector）高1.8%，比DSSD（deconvolutional single shot detector）高0.9%。在UCAS-AOD遥感数据测试集上，所提模型的检测精度分别比SSD和DSSD高2.8%和1.9%。在检测速度上，所提模型在Titan X GPU上达到21帧/s，速度超过DSSD。结论本文模型提出在两个阶段融合特征信息并改进锚框机制，不仅具有较快的检测速度和较高的精度，而且较好地解决了小目标以及重叠目标难以被检出的问题。

关键词

单阶段目标检测; 特征融合; 卷积核金字塔; 锚框; 空洞卷积

Single-stage object detection using filter pyramid and atrous convolution

Liu Tao, Wang Xili

School of Computer Science, Shannxi Normal University, Xi'an 710119, China

Supported by: National Natural Science Foundation of China (41471280, 61701290, 61701289)

Abstract

Objective Object detection is a fundamental topic in computer vision. Deep learning-based object detection networks consist of two basic parts:feature extraction and object detection modules. Convolution neural networks (CNNs) are used to extract image features. On the one hand, deep feature maps are rich in object semantic information; sensitive to category information; lacking in detailed information; insensitive to position, translation, and rotation information; and widely used in classification tasks. On the other hand, shallow feature maps are rich in detailed information; sensitive to location, translation, and rotation information; lacking in semantic information; and insensitive to category information. The two main subtasks of object detection are classification and location. The former classifies the candidate regions and requires the semantic information of the object, whereas the latter locates the candidate regions and requires detailed information (e.g., location). In the anchor mechanism of faster region-based CNN (R-CNN), each anchor point of the predicted feature map corresponds to nine anchors with different sizes and ratios. A 1×1 convolution filter is used to predict the positions and confidence scores (i.e., the probability that the object contained in the anchor box belongs to a certain category) of multiple anchors with different sizes. Therefore, for the anchors with different sizes that correspond to the anchor points, the same feature region on the feature map is used for prediction. This condition results in the mismatch between the feature region used in prediction and the corresponding anchor. To utilize the advantages of feature maps with different depths and overcome the mismatch problem in the anchor mechanism to accurately solve the problem of multi-scale object detection, we present a single-stage object detection model using convolution filter pyramid and atrous convolution. Method Feature information is fused in a variety of ways. First, multiple convolutional layers are added to the feature extraction network. The feature information in these layers is fused layer by layer (from the deep layers to the shallow ones) through pixel-by-pixel addition, thereby forming feature maps with rich semantic information and detailed information. Second, to further enhance the fusion of feature information, feature maps with different stages are concatenated for the fusion feature maps obtained in the previous step. To address the mismatch between the feature region and the corresponding anchor used for prediction, this study introduces a convolution filter pyramid structure into the anchor mechanism to detect objects with different sizes. Consequently, the sizes of the convolution filter corresponding to the anchors with different sizes is distinct and those corresponding to anchors with equal sizes but different ratios are the same. This condition alleviates the mismatch problem. In addition, the model uses the atrous convolution mechanism to design a convolution filter with different receptive fields because the large-scale convolution filter increases the number of parameters and the time complexity should be reduced. Under the action of convolution filters with different sizes, the prediction tensors (i.e., feature maps) of different resolutions are generated on the feature maps with rich semantic and detailed information. The model determines the number of anchors according to the generated prediction tensors. The number of small anchors corresponding to small objects is large, whereas those corresponding to large objects is small, thereby reducing the number of anchors. Result The proposed method was tested and evaluated on PASCAL visual object classes (VOC) and UCAS-AOD remote sensing datasets, respectively. The code was implemented on the Caffe deep learning framework, where some components of the Caffe open-source library of single-shot multibox detector (SSD) and deconvolutional single-shot detector (DSSD) were utilized. All experiments were performed on an HP workstation with a Titan X GPU. SSD was used as the pre-training model of the proposed method. The model was fine-tuned on PASCAL VOC and UCAS-AOD and the performance was evaluated using the mean average precision (mAP) on VOC2007 and UCAS-AOD test sets. The proposed method was then compared with other advanced deep learning object detection methods in terms of mAP results and detection speed. Experimental results show that on the PASCAL VOC2007 test set, the proposed model can achieve 79.3% mAP for an input size of 300×300, which is higher than SSD and DSSD by 1.8% and 0.9%, respectively. On the UCAS-AOD remote sensing dataset, the proposed model obtained a 91.0% mAP, which is 2.8% and 1.9% higher than SSD and DSSD, respectively. The testing speed of the model is 21 frame per second on Titan X GPU, which is much faster than DSSD. Conclusion In this study, a single-stage object detection model using convolution filter pyramid and atrous convolution is proposed. First, feature information is merged through pixel-by-pixel addition and channel concatenation to form a feature map with rich semantic and detailed information. The obtained information was used as a prediction feature map to provide rich feature information in predicting boundary box categories and locations. Then, the convolution filter pyramid structure is introduced into the anchor mechanism to overcome the mismatch between the feature region and corresponding anchor, as well as to accurately detect multiscale objects. At the same time, the atrous convolution is introduced to increase the receptive field of the convolution filter without increasing the number of parameters.The number of anchors is determined according to the generated prediction tensor to reduce the time complexity. The proposed model exhibited a faster detection speed and higher detection accuracy than the current advanced methods, especially in solving the problems of small objects and detecting overlapped objects, due to the effective information fusion and introduction of a convolution filter pyramid structure in the anchor mechanism. Although the proposed method demonstrated a good result in terms of detection speed and accuracy, the detection accuracy of the algorithm can still be further improved compared with the two-stage algorithm because the research on the former is limited in the feature fusion part. In the future, further research will be conducted in the feature fusion part to improve the detection accuracy of the algorithm.

Key words

single-stage object detection; feature fusion; convolution filter pyramid; anchor box; atrous convolution

0 引言

目标检测是计算机视觉领域的一个热门研究方向，目前的研究主要以基于深度学习的方法为主。基于深度学习的目标检测网络由两个基本的部分组成：特征提取模块和目标检测模块。采用卷积神经网络(CNN)提取图像特征时，深层的特征图具有丰富的对象语义信息，对类别信息比较敏感，但缺乏细节信息，在分类任务中使用较多。而浅层的特征图具有丰富的细节信息，对位置、平移、旋转比较敏感，但缺乏语义信息。目标检测包括分类和定位目标，前者给候选区域分类，需要对象语义信息；后者定位候选区域，需要位置等细节信息。为了提高目标检测性能，往往融合不同深度的特征信息以利于目标的分类和定位。

根据是否进行特征信息融合，将目标检测网络分为两类，无特征信息融合和有特征信息融合。在无特征信息融合的目标检测网络中，一类是在单层特征图上进行预测，如两阶段方法Fast R-CNN(Girshick，2015)和Faster R-CNN(Ren等，2015)，以及单阶段方法YOLO(you only look once)(Redmon等，2016)和YOLOv2(Redmon等，2016)。另一类是在多个特征图上进行预测，如SSD(single-shot multibox deteltor)(Liu等，2016)和MS-CNN(multi-scale CNN)(Cai等，2016)。在有特征信息融合的目标检测网络中，一类是在单个融合的特征图上进行预测，如HyperNet(Kong等，2016)和ION(inside-outside not)(Bell等，2016)使用拼接方式融合不同层次的特征。另一类是在多个融合的特征图上进行预测，如DSSD(deconvolutional single-shot dectector)(Fu等，2017)使用逐像素相乘的方式融合信息，YOLOv3(Redmon和Farhadi, 2018)、FPN(feature pyramid network)(Lin等，2017)和Mask R-CNN(He等，2017)使用逐像素相加的方式融合信息。

为了利用不同深度特征图的优势，本文提出在两个阶段上融合特征信息：首先在特征提取网络之后添加多个卷积层，对添加的多个卷积层使用逐像素相加的方式从深层到浅层逐层融合特征信息，形成具有丰富语义信息和细节信息的特征图。其次，为进一步增强特征信息融合，使用通道拼接的方式为上一步得到的融合特征图拼接不同阶段的特征图，形成语义、细节信息更加丰富的特征图。

针对多尺度目标检测的难题，目前提出的解决方法主要有以下3类：

第1种采用图像金字塔网络(在不同尺寸的图像上提取特征)，如目标检测算法SNIP(scale normalization for image pyramids)(Singh和Davis, 2018)和人脸检测算法HR(hybnid-resolution)(Hu和Ramanan，2017)，都采用图像金字塔并取得了较好的结果。其劣势是算法时间复杂度高，为了降低时间复杂度，可采用稀疏图像金字塔，即只采用3种不同的输入图像尺寸。

第2种是在单层特征图上使用锚框机制解决目标的多尺度问题。如Faster R-CNN在最深层的特征图上使用RPN(region proposal network)网络提取候选区域，为了检测不同尺度的目标，RPN网络在特征图的每一个锚点上预测九种不同大小、不同比率的锚框。此外，R-FCN(region-based fully convolutional network)(Dai等，2016)、YOLOv2进行预测时也采用锚框机制。

第3种是在特征图金字塔上做预测。SSD目标检测网络在不同的特征图上预测不同尺度的目标，形成了特征图金字塔的雏形。DSSD、FPN都在特征图金字塔上进行预测以处理目标的多尺度问题。

以上3种思路都是解决检测目标多尺度问题的有效手段，本文将改进锚框机制以解决目标的多尺度问题。在RPN的锚框机制中，做预测的特征图上每一个锚点对应9种不同大小、不同比率的锚框，预测时使用1×1的卷积核预测多个不同大小的锚框的位置与置信度(即锚框包含的目标属于某一类别的概率)，因此对于锚点对应的不同大小的锚框，预测时使用的是特征图上的同一块特征区域，造成RPN网络进行预测时用的特征区域和对应的锚框区域不匹配。为此，本文提出在锚框机制中引入卷积核金字塔结构来检测不同大小的目标，使不同大小的锚框对应的卷积核大小不一样，而同一大小但比率不同的锚框对应的卷积核大小一样，从而缓解所述不匹配问题。此外，由于大尺寸卷积核会增加参数量，为降低时间复杂度，模型采用空洞卷积(Chen等，2018)机制设计具有不同大小感受野的卷积核。在不同大小卷积核的作用下，在具有丰富语义和细节信息的特征图上产生不同分辨率的预测张量(即特征图)，模型根据生成的预测张量来决定锚框的数目，使小目标对应于小锚框且数目多，大目标对应于大锚框且数目少，从而合理地减少锚框数目。

本文提出的基于卷积核金字塔和空洞卷积的单阶段目标检测模型(AFP-SSD)的第1个贡献是采用逐像素相加和通道拼接的方式融合特征信息，形成具有丰富语义信息和细节信息的特征图，该特征图作为预测特征图为预测提供足够的特征信息。第2个贡献是在锚框机制中引入卷积核金字塔结构，解决锚框和特征区域不匹配的问题，以更准确地检测多尺度目标，其中引入空洞卷积在不增加参数量的情况下增大卷积核的感受野，同时根据生成的预测张量确定锚框的数目，降低了时间复杂度。

1 AFP-SSD模型

基于深度学习的目标检测模型将分类网络的特征提取模块作为基础网络，较好的分类网络有VGGNet(Simonyan和Zisserman，2014)和ResNet(He等，2016)等，综合考虑性能和速度，本文选择VGG16网络作为AFP-SSD的基础网络。

所提目标检测模型如图 1所示。其中蓝色的特征图和箭头分别表示SSD的原始预测层和使用双线性插值操作增大特征图的分辨率，+表示使用逐像素相加的方式和前一层特征图融合，绿色的特征图和箭头表示拼接不同阶段的特征图，F表示卷积核大小，D表示空洞卷积系数。

图 1 AFP-SSD目标检测框架

Fig. 1 AFP-SSD object detection framework

1.1 特征信息融合模块

AFP-SSD模型的融合过程从原始SSD预测层的最深层开始，先通过双线性插值操作增大特征图的分辨率，再使用逐像素相加的方式和前一层特征图融合，这样层层上采样融合直至原始SSD预测层的最浅层，形成既包含细节信息又包含语义信息的特征图，该步融合过程可见图 1的特征信息融合模块的蓝色特征图及其连线所示的融合部分。此外，拼接不同阶段的特征图以进一步增强预测特征图的语义和细节信息，该步融合过程可见图 1的特征信息融合模块的绿色特征图及其连线所示的融合部分。然后将使用不同方式融合的特征图用通道拼接的方式拼接成最终的预测特征图，由于它更好地保留了细节信息和语义信息，使用这样的特征不仅有利于检测大目标，还可以增强模型对小目标的检测能力。

1.2 目标检测模块

1.2.1 基于空洞卷积的卷积核金字塔结构

为了在具有丰富语义信息和细节信息的特征图上检测不同尺度的目标，采用卷积核金字塔结构，用不同大小的卷积核对特征图进行卷积操作，生成不同大小的预测张量、基于此预测目标所属类别的置信度以及位置信息。不同大小的卷积核对特征图的卷积操作对应于对原图上不同大小的感受野进行卷积操作，从而有利于检测出不同大小的目标。由于大卷积核带来大量的参数，大幅增加模型计算量，因此在卷积核金字塔中采用空洞卷积，以便在增大卷积核感受野的同时不增加参数数量。基于空洞卷积的卷积核金字塔模块如图 2所示，做预测的特征图经过不同卷积核的作用生成预测张量，在每一组输出张量上先做两次卷积操作，再分别预测边界框的类别与位置，见图中浅红色和浅蓝色矩形框所示部分。

图 2 基于空洞卷积的卷积核金字塔结构

Fig. 2 Filter pyramid based on atrous convolution

最终做预测的特征图的分辨率为38×38像素，设计了多组不同大小的卷积核，尽可能均匀地覆盖原图上不同大小的感受野，从而更好地预测不同尺度的目标。最小的卷积核大小为${k_{\min }}$，取值为3，最大的卷积核大小为${k_{\max }}$，取值为38。使用$n$组不同大小的卷积核预测得到$n$组不同大小的张量($n$=6)。为了使卷积核大小在3~38之间保持均匀分布，第$m$组卷积核的大小为

$ {k_{\min }} + \frac{{{k_{\max }} - {k_{\min }}}}{{n - 1}} \times \left({m - 1} \right) $

(1)

输出的每组预测张量的分辨率同样要满足均匀分布的要求，以使不同大小锚框的数目更加有效合理，因此第$m$组输出的预测张量大小为

$ \left\lceil {38/{2^{m - 1}}} \right\rceil $

(2)

式中，$\left\lceil {} \right\rceil $表示向上取整，根据卷积核大小分布的要求和输出的预测张量的分辨率大小的要求，并通过实验验证，设计了一套卷积核大小定义机制，如表 1所示。$r$和$d$表示实际的卷积核大小和空洞系数；$s$表示步长；$p$表示填充情况；$e$表示在空洞卷积的情况下，等效于普通卷积核的大小；$o$表示预测张量的分辨率。$e$值满足

$ e = r + \left({r + 1} \right) \times \left({d - 1} \right) $

(3)

$o$值满足

$ o = \left\lfloor {\frac{{38 - e + 2p + s}}{s}} \right\rfloor $

(4)

式中，$\left\lfloor {} \right\rfloor $表示向下取整。实验结果表明设计方案可以满足卷积核均匀覆盖原图上不同大小感受野的要求。与多次串联使用空洞卷积的结构相比，本文设计的空洞卷积结构通过并联的方式使用多个不同的空洞卷积，在主干网络上只有一个空洞卷积结构，因此空洞卷积的gridding效应不明显。此外，本文的空洞卷积系数较小，各空洞卷积系数不同且其最大公约数不大于1，符合文献(Wang等，2018)中的HDC (hybrid dilated convolution)模块的主要特性，进一步减轻了gridding效应。

表 1 设计的空洞卷积核机制
Table 1 Designed atrous filter mechanism

下载CSV

卷积核	实际的卷积核大小	空洞系数	步长	填充情况	在空洞卷积的情况下等效普通卷积核的大小	预测张量的分辨率
1	3	0	1	1	3	38
2	5	2	2	4	11	18
3	5	3	2	0	17	11
4	5	4	3	0	23	6
5	5	5	3	0	29	3
6	-	-	-	-	38	1
注：“-”表示由于全局平均池化缺少该参数。

1.2.2 锚框机制

SSD模型中锚框的数目根据做预测的特征图的分辨率产生，若特征图分辨率大小为$n \times n$，则将输入图像划分成$n \times n$个网格，每一个网格对应生成4个或6个不同比率的锚框。如果本文也采用这种方案，则每种卷积核在原始图像上对应生成相同数量的多种不同比例的锚框，最终在300×300像素的模型输入下产生46 208(38×38(2×4+4×6))个锚框。从YOLOv2可以看到，锚框数目过多，平均检测精度反而会减小。因此本文提出根据生成的预测张量的分辨率决定锚框数目的方案以减少锚框数目。

结合卷积核金字塔和锚框机制处理检测目标多尺度问题，同一尺寸的卷积核对应多个锚框，这些锚框大小相同，比率不同；不同尺寸的卷积核对应的锚框大小不同。具体地，同一个特征图在不同大小的卷积核作用下，产生不同分辨率的预测张量，使原始图像上划分的网格数目与预测张量的分辨率保持一致。这样使得小目标对应小的锚框，且数目较多；大目标对应大的锚框，且数目较少。这样既能解决不同尺寸目标的检测问题，又能合理地减少锚框数目。采用这种机制，在300×300像素的模型输入下得到的锚框数目是8 576(38×38×4+18×18×6+10×10×6+6×6×6+3×3×6+1×1×4)，与SSD中的锚框数目8 732相差不多，比较合理。

本文根据卷积核的大小以及实验来确定锚框的大小。卷积核大小为$k$时，假设某一锚框的比率为$r$，则该锚框的宽$W$和高$H$分别为

$ W{\rm{ = }}k \times \alpha \times \sqrt r $

(5)

$ H{\rm{ = }}k \times \alpha /\sqrt r $

(6)

式中，$α$为根据实际情况设置的超参数。比率为1时，增加一种情况，即

$ W{\rm{ = }}H{\rm{ = }}\alpha \sqrt {k \times \left({k + 7} \right)} $

(7)

因此，不同大小的卷积核其锚框的大小不一样；同一大小的卷积核，对应的锚框大小一样，但比率不同。

1.3 模型训练

AFP-SSD模型将SSD作为预训练模型来初始化模型参数。类似SSD做了数据增强，以提升模型的检测精度、增强模型的鲁棒性。在锚框与真值标签的匹配过程中，将每一个真值标签匹配给与之IOU(intersection over union)大于0.5的任意锚框，当有真值标签没有匹配对象时，给其匹配与之IOU最大的锚框。对于没有匹配的锚框，根据预测的置信度选取靠前的锚框作为负样本，使负样本和正样本的比例是3 :1。模型的损失函数定义和SSD模型一样，由平滑L1定位损失与softmax分类损失之和构成。

2 实验结果与分析

分别在PASCAL VOC和遥感数据集UCAS-AOD(Zhu等，2015)上对所提方法进行了测试和评价。代码在Caffe深度学习框架上实现，利用了SSD和DSSD的Caffe开源库的部分构件。所有实验在装有一个Titan X GPU的HP工作站上进行。用SSD作为本文方法的预训练模型，分别在PASCAL VOC和UCAS-AOD遥感数据集上微调模型，在VOC2007测试集和UCAS-AOD测试集上用mAP(mean average precision)指标来评价方法的性能。从mAP结果和检测速度上对比本文方法与其他当前比较先进的深度学习目标检测方法。

定义精确率(Precision)为

$ P = \frac{{TP}}{{TP + FP}} $

(8)

召回率(Recall)为

$ R = \frac{{TP}}{{TP + FN}} $

(9)

式中，$TP$指正类被判定为正类；$FN$指正类被判定为负类；$FP$指负类被判定为正类；$TN$指负类被判定为负类。$AP$定义为在11个不同召回率水平[0, 0.1, 0.2, …, 1]上的最大精确率的平均值，为

$ AP = \frac{1}{{11}}\sum\limits_{r \in \left\{ {0, 0.1, \cdots, 1} \right\}} {{P_{\max }}\left(r \right)} $

(10)

式中，${{P_{\max }}\left(r \right)}$指在召回率为$r$时的最大精确率。$AP$是单个类别目标的评价指标，mAP指多个类别目标的$AP$的平均值。

2.1 PASCAL VOC 2007

由于DSSD采用的是ResNet-101网络作为基础网络来提取特征，为了便于对照，实验中以VGG-16为基础网络建立了VGG-16版的DSSD模型，并按照原文DSSD的训练策略训练模型。为方便与其他先进算法比较，采用相同的训练集和测试集，在PAS-CAL VOC2007和PASCAL VOC2012的联合训练集上训练AFP-SSD模型，在PASCAL VOC2007测试集上评价结果。训练时首先冻结原始SSD模型的所有权重，只训练额外增加的网络部分。在开始的前70 k次迭代中，以0.001的学习率训练；在接下来的30 k次迭代中以0.000 1的学习率训练；然后微调整个网络，先以0.001的学习率在前20 k迭代中训练，再以0.000 1的学习率在后20 k迭代中训练。

表 2展示了使用和不使用空洞卷积时模型部分参数量(两者主干网络相同，没有考虑主干网络的参数量)的比较。可见使用空洞卷积显著减少了参数量。

表 2 模型参数量比较
Table 2 Number of model parameters comparison

下载CSV

	网络是否使用空洞卷积	参数数目/10⁶
AFP	否	123.68
AFP	是	28.67

表 3展示了本文检测算法和当前比较先进的算法在PASCAL VOC2007测试集上的检测结果，表中算法均以VGG-16为基础网络，其中SSD300、DSSD300和AFP-SSD结果为本文实验所得，其他算法结果来自原文献。AFP-SSD在300×300像素的输入图像和VGG-16基础网络上取得了79.3%mAP的精度。与两阶段方法相比，AFP-SSD比在单层特征图上预测的Fast R-CNN(Girshick，2015)高9.3%，比采用锚框机制的Faster R-CNN(Ren等，2015)高6.1%，比采用在通道维度拼接融合特征信息的ION(Bell等，2016)高3.7%，比MR-CNN(Gidaris和Komodakis, 2015)高1.1%。与单阶段方法相比，在相同大小输入图像和基础网络上，AFP-SSD比在多个特征图上做预测的SSD(Liu等，2016)模型高1.8%，比在特征图金字塔上做预测的DSSD(Fu等，2017)模型高0.9%。此外，AFP-SSD比一些对SSD、DSSD系列的改进方法更好，比采用逐像素相加的方式融合信息和在多个特征图上做预测的MDSSD300(Xu等，2018)和Feature-fused SSD(Cao等，2018)的精度分别高0.7%和0.4%，比FSSD300(Li和Zhou，2017)、RSSD300(Jeong等，2017)的精度平均高0.6%，比DSOD300(Shen等，2017)高1.6%，表明本文所提检测模型性能更好。

表 3 AFP-SSD和当前其他先进方法在PASCAL VOC2007测试集上的结果
Table 3 Results of AFP-SSD and other advanced algorithms on PASCAL VOC2007 test set

下载CSV

方法	mAP	aero	bike	bird	boat	bottle	bus	car	cat	chair	cow	table	dog	horse	mbike	person	plant	sheep	sofa	train	tv
Fast R-CNN	70.0	77.0	78.1	69.3	59.4	38.3	81.6	78.6	86.7	42.8	78.8	68.9	84.7	82.0	76.6	69.9	31.8	70.1	74.8	80.4	70.4
Faster R-CNN	73.2	76.5	79.0	70.9	65.5	52.1	83.1	84.7	86.4	52.0	81.9	65.7	84.8	84.6	77.5	76.7	38.8	73.6	73.9	83.0	72.6
ION	75.6	79.2	83.1	77.6	65.6	54.9	85.4	85.1	87.0	54.4	80.6	73.8	85.3	82.2	82.2	74.4	47.1	75.8	72.7	84.2	80.4
MR-CNN	78.2	80.3	84.1	78.5	70.8	68.5	88.0	85.9	87.8	60.3	85.2	73.7	87.2	86.5	85.0	76.4	48.5	76.3	75.5	85.0	81.0
SSD300	77.5	79.5	83.9	76.0	69.6	50.5	87.0	85.7	88.1	60.3	81.5	77.0	86.1	87.5	83.9	79.4	52.3	77.9	79.5	87.6	76.8
DSSD300	78.4	80.2	87.1	78.0	71.8	50.3	86.0	86.3	88.3	62.2	81.9	78.0	86.4	87.2	86.1	79.1	54.6	79.8	79.0	87.5	78.6
Feature-Fused	78.9	82.0	86.5	78.0	71.7	52.9	86.6	86.9	88.3	63.2	83.0	76.8	86.1	88.5	87.5	80.4	53.9	80.6	79.5	88.2	77.9
MDSSD300	78.6	86.5	87.6	78.9	70.6	55.0	86.9	87.0	88.1	58.5	84.8	73.4	84.8	89.2	88.1	78.0	52.3	78.6	74.5	86.8	80.7
AFP-SSD	79.3	82.1	85.4	77.9	68.8	51.6	88.0	86.6	88.6	64.8	83.7	80.2	87.1	87.5	85.9	79.4	53.8	82.2	83.7	89.0	79.7
注：加粗字体为最优结果。

表 4展示了当前比较先进的一些算法的ResNet-101版本在PASCAL VOC2007测试集上的检测结果，比较算法的结果来自原文献。AFP-SSD比Faster R-CNN(Ren等，2015)的精度高2.9%，比SSD(Liu等，2016)的精度高2.2%，比DSSD(Fu等，2017)的精度高0.7%。结果表明增加模型深度有助于提高检测精度，但是网络层数增加到一定程度之后再增加层数时检测精度并没有进一步提升。AFP-SSD比R-FCN(Dai等，2016)和512×512像素图像输入的ResNet-101版本DSSD的精度略低，但由图 3的测试时间对比可知，R-FCN的测试速度是9帧/s，DSSD的测试速度是5.5帧/s，远低于本文模型的21帧/s。

表 4 当前其他先进方法的ResNet-101版在PASCAL VOC2007测试集上的结果
Table 4 Results of other advanced methods of ResNet-101 on PASCAL VOC2007 test set

下载CSV

方法	mAP	aero	bike	bird	boat	bottle	bus	car	cat	chair	cow	table	dog	horse	mbike	person	plant	sheep	sofa	train	tv
Faster R-CNN	76.4	79.8	80.7	76.2	68.3	55.9	85.1	85.3	89.8	56.7	87.8	69.4	88.3	88.9	80.9	78.4	41.7	78.6	79.8	85.3	72.0
R-FCN	80.5	79.9	87.2	81.5	72.0	69.8	86.8	88.5	89.8	67.0	88.1	74.5	89.8	90.6	79.9	81.2	53.7	81.8	81.5	85.9	79.9
SSD321	77.1	76.3	84.6	79.3	64.6	47.2	85.4	84.0	88.8	60.1	82.6	76.9	86.7	87.2	85.4	79.1	50.8	77.2	82.6	87.3	76.6
DSSD321	78.6	81.9	84.9	80.5	68.4	53.9	85.6	86.2	88.9	61.1	83.5	78.7	86.7	88.7	86.7	79.7	51.7	78.0	80.9	87.2	79.4
注：加粗字体为最优结果。

使用1 000幅图像，将batch_size设为1来评价AFP-SSD的测试速度，如图 3所示，图中对比了一些目前先进的方法，这些方法的结果来自原文献，各实验结果使用的设备的GPU型号都是Titan X。这里所有方法均在PASCAL VOC2007和VOC2012联合训练集上训练，在PASCAL VOC2007测试集上测试。AFP-SSD在300×300像素的输入图像上的测试速度是21帧/s，比两阶段方法快，比SSD慢，但比DSSD快，检测精度比SSD与DSSD以及相关改进的单阶段检测方法都高。

图 3 在PASCAL VOC2007测试集上的精度和测试速度的比较

Fig. 3 Comparison of accuracy and speed on PASCAL VOC2007 test set

2.2 UCAS-AOD数据集

UCAS-AOD遥感数据集中有1 000幅包含飞机的图像，按照7 :3的比例随机划分成训练集和测试集，这1 000幅遥感图像的分辨率都是1 280×659像素。将所有数据转换成Caffe可以识别的lmdb格式时，也将图像的大小以及边界框标签全部放缩，使模型的输入图像大小都是300×300像素。SSD、DSSD和AFP-SSD模型都以在VOC数据集上训练好的参数初始化模型，在遥感训练数据集上进行微调，在前60 k次迭代中以0.001的学习率进行训练，在接下来的20 k次迭代中以0.000 1的学习率训练。各方法的目标检测结果见表 5。AFP-SSD的AP(average preasion)值比SSD(Liu等，2016)和DSSD(Fu等，2017)分别高2.8%和1.9%。

表 5 AFP-SSD和比较方法在遥感数据集上的目标检测结果
Table 5 Results of AFP-SSD and comparison methods on remote sensing dataset

下载CSV

方法	网格	输入图像大小/像素	AP
SSD300	VGG	300×300	88.2
DSSD300	VGG	300×300	89.1
AFP-SSD	VGG	300×300	91.0
注：加粗字体为最优结果。

图 4是SSD和AFP-SSD在PASCAL VOC2007测试集上的一些检测结果，其中，第1、3列是SSD模型检测的结果，第2、4列是AFP-SSD模型检测的结果。图 5展示了两类模型在UCAS-AOD遥感数据集上的一些检测结果，其中图 5(a)(b)分别是SSD和AFP-SSD检测的结果。图中只标出了置信度高于0.8的检测结果。可见，对于目标重叠以及小目标的检测，AFP-SSD比SSD要做得更好。

图 4 SSD和AFP-SSD在PASCAL VOC2007测试集上的检测结果

Fig. 4 Results of SSD and AFP-SSD on PASCAL VOC2007 test set

图 5 SSD和AFP-SSD在UCAS-AOD数据集上的检测结果

Fig. 5 Results of SSD and AFP-SSD in UCAS-AOD dataset ((a) SSD; (b) AFP-SSD)

3 结论

本文提出了采用卷积核金字塔和空洞卷积的单阶段目标检测模型，首先采用逐像素相加和通道拼接的方式融合特征信息，形成具有丰富语义信息和细节信息的特征图，将其作为预测特征图，为预测边界框类别与位置提供丰富的特征信息。然后在锚框机制中引入卷积核金字塔结构，克服锚框和对应特征区域不匹配的问题，以更准确地检测多尺度目标，同时引入空洞卷积在不增加参数量的情况下增大卷积核的感受野，根据生成的预测张量确定锚框的数目，降低时间复杂度。由于有效的信息融合和在锚框机制中引入卷积核金字塔结构，与目前先进的方法相比，模型具有较快的检测速度和较高的检测精度，特别是较好地解决了小目标难以检出，以及重叠目标的检测问题。

虽然本文算法在检测速度和检测精度上都有较好的结果，但检测精度与两阶段算法相比仍有进一步提升的空间，主要原因是本文算法在特征融合部分的研究比较有限。在接下来的工作中，会在特征融合部分进行更多的研究，以提高算法的检测精度。

参考文献

Bell S, Zitnick C L and Bala K.2016. Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 2874-2883[DOI:10.1109/CVPR.2016.314]

Cai Z W, Fan Q F and Feris R S.2016.A unified multi-scale deep convolutional neural network for fast object detection//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer: 354-370[DOI:10.1007/978-3-319-46493-0_22]

Cao G M, Xie X M and Yang W Z. 2018.Feature-fused SSD: fast detection for small objects//Proceedings of the 9th International Conference on Graphic and Image Processing. Qingdao, China: SPIE: 10615[DOI:10.1117/12.2304811]

Chen L C, Papandreou G, Kokkinos I. 2018. Deeplab:Semantic image segmentation with deep convolutional nets, atrousconvolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4): 834-848 [DOI:10.1109/TPAMI.2017.2699184]

Dai J F, Li Y and He K M.2016.R-FCN: object detection via region-based fully convolutional networks//Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain: Curran Associates Inc: 379-387

Fu C Y, Liu W and Ranga A. 2017. DSSD: deconvolutional single shot detector. 2017-01-23[2019-05-01]. https://arxiv.org/pdf/1701.06659.pdf

Gidaris S and Komodakis N. 2015. Object detection via a multi-region and semantic segmentation-aware CNN model//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 1134-1142[DOI:10.1109/ICCV.2015.135]

Girshick R. 2015. Fast R-CNN//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE[DOI:10.1109/ICCV.2015.169]

He K M, Gkioxari G and Dollár P. 2017. Mask R-CNN//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2980-2988[DOI:10.1109/ICCV.2017.322]

He K M, Zhang X Y and Ren S Q. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 770-778[DOI:10.1109/CVPR.2016.90]

Hu P Y and Ramanan D. 2017. Finding tiny faces//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE: 951-959[DOI:10.1109/CVPR.2017.166]

Jeong J, Park H and Kwak N. 2017. Enhancement of SSD by concatenating feature maps for object detection. 2017-05-26[2019-04-23].https://arxiv.org/pdf/1705.09587.pdf

Kong T, Yao A B and Chen Y R. 2016. HyperNet: towards accurate region proposal generation and joint object detection//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 845-853[DOI:10.1109/CVPR.2016.98]

Li Z X and Zhou F Q. 2017. FSSD: feature fusion single shot multiboxdetector. 2017-12-04[2018-04-23].https://arxiv.org/pdf/1712.00960.pdf

Lin T Y, Dollár P and Girshick R. 2017. Feature pyramid networks for object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE: 936-944[DOI:10.1109/CVPR.2017.106]

Liu W, Anguelov D and Erhan D. 2016. SSD: single shot multibox detector//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer: 21-37[DOI:10.1007/978-3-319-46448-0_2]

Redmon J, Divvala S and Girshick R. 2016. You only look once: unified, real-time object detection//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 779-788[DOI:10.1109/CVPR.2016.91]

Redmon J and Farhadi A. 2017. YOLO9000: better, faster, stronger//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE: 7263-7271[DOI:10.1109/CVPR.2017.690]

Redmon J and Farhadi A. 2018. YOLOv3: an incremental improvement[EB/OL].2018-04-08[2018-04-23]. https://arxiv.org/pdf/1804.02767.pdf

Ren S Q, He K M and Girshick R. 2015. Faster R-CNN: towards real-time object detection with region proposal networks//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press: 91-99

Shen Z Q, Liu Z and Li J G. 2017. DSOD: learning deeply supervised object detectors from scratch//Proceedings of 2017 IEEE International Conference on Computer Vision.Venice, Italy: IEEE: 1937-1945[DOI:10.1109/ICCV.2017.212]

Simonyan K and Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition[EB/OL]. 2014-09-04[2015-04-23]. https://arxiv.org/pdf/1409.1556.pdf

Singh B and Davis L S. 2018. An analysis of scale invariance in object detection-SNIP//Proceedings of 2018 IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE: 3578-3587[DOI:10.1109/CVPR.2018.00377]

Wang P Q, Chen P F and Yuan Y. 2018. Understanding convolution for semantic segmentation//Proceedings of 2018 IEEE Winter Conference on Applications of Computer Vision. Lake Tahoe, NV, USA: IEEE: 1451-1460[DOI:10.1109/WACV.2018.00163]

Xu M L, Cui L S and Lv P. 2018.MDSSD: multi-scale deconvolutional single shot detector for small objects. 2018-05-18[2018-08-19]. https://arxiv.org/pdf/1805.07009.pdf

Zhu H G, Chen X G and Dai W Q. 2015. Orientation robust object detection in aerial images using deep convolutional neural network//Proceedings of 2015 IEEE International Conference on Image Processing.Quebec City, Canada: IEEE: 3735-3739[DOI:10.1109/ICIP.2015.7351502]