Current Issue Cover
采用卷积核金字塔和空洞卷积的单阶段目标检测

刘涛, 汪西莉(陕西师范大学计算机科学学院, 西安 710119)

摘 要
目的 在基于深度学习的目标检测模型中,浅层特征图包含更多细节但缺乏语义信息,深层特征图则相反,为了利用不同深度特征图的优势,并在此基础上解决检测目标的多尺度问题,本文提出基于卷积核金字塔和空洞卷积的单阶段目标检测模型。方法 所提模型采用多种方式融合特征信息,先使用逐像素相加方式融合多层不同大小的特征图信息,然后在通道维度拼接不同阶段的特征图,形成具有丰富语义信息和细节信息的信息融合特征层作为模型的预测层。模型在锚框机制中引入卷积核金字塔结构,以解决检测目标的多尺度问题,采用空洞卷积减少大尺寸卷积核增加的参数量,合理地降低锚框数量。结果 实验结果表明,在PASCAL VOC2007测试数据集上,所提检测框架在300×300像素的输入上检测精度达到79.3% mAP(mean average precision),比SSD(single shot multibox detector)高1.8%,比DSSD(deconvolutional single shot detector)高0.9%。在UCAS-AOD遥感数据测试集上,所提模型的检测精度分别比SSD和DSSD高2.8%和1.9%。在检测速度上,所提模型在Titan X GPU上达到21帧/s,速度超过DSSD。结论 本文模型提出在两个阶段融合特征信息并改进锚框机制,不仅具有较快的检测速度和较高的精度,而且较好地解决了小目标以及重叠目标难以被检出的问题。
关键词
Single-stage object detection using filter pyramid and atrous convolution

Liu Tao, Wang Xili(School of Computer Science, Shannxi Normal University, Xi'an 710119, China)

Abstract
Objective Object detection is a fundamental topic in computer vision. Deep learning-based object detection networks consist of two basic parts:feature extraction and object detection modules. Convolution neural networks (CNNs) are used to extract image features. On the one hand, deep feature maps are rich in object semantic information; sensitive to category information; lacking in detailed information; insensitive to position, translation, and rotation information; and widely used in classification tasks. On the other hand, shallow feature maps are rich in detailed information; sensitive to location, translation, and rotation information; lacking in semantic information; and insensitive to category information. The two main subtasks of object detection are classification and location. The former classifies the candidate regions and requires the semantic information of the object, whereas the latter locates the candidate regions and requires detailed information (e.g., location). In the anchor mechanism of faster region-based CNN (R-CNN), each anchor point of the predicted feature map corresponds to nine anchors with different sizes and ratios. A 1×1 convolution filter is used to predict the positions and confidence scores (i.e., the probability that the object contained in the anchor box belongs to a certain category) of multiple anchors with different sizes. Therefore, for the anchors with different sizes that correspond to the anchor points, the same feature region on the feature map is used for prediction. This condition results in the mismatch between the feature region used in prediction and the corresponding anchor. To utilize the advantages of feature maps with different depths and overcome the mismatch problem in the anchor mechanism to accurately solve the problem of multi-scale object detection, we present a single-stage object detection model using convolution filter pyramid and atrous convolution. Method Feature information is fused in a variety of ways. First, multiple convolutional layers are added to the feature extraction network. The feature information in these layers is fused layer by layer (from the deep layers to the shallow ones) through pixel-by-pixel addition, thereby forming feature maps with rich semantic information and detailed information. Second, to further enhance the fusion of feature information, feature maps with different stages are concatenated for the fusion feature maps obtained in the previous step. To address the mismatch between the feature region and the corresponding anchor used for prediction, this study introduces a convolution filter pyramid structure into the anchor mechanism to detect objects with different sizes. Consequently, the sizes of the convolution filter corresponding to the anchors with different sizes is distinct and those corresponding to anchors with equal sizes but different ratios are the same. This condition alleviates the mismatch problem. In addition, the model uses the atrous convolution mechanism to design a convolution filter with different receptive fields because the large-scale convolution filter increases the number of parameters and the time complexity should be reduced. Under the action of convolution filters with different sizes, the prediction tensors (i.e., feature maps) of different resolutions are generated on the feature maps with rich semantic and detailed information. The model determines the number of anchors according to the generated prediction tensors. The number of small anchors corresponding to small objects is large, whereas those corresponding to large objects is small, thereby reducing the number of anchors. Result The proposed method was tested and evaluated on PASCAL visual object classes (VOC) and UCAS-AOD remote sensing datasets, respectively. The code was implemented on the Caffe deep learning framework, where some components of the Caffe open-source library of single-shot multibox detector (SSD) and deconvolutional single-shot detector (DSSD) were utilized. All experiments were performed on an HP workstation with a Titan X GPU. SSD was used as the pre-training model of the proposed method. The model was fine-tuned on PASCAL VOC and UCAS-AOD and the performance was evaluated using the mean average precision (mAP) on VOC2007 and UCAS-AOD test sets. The proposed method was then compared with other advanced deep learning object detection methods in terms of mAP results and detection speed. Experimental results show that on the PASCAL VOC2007 test set, the proposed model can achieve 79.3% mAP for an input size of 300×300, which is higher than SSD and DSSD by 1.8% and 0.9%, respectively. On the UCAS-AOD remote sensing dataset, the proposed model obtained a 91.0% mAP, which is 2.8% and 1.9% higher than SSD and DSSD, respectively. The testing speed of the model is 21 frame per second on Titan X GPU, which is much faster than DSSD. Conclusion In this study, a single-stage object detection model using convolution filter pyramid and atrous convolution is proposed. First, feature information is merged through pixel-by-pixel addition and channel concatenation to form a feature map with rich semantic and detailed information. The obtained information was used as a prediction feature map to provide rich feature information in predicting boundary box categories and locations. Then, the convolution filter pyramid structure is introduced into the anchor mechanism to overcome the mismatch between the feature region and corresponding anchor, as well as to accurately detect multiscale objects. At the same time, the atrous convolution is introduced to increase the receptive field of the convolution filter without increasing the number of parameters.The number of anchors is determined according to the generated prediction tensor to reduce the time complexity. The proposed model exhibited a faster detection speed and higher detection accuracy than the current advanced methods, especially in solving the problems of small objects and detecting overlapped objects, due to the effective information fusion and introduction of a convolution filter pyramid structure in the anchor mechanism. Although the proposed method demonstrated a good result in terms of detection speed and accuracy, the detection accuracy of the algorithm can still be further improved compared with the two-stage algorithm because the research on the former is limited in the feature fusion part. In the future, further research will be conducted in the feature fusion part to improve the detection accuracy of the algorithm.
Keywords

订阅号|日报