Current Issue Cover
一阶全卷积遥感影像倾斜目标检测

周院, 杨庆庆, 马强, 薛博维, 孔祥楠(中科星图空间技术有限公司, 西安 710000)

摘 要
目的 主流深度学习的目标检测技术对自然影像的识别精度依赖于锚框设置的好坏,并使用平行于坐标轴的正框表示物体位置,而遥感影像中地物目标具有尺寸多变、分布密集、长宽比悬殊且朝向不定的特点,更宜通过与物体朝向一致的斜框表示其位置。本文试图结合无锚框和斜框检测技术,在遥感影像上实现高精度目标识别。方法 使用斜框标注能够更为紧密地贴合目标边缘,有效减少识别干扰因素。本文基于单阶段无锚框目标检测算法:一阶全卷积目标检测网络(fully convolutional one-stage object detector,FCOS),通过引入滑动点结构,在遥感影像上实现高效率、高精度的斜框目标检测。与FCOS的不同之处在于,本文改进的检测算法增加了用于斜框检测的两个分支,通过在正框的两邻边上回归滑动顶点比率产生斜框,并预测斜框与正框的面积比以减少极端情况下的检测误差。结果 在当前最大、最复杂的斜框遥感目标检测数据集DOTA (object detection in aerial images)上对本文方法进行评测,使用ResNet50作为骨干网络,平均精确率(mean average precision,mAP)达到74.84%,相比原始正框FCOS算法精度提升了33.02%,相比于YOLOv3(you only look once)效率提升了38.82%,比斜框检测算法R3Det (refined rotation RetinaNet)精度提升了1.53%。结论 实验结果说明改进的FCOS算法能够很好地适应高分辨率遥感倾斜目标识别场景。
关键词
Improved one-stage fully convolutional network for oblique object detection in remote sensing imagery

Zhou Yuan, Yang Qingqing, Ma Qiang, Xue Bowei, Kong Xiangnan(Geovis Spatial Technology Co., Ltd., Xi'an 710000, China)

Abstract
Objective Most object detection techniques identify potential regions through well-designed anchors. The recognition accuracy is related to the setting of anchors intensively. It usually leads to sub-optimal results with no fine tunings when applying to unclear scenarios due to domain gap. The use of anchors constrains the generalization ability of object detection techniques on aerial imagery, and increases the cost of model training and parameter tuning. Moreover, object detection approaches designed for natural scenes represent objects using axis-aligned rectangles (horizontal boxes) that are inadequate when applied to aerial images since objects may have arbitrary orientation when observed from the overhead perspective. A horizontal bounding box involves multiple object instances and redundant background information in common that may confuse the learning algorithm and reduce recognition accuracy in aerial imagery. A better option tends to use oblique rectangles (oriented boxes) in aerial images. Oriented boxes are more compact compared to horizontal boxes, as they have the same direction with objects and closely adhere to objects' boundaries. We propose a novel object detection approach that is anchor-free and is capable to generate oriented bounding boxes in terms of gliding vertices of horizontal ones. Our algorithm is developed based on designated anchor-free detector fully convolutional one-stage object detector (FCOS). FCOS achieves comparable accuracy with anchor-based methods while totally eliminates the need for calibrating anchors and the complex pre and post-processing associated with anchors. It also requires less memory and can leverage more positive samples than its anchor-based counterparts. FCOS was originally designed for object detection in natural scenes observation, we adopt FCOS as our baseline and extend it for oblique object detection in aerial images. Our research contributions are mentioned as below 1) to extend FCOS for oblique object detection; 2) to weak the shape distortion issue of gliding vertex based representation of oriented boxes; and 3) to benchmark the extended FCOS on object detection in aerial images (DOTA). Method Our method integrates FCOS with gliding vertex approach to realize anchor-free oblique object detection. The following part describes our oriented object detection method on three aspects:network architecture, parameterization of oriented boxes, and experiments we conducted to evaluate the proposed network. Our network consists of a backbone for feature extraction, a feature pyramid network for feature fusion, and multiple detection heads for object recognition. Instead of using an orientation angle to represent the box direction, we adopt the gliding vertex representation for simplicity and robustness. We use ResNets as our backbone as FCOS does. The feature pyramid network fuses multi-level features from the backbone convolutional neural networks (CNN) to detect objects of various scales. Specifically, the C3, C4 and C5 feature maps are taken to produce P3, P4 and P5 by 1×1 convolution and lateral connection. P5 is fed into two subsequent convolutional layers with the stride parameter set to 2 to get P6 and P7. Unlike FCOS, we concatenate feature maps along the channel dimension followed by a 1×1 convolution and batch normalization for feature fusion. For each location on P3, P4, P5, P6 and P7, the network predicts if an object exists at that location as well as the object category. For oriented box regression, we parameterize a box using a 7D real vector:(l, t, r, b, α1, α2, k). The l, t, r, b are the distances from the location to the four sides of the object's horizontal box. These four parameters together demine the size and location of the horizontal bounding box.(α12) denote the gliding offsets on the top and left side of the horizontal bounding box that could be used to derive the coordinates of the first and second vertices of the oriented object. k is the obliquity factor that represents the area ratio between an oriented object and its horizontal bounding box. The obliquity factor describes the tilt degree of an object and guides the network to approximate nearly horizontal objects with the horizontal boxes. With this design, we can generate horizontal and oriented bounding box simultaneously with minimal increase in computing time and complexity. It is worth noting that we only predict gliding distances on two sides of the horizontal bounding box other than four with the assumption that the predicted boxes are parallelograms other than arbitrary quadrilaterals. We use fully convolutional sub-networks for target category classification and location regression that is consistent with FCOS. The detection heads are implemented using four convolutional layers, and take feature maps produced by the feature pyramid network as input. The network outputs are decoded to fetch classification scores as well as box locations. Result To illustrate the effectiveness of the proposed object detection approach, we evaluated the extended FCOS on the challenging oriented object detection dataset DOTA with various backbones and inference strategies. Without bells and whistles, our proposed network outperforms the horizontal detection baseline with 33.02% increase in mean average precision (mAP). Compared to you only look once (YOLOv3), it achieves a performance boost of 38.82% in terms of frames per second (FPS). Compared to refined rotation RetinaNet(R3Det), the proposed method improves detection accuracy by 1.53% in terms of mAP. We achieve an mAP of 74.84% on DOTA using ResNet50, that is higher than most one-stage and two-stage detectors. Conclusion The proposed method has its potentials to optimize single-stage and two-stage detectors in terms of recognition accuracy and time efficiency.
Keywords

订阅号|日报