Current Issue Cover
高分辨率遥感影像的边缘损失增强地物分割

陈琴1, 朱磊1,2, 吕燧栋1, 吴谨1(1.武汉科技大学信息科学与工程学院, 武汉 430081;2.中冶南方连铸技术工程有限责任公司, 武汉 430223)

摘 要
目的 针对高分辨率遥感影像语义分割中普遍存在的分割精度不高、目标边界模糊等问题,提出一种综合利用边界信息和网络多尺度特征的边缘损失增强语义分割方法。方法 对单幅高分辨率遥感影像,首先通过对VGG-16(visual geometry group 16-layer net)网络引入侧边输出结构,提取到图像丰富的特征细节;然后使用深度监督的短连接结构将从深层到浅层的侧边输出组合起来,实现多层次和多尺度特征融合;最后添加边缘损失增强结构,用以获得较为清晰的目标边界,提高分割结果的准确性和完整性。结果 为了验证所提方法的有效性,选取中国北方种植大棚遥感影像和Google Earth上的光伏板组件遥感影像进行人工标注,并制作实验数据集。在这两个数据集上,将所提方法与几种常用的语义分割方法进行对比实验。实验结果表明,所提方法的精度在召回率为00.9之间时均在0.8以上,在2个数据集上的平均绝对误差分别为0.079 1和0.036 2。同时,通过消融实验分析了各个功能模块对最终结果的贡献。结论 与当前先进方法相比,本文提出的边缘损失增强地物分割方法能够更加精确地从遥感影像的复杂背景中提取目标区域,使分割时提取到的目标拥有更加清晰的边缘。
关键词
Segmentation of high-resolution remote sensing image by collaborating with edge loss enhancement

Chen Qin1, Zhu Lei1,2, Lyu Suidong1, Wu Jin1(1.School of Information Science and Engineering, Wuhan University of Science and Technology, Wuhan 430081, China;2.WISDRI Continuous Casting Technology Engineering Company Ltd., Wuhan 430223, China)

Abstract
Objective Semantic analysis of remote sensing (RS) images has always been an important research topic in computer vision community. It has been widely used in related fields such as military surveillance, mapping navigation, and urban planning. Researchers can easily obtain various informative features for the following decision making by exploring and analyzing the semantic information of RS images. However, the richer, finer visual information in high-resolution RS images also puts forward higher requirements for image segmentation techniques. Traditional segmentation methods usually employ low-level visual features such as grayscale, color, spatial texture, and geometric shape to divide an image into several disjoint regions. Generally, such features are called hand-crafted ones, which are empirically defined and may be less semantically meaningful. Compared with traditional segmentation methods, semantic segmentation approaches based on deep convolutional neural networks (CNNs) are capable of learning hierarchical visual features for representing images in different semantic levels. Typical CNN-based semantic segmentation approaches mainly focus on mitigating semantic ambiguity via providing rich information. However, RS images have higher background complexity than images of nature scene. For example, they usually contain many types of geometric objects and cover massive redundant background areas. Simply employing a certain type of feature or even CNN-based ones may not be sufficient in such case. Taking single-category object extraction task in RS images for example, on the one hand, negative objects may have similar visual presentations with the expected target. These redundant, noisy semantic information may confuse the network and finally decrease the segmentation performance. On the other hand, the CNN-based feature is good at encoding the context information rather than the fine details of an image, making the CNN-based models have difficulty obtaining the precise prediction of object boundaries. Therefore, aiming at these problems in high-resolution RS image segmentation, this paper proposes an edge loss enhanced network for semantic segmentation that comprehensively utilizes the boundary information and hierarchical deep features. Method The backbone of the proposed model is a fully convolutional network that is abbreviated from a visual geometry group 16-layer net (VGG-16) structure by removing all fully connected layers and its fifth pooling layer. A side output structure is introduced for each convolutional layer of our backbone network to extract all possible rich, informative features from the input image. The side output structure starts with a (1×1, 1) convolutional layer (a specific convolutional layer is denoted as (n×n, c) where n and c are the size and number of kernels, respectively), followed by an element-wise summation layer for accumulating features in each scale. Then, a (1×1, 1) convolutional layer is used to concentrate hybrid features. The side output structure makes full use of the features of each convolutional layer of our backbone and helps the network capture the fine details of the image. The side-output features are further gradually aggregated from the deep layers to shallow layers by a deep-supervised short connection structure to enhance the connections between features crossing scales. To this end, each side output feature is first encoded by a residual convolution unit then introduced to another one of a nearby shallow stage with necessary upsampling. The short connection structure enables a multilevel, multiscale fusion during feature encoding and is proven effective in the experiment. Finally, for each fused side output feature, a (3×3, 128) convolutional layer is first used to unify its number of feature channels then send it to two paralleled branches, namely, an edge loss enhancement branch and an ordinary segmentation branch. In each edge loss enhancement branch, a Laplace operator coupled with a residual convolution unit is adopted to obtain the target boundary. The detected boundary is supervised by the ground truth that is generated by directly computing the gradient of existing semantic annotation of training samples. It does not require additional manual work for edge labeling. Experimental results show that the edge loss enhancement branch helps refine the target boundary as well as maintain the integrity of the target region. Result First, two datasets with human annotations that include the RS images of the planted greenhouses in the north of China and the photovoltaic panels collected by Google Earth are organized to evaluate the effectiveness of the proposed method. Then, visual and numerical comparisons are conducted between the proposed method and several popular semantic segmentation methods. In addition, an ablation study is included to illustrate the contribution of essential components in the proposed architecture. The experimental results show that our method outperforms other competing approaches on both datasets in the comparisons of precision-recall curves and mean absolute error (MAE). The precision achieved by our method is constantly above 0.8 when recall rate in the range of 0 to 0.9. The MAE achieved by our method is 0.079 1/0.036 2 which is the best of all evaluation results. In addition, the ablation study clearly illustrates the effectiveness of each individual functional block. First, the baseline of the proposed architecture obtains a poor result with MAE of 0.204 4 on the northern greenhouse dataset. Then, the residual convolutional units help reduce MAE by 31%, and the value further drops to 0.084 8 when the short connection structure is added to fuse the multiscale features of the network. Finally, the edge loss enhancement structure helps successfully lower MAE to 0.079 1, which is decreased by 61% compared with the baseline model. The results indicate that all components are necessary to obtain a good feature segmentation result. Conclusion In summary, compared with the competing methods, the proposed method is capable of extracting the target region more accurately from the complex background of RS images with a clearer target boundary.
Keywords

订阅号|日报