Current Issue Cover
树形结构卷积神经网络优化的城区遥感图像语义分割

胡伟, 高博川, 黄振航, 李瑞瑞(北京化工大学信息科学与技术学院, 北京 100029)

摘 要
目的 高分辨率遥感图像通常包含复杂的语义信息与易混淆的目标,对其语义分割是一项重要且具有挑战性的任务。基于DeepLab V3+网络结构,结合树形神经网络结构模块,设计出一种针对高分辨率遥感图像的语义分割网络。方法 提出的网络结构不仅对DeepLab V3+做出了修改,使其适用于多尺度、多模态的数据,而且在其后添加连接树形神经网络结构模块。树形结构通过建立混淆矩阵、提取混淆图、构建图分割,能够对易混淆的像素更好地区分,得到更准确的分割结果。结果 在国际摄影测量及遥感探测学会(International Society for Photogrammetry and Remote Sensing,ISPRS)提供的两个不同城市的遥感影像集上分别进行了实验,模型在整体准确率(overall accuracy, OA)这一项表现最好,在Vaihingen和Potsdam数据集上分别达到了90.4%和90.7%,其整体分割准确率较其基准结果有10.3%和17.4%的提升,对比ISPRS官方网站上的3种先进方法也有显著提升。结论 提出结合DeepLab V3+和树形结构的卷积神经网络,有效提升了高分辨率遥感图像语义分割整体精度,其中易混淆类别数据的分割准确率显著提高。在包含复杂语义信息的高分辨率遥感图像中,由于易混淆类别之间的像素分割错误减少,使用了树形结构的网络模型的整体分割准确率也有较大提升。
关键词
Semantic segmentation of urban remote sensing image based on optimized tree structure convolutional neural network

Hu Wei, Gao Bochuan, Huang Zhenhang, Li Ruirui(College of Information Science and Technology, Beijing University of Chemical Technology, Beijing 100029, China)

Abstract
Objective High-resolution remote sensing image segmentation refers to the task of assigning a semantic label to each pixel in an image. Recently, with the rapid development of remote sensing technology, we have been able to easily obtain very-high resolution remote sensing images with a ground sampling distance of 5 cm to 10 cm. However, the very heterogeneous appearance of objects, such as buildings, streets, trees, and cars, in very-high-resolution data makes this task challenging, leading to high intraclass variance while the inter-class variance is low. A research hotspot is on detailed 2D semantic segmentation that assigns labels to multiple object categories. Traditional image processing methods depend on the extraction technique of the vectorization model, for example, based on region segmentation, line analysis, and shadow analysis. Another mainstream study relies on supervised classifiers with manually designed features. These models were not generalized when dealing with high-resolution remote sensing images. Recently, deep learning-based technology has helped explore the high-level semantic information in imaged and provide an end-to-end approach for semantic segmentation. Method Based on DeepLab V3+, we proposed an adaptive constructed neural network, which contains two connected modules, namely, the segmentation module and the tree module. When segmenting remote-sensing images, which contain multiscale objects, understanding the context is important. To handle the problem of segmenting objects at multiple scales, DeepLab V3+ employs atrous convolution in cascade or in parallel captures multiscale context by adopting multiple atrous rates. We adopted a similar idea in designing the segmentation module. This module uses an encoder-decoder architecture. The encoder is composed of four structures: EntryFlow, MiddleFlow, ExitFlow, and atrous spatial pyramid pooling(ASPP). In addition, the decoder is composed of two layers of SeparableConv blocks. The middle flow has two Xception blocks, which are linear stacks of depth-separable convolutional layers with residual connections. The segmentation module could capture well the multiscale features in the context. However, these features pay less attention to the easily confused classes. The other core contribution of the proposed method is the tree module. This module is constructed adaptively during the training. In each round, the method computes the confusion matrix on the evaluation data and calculates the confusion degrees between every two classes. A graph could be constructed according to the confusion matrix, and we can obtain a certain tree structure through the minimum cut algorithm. According to the tree structure, we build the tree module, in which each node is a ResNeXt unit. These nodes are connected by the concatenated connections. The tree module helped distinguish the pixels between easily confused classes by adding several neural layers to process their features. To implement the proposed method, the segmentation model is based on the MXNet framework and uses two Nvidia GeForce GTX1080 Ti graphic cards for accelerated training. The input size of the image block is 640×640 pixels due to memory limitation. We set the momentum (momentum) to 0.9, the initial learning rate to 0.01, adjust the learning rate to 0.001 when the training reaches half, and adjust the learning rate to 0.000 1 when the training reaches 3/4. We perform data augmentation before training due to the small amount of data in the ISPRS(International Society for Photogrammetry and Remote Sensing) remote sensing dataset. For each piece of raw data, we rotate the image center by 10° each time and cut out the largest square tile. In this way, each training image can obtain 36 sets of pictures after rotation. In addition, because the original training image is very large in size, to directly place the entire image into the network for training is not possible. Thus, it needs to be cropped into image blocks of 640×640 pixels. We apply an overlap-tile strategy to ensure no obvious cracks in the segmentation result map after splicing. Result The model in this study performed the best in terms of overall accuracy (OA), which reached 90.4% and 90.7% on the Vaihingen and Potsdam datasets, respectively. This result indicates that the model can achieve high segmentation accuracy. In addition, in the performance of easily confused categories, for example, low shrub vegetation (low_veg) and trees, F1 values have greatly improved. In experiments of the Vaihingen dataset, the F1 values of the low shrub vegetation and tree species reached 83.6% and 89.6%, respectively. In experiments of the Potsdam dataset, the F1 values of the low shrub vegetation and tree species reached 86.8% and 87.1%. As for the average F1 value, the scores of the model on the Vaihingen and Potsdam datasets are 89.3% and 92.0%, respectively. This number is much higher than the other latest methods, indicating that the model in this study is the best for both the segmentation of remote sensing images and the average performance of each category. Additionally, compared with the model without the tree module, the proposed method has higher segmentation accuracy for each category. The OA by using tree module increased by 1.1%, and the average F1 value increased by 0.6% in the Vaihingen dataset. The OA and the average F1 value increased by 1.3% and 0.9% in the Potsdam dataset. The result shows that the tree module does not only target a certain category but also improves the overall segmentation accuracy. Conclusion The proposed network can effectively improve the overall semantic segmentation accuracy of high-resolution remote sensing images. The experimental results show that the proposed segmentation module with the tree module is greatly improved due to the reduction of the error on easily confused pixels. The proposed method in this study is universal and suitable for a wide range of application scenarios.
Keywords

订阅号|日报