目的 高分辨率遥感图像通常包含复杂的语义信息,易混淆的目标,对其语义分割是一项重要且具有挑战性的任务。本文基于DeepLab V3+网络结构,结合树形神经网络结构模块,设计一种针对高分辨率遥感图像的语义分割网络。方法 本文提出的网络结构不仅对DeepLab V3+做出了修改,使其适用于多尺度、多模态的数据,并在之后添加和连接树形神经网络结构模块。树形结构是通过建立混淆矩阵、提取混淆图、图分割构建出来的,能够对易混淆的像素进行更好地区分,得到更准确的分割结果。结果 本文在ISPRS委员会提供的两个不同城市的遥感影像集分别进行了实验,本文模型在整体准确率(OA)这一项表现最好,在Vaihingen和Potsdam数据集上分别达到了90.4%和90.7%,其整体分割准确率较其基准结果有10.3%和17.4%的提升,对比SVL、DST和UZ等其它新多种先进方法也有显著的提升。结论 本文所提出的结合DeepLab V3+和树形结构的卷积神经网络,有效提升了高分辨率遥感图像语义分割整体精度,其中易混淆类别数据的分割准确率提高显著。在包含复杂语义信息的高分辨率遥感图像中,得益于易混淆类别的像素的错误减少,使用了树形结构的本文网络模型的整体分割准确率也有较大提升。本文的提出的方法具有一定的通用性,具有广泛的应用场景。
Semantic Segmentation of Urban Remote Sensing Image Based on optimized Tree Structure Convolutional Neural Network
Hu W,Gao BC,Huang ZH,Li RR(College of Information Science and Technology,Beijing University of Chemical Technology,Beijing)
Objective High-resolution remote sensing image segmentation refers to the task of assigning a semantic label to each pixel in an image. Recently, with the rapid development of remote sensing technology, we have been able to easily obtain very-high resolution remote sensing images with a ground sampling distance (GSD) of 5 to 10 cm. What makes this task challenging is the very heterogeneous appearance of objects like buildings, streets, trees and cars in very-high-resolution data, which leads to high intra-class variance while the inter-class variance is low. A research hotspot is on detailed 2D semantic segmentation that assigns labels to multiple object categories. Traditional image processing methods depend on the extraction technique of the vectorization model, for example, based on region segmentation, line analysis and shadow analysis. Another mainstream study relies on supervised classifiers with manually designed features. These models were not generalized when dealing with high-resolution remote sensing images. Recently, deep learning based technology have helped to exploit the high-level semantic information in imaged and provide an end-to-end approach for semantic segmentation. Method In this paper, based on DeepLab V3+, we proposed an adaptive constructed neural network which contains two connected modules: the segmentation module and the tree module. When segmenting remote-sensing images which contain multi-scale objects, it is important to understand the context. To handle the problem of segmenting objects at multiple scales, DeepLab V3+ employs atrous convolution in cascade or in parallel captures multi-scale context by adopting multiple atrous rates. We adopted a similar idea in designing the segmentation module. It uses an encoder-decoder architecture. The encoder is composed of four structures: Entry Flow, Middle Flow, Exit Flow, and ASPP. And the decoder is composed of two layers of SeparableConv blocks. The middle flow has two Xception blocks which are linear stacks of depth-separable convolutional layers with residual connections. The segmentation module could well capture multi-scale features in the context, but these features pay less attention to the easily confused classes. The other core contribution of the proposed method is the tree module. It is constructed adaptively during the training process. In each round, the method computes the confusion matrix on the evaluation data and calculates the confusion degrees between every two classes. A graph could be constructed according to the confusion matrix and we can get a certain tree structure through the minimum cut algorithm. According to the tree structure, we build the tree module in which each node is a ResNext unit. These nodes are connected by the concatenated connections. The tree module helped distinguish the pixels between easily confused classes by adding more neural layers to process their features. To implement the proposed method，the segmentation model is based on the MXNet framework and uses two Nvidia GeForce GTX1080 Ti graphic cards for accelerated training. Due to the limitation of the memory, the input size of the image block is 640*640 pixels. We set the momentum (momentum) to 0.9, the initial learning rate to 0.01, adjust the learning rate to 0.001 when the training reaches half, and adjust the learning rate to 0.0001 when the training reaches 3/4. Due to the small amount of data in the ISPRS remote sensing data set, we perform data augmentation before training: for each piece of raw data, we rotate the image center by 10° each time and cut out the largest square tile. In this way, each training image can get 36 sets of pictures after rotation. In addition, because the original training image is very large in size, it is not possible to directly put the entire image into the network for training, which needs to be cropped into image blocks of 640*640 pixels. We apply an overlap-tile strategy to ensure that there are no obvious cracks in the segmentation result map after splicing. Result The model in this paper performed the best in terms of overall accuracy (OA), which reached 90.4% and 90.7% on the Vaihingen and Potsdam datasets respectively, indicating that the model can achieve high segmentation accuracy. In addition, in the performance of easily confused categories，for example, low shrub vegetation (low_veg) and trees, F1 values have improved a lot. In experiments of the Vaihingen dataset, the F1 values of the low shrub vegetation and tree species reached 83.6% and 89.6%, respectively. In experiments of the Potsdam dataset, the F1 values of the low shrub vegetation and tree species reached 86.8% and 87.1%. As for the average F1 value, the scores of the model on the Vaihingen and Potsdam datasets are 89.3% and 92.0%, respectively, which is much higher than the other latest methods. It indicates that the model in this paper is the best for both the segmentation of remote sensing images and the average performance of each category. Additionally，compared with the model without the tree module, the proposed method has higher segmentation accuracy for each category. The overall accuracy (OA) using tree module increased by 1.1% and the average F1 value increased by 0.6% in the Vaihingen dataset. The overall accuracy and the average F1 value increased by 1.3% and 0.9% in the Potsdam dataset. The above shows that the tree module does not only target a certain category, but improve the overall segmentation accuracy. Conclusion The proposed network can effectively improve overall semantic segmentation accuracy of high-resolution remote sensing images. The experimental results show that the proposed segmentation module with the tree module is greatly improved due to the reduction of the error on easily confused pixels. The proposed method in this paper is universal and suitable for a wide range of application scenarios.