Print

发布时间: 2020-12-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.190557
2020 | Volume 25 | Number 12




    遥感图像处理    




  <<上一篇 




  下一篇>> 





残差密集空间金字塔网络的城市遥感图像分割
expand article info 韩彬彬1,2,3, 张月婷1,2, 潘宗序1,2, 台宪青1,2, 李芳芳1,2
1. 中国科学院空天信息创新研究院, 北京 100190;
2. 空间信息处理与应用系统技术重点实验室, 北京 100190;
3. 中国科学院大学, 北京 100049

摘要

目的 遥感图像语义分割是根据土地覆盖类型对图像中每个像素进行分类,是遥感图像处理领域的一个重要研究方向。由于遥感图像包含的地物尺度差别大、地物边界复杂等原因,准确提取遥感图像特征具有一定难度,使得精确分割遥感图像比较困难。卷积神经网络因其自主分层提取图像特征的特点逐步成为图像处理领域的主流算法,本文将基于残差密集空间金字塔的卷积神经网络应用于城市地区遥感图像分割,以提升高分辨率城市地区遥感影像语义分割的精度。方法 模型将带孔卷积引入残差网络,代替网络中的下采样操作,在扩大特征图感受野的同时能够保持特征图尺寸不变;模型基于密集连接机制级联空间金字塔结构各分支,每个分支的输出都有更加密集的感受野信息;模型利用跳线连接跨层融合网络特征,结合网络中的高层语义特征和低层纹理特征恢复空间信息。结果 基于ISPRS(International Society for Photogrammetry and Remote Sensing)Vaihingen地区遥感数据集展开充分的实验研究,实验结果表明,本文模型在6种不同的地物分类上的平均交并比和平均F1值分别达到69.88%和81.39%,性能在数学指标和视觉效果上均优于SegNet、pix2pix、Res-shuffling-Net以及SDFCN(symmetrical dense-shortcut fully convolutional network)算法。结论 将密集连接改进空间金字塔池化网络应用于高分辨率遥感图像语义分割,该模型利用了遥感图像不同尺度下的特征、高层语义信息和低层纹理信息,有效提升了城市地区遥感图像分割精度。

关键词

语义分割; 遥感影像; 多尺度; 残差卷积网络; 密集连接

Residual dense spatial pyramid network for urbanremote sensing image segmentation
expand article info Han Binbin1,2,3, Zhang Yueting1,2, Pan Zongxu1,2, Tai Xianqing1,2, Li Fangfang1,2
1. Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China;
2. Key Laboratory of Technology in Geo-Spatial Information Processing and Application Systems, Beijing 100190, China;
3. University of Chinese Academy of Sciences, Beijing 100049, China
Supported by: National Key Research and Development Program of China (2016YFF0202700); National Natural Science Foundation of China (61701478)

Abstract

Objective Remote sensing image semantic segmentation, in which each pixel in an image is classified according to the land cover type, presents an important research direction in the field of remote sensing image processing. However, accurately segmenting and extracting features from remote sensing images is difficult due to the wide coverage of these images and the large-scale difference and complex boundaries among these features. Meanwhile, the traditional remote sensing image processing methods are inefficient, inaccurate, and require much expertise. Convolutional neural networks are deep learning networks that are suitable for processing data with grid structures, such as 1D data with time series features (e.g., speech) and image data with 2D pixel matrix grids. Given its multi-layer structure, a convolutional neural network can automatically learn features at different levels. This network also has two features that facilitate image processing. First, a convolutional neural network uses the 2D characteristics of an image in feature extraction. Given the high correlation among adjacent pixels in an image, the neuron nodes in the network do not need to connect all pixels; only a local connection is required to extract features. Second, convolution kernel parameters are shared when the convolutional neural network performs convolution operations, and features at different positions of an image use the same convolution kernel to calculate their values, there by greatly reducing the model parameters. In this paper, a full convolutional neural network based on a residual dense spatial pyramid is applied in urban remote sensing image segmentation to achieve an accurate semantic segmentation of high-resolution remote sensing images. Method To improve the semantic segmentation precision of high-resolution urban remote sensing images, we first take a 101-layer residual convolutional network as our backbone in extracting remote sensing image feature maps. When extracting features by using classic convolutional neural networks, the repeated concatenation of max-pooling and striding at consecutive layers significantly reduces the spatial resolution of the feature maps, typically by a factor of 32 across each direction in general deep convolutional neural networks(DCNNs), thereby leading to spatial information loss. Semantic segmentation is a pixel-to-pixel mapping task whose class intensity reaches the pixel level. Reducing the spatial resolution of feature maps can lead to spatial information loss, which is not conducive to the semantic segmentation of remote sensing images. To avoid such loss, the proposed model introduces atrous convolution into the residual convolutional neural network. Compared with ordinary convolution, atrous convolution uses the parameter r to control the receptive field of the convolution kernel during the calculation. The convolutional neural network with atrous convolution can expand the receptive field of the feature map while keeping the feature map size unchanged, thereby significantly improving the remote sensing image semantic segmentation performance of the proposed model. Objects in remote sensing images often demonstrate large-scale variations and complex texture features, both of which challenge the accurate encoding of multi-scale advanced features. To accurately extract multi-scale features in these images, the proposed model cascades each branch of aspatial pyramid structure based on a dense connection mechanism, which allows each branch to output highly dense receptive field information. In these mantic segmentation of remote sensing images, not only the high-level semantic features extracted by the convolutional neural network are required to correctly determine the category of each pixel; low-level texture features are also required to determine the edges of the target. Low-level texture features can benefit the reconstruction of object edges during semantic segmentation. Our proposed model uses a simple encoder to effectively use high-level semantic features and low-level texture features in a network. A decoder also uses skip connection to fuse cross-layer network information and to combine high-level semantic features with the underlying texture features. After fusing high- and low-level information, we use two 3×3 convolutions to integrate the information among channels and to recover spatial information. We eventually input the extracted feature map to a softmax classifier for pixel-level classification and obtain the remote sensing image semantic segmentation results. Result Full experiments are performed by using the ISPRS(International Society for Phtogrammetry and Remote Sensing) remote sensing dataset of the Vaihingen area. WE use intersection over union (IoU) and F1 as our indicators for evaluating the segmentation performance of the proposed model. We also build and train our models based on the NVIDIA Tesla P100 platform and the Tensorflow deep learning framework. The complexity of tasks in the experiment increases at each stage. Experimental results show that the proposed model obtains mean IoU (MIoU) and F1values of 69.88% and 81.39% over six types of surface features, respectively, thereby demonstrating vast improvements compared with a residual convolutional network without atrous convolution. Our proposed method also outperforms SegNet, Res-shuffling-Net and SDFCN (symmetrical dense-shortcut fully convolutional network) in terms of mathematics and outperforms pix2pix in terms of visual effects, thereby cementing its validity. We then apply this model on the remote sensing image data of Potsdam area and obtain MIoU and F1 values of 74.02% and 83.86%, respectively, thereby proving the robustness of our model. Conclusion We build an end-to-end deep learning model for the semantic segmentation of remote sensing images of high-resolution urban areas. By applying an improved spatial pyramid pooling network based on atrous convolution and dense connections, our proposed model effectively extracts multi-scale features from remote sensing images and fuse high-level semantic information and low-level texture information of the network, which in turn can improve the accuracy of the model in the remote sensing image segmentation of urban areas. Experimental results prove that the proposed model achieves an excellent performance in terms of mathematical and visual effects and has high application value in the semantic segmentation of high-resolution remote sensing images.

Key words

semantic segmentation; remote sensing images; multiscale; residual convolutional network; dense connection

0 引言

随着遥感技术的快速发展,遥感图像空间、时间以及光谱分辨率大大提高。遥感图像语义分割是根据图像中表达的语义信息对像素进行分组,得到具有逐像素语义注释的分割图像,如图 1所示,已广泛应用于环境监测、农业、林业和城市规划等各个领域(冯丽英,2017),是遥感图像应用的重要组成部分。

图 1 语义分割实例
Fig. 1 Examples for semantic segmentation

相关学者对遥感图像语义分割做了大量研究,主要的研究方法分为两种,一种是基于人工特征的传统方法,包括阈值方法,边缘检测方法和区域方法(陈天华等,2018),这种传统方法一方面效率低且不准确,另一方面需要大量的专业知识;另一种是基于卷积神经网络(convolutional neural networks,CNN)的方法。随着CNN在计算机视觉领域的成功应用,相关学者逐渐开始研究其在遥感图像语义分割中的应用。李欣等人(2019)利用带孔卷积,提出了一种基于深度残差网络的多尺度语义分割模型。Audebert等人(2016)研究了全卷积神经网络(fully convolutional network,FCN)在地球观测图像上基于像素场景标记中的应用。Liu等人(2017)使用完整的卷积神经网络作为特征提取器,结合条件随机场对高分辨率遥感影像进行语义分割。Blomley和Weinmann(2017)使用深度完全卷积神经网络处理高分辨率遥感影像的多模态数据,并将其应用于语义分割。Chen等人(2018a)研究了对称法线捷径FCN(symmetrical normal-shortcut FCN,SNFCN)和对称密线捷径FCN(symmetrical dense-shortcut FCN,SDFCN)框架在超高分辨率遥感图像分割中的应用。Marmanis等人(2016)使用两个孪生网络别分提取遥感图像与数字地表模型(digital surface model,DSM)的特征,应用于高分辨率航空遥感图像的语义分割。Sherrah(2016)提出了一种不进行下采样的卷积神经网络,但计算效率低,且需要大量的GPU资源。

本文主要贡献是将密集连接的空间金字塔池网络应用于高分辨率城市地区遥感影像语义分割。网络融合遥感多尺度语义特征以及低层纹理特征,能够有效地提取不同尺度目标及其边缘,可以更有效地分割纹理复杂的高分辨率遥感图像。本文基于高分辨率遥感图像数据集ISPRS(International Society for Photogrammetry and Remote Sensing)(Rottensteiner等,2012)开展充分实验研究,实验结果表明,本文模型在均交并比和F1值两个指标上优于Segnet、pix2pix、Res-shuffling-Net以及SDFCN算法,证明了算法的有效性。

1 方法原理

1.1 网络结构

本文模型结构包含3部分,如图 2所示。1)基于带孔卷积改进的残差网络(图 2中深度卷积神经网络(deep convolutional neural network,DCNN))部分用于提取特征;2)基于密集连接改进的空洞空间卷积池化金字塔(atrous spatial pyramid pooling,ASPP)模块,用于提取和融合多尺度特征;3)解码器,采用跳连接融合高低信息进行简单解码,输出语义分割图。

图 2 网络结构图
Fig. 2 Network architecture diagram

1.2 空洞卷积残差网络

本文使用101层残差网络(ResNet-101)(He等,2016)作为主干网络提取特征。CNN在提取特征过程中存在下采样操作扩大感受野,会造成空间信息的损失,不利于遥感图像语义分割。本文使用带孔卷积替代普通卷积。带孔卷积具体计算为

$ \begin{array}{c} \boldsymbol{Y}[i, j]= \\ \sum\limits_{m} \sum\limits_{n}(\boldsymbol{X}[i+r \cdot m, j+r \cdot n] * \boldsymbol{W}[m, n]) \end{array} $ (1)

式中,$\mathit{\boldsymbol{X}}$表示输入特征图,$\mathit{\boldsymbol{W}}$表示卷积核,参数$r$为带孔卷积的采样率,用于控制卷积核感受野。由式(1)可知,带孔卷积在扩大特征图感受野的同时能够保持特征图尺寸不变,可有效避免空间信息的损失(Yu和Koltun,2015Chen等,2017)。

基于带孔卷积的残差卷积网络(residual convolutional network with atrous convolution, RNA)结构如表 1所示。根据输出步长大小设置网络中的conv2_x—conv4_x阶段是否采用带孔卷积,详细情况后续实验将具体讨论,conv5_x阶段3个残差瓶颈单元中的3 × 3卷积分别设置为采样率为1、2、4的带孔卷积。

表 1 特征提取器网络结构
Table 1 Network structure of feature extractors

下载CSV
ResNet-101
conv1 7×7, 64
3×3, max pool
conv2_x $\left[ {\begin{array}{*{20}{l}} {1 \times 1, 64}\\ {3 \times 3, 64}\\ {1 \times 1, 256} \end{array}} \right] \times 3$
conv3_x $\left[ {\begin{array}{*{20}{l}} {1 \times 1, 128}\\ {3 \times 3, 128}\\ {1 \times 1, 512} \end{array}} \right] \times 4$
conv4_x $\left[\begin{array}{l}1 \times 1, 256 \\ 3 \times 3, 256 \\ 1 \times 1, 1024\end{array}\right] \times 23$
conv5_x $\left[\begin{array}{l}1 \times 1, 512 \\ 3 \times 3, 512 \\ 1 \times 1, 2048\end{array}\right] \times 3$

1.3 空洞空间卷积池化金字塔(ASPP)

遥感图像中存在不同尺度的目标,本文首先采用ASPP(Chen等,2017)结构,使用一组具有不同采样率的带孔卷积并行提取遥感图像中的多尺度特征。ASPP结构如图 3所示,具体计算为

$ \begin{array}{c} \boldsymbol{Y}={Concat}\left(\boldsymbol{I}_{\text {pooling }}(\boldsymbol{X}), H_{1, 3}(\boldsymbol{X}), \right. \\ \left.H_{6, 3}(\boldsymbol{X}), H_{12, 3}(\boldsymbol{X}), H_{18, 3}(\boldsymbol{X})\right) \end{array} $ (2)

图 3 ASPP结构图
Fig. 3 Structure of ASPP

式中,$Concat$(·)表示对特征图做第1维度上的拼接操作,$H_{r, n}(\boldsymbol{X})$为采样率为$r$、卷积核大小为$n$的带孔卷积,${{\mathit{\boldsymbol{I}}_{{\rm{pooling }}}}}$图 3中image pooling分支的图像级特征,即输入特征图的平均池化特征。

1.4 解码器

由1.2节可知,主干网络用于提取特征。由于提取特征会造成特征图尺寸下降,本文使用简单的解码网络(Chen等,2018c)恢复特征图尺寸。在解码网络中,首先通过线性插值法对特征图做2倍的上采样,然后将得到的特征图与主干网络中对应尺寸的低级特征(取conv2_x中的第2个卷积组中conv3的输出)融合,最后使用3 × 3的卷积整合通道间信息,输出语义分割图。

1.5 密集空洞空间卷积池化金字塔(Dense ASPP)

高分辨率遥感图像中的目标具有纹理特征复杂和尺度变化范围大的特点,ASPP并行提取特征,在一定程度上有利于解决多尺度问题,但其在尺度轴维度上的分辨率不足以精确提取遥感图像中的目标特征。本文采用Yang等人(2018)基于密集连接提出的ASPP(Dense ASPP)代替ASPP,Dense ASPP中各分支以密集连接方式级联,且带孔卷积采样率逐渐增加,每个分支的输入是之前分支输出的拼接,每个分支的输出都有更加密集的感受野信息,具体结构如图 4所示。图 4中每一层的输出表达式为

$ \boldsymbol{Y}_{l}=\left\{\begin{array}{ll} \boldsymbol{I}_{\text {pooling }}(\boldsymbol{X}) & l=0 \\ H_{r, n}\left({ Concat }\left(\boldsymbol{Y}_{0}, \boldsymbol{Y}_{1}, \cdots, \boldsymbol{Y}_{l-1}\right)\right) & l \neq 0 \end{array}\right. $ (3)

图 4 Dense ASPP结构图
Fig. 4 Structure of Dense ASPP

式中,${{\mathit{\boldsymbol{Y}}_{l}}}$表示第$l$层的输出。

2 实验

2.1 实验数据集及模型评价指标

本文使用ISPRS(Rottensteiner等,2012) Vaihingen区域的高分辨率遥感数据进行实验。数据集包含33幅超高分辨率遥感图像,其中16幅用于训练,17幅用于测试,图像的尺寸在1 281~3 816像素之间,空间分辨率为0.09 m。数据集真值已知,划分为6种常见的土地覆盖类别,包括不透水面、建筑物、低矮的植被、树木,汽车和杂波/背景,数据集示例如图 5所示。实验中首先对数据进行预处理,包括数据切割和数据增强:1)将数据集中的图像均匀地切割成尺寸为400 × 500像素的图像块;2)对得到的训练图像进行数据增强处理,即对训练集中的每幅图像进行上下左右翻转,旋转90°、180°和270°。最终,训练集中包含2 718幅图像,测试集中包含517幅图像。

图 5 ISPRS数据集样例
Fig. 5 Samples of ISPRS dataset((a)remote sensing images; (b)labels)

本文使用交并比(intersection over union,IoU)和F1值作为模型评价指标。IoU的定义是预测和标签中都标记为某一类的像素数与在预测或标签中被标记为该类像素数的比值;F1值的定义为精确率和召回率的调和平均值。分别计算为

$ I o U=\frac{p_{i i}}{\sum\limits_{j=0}^{k} p_{i j}+\sum\limits_{j=0}^{k} p_{j i}-p_{i i}} $ (4)

$ F_{1}=2 \cdot \frac{P \cdot R}{P+R} $ (5)

式中,$k$表示类别数,$p_{ij }$真实像素类别为$i$的像素被预测为类别$j$的数量,$P$$R$分别表示精确率和召回率。

2.2 实验环境及实验超参数

实验环境为Windows Server系统,深度学习框架为Tensorflow 1.9.0,硬件平台为NVIDIA Tesla P100。定义${S_{{\rm{ratio}}}}$为输入图像与ResNet-101提取特征图大小的比值。学习率策略为多项式学习率策略,更新公式为${\eta _0} \times {(1 - s/{s_{{\rm{max}}}})^ \wedge }p$,初始学习率${\eta _0}$为0.001,$s$指当前迭代次数,${s_{{\rm{max}}}}$指最大迭代次数,$p$为学习率策略指数,实验中设置为0.9;训练批尺寸为4,迭代次数均为25 000。

2.3 模型参数分析

2.3.1 输出步长设置

为了探究输出步长对模型性能的影响,以RNA为训练模型并设置4组实验进行对比分析。在实验过程中,RNA的输出步长分别设置为4,8,16,32。设置方式为当特征图尺寸缩小至指定输出步长大小,残差块(conv2_x—conv4_x)使用带孔卷积,否则与ResNet-101相同。例如,当输出步长为4时,conv2_x~conv4_x阶段中所有的3 × 3卷积均设置为采样率为2的带孔卷积,输出步长为8时,conv3_x、conv4_x阶段3 × 3卷积均设置为带孔卷积。当RNA的输出步长为4时,由于GPU内存的限制,批尺寸设置为1,对应的训练迭代为100 000。实验得到的平均交并比(mean intersection over union,MIoU)结果如表 2所示。由表 2可知,RNA的性能与输出步长呈负相关。当输出步长减小到4时,模型性能没有明显改善,但计算量大幅增加。为了平衡模型性能和计算量,在后续实验中将模型的输出步长统一设置为8。

表 2 不同输出步长RNA的MIoU结果
Table 2 MIoU of RNA with defferent Sratio  

下载CSV
/%
${S_{{\rm{ratio}}}}$ MIoU
4 67.47
8 67.24
16 66.27
32 62.12

2.3.2 ASPP参数设置

由1.3节可知,为了提取遥感图像中的多尺度特征,本文使用ASPP结构。为了探究ASPP结构中带孔卷积分支采样率对性能的影响,本节设计了3组对比实验,将ASPP中带孔卷积的采样率分别设置为${R_{{\rm{group}}}}$= 2、4、6、8;${R_{{\rm{group}}}}$= 6、12、18、24和${R_{{\rm{group}}}}$= 6、12、18。实验结果如表 3所示,由表 3可知,当采样率组设置为(6,12,18)时,模型分割效果最佳。在后续实验中,ASPP采样率组设置为(6,12,18),以确保模型可以表现出最佳性能。

表 3 不同比率的RNA + ASPP的MIoU结果
Table 3 MIoU of RNA+ASPP with different rates  

下载CSV
/%
${R_{{\rm{group}}}}$ MIoU
(6, 12, 18, 24) 67.53
(2, 4, 6, 8) 67.32
(6, 12, 18) 67.67

2.3.3 Dense ASPP和解码器

在本节实验中,将解码器和Dense ASPP(Yang等,2018)添加到RNA上。在解码器中,为了平衡解码网络中高级和低级特征的权重,首先使用通道数为48的1 × 1卷积来减少高级功能的通道数,在融合高低层语义特征时,使用两个连续的通道数为256的3 × 3卷积整合高级语义信息和低级纹理信息。为了验证解码器的作用,在2.3.2节的基础上添加了解码器,重新训练模型作为对比。2.3节各阶段训练的模型在ISPRS测试集上的详细实验结果如表 4表 5所示。由表 4表 5可知,本文模型在F1和IoU两个指标上均有所提升,解码器和Dense ASPP具有一定效果。

表 4 不同模型在ISPRS测试集上的IoU对比
Table 4 Comparison of IoU by different models on ISPRS test set  

下载CSV
/%
模型 土地覆盖类别 平均值
不透水面 建筑物 低植被 汽车 杂波/背景
RNA+ASPP 80.04 86.69 66.27 76.18 60.18 36.70 67.67
RNA+ASPP+decoder 80.64 87.62 66.96 76.24 62.16 36.28 68.31
本文 81.13 87.67 66.78 76.25 63.57 43.86 69.88
注:加粗字体表示各列最优结果。

表 5 不同模型在ISPRS测试集上的F1对比
Table 5 Comparison of F1 by different models on ISPRS test set  

下载CSV
/%
模型 土地覆盖类别 平均值
不透水面 建筑物 低植被 汽车 杂波/背景
RNA + ASPP 88.91 92.87 79.71 86.48 75.14 53.69 79.47
RNA + ASPP + decoder 89.29 93.40 80.21 86.52 76.66 53.24 79.89
本文 89.58 93.42 80.08 86.52 77.73 60.97 81.39
注:加粗字体表示各列最优结果。

2.4 模型效果分析

本文消融实验结果如表 6所示,以RNA作为对比基准,本文模型MIoU值提升了2.64,证明了模型的有效性。

表 6 具有不同组件的模型的结果
Table 6 Results of models with different components

下载CSV
RNA ASPP Dense ASPP decoder MIoU/%
67.24
67.67
68.31
69.88

为了进行横向对比,本文基于Vaihingen数据集训练SegNet(Badrinarayanan等,2017)和pix2pix(Isola等,2017),并统计Res-shuffling-Net(Chen等,2018b)、SDFCN(Chen等,2018a)及本文模型结果,相关数据见表 7。由表 7可知,本文模型的MIoU优于SegNet、Res-shuffling-Net和SDFCN。表 7未统计pix2pix的结果,原因是pix2pix属于生成对抗网络(generative adversarial network,GAN)。GAN的机制是使生成器生成尽可能接近真实值的结果,并非原始标签值完全相同,故在数学角度上,精度非常低,但pix2pix模型能够生成较好的视觉效果图。在图 6中列出了SegNet、pix2pix和本文模型的输出图像。由表 7图 6可知,本文模型在数学和视觉方面均表现出较好的性能。

表 7 不同模型的Vaihingen测试集结果
Table 7 Results by different models on Vaihingen test set  

下载CSV
/%
方法 MIoU
SegNet 44.51
Res-shuffling-Net 60.48
SDFCN 62.38
本文 69.88
注:加粗字体表示各行最优结果。
图 6 模型视觉效果图
Fig. 6 Visual effects of the models ((a) source images; (b) SegNet; (c) pix2pix; (d) ours; (e) labels)

2.5 Potsdam地区的实验结果

为了验证本文模型的泛化能力,将本文模型应用于Potsdam地区的遥感图像。数据集包含38幅遥感影像,24幅用于训练,14幅用于测试,图像尺寸为6 000×6 000像素,分辨率为0.05 m。使用2.1节中的方法进行数据预处理,训练集中获得25 920幅图像,测试集中获得2 520幅图像。结果如表 8所示。可以看出,在Potsdam数据集上,MIoU达到74.02%,F1均值达到83.86%,证明了本文模型对高分辨率城市地区遥感影像语义分割的有效性。

表 8 Potsdam地区的实验结果
Table 8 Experimental results on Potsdam area  

下载CSV
/%
评价指标 土地覆盖类别 平均值
不透水面 建筑物 低植被 汽车 背景
IoU 83.00 90.89 73.40 75.78 81.81 39.23 74.02
F1 90.71 95.23 84.66 86.22 89.99 56.35 83.86

3 结论

为了提高城市地区遥感图像语义分割精度,本文设计了一种适用于高分辨率城市地区遥感图像语义分割的全卷积神经网络。本文模型主要有以下两个特点:

1) 采用基于带孔卷积改进的残差结构卷积网络作为主干网络提取遥感图像特征,主干网络能够通过设置超参数控制提取的特征图的尺寸,避免了提取特征过程中空间信息的损失,能够准确有效地提取遥感图像的高级特征。

2) 引入了基于密集连接机制构建的多尺度特征提取模块,该模块以密集连接为基础,充分利用网络中的多尺度特征,能够有效地分割遥感图像中不同尺度目标。

本文基于Tensorflow框架搭建网络模型,在ISPRS数据集上分阶段展开实验,以RNA、RNA + ASPP、RNA + ASPP + decoder、RNA + Dense ASPP + decoder为模型进行递进研究。同时将本文模型与SegNet、pix2pix、Res-shuffling-Net以及SDFCN的分割效果进行比较,实验结果表明,本文模型在数学指标和视觉效果方面均优于其他算法,证明了本文模型的有效性。

本文模型的不足之处在于,由于模型中多尺度模块以密集连接为基础,使得模型的计算量大幅提升,模型分割实时性较差。后期研究将进一步对模型进行优化,在不降低模型准确性的前提下,改进网络结构,研究兼具精度和速度的实时城市遥感图像语义分割模型。

参考文献

  • Audebert N, Le Saux B L and Lefèvre S. 2016. Semantic segmentation of earth observation data using multimodal and multi-scale deep networks//Proceedings of the 13th Asian Conference on Computer Vision. Taipei, China: Springer: 180-196[DOI: 10.1007/978-3-319-54181-5_12]
  • Badrinarayanan V, Kendall A, Cipolla R. 2017. SegNet:a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12): 2481-2495 [DOI:10.1109/TPAMI.2016.2644615]
  • Blomley R and Weinmann M. 2017. Using multi-scale features for the 3D semantic labeling of airborne laser scanning data//ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences. Wuhan, China: ISPRS: 43-50[DOI: 10.5194/isprs-annals-IV-2-W4-43-2017]
  • Chen G Z, Zhang X D, Wang Q, Dai F, Gong Y F, Zhu K. 2018a. Symmetrical dense-shortcut deep fully convolutional networks for semantic segmentation of very-high-resolution remote sensing images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 11(5): 1633-1644 [DOI:10.1109/JSTARS.2018.2810320]
  • Chen K, Weinmann M, Gao X, Yan M, Hinz S, Jutzi B and Weinmann M. 2018b. Residual shuffling convolutional neural networks for deep semantic image segmentation using multi-modal data//ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences. Riva del Garda, Italy: ISPRS: 65-72[DOI: 10.5194/isprs-annals-IV-2-65-2018]
  • Chen L C, Papandreou G, Schroff F and Adam H. 2017. Rethinking atrous convolution for semantic image segmentation[EB/OL].[2019-09-30]. https://arxiv.org/pdf/1706.05587.pdf
  • Chen L C, Zhu Y K, Papandreou G, Schroff F and Adam H. 2018c. Encoder-decoder with atrous separable convolution for semantic image segmentation[EB/OL].[2019-09-30]. https://arxiv.org/pdf/1802.02611v1.pdf
  • Chen T H, Zheng S Q, Yu J C. 2018. Remote sensing image segmentation based on improved DeepLab network. Measurement and Control Technology, 37(11): 34-39 (陈天华, 郑司群, 于峻川. 2018. 采用改进DeepLab网络的遥感图像分割. 测控技术, 37(11): 34-39) [DOI:10.19708/j.ckjs.2018.11.008]
  • Feng L Y. 2017. Research on Construction Land Information Extraction from High Resolution Images with Deep Learning Technology. Hangzhou: Zhejiang University (冯丽英. 2017.基于深度学习技术的高分辨率遥感影像建设用地信息提取研究.杭州: 浙江大学)
  • He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 770-778[DOI: 10.1109/CVPR.2016.90]
  • Isola P, Zhu Y J, Zhou T H and Efros A A. 2017. Image-to-image translation with conditional adversarial networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 5967-5976[DOI: 10.1109/CVPR.2017.632]
  • Li X, Tang W L, Yang B. 2019. Semantic segmentation of high-resolution remote sensing image based on deep residual network. Journal of Applied Sciences-Electronics and Information Engineering, 37(2): 282-290 (李欣, 唐文莉, 杨博. 2019. 利用深度残差网络的高分遥感影像语义分割. 应用科学学报, 37(2): 282-290) [DOI:10.3969/j.issn.0255-8297.2019.02.013]
  • Liu Y S, Piramanayagam S, Monteiro S T and Saber E. Dense semantic labeling of very-high-resolution aerial imagery and LiDAR with fully-convolutional neural networks and higher-order CRFs//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Honolulu, USA: IEEE: 1561-1570[DOI: 10.1109/CVPRW.2017.200]
  • Marmanis D, Wegner J D, Galliani S, Schindler K, Datcu M and Stilla U. 2016. Semantic segmentation of aerial images with an ensemble of CNNs//ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences. Prague, Czech Republic: ISPRS: 473-480[DOI: 10.5194/isprs-annals-Ⅲ-3-473-2016]
  • Rottensteiner F, Sohn G, Jung J, Gerke M, Baillard C, Bénitez S and Breitkopf U. 2012. The ISPRS benchmark on urban object classification and 3D building reconstruction//ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences. Melbourne, Australia: ISPRS: 293-298[DOI: 10.5194/isprsannals-I-3-293-2012]
  • Sherrah J. 2016. Fully convolutional networks for dense semantic labelling of high-resolution aerial imagery[EB/OL].[2019-09-30]. https://arxiv.org/pdf/1606.02585.pdf
  • Yang M K, Yu K, Zhang C, Li Z W and Yang K Y. 2018. DenseASPP for semantic segmentation in street scenes//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 3684-3692[DOI: 10.1109/CVPR.2018.00388]
  • Yu F and Koltun V. 2015. Multi-scale context aggregation by dilated convolutions[EB/OL].[2019-09-30]. https://arxiv.org/pdf/1511.07122.pdf