多源特征自适应融合网络的高分遥感影像语义分割

张文凯; 刘文杰; 孙显; 许光銮; 付琨

doi:10.11834/jig.210054

遥感图像处理 | 浏览量 : 0 下载量: 0 CSCD: 3

PDF
导出
分享
收藏
专辑

多源特征自适应融合网络的高分遥感影像语义分割
Multi-source features adaptation fusion network for semantic segmentation in high-resolution remote sensing images
2022年27卷第8期页码：2516-2526
纸质出版日期： 2022-08-16 ，

录用日期： 2021-04-06
DOI： 10.11834/jig.210054
稿件说明：

移动端阅览

张文凯, 刘文杰, 孙显, 许光銮, 付琨. 多源特征自适应融合网络的高分遥感影像语义分割[J]. 中国图象图形学报, 2022,27(8):2516-2526.

Wenkai Zhang, Wenjie Liu, Xian Sun, Guangluan Xu, Kun Fu. Multi-source features adaptation fusion network for semantic segmentation in high-resolution remote sensing images[J]. Journal of Image and Graphics, 2022,27(8):2516-2526.
张文凯, 刘文杰, 孙显, 许光銮, 付琨. 多源特征自适应融合网络的高分遥感影像语义分割[J]. 中国图象图形学报, 2022,27(8):2516-2526. DOI： 10.11834/jig.210054.

Wenkai Zhang, Wenjie Liu, Xian Sun, Guangluan Xu, Kun Fu. Multi-source features adaptation fusion network for semantic segmentation in high-resolution remote sensing images[J]. Journal of Image and Graphics, 2022,27(8):2516-2526. DOI： 10.11834/jig.210054.

摘要

目的

在高分辨率遥感影像语义分割任务中，仅利用可见光图像很难区分光谱特征相似的区域(如草坪和树、道路和建筑物)，高程信息的引入可以显著改善分类结果。然而，可见光图像与高程数据的特征分布差异较大，简单的级联或相加的融合方式不能有效处理两种模态融合时的噪声，使得融合效果不佳。因此如何有效地融合多模态特征成为遥感语义分割的关键问题。针对这一问题，本文提出了一个多源特征自适应融合模型。

方法

通过像素的目标类别以及上下文信息动态融合模态特征，减弱融合噪声影响，有效利用多模态数据的互补信息。该模型主要包含3个部分：双编码器负责提取光谱和高程模态的特征；模态自适应融合模块协同处理多模态特征，依据像素的目标类别以及上下文信息动态地利用高程信息强化光谱特征，使得网络可以针对特定的对象类别或者特定的空间位置来选择特定模态网络的特征信息；全局上下文聚合模块，从空间和通道角度进行全局上下文建模以获得更丰富的特征表示。

结果

对实验结果进行定性、定量相结合的评价。定性结果中，本文算法获取的分割结果更加精细化。定量结果中，在ISPRS(International Society for Photogrammetry and Remote Sensing)Vaihingen和GID(Gaofen Image Dataset)数据集上对本文模型进行评估，分别达到了90.77%、82.1%的总体精度。与DeepLab V3+、PSPNet(pyramid scene parsing network)等算法相比，本文算法明显更优。

结论

实验结果表明，本文提出的多源特征自适应融合网络可以有效地进行模态特征融合，更加高效地建模全局上下文关系，可以广泛应用于遥感领域。

Abstract

Objective

In the semantic segmentation of high-resolution remote sensing images

it is difficult to distinguish regions with similar spectral features (such as lawn and trees

roads and buildings) only using visible images for their single-angles. Most of the existing neural network-based methods focus on spectral and contextual feature extraction through a single encoder-decoder network

while geometric features are often not fully mined. The introduction of elevation information can improve the classification results significantly. However

the feature distribution of visible image and elevation data is quite different. Multiple modal flow features cascading simply fails to utilize the complementary information of multimodal data in the early

intermediate and latter stages of the network structure. The simple fusion methods by cascading or adding cannot deal with the noise generated by multimodal fusion clearly

which makes the result poor. In addition

high-resolution remote sensing images usually cover a large area

and the target objects have problems of diverse sizes and uneven distribution. Current researches has involved to model long-range relationships to extract contextual features.

Method

We proposed a multi-source features adaptation fusion network in our researchanalysis. In order to dynamically recalibrate the scene contexted feature maps

we utilize the modal adaptive fusion block to model the correlations explicitly between the two modal feature maps. To release the influence of fusion noise and utilize the complementary information of multi-modal data effectively

modal features are fused by the target categories and context information of pixels in motion. Meanwhile

the global context aggregation module is facilitated to improve the feature demonstration ability of the full convolutional neural network through modeling the remote relationship between pixels. Our model consists of three aspects as mentioned below: 1)the double encoder is responsible for extracting the features of spectrum modality and elevation modality; 2)the modality adaptation fusion block is coordinated to the multi-modal features to enhance the spectral features based on the dynamic elevation information; 3) the global context aggregation module is used to model the global context from the perspective of space and channel.

Result

Our efficiency unimodal segmentation architecture (EUSA) is evaluated on the International Society for Photogrammetry and Remote Sensing(ISPRS) Vaihingen and Gaofen Image Dataset(GID) validation set

and the overall accuracy is 90.64% and 82.1%

respectively. Specifically

EUSA optimizes the overall accuracy value and mean intersection over union value by 1.55% and 3.05% respectively in comparison with the value of baseline via introducing a small amount of parameters and computation on ISPRS Vaihingen test set. This proposed modal adaptive block increases the overall accuracy value and mean intersection over union value of 1.32% and 2.33% each on ISPRS Vaihingen test set. Our MSFAFNet has its priorities in terms of the ISPRS Vaihingen test set evaluation

which achieves 90.77% in overall accuracy.

Conclusion

Our experimental results show that the efficient single-mode segmentation framework EUSA can model the long-range contextual relationships between pixels. To improve the segmentation results of regions in the shadow or with similar textures

we proposed MSFAFNet to extract more effectivefeatures of elevation information.

关键词

语义分割遥感影像多模态模态自适应融合全局上下文聚合

Keywords

semantic segmentationremote sensing imagesmulti-modal datamodality adaptation fusionglobal context aggregation

references

Audebert N, Le Saux B, Lefèvre S. 2018. Beyond RGB: very high resolution urban remote sensing with multimodal deep networks. ISPRS Journal of Photogrammetry and Remote Sensing, 140: 20-32 [DOI: 10.1016/j.isprsjprs.2017.11.011]

Badrinarayanan V, Kendall A, Cipolla R. 2017. SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12): 2481-2495 [DOI: 10.1109/TPAMI.2016.2644615]

Cao Z Y, Fu K, Lu X D, Diao W H, Sun H, Yan M L, Yu H F, Sun X. 2019. End-to-end DSM fusion networks for semantic segmentation in high-resolution aerial images. IEEE Geoscience and Remote Sensing Letters, 16(11): 1766-1770 [DOI: 10.1109/LGRS.2019.2907009]

Chen L C, Zhu Y K, Papandreou G, Schroff F, Adam H. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 833-851

Fu J, Liu J, Tian H J, Li Y, Bao Y J, Fang Z W, Lu H Q. 2019. Dual attention network for scene segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 3141-3149

He K M, Zhang X Y, Ren S Q, Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778

Hu J, Shen L, Albanie S, Sun G, Wu E H. 2020. Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(8): 2011-2023 [DOI: 10.1109/TPA-MI.2019.2913372]

Huang Z L, Wang X G, Huang L C, Huang C, Wei Y C, Liu W Y. 2019. CCNet: criss-cross attention for semantic segmentation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 603-612

LeCun Y, Boser B, Denker J S, Henderson D, Howard R E, Hubbard W, Jackel L D. 1989. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4): 541-551 [DOI: 10.1162/neco.1989.1.4.541]

Liu Z W, Li X X, Luo P, Loy C C, Tang X O. 2015. Semantic image segmentation via deep parsing network//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 1377-1385

Long J, Shelhamer E, Darrell T. 2015. Fully convolutional networks for semantic segmentation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3431-3440

Marcos D, Volpi M, Kellenberger B, Tuia D. 2018. Land cover mapping at very high resolution with rotation equivariant CNNs: towards small yet accurate models. ISPRS Journal of Photogrammetry and Remote Sensing, 145: 96-107 [DOI: 10.1016/j.isprsjprs.2018.01.021]

Marmanis D, Wegner J D, Galliani S, Schindler K, Datcu M, Stilla U. 2016. Semantic segmentation of aerial images with an ensemble of CNSS. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, III-3: 473-480

Qin R J, Fang W. 2014. A hierarchical building detection method for very high resolution remotely sensed images combined with DSM using graph cut optimization. Photogrammetric Engineering and Remote Sensing, 80(9): 873-883 [DOI: 10.14358/PERS.80.9.000]

Wang X L, Girshick R, Gupta A, He K M. 2018. Non-local neural networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7794-7803

Zhao H S, Shi J P, Qi X J, Wang X G, Jia J Y. 2017. Pyramid scene parsing network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 6230-6239

Zheng S, Jayasumana S, Romera-Paredes B, Vineet V, Su Z Z, Du D L, Huang C, Torr P H S. 2015. Conditional random fields as recurrent neural networks//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 1529-1537

文章被引用时，请邮件提醒。

提交

残差密集空间金字塔网络的城市遥感图像分割

结合双边交叉增强与自注意力补偿的点云语义分割

面向无人机海岸带生态系统监测的语义分割基准数据集

基于深度学习的弱监督语义分割方法综述

跨层细节感知和分组注意力引导的遥感图像语义分割