跨层细节感知和分组注意力引导的遥感图像语义分割

李林娟; 贺赟; 谢刚; 张浩雪; 柏艳红

发布时间： 2024-05-20
摘要点击次数： 340
全文下载次数： 304
DOI: 10.11834/jig.230653
2024 | Volume 29 | Number 5

跨层细节感知和分组注意力引导的遥感图像语义分割

李林娟^1,2, 贺赟¹, 谢刚^1,2, 张浩雪^1,2, 柏艳红¹(1.太原科技大学电子信息工程学院, 太原 030024;2.先进控制与装备智能化山西省重点实验室, 太原 030024)

摘要

目的语义分割是遥感图像智能解译的关键任务之一，遥感图像覆盖面广，背景交叉复杂，且地物尺寸差异性大。现有方法在复杂背景下的多尺度地物上分割效果较差，且分割区域破碎边界不连续。针对上述问题，提出了一种跨层细节感知和分组注意力引导的语义分割模型用于高分辨率遥感图像解析。方法首先采用结构新颖的ConvNeXt骨干网络，编码输入图像的各层次特征。其次，设计了分组协同注意力模块，分组并行建模通道和空间维度的特征依赖性，通道注意力和空间注意力协同强化重要通道和区域的特征信息。接着，引入了自注意力机制，构建了跨层细节感知模块，利用低层特征中丰富的细节信息，指导高层特征层学习空间细节，保证分割结果的区域完整性和边界连续性。最后，以山西省太原市为研究区域，自制高分辨率遥感太原市城区土地覆盖数据集（Taiyuan urban land cover dataset，TULCD），所提方法实现了太原市城区土地覆盖精细分类任务。结果实验在自制数据集TULCD和公开数据集Vaihingen上与最新的5种算法进行了比较，所提方法在两个数据集上平均像素准确率（mean pixel accuracy，mPA）为74.23%、87.26%，平均交并比（mean intersection over union，mIoU）为58.91%、77.02%，平均得分mF1为72.24%、86.35%，均优于对比算法。结论本文提出的高分辨率遥感图像语义分割模型具有较强的空间和细节感知能力，对类间差异小的相邻地物也有较强的鉴别能力，模型的整体分割精度较高。

关键词

遥感图像语义分割全卷积网络(FCN) 注意力机制分组卷积

Cross-layer detail perception and group attention-guided semantic segmentation network for remote sensing images

Li Linjuan^1,2, He Yun¹, Xie Gang^1,2, Zhang Haoxue^1,2, Bai Yanhong¹(1.School of Electronic Information Engineering, Taiyuan University of Science and Technology, Taiyuan 030024, China;2.Shanxi Key Laboratory of Advanced Control and Equipment Intelligence, Taiyuan 030024, China)

Abstract

Objective Semantic segmentation plays a crucial role in intelligent interpretation of remote sensing images. With the rapid advancement of remote sensing technology and the burgeoning field of big data mining，the semantic segmentation of remote sensing images has become increasingly pivotal across diverse applications，such as natural resource survey，mineral exploration，water quality monitoring，and vegetation ecological assessment. The expansive coverage of remote sensing images，coupled with intricate background intersections and considerable variations in the sizes of ground objects，underscores the difficulties and challenges to the task at hand. Existing methods exhibit limitations in achieving high segmentation accuracies，particularly when confronted with multiscale objects within intricate backgrounds. The resulting segmentation boundaries often appear fuzzy and discontinuous. Thus，a cross-layer detail perception and group attention-guided semantic segmentation network （CDGCANet） is proposed for high-resolution remote sensing images. Method First，the ConvNeXt backbone network with a novel structure is used to encode the network features at each level of the input image. It combines the popular Transformer network architecture and the classic convolutional neural network architecture，takes advantage of the two mainstream architectures，and adopts the SwinTransformer design strategy to improve the structure of ResNet50，obtaining the ConvNeXt network structure. Second，the group collaborative attention module is designed to model the feature dependencies of channel and spatial dimensions in parallel，thereby modeling the spatial and channel relationships of multiscale feature features and promoting the information interaction between channels. Channel attention and spatial attention collaboratively enhance the feature information of important channels and regions and then improve the network's ability to discriminate multiscale features，especially small targets. Next，a self-attention mechanism is introduced to construct the cross-layer detail-aware module（CDM），which uses the rich detail information in low-level features to guide high-level feature layers in learning spatial details and ensure the regional integrity and boundary continuity of segmentation results. During semantic segmentation network coding，the shallow features have strong detail information but poor semantic consistency due to the limited sensing field，while the deep features are rough the spatial information due to low resolution and inability to restore the detail information. This leads to problems，such as missing segmentation edges and discontinuity. The CDM module utilizes the spatial information of the previous layers to guide the learning of the deeper detailed features and thus ensure the semantic consistency between the low level features and high level features. Finally，Taiyuan City，Shanxi Province is taken as the research area；the high-resolution remote sensing Taiyuan urban land cover dataset，termed TULCD，is self-made. Whose original remote sensing image is extracted from 1 m-level data source of Gaofen-2 domestic satellite，the size of the original image reaches 56 251×52 654 pixels with the overall capacity size of 12. 7 GB. The overlap tiling strategy is used for the large remote sensing image cropping with the size of the sliding window of 512 and the step size of 256 to produce 512×512 pixel images. A total of 6 607 images are obtained，and the dataset is divided in accordance with the 8∶2 ratio，in which 5 285 images are for the training set and 1 322 images are for the validation set. The proposed method realizes the task of fine classification of land cover in the urban area of Taiyuan City. Result The experiments were conducted with the latest five algorithms（e. g. ，UNet，PSPNet， DeeplabV3+ ，A2-FPN，and Swin Transformer）on the self-made dataset TULCD and the public dataset Vaihingen，and three evaluation metrics were used to evaluate the model performance. The performance of the proposed CDGCANet outperforms other algorithms on the TULCD dataset，with an average pixel accuracy（mPA），average intersection over union （mIoU），and mF1 of 74. 23%，58. 91%，and 72. 24%，respectively，and the overall performance exceeds that of the second-ranked model PSPNet with an mPA of 4. 61%，an mIoU of 1. 58%，and mF1 of 1. 63%. The overall performance achieved by the CDGCANet on the Vaihingen dataset is 83. 22%，77. 62%，and 86. 26% for mPA，mIoU，and mF1， respectively. These values are higher than those of the second-ranked model DeeplabV3+，which are 1. 86%，2. 62%，and 2. 06% for mPA，mIoU，and mF1，respectively. According to the visualization，the results show that the model can correctly identify the feature target with a complete segmentation area，clear details，and continuous edges. In addition，the neural network visualization tool GradCAM is used to view the category heat map output by the model. Experimental results show that the attention mechanism can help the model focus on key areas and ground objects and enhance the feature expression ability of the model. Conclusion The semantic segmentation model of high-resolution remote sensing images proposed in this study has strong spatial and detail perception capabilities，which not only improves the accuracy of semantic segmentation but also yields more satisfactory segmentation results when handling complex remote sensing images. Looking ahead，we anticipate further optimization and in-depth research to propel the practical application of this model，contributing to remarkable breakthroughs and advancements in the field of remote sensing image interpretation.

Keywords

remote sensing images semantic segmentation fully convolutional network(FCN) attention mechanisms group convolution