Current Issue Cover
跨层细节感知和分组注意力引导的遥感图像语义分割

李林娟, 谢刚, 贺赟, 张浩雪(太原科技大学)

摘 要
目的 语义分割是遥感图像智能解译的关键任务之一,遥感图像覆盖面广,背景交叉复杂,且地物尺寸差异性大。现有方法在复杂背景下的多尺度地物上分割效果较低,且分割区域破碎边界不连续。 针对上述问题,提出了一种跨层细节感知和分组注意力引导的语义分割模型用于高分辨率遥感图像解析。方法 首先采用结构新颖的ConvNeXt骨干网络,编码输入图像的网络各层次特征。其次,设计了分组协同注意力模块,分组并行建模通道和空间维度的特征依赖性,通道注意力和空间注意力协同强化重要通道和区域的特征信息。接着,引入了自注意力机制,构建了跨层细节感知模块,利用低层特征中丰富的细节信息,指导高层特征层学习空间细节,保证分割结果的区域完整性和边界连续性。最后,以山西省太原市为研究区域,自制高分辨率遥感太原市城区土地覆盖数据集(Taiyuan urban land cover dataset,TULCD),所提的方法实现了太原市城区土地覆盖精细分类任务。结果 实验在自制数据集TULCD和公开数据集Vaihingen数据集上与最新的5种算法进行了比较,用3种评价指标评估模型性能。所提方法在两个数据集上平均像素准确率mPA为74.23%、87.26%,平均交并比mIoU为58.91%、77.02%及平均得分mF1 为72.24%、86.35%,均优于相比较的算法。结论 本文提出的高分辨率遥感图像语义分割模型具有较强的空间和细节感知能力,对类间差异性小的相邻地物也有较强的鉴别能力,模型的整体分割精度较高。
关键词
Cross-layer detail perception and group attention-guided semantic segmentation network for remote sensing

lilinjuan, xiegang, heyun, ZhangHaoxue(Taiyuan University of Science and Technology)

Abstract
Objective Semantic segmentation plays a crucial role in the intelligent interpretation of remote sensing images. With the rapid advancement of remote sensing technology and the burgeoning field of big data mining, the semantic segmentation of remote sensing images has become increasingly pivotal across diverse applications such as natural resource surveys, mineral exploration, water quality monitoring, and vegetation ecological assessment. The expansive coverage of remote sensing images, coupled with intricate background intersections and significant variations in the sizes of ground objects, underscores the difficulties and challenges of the task at hand. Existing methods exhibit limitations in achieving high segmentation accuracies, particularly when confronted with multi-scale objects within intricate backgrounds. The resulting segmentation boundaries often appear fuzzy and discontinuous. To address the above problems, a cross-layer detail perception and group attention-guided semantic segmentation network (CDGCANet) is proposed for high-resolution remote sensing images. Method First, the ConvNeXt backbone network with a novel structure is used to encode the network features at each level of the input image. It combines the popular transformer network architecture and the classic convolutional neural network architecture, takes advantage of the two mainstream architectures, and adopts the Swin Transformer design strategy to improve the structure of ResNet50, obtaining the ConvNeXt network structure. Second, in order to model the spatial and channel relationships of multi-scale feature features, and promote the information interaction between channels, the group collaborative attention module (GCAM) is designed to model the feature dependencies of channel and spatial dimensions in parallel. Channel attention and spatial attention collaboratively enhance the feature information of important channels and regions, and then improve the network"s ability to discriminate multi-scale features, especially small targets. Next, a self-attention mechanism is introduced to construct the cross-layer detail-aware module (CDM), which uses the rich detail information in low-level features to guide high-level feature layers to learn spatial details and ensure the regional integrity and boundary continuity of segmentation results. In the process of semantic segmentation network coding, due to the limited sensing field, the shallow features have stronger detail information but poorer semantic consistency, while the deeper features are rougher on the spatial information due to lower resolution and cannot restore the detail information. This leads to problems such as missing segmentation edges and discontinuity. The CDM module utilizes the spatial information of the previous layers to guide the learning of the deeper detailed features, to ensure the semantic consistency between the low-level. Finally, taking Taiyuan City, Shanxi province as the research area, the high-resolution remote sensing Taiyuan urban land cover dataset, termed TULCD, is self-made. Remote sensing images from domestic satellites of the 1-meter level satellite images, the size of the original image reaches 56251 pixels × 52654 pixels, the overall capacity size of 12.7 GB. The overlap tiling strategy is used for the large remote sensing image cropping with the size of the sliding window of 512 and the step size of 256, to produce 512 × 512 pixel images. A total of 6607 images are obtained, and the dataset is divided according to the 8:2 ratio, in which 5285 images are for the training set and 1322 images are for the validation set. The proposed method realizes the task of fine classification of land cover in the urban area of Taiyuan city. Results The experiments were conducted with the latest five algorithms (e.g., UNet, PSPNet, DeeplabV3+, A2-FPN, and SwinTransformer) on the self-made dataset TULCD and the public dataset Vaihingen, and three evaluation metrics were used to evaluate the model performance. The performance of the proposed CDGCANet outperforms other algorithms on the TULCD dataset, with average pixel accuracy mPA, average intersection and merger ratio mIoU and mF1 of 74.23%, 58.91%, and 72.24%, respectively, and the overall performance exceeds that of the second-ranked model PSPNet with mPA of 4.61%, mIoU of 1.58%, and mF1 of 1.63%. The overall performance achieved by the CDGCANet on the Vaihingen dataset is mPA 83.22%, mIoU 77.62%, and mF1 86.26%, which are higher than those of the second-ranked model DeeplabV3+, which are mPA 1.86%, mIoU 2.62%, and mF1 2.06%, respectively. According to the visualization, the results show that the model can correctly identify the feature target with a complete segmentation area, clear details, and continuous edges. In addition, the neural network visualization tool GradCAM is used to view the category heat map output by the model. The experimental results show that the attention mechanism can help the model focus on key areas and ground objects, and enhance the feature expression ability of the model. Conclusion The semantic segmentation model of high-resolution remote sensing images proposed in this paper has strong spatial and detail perception capabilities, which not only improves the accuracy of semantic segmentation but also yields more satisfactory segmentation results when handling complex remote sensing images. Looking ahead, we anticipate further optimization and in-depth research to propel the practical application of this model, contributing to significant breakthroughs and advancements in the field of remote sensing image interpretation.
Keywords

订阅号|日报