DP-SAM： Efficient Semantic Segmentation of Remote Sensing Images by Fine-tuning the SAM Model

Liu Siyong; Zhao Yili

doi:10.11834/jig.240540

Views : 0 下载量: 0 CSCD: 0

PDF
Export
Share
Collection
Album

DP-SAM： Efficient Semantic Segmentation of Remote Sensing Images by Fine-tuning the SAM Model
Pages: 1-13(2025)
Published Online： 17 February 2025 ，

Accepted： 2025-02-13
DOI： 10.11834/jig.240540
稿件说明：

移动端阅览

刘思涌,赵毅力.DP-SAM：微调 SAM 模型的遥感图像高效语义分割[J].中国图象图形学报,

Liu Siyong,Zhao Yili.DP-SAM： Efficient Semantic Segmentation of Remote Sensing Images by Fine-tuning the SAM Model[J].Journal of Image and Graphics,
刘思涌,赵毅力.DP-SAM：微调 SAM 模型的遥感图像高效语义分割[J].中国图象图形学报, DOI： 10.11834/jig.240540.

Liu Siyong,Zhao Yili.DP-SAM： Efficient Semantic Segmentation of Remote Sensing Images by Fine-tuning the SAM Model[J].Journal of Image and Graphics, DOI： 10.11834/jig.240540.

摘要

目的

SAM（Segment Anything Model）已经成为自然图像零样本分割的一个大模型基准。由于遥感图像的复杂性和场景多变性，以及SAM是一个需要提示信息的分割模型，直接将这个“基础宏观模型”应用于遥感图像分割会导致过分割以及需要大量手动输入提示的问题。针对上述问题，论文提出一种通过微调将SAM用于遥感图像语义分割的高效方法。

方法

首先，保留原生SAM的图像编码器模块但对其训练参数进行微调，并且引入一条新的CNN编码器路径。其次，在解码器中采用一种经过微调的无提示方法，消除了将SAM应用于图像分割需要输入提示的问题。通过CNN和Transformer两条路径分别输出两个独立的预测掩码，并根据这两个掩码获得分割的结果。我们将这种具有两条路径且经过精细微调的模型命名为DP-SAM。

结果

本文使用两个经过标注的遥感图像数据集Potsdam和Vaihingen对DP-SAM进行了评估，并通过消融性实验对如何根据两条解码器路径的输出生成预测掩码的方法进行了讨论。实验表明，DP-SAM能对遥感图像进行高效语义分割，在Potsdam数据集上达到了86.2%的mIoU和92.7%的F1分数，在Vaihingen数据集上达到了85.9%的mIoU和92.4%的F1分数。

结论

本文提出的对SAM模型进行微调的方法具有良好的性能，实现了将大模型应用于遥感领域语义分割的场景，该方法优于所对比的其它基于深度学习和微调SAM的方法。本工作的源代码将可在

https：//github.com/Jacky-Android/DP-SAM

https://github.com/Jacky-Android/DP-SAM

获取。

Abstract

Objective

SAM （Segment Anything Model） has become a large-scale model benchmark in the field of zero-shot segmentation of natural images. However， due to the complex scene diversity and semantic information of remote sensing images， and SAM as a segmentation model that requires prompt information， directly applying this complex macro model to the remote sensing image segmentation task will face over-segmentation and require a large number of artificial professional manual prompts. Working like this to give high-resolution images itself requires a lot of computing resources to make things worse. In response to the above problems， this paper proposes an effective SAM fine-tuning method to cope with the complexity of remote sensing image segmentation. This method combines the advantages of CNN and Transformer， and by carefully adjusting the parameters and structure of the SAM model， it aims to reduce the risk of over-segmentation and reliance on manual prompt input， thereby improving the model's applicability and effectiveness in this field. Through this method， we can better adapt to the characteristics of remote sensing images， improve the accuracy and efficiency of segmentation results， and provide more reliable solutions for remote sensing image segmentation tasks.

Methods

First， when the input image passes through the SAM image encoder， training is frozen while using SAM’s original prior knowledge weights， and a new lightweight CNN encoder path ResNet18 is introduced. This is to utilize the excellent prior knowledge of the SAM original image encoder and introduce the CNN path for fine-tuning. The purpose of this is to allow DP-SAM to use SAM weights while also using another CNN path for learning to avoid forgetting in the model. Secondly， the decoder also has two paths. The mask decoder adopts a fine-tuned prompt-free approach， which eliminates the dependence on the prompt encoder module and makes the model more flexible and versatile. By using the CNN decoder path to splice the shallow to deep semantic features of the CNN encoder， more feature information can be fused. In this way， the CNN decoder and the mask decoder output two prediction masks， so it is a very interesting experiment to compare the fusion or separate output effects of the two masks to derive the optimal segmentation strategy. We name this fine-tuned model with two paths as DP-SAM （Dual Path Piecewise Arbitrary Model）. This improved model not only improves segmentation accuracy but also takes into account the lightweight and versatility of the model， bringing a more comprehensive and efficient solution to remote sensing image segmentation tasks using SAM fine-tuning.

Results

We evaluate our method DP-SAM using two publicly available labeled datasets （Potsdam and Vaihingen）. Both datasets are high-resolution images with complex scenes including dense streets， large building complexes， etc. These complex scenes pose challenges to remote sensing image segmentation tasks， requiring models with good generalization capabilities and sensitivity to details. By evaluating on these two datasets， we can have a comprehensive understanding of the performance and effect of DP-SAM in processing high-resolution， complex scene remote sensing images. Experimental results show that DP-SAM performs well in semantic segmentation. On the Potsdam dataset， DP-SAM achieves 86.11% mIoU and 92.7% F1 score. On the Vaihingen dataset， DP-SAM achieves 85.9% mIoU and 92.4% F1 score. These evaluation indicators highlight the excellent performance and robustness of the DP-SAM model in remote sensing image segmentation tasks. At the same time， we also conducted ablation experiments to evaluate whether to fuse the prediction masks generated by the two decoder paths or output them separately when training the fine-tuned model. The ablation experiments verified the optimal generation strategy of the dual-path mask of the DP-SAM decoder. are the prediction masks generated by the dual-channel decoder respectively.

Conclusion

The method proposed in this paper realizes the application of large-scale models in semantic segmentation scenarios required for remote sensing. DP-SAM not only effectively captures semantic information in images， but also produces highly accurate segmentation results. This is of extremely important signif

icance for remote sensing image processing and analysis， provides reliable technical support and solutions for practical applications in map production， environmental monitoring and other fields， and promotes the widespread application of remote sensing technology in practice. Through this innovative method， we demonstrated the superior performance of large models in complex scenes， brought higher accuracy and efficiency to the field of remote sensing image analysis， and promoted the development and application of remote sensing technology. The source code of this work will be available at

https：//github.com/Jacky-Android/DP-SAM

https://github.com/Jacky-Android/DP-SAM

关键词

Keywords

references

Bao H ， Dong L ， Wei F ， Wang W ， Yang N ， Liu X ， Wang Y ， Piao S ， Gao J ， Zhou M ， and Hon H-W . 2020 . UniLMv2： Pseudo-Masked Language Models for Unified Language Model Pre-Training ［EB/OL］. ［ 2024-3-29 ］. https://arxiv.org/pdf/2002.12804.pdf https://arxiv.org/pdf/2002.12804.pdf

Chen K ， Liu C ， Chen H ， Zhang H ， Li W ， Zou Z ， and Shi Z . 2024 . RSPrompter： Learning to Prompt for Remote Sensing Instance Segmentation Based on Visual Foundation Model . IEEE Transactions on Geoscience and Remote Sensing ， 62 ， 1 – 17 ［ DOI： 10.1109/TGRS.2024.3356074 http://dx.doi.org/10.1109/TGRS.2024.3356074 ］

Chen J ， Lu Y ， Yu Q ， Luo X ， Adeli E ， Wang Y ， Lu L ， Yuille AL ， and Zhou Y . 2021 . TransUNet： Transformers Make Strong Encoders for Medical Image Segmentation ［EB/OL］. ［ 2024-3-29 ］. https://arxiv.org/pdf/2102.04306.pdf https://arxiv.org/pdf/2102.04306.pdf

Chai S ， Jain RK ， Teng S ， Liu J ， Li Y ， Tateyama T ， and Wei Yen-Chen . 2023 . Ladder Fine-tuning approach for SAM integrating complementary network ［EB/OL］. ［ 2024-3-29 ］. https://arxiv.org/pdf/2306.12737.pdf https://arxiv.org/pdf/2306.12737.pdf

Chen C-F R ， Fan Q ， and Panda R . 2021 . CrossViT： Cross-Attention Multi-Scale Vision Transformer for Image Classification . 2021 IEEE/CVF International Conference on Computer Vision （ICCV）， 347 – 356 ［ DOI： 10.1109/ICCV48922.2021.00041 http://dx.doi.org/10.1109/ICCV48922.2021.00041 ］

Chen L-C ， Papandreou G ， Schroff F ， and Adam H . 2017 . Rethinking Atrous Convolution for Semantic Image Segmentation ［EB/OL］. ［ 2024-3-29 ］. https://arxiv.org/pdf/1706.05587.pdf https://arxiv.org/pdf/1706.05587.pdf

Dosovitskiy A ， Beyer L ， Kolesnikov A ， Weissenborn D ， Zhai X ， Unterthiner T ， Dehghani M ， Minderer M ， Heigold G ， Gelly S ， Uszkoreit J ， and Houlsby N . 2021 . An Image is Worth 16 x 16 Words： Transformers for Image Recognition at Scale ［EB/OL］. ［ 2024-3-29 ］. https://arxiv.org/pdf/2010.11929.pdf https://arxiv.org/pdf/2010.11929.pdf

He K ， Chen X ， Xie S ， Li Y ， Dollár P ， and Girshick R . 2022 . Masked Autoencoders Are Scalable Vision Learners . 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）， 15979 – 15988 ［ DOI： 10.1109/CVPR52688.2022.01553 http://dx.doi.org/10.1109/CVPR52688.2022.01553 ］

Hu X ， Xu X ， and Shi Y 2023 . How to Efficiently Adapt Large Segmentation Model （SAM） to Medical Images ［EB/OL］. ［ 2024-3-29 ］. https://arxiv.org/pdf/2306.13731.pdf https://arxiv.org/pdf/2306.13731.pdf

He K ， Zhang X ， Ren S ， and Sun J . 2016 . Deep Residual Learning for Image Recognition . 2016 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）， 770 – 778 ［ DOI： 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ］

Hu H ， Zhang Z ， Xie Z ， and Lin S . 2019 . Local Relation Networks for Image Recognition . 2019 IEEE/CVF International Conference on Computer Vision （ICCV）， 3463 – 3472 ［ DOI： 10.1109/ICCV.2019.00356 http://dx.doi.org/10.1109/ICCV.2019.00356 ］

Jadon S . 2020 . A survey of loss functions for semantic segmentation . 2020 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology （CIBCB）， 1 – 7 ［ DOI： 10.1109/CIBCB48159.2020.9277638 http://dx.doi.org/10.1109/CIBCB48159.2020.9277638 ］

Kirillov A ， Mintun E ， Ravi N ， Mao H ， Rolland C ， Gustafson L ， Xiao T ， Whitehead S ， Berg AC ， Lo W-Y ， Dollár P ， and Girshick R . 2023 . Segment Anything ［EB/OL］. ［ 2024-3-29 ］. https://arxiv.org/pdf/2304.02643.pdf https://arxiv.org/pdf/2304.02643.pdf

Li F ， Zhang H ， Sun P ， Zou X ， Liu S ， Yang J ， Li C ， Zhang L ， and Gao J . 2023 . Semantic-SAM： Segment and Recognize Anything at Any Granularity ［EB/OL］. ［ 2024-3-29 ］. https://arxiv.org/pdf/2307.04767.pdf https://arxiv.org/pdf/2307.04767.pdf

Liu H ， Tam D ， Muqeeth M ， Mohta J ， Huang T ， Bansal M ， and Raffel C . 2022 . Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning ［EB/OL］. ［ 2024-3-29 ］. https://arxiv.org/pdf/2205.05638.pdf https://arxiv.org/pdf/2205.05638.pdf

Lialin V ， Deshpande V ， and Rumshisky A . 2023 . Scaling Down to Scale Up： A Guide to Parameter-Efficient Fine-Tuning ［EB/OL］. ［ 2024-3-29 ］. https://arxiv.org/pdf/2303.15647.pdf https://arxiv.org/pdf/2303.15647.pdf

Li X ， Sun X ， Meng Y ， Liang J ， Wu F ， and Li J . 2020 . Dice Loss for Data-imbalanced NLP Tasks ［EB/OL］. ［ 2024-3-29 ］. https://arxiv.org/pdf/1911.02855.pdf https://arxiv.org/pdf/1911.02855.pdf

Liu Z. ， Lin Y. ， Cao Y. ， Hu H. ， Wei Y. ， Zhang Z. ， Lin S. ， & Guo ， B . Swin Transformer： Hierarchical Vision Transformer using Shifted Windows ［EB/OL］. ［ 2024-7-29 ］. https://arxiv.org/abs/2103.14030 https://arxiv.org/abs/2103.14030

Loshchilov I ， and Hutter F . 2019 . Decoupled Weight Decay Regularization ［EB/OL］. ［ 2024-3-29 ］. https://arxiv.org/pdf/1711.05101.pdf https://arxiv.org/pdf/1711.05101.pdf

OpenAI ， and Janko Altenschmidt . 2024 . GPT-4 Technical Report ［EB/OL］. ［ 2024-3-29 ］. https://arxiv.org/pdf/2303.08774.pdf https://arxiv.org/pdf/2303.08774.pdf

Rombach R ， Blattmann A ， Lorenz D ， Esser P ， and Ommer B . 2022 . High-Resolution Image Synthesis with Latent Diffusion Models ［EB/OL］. ［ 2024-3-29 ］. https://arxiv.org/pdf/2112.10752.pdf https://arxiv.org/pdf/2112.10752.pdf

Ronneberger O ， Fischer P and Brox T . 2015 . U-Net：convolutional networks for biomedical image segmentation // Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention . Munich，Germany ： Springer： 234 - 241 ［ DOI： 10.1007/978-3-319-24574-4_28 http://dx.doi.org/10.1007/978-3-319-24574-4_28 ］

Raffel C ， Shazeer N ， Roberts A ， Lee K ， Narang S ， Matena M ， Zhou Y ， Li W ， and Liu PJ . 2023 . Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer ［EB/OL］. ［ 2024-3-29 ］. https://arxiv.org/pdf/1910.10683.pdf https://arxiv.org/pdf/1910.10683.pdf

Shelhamer E ， Long J ， and Darrell T . 2017 . Fully Convolutional Networks for Semantic Segmentation . IEEE Transactions on Pattern Analysis and Machine Intelligence ， 39 （ 4 ）， 640 – 651 ［DOI： 10.1109/TPAMI.2016.2572683］ https：//arxiv.org/pdf/2208.03987.pdf https://arxiv.org/pdf/2208.03987.pdf

Wang L ， Li R ， Wang D ， Duan C ， Wang T ， and Meng X . 2021 . " Transformer Meets Convolution： A Bilateral Awareness Network for Semantic Segmentation of Very Fine Resolution Urban Scene Images" Remote Sensing 13 ， no. 16 ： 3065 ［ DOI： 10.3390/rs13163065 http://dx.doi.org/10.3390/rs13163065 ］

Wang Miao ， Huang Zhizhong ， He Huiguang ， Lu Huchuan ， Shan Hongming ， Zhang Junping . 2024 . Potential and prospects of segment anything model： a survey . Journal of Image and Graphics ， 29 （ 06 ）： 1479 - 1509

王淼，黄智忠，何晖光，卢湖川，单洪明，张军平 . 2024 . 分割一切模型SAM的潜力与展望：综述［J］. 中国图象图形学报， 29 （ 06 ）： 1479 - 1509 ［ DOI： 10.11834/jig.230792 http://dx.doi.org/10.11834/jig.230792 .］

Zhou

et al. ， "MeSAM： Multiscale Enhanced Segment Anything Model for Optical Remote Sensing Images ，" in IEEE Transactions on Geoscience and Remote Sensing ， vol . 62 ， pp. 1 - 15 ， 2024 ， Art no. 5623515 ［Doi： 10.1109/TGRS.2024.3398038］ ZhaoH， ShiJ， QiX， WangX， and JiaJ. 2017. Pyramid Scene Parsing Network. 2017 IEEE Conference on Computer Vision and Pattern Recognition CVPR）， 6230–6239 ［DOI： 10.1109/CVPR.2017.660］ ZhangL， RaoA， a nd AgrawalaM. 2023. Adding Conditional Control to Text-to-Image Diffusion Models ［EB/OL］. ［ 2024-3-29 ］. https://arxiv.org/pdf/2302.05543.pdf （Zheng Z https://arxiv.org/pdf/2302.05543.pdf(ZhengZ ， ZhongY， WangJ， MaA， and ZhangL. 2023. FarSeg++： Foreground-Aware Relation Network for Geospatial Object Segmentation in High Spatial Resolution Remote Sensing Imagery. IEEE Transactions on Pattern Analysis and Machine Intelligence， 45（ 11 ）， 13715 – 13729 ［ DOI： 10.1109/TPAMI.2023.3296757 http://dx.doi.org/10.1109/TPAMI.2023.3296757 ］

Xuemei Zhao ， Jun Wu ， Ruixing Chen . RMFS-CNN： new deep learning framework for remote sensing image classification ［J］. Journal of Image and Graphics ， 2021 ， 26 （ 2 ）： 297 - 304

赵雪梅，吴军，陈睿星 . RMFS-CNN：遥感图像分类深度学习新框架［J］. 中国图象图形学报， 2021 ， 26 （ 2 ）： 297 - 304 ［DOI：10.11834/jig.200397］ .

Alert me when the article has been cited

提交

Multimodal-based zero-shot human action recognition