Medical image segmentation with vision Mamba and adaptive multiscale loss fusion

Liu Jianming; Cao Shenghao; Zhang Zhipeng

doi:10.11834/jig.250224

Medical Image Processing | Views : 0 下载量: 240 CSCD: 0

PDF
Export
Share
Collection
Album

Medical image segmentation with vision Mamba and adaptive multiscale loss fusion
Vol. 31, Issue 1, Pages: 335-348(2026)
Received：22 May 2025，

Revised：2025-06-23，

Accepted：09 July 2025，

Published：16 January 2026
DOI： 10.11834/jig.250224
稿件说明：

移动端阅览

刘建明，曹圣浩，张志鹏. 2026. 融合视觉Mamba与自适应多尺度损失的医学图像分割. 中国图象图形学报， 31(1):0335-0348 DOI： 10.11834/jig.250224.

Liu Jianming， Cao Shenghao， Zhang Zhipeng. 2026. Medical image segmentation with vision Mamba and adaptive multiscale loss fusion. Journal of Image and Graphics， 31(1):0335-0348 DOI： 10.11834/jig.250224.

摘要

目的

在医学图像分割领域，传统基于卷积神经网络（convolutional neural network，CNN）的模型在捕捉长距离依赖信息方面存在固有局限，而基于视觉Transformer（vision Transformer， ViT）的模型其自注意力机制的计算复杂度与图像尺寸呈平方关系，在资源有限的现实环境中难以部署。为了解决这些问题，提出一种融合视觉 Mamba 和自适应多尺度损失的医学图像分割方法VMAML-UNet（medical image segmentation with vision Mamba and adaptive multi-scale loss）。

方法

VMAML-UNet采用编码器—解码器架构。在编码阶段，设计了融合小波卷积的视觉 Mamba 块，以线性复杂度提取病变区域的精确特征并扩大感受野，并通过块合并进行下采样。解码阶段同样引入融合小波卷积的视觉 Mamba 块并利用块扩展进行上采样。跳跃连接中，提出小波卷积注意力聚合模块，用于提取并融合不同尺度下的图像特征。此外，设计了柯尔莫哥洛夫—阿诺德网络（Kolmogorov-Arnold network， KAN）调控多尺度加权损失，动态调控各层级损失权重。

结果

在BUSI（breast ultrasound images dataset）、GlaS（gland segmentation in histology images challenge dataset）和CVC（CVC-ClinicDB dataset）3个异质性显著的医学图像数据集上的实验结果表明，与主流的VM-UNet（vision Mamba UNet）等采用Mamba的医学图像分割方法相比取得显著的性能提升。在BUSI数据集上，交并比（intersection over union，IoU）和F1分数分别提升2.72%和2.02%；在GlaS数据集上，IoU和F1分数分别提升3.38%和1.89%；在CVC数据集上，IoU和F1分数分别提升2.51%和1.42%。

结论

提出的VMAML-UNet采用基于视觉Mamba的线性复杂度的长距离依赖建模与基于KAN的动态损失优化机制，显著减少了计算成本，同时提升了模型对复杂医学图像的分割精度。该模型在3个数据集上的优异表现证明了其在不同医学图像场景下的广泛适用性和高效性。

Abstract

Objective

Medical image segmentation is crucial for identifying anatomical structures and regions of interest in medical images， playing a critical role in diagnosis and treatment planning. Although traditional convolutional neural network （CNN）-based models have shown notable success， they often struggle to capture long-range dependencies， resulting in suboptimal feature extraction and segmentation performance. This limitation is particularly problematic in medical imaging， where accurate and detailed segmentation is necessary for reliable diagnoses. Transformer-based models that use the self-attention mechanism excel in global context modeling but demonstrate quadratic computational complexity with image size， increasing their computational cost for dense medical image segmentation tasks and hindering efficient real-world applications. Recent studies indicate that state-space models such as Mamba can simulate long-range dependencies with linear complexity. Furthermore， Kolmogorov-Arnold networks （KAN） possess powerful nonlinear modeling capabilities suitable for complex medical image features. However， traditional static weighting strategies ineffectively adapt to the dynamic nature of medical image data. Aiming to address these challenges， VMAML-UNet， a novel medical image segmentation framework combining KAN， is proposed to regulate multiscale weighted losses and visual Mamba for efficient long-range dependency modeling.

Method

The VMAML-UNet method adopts an encoder-decoder architecture， a widely used and effective design in deep learning for image segmentation tasks. In the encoding stage， a novel visual Mamba block （WCVM block） is introduced， incorporating wavelet convolutions to extract precise and localized features from lesion regions with linear computational complexity. The use of wavelet convolutions enables the model to expand its receptive field， which is critical for capturing long-range dependencies within the image. The visual Mamba block enhances feature extraction by improving the representation of critical areas within the image， thereby addressing the issue of insufficient feature capture. Furthermore， the encoding stage incorporates downsampling through block merging， which effectively reduces data dimensionality while retaining important features. In the decoding phase， WCVM blocks are reused， and block expansion is employed to perform upsampling. This approach aids in accurately reconstructing the segmentation mask with high accuracy， ensuring that fine details are preserved throughout the process. The skip connections between the encoder and decoder are designed to transfer critical information from low to high layers of the network. This study introduces a new component： the wavelet convolution attention aggregation （WCAA） module. The WCAA module is designed to fuse and refine features from multiple scales， both spatially and across channels， which allows the model to capture more complex， multidimensional patterns within the image. This module is particularly useful for improving the quality of segmentation in images where the regions of interest are surrounded by similar tissue， making them harder to differentiate. Additionally， a KAN-regulated multiscale weighted loss module is introduced to dynamically capture the nonlinear features and inter-layer dependencies among outputs from different stages of the model. This module addresses the limitations of traditional static weighting strategies， which fail to adapt to the dynamic nature of feature representations extracted at different layers. Specifically， the KAN module applies KAN convolutions to the final three decoder layers to generate multiscale segmentation masks， which are then used to compute hierarchical losses. These losses are then combined with the corresponding encoder outputs to form the multiscale weighted loss. Finally， this loss is integrated with the loss computed from the true labels and predicted masks， enabling effective backpropagation and model training.

Result

Aiming to evaluate the performance of the proposed VMAML-UNet model， experiments on three diverse and heterogeneous medical image datasets were conducted： the BUSI dataset， the GlaS dataset， and the CVC dataset. These datasets were selected because they represent different types of medical images with varying complexity and noise levels. Experimental results show that the VMAML-UNet outperforms other segmentation methods， such as VM-UNet， which also employs VSS blocks for segmentation. Specifically， on the BUSI dataset， the VMAML-UNet model achieved an improvement in intersection over union （IoU） and an improvement in F1 score by 2.72% and 2.02%， respectively， compared to VM-UNet. The BUSI dataset， which contains breast ultrasound images， presents challenges due to the noise and variability in image quality. However， the proposed model showed notable improvements in addressing these issues. On the GlaS dataset， which contains eye fundus images for glaucoma detection， the VMAML-UNet model achieved 3.38% and 1.89% improvements in IoU and F1 score， respectively. Glaucoma is a leading cause of blindness， and accurate segmentation of the optic nerve head is crucial for effective diagnosis. The strong performance of the VMAML-UNet model on this dataset highlights its capability to capture fine details in medical images. Similarly， on the CVC dataset， which comprises colonoscopy images， the model demonstrated improvements of 2.51% in IoU and 1.42% in F1 score. These results further confirm that the proposed VMAML-UNet model substantially improves segmentation performance across different types of medical images.

Conclusion

Through the integration of wavelet convolution-enhanced visual state-space （VSS） blocks， the proposed VMAML-UNet notably reduces computational costs and effectively addresses the limitations of CNN and Transformer-based models in medical image segmentation. The superior performance of this model across three datasets highlights its broad applicability and efficiency in various medical imaging scenarios， offering valuable insights into the development of highly efficient and robust medical image segmentation methods.

关键词

Keywords

references

Agrawal A ， Agrawal A ， Gupta S and Bagade P . 2025 . KAN-Mamba FusionNet： redefining medical image segmentation with non-linear modeling ［EB/OL］. ［ 2025-05-10 ］. https://arxiv.org/pdf/2411.11926.pdf https://arxiv.org/pdf/2411.11926.pdf

Al-Dhabyani W ， Gomaa M ， Khaled H and Fahmy A . 2020 . Dataset of breast ultrasound images . Data in Brief ， 28 ： # 104863 ［ DOI： 10.1016/j.dib.2019.104863 http://dx.doi.org/10.1016/j.dib.2019.104863 ］

Ba J L ， Kiros J R and Hinton G E . 2016 . Layer normalization ［EB/OL］. ［ 2025-05-10 ］. https://arxiv.org/pdf/1607.06450.pdf https://arxiv.org/pdf/1607.06450.pdf

Bernal J ， S􀅡nchez F J ， Fern􀅡ndez-Esparrach G ， Gil D ， Rodríguez C and Vilariño F . 2015 . WM-DOVA maps for accurate polyp highlighting in colonoscopy： validation vs. saliency maps from physicians . Computerized Medical Imaging and Graphics ， 43 ： 99 - 111 ［ DOI： 10.1016/j.compmedimag.2015.02.007 http://dx.doi.org/10.1016/j.compmedimag.2015.02.007 ］

Bodner A D ， Tepsich A S ， Spolski J N and Pourteau S . 2025 . Convolutional Kolmogorov-Arnold networks ［EB/OL］. ［ 2025-05-10 ］. https://arxiv.org/pdf/2406.13155.pdf https://arxiv.org/pdf/2406.13155.pdf

Cao H ， Wang Y Y ， Chen J ， Jiang D S ， Zhang X P ， Tian Q ， et al . 2023 . Swin-unet： Unet-like pure transformer for medical image segmentation // Proceedings of 2023 European Conference on Computer Vision . Tel Aviv， Israel ： Springer： 205 - 218 ［ DOI： 10.1007/978-3-031-25066-8_9 http://dx.doi.org/10.1007/978-3-031-25066-8_9 ］

Chen J N ， Lu Y Y ， Yu Q H ， Luo X D ， Adeli E ， Wang Y ， et al . 2021 . TransUNet： transformers make strong encoders for medical image segmentation ［EB/OL］. ［ 2025-05-10 ］. https://arxiv.org/pdf/2102.04306.pdf https://arxiv.org/pdf/2102.04306.pdf

Çiçek Ö ， Abdulkadir A ， Lienkamp S S ， Brox T and Ronneberger O . 2016 . 3D U-Net： learning dense volumetric segmentation from sparse annotation // Proceedings of the 19th International Conference on Medical Image Computing and Computer-Assisted Intervention——MICCAI 2016 . Athens， Greece ： Springer： 424 - 432 ［ DOI： 10.1007/978-3-319-46723-8_49 http://dx.doi.org/10.1007/978-3-319-46723-8_49 ］

Dosovitskiy A ， Beyer L ， Kolesnikov A ， Weissenborn D ， Zhai X H ， Unterthiner T ， et al . 2021 . An image is worth 16 × 16 words： transformers for image recognition at scale ［EB/OL］. ［ 2025-05-10 ］. https://arxiv.org/pdf/2010.11929.pdf https://arxiv.org/pdf/2010.11929.pdf

Elfwing S ， Uchibe E and Doya K . 2018 . Sigmoid-weighted linear units for neural network function approximation in reinforcement learning . Neural Networks ， 107 ： 3 - 11 ［ DOI： 10.1016/j.neunet.2017.12.012 http://dx.doi.org/10.1016/j.neunet.2017.12.012 ］

Finder S E ， Amoyal R ， Treister E and Freifeld O . 2025 . Wavelet convolutions for large receptive fields // Proceedings of the 18th European Conference on Computer Vision . Milan， Italy ： Springer： 363 - 380 ［ DOI： 10.1007/978-3-031-72949-2_21 http://dx.doi.org/10.1007/978-3-031-72949-2_21 ］

Gu A and Dao T . 2024 . Mamba： linear-time sequence modeling with selective state spaces ［EB/OL］. ［ 2025-05-10 ］. https://arxiv.org/pdf/2312.00752.pdf https://arxiv.org/pdf/2312.00752.pdf

Gu A ， Dao T ， Ermon S ， Rudra A and Ré C . 2020 . HiPPO： recurrent memory with optimal polynomial projections // Proceedings of the 34th International Conference on Neural Information Processing Systems . Vancouver， Canada ： Curran Associates Inc.： 1474 - 1487

Gu A ， Goel K and Ré C . 2022 . Efficiently modeling long sequences with structured state spaces ［EB/OL］. ［ 2025-05-10 ］. https://arxiv.org/pdf/2111.00396.pdf https://arxiv.org/pdf/2111.00396.pdf

Gu A ， Johnson I ， Goel K ， Saab K ， Dao T ， Rudra A ， et al . 2021 . Combining recurrent ， convolutional， and continuous-time models with linear state-space layers// Proceedings of the 35th International Conference on Neural Information Processing Systems . Virtual Event ： Curran Associates Inc.： 572 - 585

Hornik K ， Stinchcombe M and White H . 1989 . Multilayer feedforward networks are universal approximators . Neural Networks ， 2 （ 5 ）： 359 - 366 ［ DOI： 10.1016/0893-6080（89）90020-8 http://dx.doi.org/10.1016/0893-6080（89）90020-8 ］

Kalman R E . 1960 . A new approach to linear filtering and prediction problems . Transactions of the ASME-Journal of Basic Engineering ， 82 （ 1 ）： 35 - 45 ［ DOI： 10.1115/1.3662552 http://dx.doi.org/10.1115/1.3662552 ］

Li C X ， Liu X Y ， Li W Y ， Wang C ， Liu H Y ， Liu Y F ， et al . 2024a . U-KAN makes strong backbone for medical image segmentation and generation ［EB/OL］. ［ 2025-05-10 ］. https://arxiv.org/pdf/2406.02918.pdf https://arxiv.org/pdf/2406.02918.pdf

Li H ， Zhai D H and Xia Y Q . 2024b . ERDUnet： an efficient residual double-coding UNet for medical image segmentation . IEEE Transactions on Circuits and Systems for Video Technology ， 34 （ 4 ）： 2083 - 2096 ［ DOI： 10.1109/TCSVT.2023.3300846 http://dx.doi.org/10.1109/TCSVT.2023.3300846 ］

Liu J M and Tang Y C . 2025 . Two-stage coronary artery segmentation via multidirectional snake convolution and vision Mamba . Journal of Image and Graphics ， 30 （ 10 ）： 3242 - 3254

刘建明，唐煜城 . 采用多方向蛇形卷积和视觉残差Mamba的两阶段冠状动脉分割方法 . 中国图象图形学报［ DOI： 10.11834/jig.240538 http://dx.doi.org/10.11834/jig.240538 ］

Liu J X ， Chen Y ， Ni B B and Yu Z B . 2023 . Joint global and dynamic pseudo labeling for semi-supervised point cloud sequence segmentation . IEEE Transactions on Circuits and Systems for Video Technology ， 33 （ 10 ）： 5679 - 5691 ［ DOI： 10.1109/TCSVT.2023.3253210 http://dx.doi.org/10.1109/TCSVT.2023.3253210 ］

Liu M S ， Dan J ， Lu Z Q ， Yu Y L ， Li Y M and Li X . 2024a . CM-UNet： hybrid CNN-Mamba UNet for remote sensing image semantic segmentation ［EB/OL］. ［ 2025-05-10 ］. https://arxiv.org/pdf/2405.10530.pdf https://arxiv.org/pdf/2405.10530.pdf

Liu Y ， Tian Y J ， Zhao Y Z ， Yu H T ， Xie L X ， Wang Y W ， et al . 2025a . VMamba： visual state space model // Proceedings of the 38th International Conference on Neural Information Processing Systems . Vancouver， Canada ： Curran Associates Inc.： 103031 - 103063

Liu Y T ， Zhu H J ， Liu M T ， Yu H Y ， Chen Z H and Gao J . 2024b . Rolling-Unet： revitalizing MLP’s ability to efficiently extract long-distance dependencies for medical image segmentation // Proceedings of the 38th AAAI Conference on Artificial Intelligence . Vancouver， Canada ： AAAI： 3819 - 3827 ［ DOI： 10.1609/aaai.v38i4.28173 http://dx.doi.org/10.1609/aaai.v38i4.28173 ］

Liu Z ， Lin Y T ， Cao Y ， Hu H ， Wei Y X ， Zhang Z ， et al . 2021 . Swin transformer： hierarchical vision transformer using shifted windows // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 9992 - 10002 ［ DOI： 10.1109/ICCV48922.2021.00986 http://dx.doi.org/10.1109/ICCV48922.2021.00986 ］

Liu Z M ， Wang Y X ， Vaidya S ， Ruehle F ， Halverson J ， Soljačić M ， et al . 2025b . KAN： Kolmogorov-Arnold networks ［EB/OL］. ［ 2025-05-10 ］. https://arxiv.org/pdf/2404.19756.pdf https://arxiv.org/pdf/2404.19756.pdf

Ma J ， Li F F and Wang B . 2024 . U-Mamba： enhancing long-range dependency for biomedical image segmentation ［EB/OL］. ［ 2025-05-10 ］. https://arxiv.org/pdf/2401.04722.pdf https://arxiv.org/pdf/2401.04722.pdf

Mu P ， Wu G Y ， Liu J Y ， Zhang Y D ， Fan X and Liu R S . 2024 . Learning to search a lightweight generalized network for medical image fusion . IEEE Transactions on Circuits and Systems for Video Technology ， 34 （ 7 ）： 5921 - 5934 ［ DOI： 10.1109/TCSVT.2023.3342808 http://dx.doi.org/10.1109/TCSVT.2023.3342808 ］

Myronenko A . 2019 . 3D MRI brain tumor segmentation using autoencoder regularization // 4th International Workshop on Brainlesion： Glioma， Multiple Sclerosis， Stroke and Traumatic Brain Injuries . Granada， Spain ： Springer： 311 - 320 ［ DOI： 10.1007/978-3-030-11726-9_28 http://dx.doi.org/10.1007/978-3-030-11726-9_28 ］

Oktay O ， Schlemper J ， Le Folgoc L ， Lee M ， Heinrich M ， Misawa K ， et al . 2018 . Attention U-net： learning where to look for the pancreas ［EB/OL］. ［ 2025-05-10 ］. https://arxiv.org/pdf/1804.03999.pdf https://arxiv.org/pdf/1804.03999.pdf

Ronneberger O ， Fischer P and Brox T . 2015 . U-net： convolutional networks for biomedical image segmentation // Proceedings of the 18th International Conference on Medical Image Computing and computer-Assisted Intervention-MICCAI 2015 . Munich， Germany ： Springer： 234 - 241 ［ DOI： 10.1007/978-3-319-24574-4_28 http://dx.doi.org/10.1007/978-3-319-24574-4_28 ］

Ruan J C ， Li J C and Xiang S C . 2024 . VM-Unet： Vision Mamba UNet for medical image segmentation ［EB/OL］. ［ 2025-05-10 ］. https://arxiv.org/pdf/2402.02491.pdf https://arxiv.org/pdf/2402.02491.pdf

Ruan J C ， Xiang S C ， Xie M Y ， Liu T and Fu Y Z . 2022a . MALUNet： a multi-attention and light-weight UNet for skin lesion segmentation // Proceedings of 2022 IEEE International Conference on Bioinformatics and Biomedicine （BIBM） . Las Vegas， USA ： IEEE： 1150 - 1156 ［ DOI： 10.1109/BIBM55620.2022.9995040 http://dx.doi.org/10.1109/BIBM55620.2022.9995040 ］

Ruan J C ， Xie M Y ， Gao J S ， Liu T and Fu Y Z . 2023 . EGE-UNet： an efficient group enhanced UNet for skin lesion segmentation // Proceedings of the 26th International Conference on Medical Image Computing and Computer Assisted Intervention . Vancouver， Canada ： Springer： 481 - 490 ［ DOI： 10.1007/978-3-031-43901-8_46 http://dx.doi.org/10.1007/978-3-031-43901-8_46 ］

Ruan J C ， Xie M Y ， Xiang S C ， Liu T and Fu Y Z . 2022b . MEW-UNet： multi-axis representation learning in frequency domain for medical image segmentation ［EB/OL］. ［ 2025-05-10 ］. https://arxiv.org/pdf/2210.14007.pdf https://arxiv.org/pdf/2210.14007.pdf

Valanarasu J M J ， Oza P ， Hacihaliloglu I and Patel V M . 2021 . Medical transformer： gated axial-attention for medical image segmentation // Proceedings of the 24th International Conference on Medical Image Computing and Computer Assisted Intervention-MICCAI 2021 . Strasbourg， France ： Springer： 36 - 46 ［ DOI： 10.1007/978-3-030-87193-2_4 http://dx.doi.org/10.1007/978-3-030-87193-2_4 ］

Valanarasu J M J and Patel V M . 2022 . UNeXt： MLP-based rapid medical image segmentation network // Proceedings of the 25th International Conference on Medical Image Computing and Computer Assisted Intervention . Singapore， Singapore ： Springer： 23 - 33 ［ DOI： 10.1007/978-3-031-16443-9_3 http://dx.doi.org/10.1007/978-3-031-16443-9_3 ］

Vaswani A ， Shazeer N ， Parmar N ， Uszkoreit J ， Jones L ， Gomez A N ， et al . 2017 . Attention is all you need // Proceedings of the 31st International Conference on Neural Information Processing Systems . Long Beach， USA ： Curran Associates Inc.： 6000 - 6010

Wang T ， Lu C H ， Sun Y N ， Yang M ， Liu C and Ou C S . 2021 . Automatic ECG classification using continuous wavelet transform and convolutional neural network . Entropy ， 23 （ 1 ）： # 119 ［ DOI： 10.3390/e23010119 http://dx.doi.org/10.3390/e23010119 ］

Woo S ， Park J ， Lee J Y and Kweon I S . 2018 . CBAM： convolutional block attention module // Proceedings of the 15th European Conference on Computer Vision （ECCV） . Munich， Germany ： Springer： 3 - 19 ［ DOI： 10.1007/978-3-030-01234-2_1 http://dx.doi.org/10.1007/978-3-030-01234-2_1 ］

Xing Z ， Ye T ， Yang Y ， Liu G and Zhu L . 2024 . SegMamba： long-range sequential modeling Mamba for 3D medical image segmentation // Proceedings of the 27th International Conference on Medical Image Computing and Computer Assisted Intervention . Marrakesh， Morocco ： Springer： 578 - 588 ［ DOI： 10.1007/978-3-031-72111-3_54 http://dx.doi.org/10.1007/978-3-031-72111-3_54 ］

Zhang Y B . 2025 . KM-UNet KAN Mamba UNet for medical image segmentation ［EB/OL］. ［ 2025-05-10 ］. https://arxiv.org/pdf/2501.02559.pdf https://arxiv.org/pdf/2501.02559.pdf

Zhang Y D ， Liu H Y and Hu Q . 2021 . TransFuse： fusing transformers and CNNs for medical image segmentation // Proceedings of the 24th International Conference on Medical Image Computing and Computer Assisted Intervention——MICCAI 2021 . Strasbourg， France ： Springer： 14 - 24 ［ DOI： 10.1007/978-3-030-87193-2_2 http://dx.doi.org/10.1007/978-3-030-87193-2_2 ］

Zhou T ， Zhou Y ， Li G Y ， Chen G and Shen J B . 2024 . Uncertainty-aware hierarchical aggregation network for medical image segmentation . IEEE Transactions on Circuits and Systems for Video Technology ， 34 （ 8 ）： 7440 - 7453 ［ DOI： 10.1109/TCSVT.2024.3370685 http://dx.doi.org/10.1109/TCSVT.2024.3370685 ］

Zhou Z W ， Rahman Siddiquee M M ， Tajbakhsh N and Liang J M . 2018 . UNet++： a nested U-Net architecture for medical image segmentation // Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support： 4th International Workshop， DLMIA 2018， and 8th International Workshop ， ML-CDS 2018. Granada， Spain ： Springer： 3 - 11 ［ DOI： 10.1007/978-3-030-00889-5_1 http://dx.doi.org/10.1007/978-3-030-00889-5_1 ］

Zhu L H ， Liao B C ， Zhang Q ， Wang X L ， Liu W Y and Wang X G . 2024 . Vision Mamba： efficient visual representation learning with bidirectional state space model ［EB/OL］. ［ 2025-05-10 ］. https://arxiv.org/pdf/2401.09417.pdf https://arxiv.org/pdf/2401.09417.pdf

Alert me when the article has been cited

提交

暂无数据