Swin Transformer V2和特征融合的U-Net图像去噪方法

利铭康; 柳薇; 陈卫东

doi:10.11834/jig.240659

浏览量 : 0 下载量: 0 CSCD: 0

PDF
导出
分享
收藏
专辑

Swin Transformer V2和特征融合的U-Net图像去噪方法
Image Denoising via Swin Transformer V2 and Feature Fusion U-Net
2025年页码：1-11
收稿日期：2024-10-30，

修回日期：2025-03-07，

录用日期：2025-03-10，

网络出版日期：2025-03-12，
DOI： 10.11834/jig.240659
稿件说明：

移动端阅览

利铭康, 柳薇, 陈卫东. Swin Transformer V2和特征融合的U-Net图像去噪方法[J/OL]. 中国图象图形学报, 2025,1-11. DOI： 10.11834/jig.240659.

Li Mingkang, Liu Wei, Chen Weidong. Image Denoising via Swin Transformer V2 and Feature Fusion U-Net[J/OL]. Journal of image and graphics, 2025, 1-11. DOI： 10.11834/jig.240659.

摘要

目的

纯Transformer神经网络在图像去噪上效果显著，但要进一步提升去噪质量，需要增加大量的训练和预测资源；另外，原始Swin Transformer对高分辨率图片输入缺少良好的适应性。对此，设计了一种基于Swin Transformer V2的U-Net图像去噪深度学习网络。

方法

该网络在下采样阶段设计了一种包括Swin Transformer V2和卷积并行提取特征的Transformer块，然后在上采样阶段设计了一种特征融合机制来提升网络的特征学习能力。针对图像去噪任务对Transformer块修改了归一化位置及采用镜像填充机制，提高Swin Transformer V2块的适应性。

结果

在CBSD68（Color Berkeley Segmentation Dataset）、Kodak24、McMaster和彩色Urban100四个图像去噪常用测试集上进行去噪实验，选择峰值信噪比（peak signal-to-noise ratio， PSNR）作为去噪效果的评价指标，在噪声等级为50的去噪实验中，得到的平均PSNR值分别为28.59、29.87、30.27、29.88，并与几种流行的基于卷积和基于Transformer的去噪方法进行比较。本文的去噪算法优于基于卷积的去噪方法，而相比于性能接近的基于Transformer方法，本文的去噪算法所需浮点运算量仅为26.12%。

结论

本文所提方法使用的Swin Transformer V2和特征融合机制均可以有效提升图像去噪效果。与现有方法相比，本文方法在保证或提升图像去噪效果的前提下，大幅度降低了训练和预测所需要的计算资源。

Abstract

Objective

Image denoising represents a fundamental challenge in the field of image processing， with the primary goal of recovering clear images from their noise-degraded counterparts. Throughout the image acquisition and formation processes， multiple factors such as suboptimal lighting conditions， temperature fluctuations， and imaging system corrections can significantly contribute to the presence of noise in the final images. The impact of image noise extends beyond mere visual perception degradation， substantially affecting the accuracy of advanced image processing tasks， including image segmentation and object recognition. Traditional denoising approaches， which require manual tuning of numerous parameters， are both complex and time-consuming. While CNN-based （convolutional neural network） denoising methods have demonstrated promising results， pure Transformer neural networks have shown significant effectiveness in image denoising. Moreover， CNN-based denoising methods are inherently constrained by their convolution kernel sizes， limiting their ability to utilize global image information. Conversely， while Transformer-based methods effectively leverage global image information， they demand exponentially increasing computational resources for enhanced detail restoration. Additionally， the original Swin Transformer lacks good adaptability to high-resolution image inputs. In response to these challenges， we have developed a U-Net image denoising method based on Swin Transformer V2， which successfully integrates Transformer features with conventional convolutional features， achieving remarkable denoising performance and visual quality across standard image denoising datasets.

Method

We present a novel image denoising network method based on Swin Transformer V2. The network consists of downsampling and upsampling stages. During downsampling， images undergo feature extraction in progressive feature spaces. Each encoder layer contains a different number of DB-Transformer blocks and Transformer blocks. In each DB-Transformer block， parallel Transformer and local convolution branches independently extract Transformer feature maps and local convolution feature maps， respectively， and these features interact before being passed to the next block. During upsampling， the network reconstructs images from extracted features. The upsampling decoders contain only Transformer blocks， with each decoder preceded by a feature fusion module that receives features from both downsampling and upsampling stages. The feature fusion module incorporates a global average pooling component and a multilayer perceptron， which， through a softmax function， generate dynamic weights that enable the network to adaptively select more informative features from different feature maps. Long-skip connections are employed before the final output， as noisy and clean images share considerable information， and these connections prevent gradient vanishing. To enhance the adaptability and denoising performance of Swin Transformer V2 blocks within our network， we strategically position layer normalization before self-attention computation， accelerating network convergence and optimizing it for small-scale model training. Furthermore， instead of utilizing masked shifted window-based multi-head self-attention， we implement mirror padding for incomplete window sections， enhancing the contribution of edge pixels to training， recognizing their equal importance in image denoising tasks. We train on a combined dataset of BSD500， DIV2K， Flickr2K， and WaterlooED， with random patch selection from each image. Our experiments utilize the Charbonnier loss function and progressive training mechanism， conducted on a single NVIDIA GeForce RTX 4070Ti Super GPU.

Result

To validate the model's effectiveness， we conducted comprehensive testing using four widely recognized datasets in the image denoising domain： CBSD68 （Color Berkeley Segmentation Dataset）， Kodak24， McMaster， and color Urban100. Employing peak signal-to-noise ratio （PSNR） as the primary evaluation metric， our denoising experiments achieved impressive average PSNR values of 28.59， 29.87， 30.27 and 29.88， respectively across these datasets with noise level 50. Compared to traditional algorithms， our approach demonstrates significantly enhanced denoising effects and visual perception， with PSNR metrics surpassing those of CNN-based denoising methods. Notably， while achieving comparable performance to Transformer-based methods， our denoising algorithm requires only 26.12% of the floating-point operations. Additionally， we conducted extensive ablation studies to verify the effectiveness of our proposed method， examining various aspects including the number of convolution blocks， feature fusion modules， and Transformer block improvements. The experimental results convincingly demonstrate that our approach effectively balances training efficiency with image denoising performance.

Conclusion

We have successfully developed and implemented a U-Net deep learning network model based on Swin Transformer V2 for image denoising， definitively establishing the viability of Swin Transformer V2 in the image denoising domain. Our network architecture effectively combines the strengths of local convolution and Transformer， not only efficiently extracting valuable information from both structural feature maps but also achieving superior training efficiency. The experimental results comprehensively demonstrate that our proposed network architecture offers significant advantages in both detail restoration and operational efficiency.

关键词

Keywords

references

Agustsson E and Timofte R . 2017 . NTIRE 2017 Challenge on Single Image Super-Resolution： Dataset and Study // 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops （CVPRW） . Honolulu ： IEEE： 1122 - 1131 ［ DOI： 10.1109/CVPRW.2017.150 http://dx.doi.org/10.1109/CVPRW.2017.150 ］

Arbeláez P ， Maire M ， Fowlkes C ， and Malik J . 2011 . Contour Detection and Hierarchical Image Segmentation . IEEE Transactions on Pattern Analysis and Machine Intelligence ， 33 （ 5 ）： 898 - 916 ［ DOI： 10.1109/TPAMI.2010.161 http://dx.doi.org/10.1109/TPAMI.2010.161 ］

Brempong E A ， Kornblith S ， Chen T ， Parmar N ， Minderer M ， and Norouzi M . 2022 . Denoising Pretraining for Semantic Segmentation // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops （CVPRW） . New Orleans ： IEEE： 4174 - 4185 ［ DOI： 10.1109/CVPRW56347.2022.00462 http://dx.doi.org/10.1109/CVPRW56347.2022.00462 ］

Buades A ， Coll B ， and Morel J M . 2005 . A non-local algorithm for image denoising //2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition （CVPR’05）： Vol. 2 . San Diego ： IEEE： 60 - 65 vol. 2 ［ DOI： 10.1109/CVPR.2005.38 http://dx.doi.org/10.1109/CVPR.2005.38 ］

Chen H ， Wang Y ， Guo T ， Xu C ， Deng Y ， Liu Z ， Ma S ， Xu C ， Xu C ， and Gao W . 2021 . Pre-Trained Image Processing Transformer // 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . Nashville ： IEEE： 12294 - 12305 ［ DOI： 10.1109/CVPR46437.2021.01212 http://dx.doi.org/10.1109/CVPR46437.2021.01212 ］

Cheng S ， Wang Y ， Huang H ， Liu D ， Fan H ， and Liu S . 2021 . NBNet： Noise Basis Learning for Image Denoising with Subspace Projection // 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . Nashville ： IEEE： 4894 - 4904 ［ DOI： 10.1109/CVPR46437.2021.00486 http://dx.doi.org/10.1109/CVPR46437.2021.00486 ］

Congzhong W ， Xi C ， Dong J and Shu Z . Image denoising via residual network based on perceptual loss . Journal of Image and Graphics ， 23 （ 10 ）： 1483 - 1491

吴从中，陈曦，季栋，詹曙 . 2018 . 结合深度残差学习和感知损失的图像去噪 . 中国图象图形学报， 23 （ 10 ）： 1483 - 1491 ［ DOI： 10.11834/jig.180069 http://dx.doi.org/10.11834/jig.180069 ］

Dabov K ， Foi A ， Katkovnik V ， and Egiazarian K . 2007 . Image Denoising by Sparse 3-D Transform-Domain Collaborative Filtering . IEEE Transactions on Image Processing ， 16 （ 8 ）： 2080 - 2095 ［ DOI： 10.1109/TIP.2007.901238 http://dx.doi.org/10.1109/TIP.2007.901238 ］

Ding Y ， Wu H ， Kong F ， Xu D and Yuan Guowu . 2023 . A dual of real image noise-related blind denoising technique . Journal of Image and Graphics ， 28 （ 07 ）： 2026 - 2036

丁岳皓，吴昊，孔凤玲，徐丹，袁国武 . 2024 . 面向真实图像噪声的两阶段盲去噪 . 中国图象图形学报， 28 （ 7 ）： 2026 - 2036 ［ DOI： 10.11834/jig.211020 http://dx.doi.org/10.11834/jig.211020 ］

Dosovitskiy A ， Beyer L ， Kolesnikov A ， Weissenborn D ， Zhai X ， Unterthiner T ， Dehghani M ， Minderer M ， Heigold G ， Gelly S ， Uszkoreit J ， and Houlsby N . 2021 . An Image is Worth 16 x 16 Words： Transformers for Image Recognition at Scale［A/OL］. ［2021-06-03］ http：//arxiv.org/abs/2010.11929 http://arxiv.org/abs/2010.11929

Huang J B ， Singh A ， and Ahuja N . 2015 . Single image super-resolution from transformed self-exemplars // 2015 IEEE Conference on Computer Vision and Pattern Recognition （CVPR） . Boston ： IEEE： 5197 - 5206 ［ DOI： 10.1109/CVPR.2015.7299156 http://dx.doi.org/10.1109/CVPR.2015.7299156 ］

Kivanc Mihcak M ， Kozintsev I ， Ramchandran K ， and Moulin P . 1999 . Low-complexity image denoising based on statistical modeling of wavelet coefficients . IEEE Signal Processing Letters ， 6 （ 12 ）： 300 - 303 ［ DOI： 10.1109/97.803428 http://dx.doi.org/10.1109/97.803428 ］

Li X ， Wang W ， Hu X ， and Yang J . 2019 . Selective Kernel Networks/ /2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. 510 - 519 ［ DOI： 10.1109/CVPR.2019.00060 http://dx.doi.org/10.1109/CVPR.2019.00060 ］

Liang J ， Cao J ， Sun G ， Zhang K ， Van Gool L ， and Timofte R . 2021 . SwinIR： Image Restoration Using Swin Transformer // 2021 IEEE/CVF International Conference on Computer Vision Workshops （ICCVW） . Montreal ： IEEE： 1833 - 1844 ［ DOI： 10.1109/ICCVW54120.2021.00210 http://dx.doi.org/10.1109/ICCVW54120.2021.00210 ］

Lim B ， Son S ， Kim H ， Nah S ， and Lee K M . 2017 . Enhanced Deep Residual Networks for Single Image Super-Resolution // 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops （CVPRW） . Montreal ： IEEE： 1132 - 1140 ［ DOI： 10.1109/CVPRW.2017.151 http://dx.doi.org/10.1109/CVPRW.2017.151 ］

Lin H ， Ma Z ， Ji R ， Wang Y ， and Hong X . 2022 . Boosting Crowd Counting via Multifaceted Attention // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . New Orleans ： IEEE： 19596 - 19605 ［ DOI： 10.1109/CVPR52688.2022.01901 http://dx.doi.org/10.1109/CVPR52688.2022.01901 ］

Liu Z ， Hu H ， Lin Y ， Yao Z ， Xie Z ， Wei Y ， Ning J ， Cao Y ， Zhang Z ， Dong L ， Wei F ， and Guo B . 2022 . Swin Transformer V2： Scaling Up Capacity and Resolution // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . New Orleans ： IEEE： 11999 - 12009 ［ DOI： 10.1109/CVPR52688.2022.01170 http://dx.doi.org/10.1109/CVPR52688.2022.01170 ］

Liu Z ， Lin Y ， Cao Y ， Hu H ， Wei Y ， Zhang Z ， Lin S ， and Guo B . 2021 . Swin Transformer： Hierarchical Vision Transformer using Shifted Windows // 2021 IEEE/CVF International Conference on Computer Vision （ICCV） . Montreal ： IEEE： 9992 - 10002 ［ DOI： 10.1109/ICCV48922.2021.00986 http://dx.doi.org/10.1109/ICCV48922.2021.00986 ］

Martin D ， Fowlkes C ， Tal D ， and Malik J . 2001 . A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics //Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001： Vol. 2 . Vancouver ： IEEE： 416 - 423 vol. 2 ［ DOI： 10.1109/ICCV.2001.937655 http://dx.doi.org/10.1109/ICCV.2001.937655 ］

Mei Y ， Fan Y ， and Zhou Y . 2021 . Image Super-Resolution with Non-Local Sparse Attention // 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . Nashville ： IEEE： 3516 - 3525 ［ DOI： 10.1109/CVPR46437.2021.00352 http://dx.doi.org/10.1109/CVPR46437.2021.00352 ］

Mohd Sagheer S V and George S N . 2020 . A review on medical image denoising algorithms . Biomedical Signal Processing and Control ， 61 ： 102036 ［ DOI： 10.1016/j.bspc.2020.102036 http://dx.doi.org/10.1016/j.bspc.2020.102036 ］

Ronneberger O ， Fischer P ， and Brox T . 2015 . U-Net： Convolutional Networks for Biomedical Image Segmentation［A/OL］ . ［2015-05-18］ http：//arxiv.org/abs/1505.04597 http://arxiv.org/abs/1505.04597

Tian C ， Xu Y ， and Zuo W . 2020 . Image denoising using deep CNN with batch renormalization . Neural Networks ， 121 ： 461 - 473 ［ DOI： 10.1016/j.neunet.2019.08.022 http://dx.doi.org/10.1016/j.neunet.2019.08.022 ］

Vaswani A ， Shazeer N ， Parmar N ， Uszkoreit J ， Jones L ， Gomez A N ， Kaiser Ł ， and Polosukhin I . 2017 . Attention is all you need // Proceedings of the 31st International Conference on Neural Information Processing Systems . New York ： Curran Associates Inc.： 6000 - 6010 ［ DOI： 10.5555/3295222.3295349 http://dx.doi.org/10.5555/3295222.3295349 ］

Wang Z ， Cun X ， Bao J ， Zhou W ， Liu J ， and Li H . 2022 . Uformer： A General U-Shaped Transformer for Image Restoration // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . New Orleans ： IEEE： 17662 - 17672 ［ DOI： 10.1109/CVPR52688.2022.01716 http://dx.doi.org/10.1109/CVPR52688.2022.01716 ］

Yang F ， Yang H ， Fu J ， Lu H ， and Guo B . 2020 . Learning Texture Transformer Network for Image Super-Resolution // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . Seattle ： IEEE： 5790 - 5799 ［ DOI： 10.1109/CVPR42600.2020.00583 http://dx.doi.org/10.1109/CVPR42600.2020.00583 ］

Yin H and Ma S . 2022 . CSformer： Cross-Scale Features Fusion Based Transformer for Image Denoising . IEEE Signal Processing Letters ， 29 ： 1809 - 1813 ［ DOI： 10.1109/LSP.2022.3199145 http://dx.doi.org/10.1109/LSP.2022.3199145 ］

Yuan L ， Chen Y ， Wang T ， Yu W ， Shi Y ， Jiang Z ， Tay F E H ， Feng J ， and Yan S . 2021 . Tokens-to-Token ViT： Training Vision Transformers from Scratch on ImageNet // 2021 IEEE/CVF International Conference on Computer Vision （ICCV） . Montreal ： IEEE： 538 - 547 ［ DOI： 10.1109/ICCV48922.2021.00060 http://dx.doi.org/10.1109/ICCV48922.2021.00060 ］

Zhang F ， Chen G ， Wang H ， Li J ， and Zhang C . 2023 . Multi-Scale Video Super-Resolution Transformer With Polynomial Approximation . IEEE Transactions on Circuits and Systems for Video Technology ， 33 （ 9 ）： 4496 - 4506 ［ DOI： 10.1109/TCSVT.2023.3278131 http://dx.doi.org/10.1109/TCSVT.2023.3278131 ］

Zhang K ， Li Y ， Zuo W ， Zhang L ， Van Gool L ， and Timofte R . 2022 . Plug-and-Play Image Restoration With Deep Denoiser Prior . IEEE Transactions on Pattern Analysis and Machine Intelligence ， 44 （ 10 ）： 6360 - 6376 ［ DOI： 10.1109/TPAMI.2021.3088914 http://dx.doi.org/10.1109/TPAMI.2021.3088914 ］

Zhang K ， Zuo W ， Chen Y ， Meng D ， and Zhang L . 2017 . Beyond a Gaussian Denoiser： Residual Learning of Deep CNN for Image Denoising . IEEE Transactions on Image Processing ， 26 （ 7 ）： 3142 - 3155 ［ DOI： 10.1109/TIP.2017.2662206 http://dx.doi.org/10.1109/TIP.2017.2662206 ］

Zhang K ， Zuo W ， and Zhang L . 2018 . FFDNet： Toward a Fast and Flexible Solution for CNN-Based Image Denoising . IEEE Transactions on Image Processing ， 27 （ 9 ）： 4608 - 4622 ［ DOI： 10.1109/TIP.2018.2839891 http://dx.doi.org/10.1109/TIP.2018.2839891 ］

Zhang L ， Wu X ， Buades A ， and Li X . 2011 . Color demosaicking by local directional interpolation and nonlocal adaptive thresholding . Journal of Electronic Imaging ， 20 ： 023016- 023016 - 16 ［ DOI： 10.1117/1.3600632 http://dx.doi.org/10.1117/1.3600632 ］

Zhang Y ， Li K ， Li K ， Wang L ， Zhong B ， and Fu Y . 2018a . Image Super-Resolution Using Very Deep Residual Channel Attention Networks // Computer Vision – ECCV 2018： 15th European Conference ， Munich， Germany， September 8 – 14 ， 2018 ， Proceedings， Part VII. Berlin ： Springer-Verlag： 294 - 310 ［ DOI： 10.1007/978-3-030-01234-2_18 http://dx.doi.org/10.1007/978-3-030-01234-2_18 ］

Zhang Y ， Tian Y ， Kong Y ， Zhong B ， and Fu Y . 2018b . Residual Dense Network for Image Super-Resolution // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City ： IEEE： 2472 - 2481 ［ DOI： 10.1109/CVPR.2018.00262 http://dx.doi.org/10.1109/CVPR.2018.00262 ］

Zuo J ， Jia Z ， Yang J ， and Kasabov N . 2019 . Moving Target Detection Based on Improved Gaussian Mixture Background Subtraction in Video Images . IEEE Access ， 7 ： 152612 - 152623 ［ DOI： 10.1109/ACCESS.2019.2946230 http://dx.doi.org/10.1109/ACCESS.2019.2946230 ］

文章被引用时，请邮件提醒。

提交

引入分组注意力的医学图像分割模型

航空遥感图像深度学习目标检测技术研究进展

时空特征融合网络的多目标跟踪与分割

U-Net支气管超声弹性图像纵膈淋巴结分割

融合时空域特征的人脸表情识别