通道注意力嵌入的Transformer图像超分辨率重构
Image super-resolution with channel-attention-embedded Transformer
- 2023年28卷第12期 页码:3744-3757
纸质出版日期: 2023-12-16
DOI: 10.11834/jig.221033
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2023-12-16 ,
移动端阅览
熊巍, 熊承义, 高志荣, 陈文旗, 郑瑞华, 田金文. 2023. 通道注意力嵌入的Transformer图像超分辨率重构. 中国图象图形学报, 28(12):3744-3757
Xiong Wei, Xiong Chengyi, Gao Zhirong, Chen Wenqi, Zheng Ruihua, Tian Jinwen. 2023. Image super-resolution with channel-attention-embedded Transformer. Journal of Image and Graphics, 28(12):3744-3757
目的
2
基于深度学习的图像超分辨率重构研究取得了重大进展,如何在更好提升重构性能的同时,有效降低重构模型的复杂度,以满足低成本及实时应用的需要,是该领域研究关注的重要问题。为此,提出了一种基于通道注意力(channel attention,CA)嵌入的Transformer图像超分辨率深度重构方法(image super-resolution with channel-attention-embedded Transformer,CAET)。
方法
2
提出将通道注意力自适应地嵌入Transformer变换特征及卷积运算特征,不仅可充分利用卷积运算与Transformer变换在图像特征提取的各自优势,而且将对应特征进行自适应增强与融合,有效改进网络的学习能力及超分辨率性能。
结果
2
基于5个开源测试数据集,与6种代表性方法进行了实验比较,结果显示本文方法在不同放大倍数情形下均有最佳表现。具体在4倍放大因子时,比较先进的SwinIR(image restoration using swin Transformer)方法,峰值信噪比指标在Urban100数据集上得到了0.09 dB的提升,在Manga109数据集提升了0.30 dB,具有主观视觉质量的明显改善。
结论
2
提出的通道注意力嵌入的Transformer图像超分辨率方法,通过融合卷积特征与Transformer特征,并自适应嵌入通道注意力特征增强,可以在较好地平衡网络模型轻量化同时,得到图像超分辨率性能的有效提升,在多个公共实验数据集的测试结果验证了本文方法的有效性。
Objective
2
Research on single image super-resolution reconstruction based on deep learning technology has made great progress in recent years. However, to improve reconstruction performance, previous studies have mostly focused on building complex networks with a large number of parameters. How to effectively reduce the complexity of the model while improving the reconstruction performance to meet the needs of low-cost and real-time applications has become an important research direction. While state-of-the-art lightweight super-resolution methods are mainly based on convolutional neural networks, only few methods have been designed with Transformer, which show an excellent performance in image restoration tasks. To solve these problems, we propose a lightweight super-resolution network called image super-resolution with channel-attention-embedded Transformer (CAET), which can achieve excellent super-resolution performance with a small number of parameters.
Method
2
CAET involves four stages, namely, shallow feature extraction, hierarchical feature extraction, multi-layer feature fusion, and image reconstruction. The hierarchical feature extraction stage is performed by a basic building block called channel-attention-embedded Transformer block (CAETB), which adaptively embeds channel attention (CA) into Transformer and convolutional features, hence not only taking full advantage of the convolutional network and Transformer in image feature extraction but also adaptively enhancing and fusing the corresponding features. Convolutional layers provide stable optimization and extraction results during early vision feature processing, and co-solution layers with spatially invariant filters can enhance the advection equivalence of the network. The stacking of convolutional layers can effectively increase the perceptual field of the network. Therefore, three cascaded convolutional layers are placed in front of CAETB to receive the features output from the previous module, and the LeakyReLU activation function is used to activate them. The features extracted by convolution layers are embedded with channel attention. To effectively adjust the channel attention parameters, we adopt a linear weighting method to combine channel attention with features from different levels. These features are then inputted into the swin Transformer layer (SwinIR) for further deep feature extraction. Given that increasing the network depth leads to saturation, we set the number of CAETB to 4 to maintain a balance between model complexity and super-resolution performance. The hierarchical information at different stages is helpful in interpreting the final reconstruction results. Therefore, CAET combines all the low- and high-level information from the deep feature extraction and multi-level feature fusion stages. In the image reconstruction phase, we use a convolution layer and the pixel shuffle layer to upsample the features to the corresponding dimensions of a high-resolution image. During the training stage, we use 800 images from the DIV2K dataset to train CAET, and we augment all training images by randomly flipping them vertically and horizontally to increase the diversity of the training data. For each mini-batch, we randomly crop image patches to a size of 64 × 64 pixels as our low-resolution (LR) images. We then optimize our network using the Adam algorithm and apply L1 loss as our loss function.
Result
2
We conduct experiments on five public datasets, namely, Set5, Set14, Berkeley segmentation dataset (BSD) 100, Urban100, and Manga109, to compare the performance of our proposed method with that of six state-of-the-art models, including super-resolution convolutional neural network (SRCNN), cascading residual network (CARN), information multi-distillation network (IMDN), super-resolution with lattice block (LatticeNet), and image restoration using swin Transformer (SwinIR). We measure the performances of these methods using peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) as metrics. Given that humans are highly sensitive to the brightness of images, we measure these metrics in the Y channel of an image. Experiment results show that the proposed method receives the highest PSNR and SSIM values and recovers more detailed information and more accurate texture compared with the state-of-the-art methods at ×2, ×3, and ×4 amplification factors. At the ×4 amplification factor, the PSNR of the proposed method is improved by 0.09 dB on the Urban100 dataset and by 0.30 dB on the Manga109 dataset compared to that of SwinIR. In terms of model complexity, CAET achieves a better performance with fewer parameters and multiply-accumulator operations compared to SwinIR, which also uses Transformer as the backbone of the network. Although CAET consumes more parameters and multiply-accumulator operations compared to IMDN and the LatticeNet, this method achieves significantly higher performance in terms of PSNR and SSIM
.
Conclusion
2
The proposed CAET can effectively improve the image super-resolution reconstruction performance by fusing convolution and Transformer features and applies adaptive embedding channel attention to enhance the features. CAET effectively improves the image super-resolution performance while controlling the complexity of the whole network. Experiment results on several public experimental datasets verify the effectiveness of our method.
超分辨率(SR)Transformer卷积神经网络(CNN)通道注意力(CA)深度学习
super-resolution (SR)Transformerconvolutional neural network (CNN)channel attention (CA)deep learning
Agustsson E and Timofte R. 2017. NTIRE 2017 challenge on single image super-resolution: dataset and study//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Honolulu, USA: IEEE: 1122-1131 [DOI: 10.1109/CVPRW.2017.150http://dx.doi.org/10.1109/CVPRW.2017.150]
Ahn N, Kang B and Sohn K A. 2018. Fast, accurate, and lightweight super-resolution with cascading residual network//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 256-272 [DOI: 10.1007/978-3-030-01249-6_16http://dx.doi.org/10.1007/978-3-030-01249-6_16]
Bevilacqua M, Roumy A, Guillemot C and Morel M L A. 2012. Low-complexity single-image super-resolution based on nonnegative neighbor embedding//Proceedings of the 23rd British Machine Vision Conference. Surrey, UK: BMVA Press: 1-12 [DOI: 10.5244/c.26.135http://dx.doi.org/10.5244/c.26.135]
Chen H T, Wang Y H, Guo T Y, Xu C, Deng Y P, Liu Z H, Ma S W, Xu C J, Xu C and Gao W. 2021. Pre-trained image processing Transformer//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: IEEE: 12294-12305 [DOI: 10.1109/CVPR46437.2021.01212http://dx.doi.org/10.1109/CVPR46437.2021.01212]
Dong C, Loy C C, He K M and Tang X O. 2014. Learning a deep convolutional network for image super-resolution//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 184-199 [DOI: 10.1007/978-3-319-10593-2_13http://dx.doi.org/10.1007/978-3-319-10593-2_13]
Dong C, Loy C C and Tang X O. 2016. Accelerating the super-resolution convolutional neural network//Proceedings of the 16th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 391-407 [DOI: 10.1007/978-3-319-46475-6_25http://dx.doi.org/10.1007/978-3-319-46475-6_25]
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X H, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J and Houlsby N. 2021. An image is worth 16 × 16 words: Transformers for image recognition at scale [EB/OL]. [2022-10-10]. https://doi.org/10.48550/arXiv.2010.11929https://doi.org/10.48550/arXiv.2010.11929
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 770-778 [DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]
Hu J, Shen L and Sun G. 2018. Squeeze-and-excitation networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7132-7141 [DOI: 10.1109/CVPR.2018.00745http://dx.doi.org/10.1109/CVPR.2018.00745]
Huang J B, Singh A and Ahuja N. 2015. Single image super-resolution from Transformed self-exemplars//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 5197-5206 [DOI: 10.1109/CVPR.2015.7299156http://dx.doi.org/10.1109/CVPR.2015.7299156]
Hui Z, Gao X B, Yang Y C and Wang X M. 2019. Lightweight image super-resolution with information multi-distillation network//Proceedings of the 27th ACM International Conference on Multimedia. Nice, France: ACM: 2024-2032 [DOI: 10.1145/3343031.3351084http://dx.doi.org/10.1145/3343031.3351084]
Jiang M J, Qian W H, Xu D, Wu H and Liu C Y. 2022. Gradual model reconstruction of Dongba painting based on residual dense structure. Journal of Image and Graphics, 27(4): 1084-1096
蒋梦洁, 钱文华, 徐丹, 吴昊, 柳春宇. 2022. 残差密集结构的东巴画渐进式重建. 中国图象图形学报, 27(4): 1084-1096[DOI: 10.11834/jig.200523http://dx.doi.org/10.11834/jig.200523]
Kim J, Lee J K and Lee K M. 2016a. Accurate image super-resolution using very deep convolutional networks//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 1646-1654 [DOI: 10.1109/CVPR.2016.182http://dx.doi.org/10.1109/CVPR.2016.182]
Kim J, Lee J K and Lee K M. 2016b. Deeply-recursive convolutional network for image super-resolution //Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 1637-1645 [DOI: 10.1109/CVPR.2016.181http://dx.doi.org/10.1109/CVPR.2016.181]
Kingma D P and Ba J L. 2017. Adam: a method for stochastic optimization [EB/OL]. [2022-10-10]. https://arxiv.org/pdf/1412.6980.pdfhttps://arxiv.org/pdf/1412.6980.pdf
Lai W S, Huang J B, Ahuja N and Yang M H. 2017. Deep Laplacian pyramid networks for fast and accurate super-resolution//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 5838-5843 [DOI: 10.1109 /CVPR.2017.618http://dx.doi.org/10.1109/CVPR.2017.618]
Lei P C, Liu C, Tang J G and Peng D L. 2020. Hierarchical feature fusion attention network for image super-resolution reconstruction. Journal of Image and Graphics, 25(9): 1773-1786.
雷鹏程, 刘丛, 唐坚刚, 彭敦陆. 2020. 分层特征融合注意力网络图像超分辨率重建. 中国图象图形学报, 25(9): 1773-1786 [DOI: 10.11834/jig.190607http://dx.doi.org/10.11834/jig.190607]
Lei S, Shi Z W and Mo W J. 2022. Transformer-based multistage enhancement for remote sensing image super-resolution. IEEE Transactions on Geoscience and Remote Sensing, 60: #5615611 [DOI: 10.1109/TGRS.2021.3136190http://dx.doi.org/10.1109/TGRS.2021.3136190]
Liang J Y, Cao J Z, Sun G L, Zhang K, Van Gool L and Timofte R. 2021. SwinIR: image restoration using Swin Transformer//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW). Montreal, Canada: IEEE: 1833-1844 [DOI: 10.1109/ICCVW54120.2021.00210http://dx.doi.org/10.1109/ICCVW54120.2021.00210]
Lim B, Son S, Kim H, Nah S and Lee K M. 2017. Enhanced deep residual networks for single image super-resolution//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Honolulu, USA: IEEE: 1132-1140 [DOI: 10.1109/CVPRW.2017.151http://dx.doi.org/10.1109/CVPRW.2017.151]
Liu J, Tang J and Wu G S. 2020. Residual feature distillation network for lightweight image super-resolution//Proceedings of 2020 European Conference on Computer Vision. Glasgow, UK: Springer: 41-55 [DOI: 10.1007/978-3-030-67070-2_2http://dx.doi.org/10.1007/978-3-030-67070-2_2]
Liu Z, Lin Y T, Cao Y, Hu H, Wei Y X, Zhang Z, Lin S and Guo B N. 2021. Swin Transformer: hierarchical vision Transformer using shifted windows//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE: 9992-10002 [DOI: 10.1109/ICCV48922.2021.00986http://dx.doi.org/10.1109/ICCV48922.2021.00986]
Lu Z S, Li J C, Liu H, Huang C Y, Zhang L L, Zeng T Y. 2022. Transformer for single image super-resolution//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). New Orleans, USA: IEEE: 456-465 [DOI: 10.1109/CVPRW56347.2022.00061http://dx.doi.org/10.1109/CVPRW56347.2022.00061]
Luo X T, Xie Y, Zhang Y L, Qu Y Y, Li C H and Fu Y. 2020. LatticeNet: towards lightweight image super-resolution with lattice block//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 272-289 [DOI: 10.1007/978-3-030-58542-6_17http://dx.doi.org/10.1007/978-3-030-58542-6_17]
Matsui Y, Ito K, Aramaki Y, Fujimoto A, Ogawa T, Yamasaki T and Aizawa K. 2017. Sketch-based manga retrieval using manga109 dataset. Multimedia Tools and Applications, 76(20): 21811-21838 [DOI: 10.1007/s11042-016-4020-zhttp://dx.doi.org/10.1007/s11042-016-4020-z]
Ronneberger O, Fischer P and Brox T. 2015. U-Net: convolutional networks for biomedical image segmentation//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich, Germany: Springer: 234-241 [DOI: 10.1007/978-3-319-24574-4_28http://dx.doi.org/10.1007/978-3-319-24574-4_28]
Shi W Z, Caballero J, Huszr F, Totz J, Aitken A P, Bishop R, Rueckert D and Wang Z H. 2016. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 1874-1883[DOI: 10.1109/CVPR.2016.207http://dx.doi.org/10.1109/CVPR.2016.207]
Tai Y, Yang J and Liu X M. 2017. Image super-resolution via deep recursive residual network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 2790-2798 [DOI: 10.1109/CVPR.2017.298http://dx.doi.org/10.1109/CVPR.2017.298]
Timofte R, De Smet V and Van Gool L. V. 2014. A+: adjusted anchored neighborhood regression for fast super-resolution//Proceedings of the 12th Asian Conference on Computer Vision. Singapore,Singapore: Springer: 111-126 [DOI: 10.1007/978-3-319-16817-3_8http://dx.doi.org/10.1007/978-3-319-16817-3_8]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc.: 6000-6010
Wang Z D, Cun X D, Bao J M, Zhou W G, Liu G Z and Li H Q. 2022. Uformer: a general U-shaped Transformer for image restoration//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 17683-17693 [DOI: 10.1109/CVPR52688.2022.01716http://dx.doi.org/10.1109/CVPR52688.2022.01716]
Wang Z H, Chen J and Hoi S C H. 2021. Deep learning for image super-resolution: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10): 3365-3387 [DOI: 10.1109/TPAMI.2020.2982166http://dx.doi.org/10.1109/TPAMI.2020.2982166]
Xiong C Y, Shi X D, Gao Z R and Wang G. 2021. Attention augmented multi-scale network for single image super-resolution. Applied Intelligence, 51(2): 935-951 [DOI: 10.1007/s10489-020-01869-zhttp://dx.doi.org/10.1007/s10489-020-01869-z]
Yu J H, Fan Y C, Yang J C, Xu N, Wang Z W, Wang X C and Huang T. 2018. Wide activation for efficient and accurate image super-resolution [EB/OL]. [2022-10-10]. https://arxiv.org/pdf/1808.08718v1.pdfhttps://arxiv.org/pdf/1808.08718v1.pdf
Zeyde R, Elad M and Protter M. 2010. On single image scale-up using sparse-representations//Proceedings of the 7th International Conference on Curves and Surfaces. Avignon, France: Springer: 711-730 [DOI: 10.1007/978-3-642-27413-8_47http://dx.doi.org/10.1007/978-3-642-27413-8_47]
Zhang X D, Zeng H and Zhang L. 2021. Edge-oriented convolution block for real-time super resolution on mobile devices//Proceedings of the 29th ACM International Conference on Multimedia. New York, USA: ACM: 4034-4043 [DOI: 10.1145/3474085.3475291http://dx.doi.org/10.1145/3474085.3475291]
Zhang Y L, Li K P, Li K, Wang L C, Zhong B N and Fu Y. 2018. Image super-resolution using very deep residual channel attention networks//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 294-310 [DOI: 10.1007/978-3-030-01234-2_18http://dx.doi.org/10.1007/978-3-030-01234-2_18]
相关作者
相关机构