熊巍1, 熊承义2, 高志荣3, 陈文旗1, 郑瑞华1, 田金文4(1.中南民族大学电子信息工程学院;2.中南民族大学电子信息工程学院 中南民族大学智能无线通信湖北省重点实验室;3.中南民族大学计算机科学学院;4.华中科技大学多谱信息处理技术国家重点实验室)
目的 基于深度学习的图像超分辨率重构研究近年来取得了重大进展，而如何在更好提升重构性能的同时，有效降低重构模型的复杂度，以满足低成本及实时应用的需要，是该领域研究关注的重要问题。为此，本文提出了一种基于通道注意力嵌入的Transformer图像超分辨率深度重构方法(Image super-resolution with channel attention embedded Transformer，简称CAET)。方法 提出将通道注意力自适应地嵌入Transformer变换特征及卷积运算特征，不仅可充分利用卷积运算与Transformer变换在图像特征提取的各自优势，而且将对应特征进行自适应增强与融合，有效改进网络的学习能力及超分辨率性能。结果 基于5个开源测试数据集，与6种代表性方法进行了实验比较，结果显示本文方法在不同放大倍数情形下均有最佳表现。具体在4倍放大因子时，比较先进的SwinIR方法，得到了PSNR指标在Urban100数据集0.09dB的提升，在Manga109数据集0.30dB的提升，以及具有主观视觉质量的明显改善。结论 提出的通道注意力嵌入的Transformer图像超分辨率方法，通过融合卷积特征与Transformer特征，并自适应嵌入通道注意力特征增强，可以在较好平衡网络模型轻量化同时，得到图像超分辨率性能的有效提升；在多个公共实验数据集的测试结果验证了本文方法的有效性。
Image super-resolution with channel attention embedded Transformer
Xiong Wei, Xiong Chengyi1, Gao Zhirong2, Chen Wenqi3, Zheng Ruihua3, Tian Jinwen4(1.School of Electronic and Information Engineering,South-Central Minzu University Hubei Key Laboratory of Intelligent Wireless Communication, South-Central Minzu University;2.School of Computer Science,South-Central Minzu University;3.School of Electronic and Information Engineering,South-Central Minzu University;4.State Key Laboratory of Multispectral Information Processing Technology, Huazhong University of Science and Technology)
Objective The research of single image super-resolution reconstruction based on deep learning technology has made great progress in recent years. However, in order to obtain better reconstruction performance, most of the existing research focuses on building complex networks with a large number of parameters. How to effectively reduce the complexity of the model while improving the reconstruction performance to meet the needs of low-cost and real-time applications has become an important research direction. While state-of-the-art lightweight super-resolution methods are mainly based on convolutional neural networks, few methods have been designed with Transformer which show excellent performance on image restoration tasks. To solve these problems, we propose a lightweight super-resolution network named Image super-resolution with channel attention embedded Transformer (CAET), which can achieve excellent super-resolution performance with a small number of parameters. Method The proposed CAET is mainly composed of four stages, shallow feature extraction，hierarchical feature extraction，multi-layer feature fusion and image reconstruction stage. More specifically, the hierarchical feature extraction phase is performed by the basic building block named channel attention embedded Transformer block (CAETB). The proposed CAETB adaptively embeds the channel attention into Transformer features and convolutional features, not only taking full advantage of the respective advantages of convolutional and Transformer in image feature extraction, but also adaptively enhancing and fusing the corresponding features. Convolutional layers provide more stable optimization and better extraction results at early vision feature processing, and co-solution layers with spatially invariant filters can enhance the advection equivalence of the network. The stacking of convolutional layers can effectively increase the perceptual field of the network. Therefore, three cascaded convolutional layers are placed at the front of the CAETB to receive the features output from the previous module, and the LeakyReLU activation function is used to activate them. The extracted features by convolution layers will be embedded with channel attention. In order to better adjust the parameters of channel attention, we adopt a linear weighting method to combine channel attention with features from different levels. Then the feature will be input to Swin Transformer layers for further deep feature extraction. Due to the increase of network depth will lead to saturation, the number of CAETB is set to 4 to keep a balance between model complexity and super-resolution performance. Hierarchical information at different stages is helpful to the final reconstruction results. Therefore, CAET combine all the low-level and high-level information from the deep feature extraction stage in the multi-level feature fusion stage. In the image reconstruction phase, we use a convolution layer and the pixel shuffle layer to upsample the features to the corresponding dimension of high-resolution image. During the training stage, 800 images from DIV2K dataset is used to train the proposed CAET, and all the training images are augmented by randomly flipping vertically and horizontally to increase diversity of training data. For each mini-batch, we random crop image patches with a size of 64 × 64 pixels are obtained as LR images. Our network is optimized by using the Adam algorithm and L1 loss is used as the loss function. Result Five public datasets are used as test sets to show the performance of the proposed method, including Set5，Set14，Berkeley segmentation dataset (BSD) 100，Urban100 and Manga109. We compared our work with 6 state-of-the-art models, including super-resolution convolutional neural network (SRCNN), Cascading Residual Network (CARN), Information Multi-distillation Network (IMDN), Super-resolution with Lattice Block (LatticeNet), Image Restoration Using Swin Transformer (SwinIR). Experimental results measure the performance of the algorithm by using peak signal-to-noise ratio (PSNR), structural similarity (SSIM). As human are more sensitive to the brightness of the image, these metrics are measured in the Y channel of the image. Experiments show that the proposed method get the highest PSNR and SSIM values, and recover more detailed information and more accurate texture when compared with most of the compared methods at ×2，×3 and ×4 amplification factors. Especially at ×4 amplification factor, the PSNR metric is improved by 0.09dB in the Urban100 dataset and 0.30dB in the Manga109 dataset, compared to current advanced method SwinIR. For model complexity, the CAET achieves better performance with fewer parameters and Multiply-Accumulator operations than those of Swinir, which also uses the Transformer as the backbone of the network. Although the proposed CAET consumes more parameters and Multiply-Accumulator operations when compared to the IMDN and the LatticeNet, the CAET achieves significantly higher performance in terms of PSNR and SSIM. Conclusion The proposed image super-resolution method with channel attention embedding Transformer (CAET) can effectively improve the image super-resolution reconstruction performance by fusing convolution features and Transformer features, and adaptive embedding channel attention to enhance feature. CAET can effectively improve the image super-resolution performance while control the complexity of the whole network. The test results in several public experimental datasets verify the effectiveness of our method.