孙阳, 丁建伟, 张琪, 邓琪瑶(中国人民公安大学信息网络安全学院)
摘 要：目的 近年来，超分辨率（Super Resolution，SR）重建任务通过划分窗口引入自注意力机制进行特征提取，获得了令人瞩目的成绩，成为目前超分辨率图像研究的热点。针对划分窗口应用自注意力机制时会限制图像信息聚合范围、制约模型对特征信息进行建模的问题，使用转置自注意力机制构建全局信息建模网络捕捉图像全局依赖关系。方法 首先采用轻量的基线模型对特征进行简单关系建模，然后将空间维度上的自注意力机制转换到通道维度，通过计算交叉协方差矩阵构建各像素点之间的长距离依赖关系，接着通过引入通道注意力块补充图像重建所需的局部信息，最后构建双门控机制控制信息在模型中的流动，提高模型对特征的建模能力及其鲁棒性。结果 实验在5个基准数据集Set5、Set14、BSD100、Urban100、Manga109上与主流方法进行了比较，在不同比例因子的SR任务中均获得了最佳或者次佳的结果。与SwinIR在×2倍SR任务中相比，在以上5个数据集上的峰值信噪比分别提升了0.03 dB、0.21dB、0.05dB、0.29dB、0.10dB，结构相似度也获得了极大地提升，同时视觉感知优化十分明显。结论 所提出的网络模型能够更充分地对特征信息全局关系进行建模，同时也不会丢失特征局部关系。重建图像质量明显提高，细节更加丰富，充分说明本文方法的有效性与先进性。
Research on super-resolution image reconstruction based on transposed self-attention mechanism
sunyang, dingjianwei, zhangqi, dengqiyao(School of Information Network Security, People''s Public Security University of China)
Objective Research on super-resolution image reconstruction based on deep learning techniques has gained very surprising progress in recent years. In particular, after when the development of traditional convolutional neural networks reached a bottleneck, Transformer, which performs extremely well in natural language processing, was introduced to approximate super-resolution image reconstruction. However, the computational complexity of Transformer is related to the square of the HW of the input image, leading to the inability to fully migrate Transformer to low-level computer vision tasks. recent papers such as SwinIR achieve very good performance by dividing windows, performing self-attention within the windows and interacting the information between the windows. However, this method of dividing windows increases the computational burden as the window size increases. Moreover, the window division method cannot model the global information of the image completely, resulting in partial loss of information. To solve the above problems, we model the long-range dependencies of images by constructing a Transformer Block while maintaining a moderate level of the number of parameters. Excellent super-resolution reconstruction performance is achieved by constructing global dependencies of features. Method The proposed transposed self-attention mechanism (SRTSA) super-resolution reconstruction model consists of four main stages: a shallow feature extraction module, a deep feature special extraction module, an image upsampling module, and an image reconstruction module. The shallow feature extraction part consists of a 3×3 convolution. For the deep feature extraction part specifically, it mainly consists of a global and local information extraction block (GLIEB). Our proposed GLIEB performs simple relational modeling through a sufficiently lightweight NAFBlock. Although dropout can improve the robustness of the model, we discard the dropout layer in order not to lose other information before modeling the feature information globally. In the global modeling of feature information using the transposed self-attention mechanism, we keep the features with positive effects on image reconstruction and discard the features with negative effects by replacing the softmax activation function in the self-attention mechanism with the ReLu activation function, which makes the reconstructed global dependencies more robust. Also considering that the image includes both global and local information, the residual channel attention module is used to supplement the local information and enhance the expressive ability of the model. Finally, a new dual-channel gating mechanism is introduced to control the flow of information in the model to improve the modeling capability of the model for features and its robustness. The image upsampling module uses sub-pixel convolution to expand the features to the target dimension, and finally the reconstruction module uses a 3×3 convolution to obtain the final reconstruction results. For the loss function, although many loss functions have been proposed to optimize the model training, in order to demonstrate the advancement and effectiveness of our model, we use the same L1 loss function as SwinIR to supervise the model training, and the L1 loss function can provide a stable gradient that allows the model to converge quickly. In the image training phase, 800 images from the DIV2K dataset are used for training. To expand the dataset, the 800 training images are randomly rotated or horizontally flipped, and 16 LR image blocks of size 48×48 are used as input in each iteration, while the Adam optimizer is used for training. Result We test on five datasets commonly used in super-resolution tasks, including Set5, Set14, Berkeley segmentation dataset (BSD) 100, Urban100, and Manga109, as a way to demonstrate the effectiveness and robustness of the proposed method. We also compare with SRCNN, VDSR, EDSR, RCAN, SAN, HAN, NLSA and SwinIR networks in terms of objective metrics, and it should be noted that these networks are also supervised using only the L1 loss function during the training process. Peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) are calculated on the Y channel of the YCbCr space of the output image to measure the image reconstruction effect. Among them, a higher PSNR value indicates a better reconstruction effect, and a closer SSIM value to 1 indicates that the SR image is closer to the HR image. The experimental results show that the PSNR and SSIM values obtained by the method in this paper are both optimal. In the ×2 super-resolution tasks, compared with SwinIR, the PSNR is improved by 0.03dB, 0.21dB, 0.05dB, 0.29dB, and 0.10dB, and the SSIM obtained an improvement of 0.0004, 0.0016, 0.0009, 0.0027 on four datasets except Manga109. Meanwhile, the reconstruction effect demonstrates that SRTSA can recover more detail information and more texture structure compared with most methods. Also after the attribution analysis of the model using LAM, it can be seen that SRTSA uses a larger range of pixels in the reconstruction process compared to methods such as SwinIR, which fully illustrates the global modeling capability of SRTSA. Conclusion The proposed super-resolution image reconstruction algorithm based on the transposed self-attention mechanism can more fully model the global relationship of feature information without losing the local relationship of features by converting the global relationship modeling in the spatial dimension to the channel dimension for global relationship modeling, and contains both global and local information, which effectively improves the image super-resolution reconstruction performance. The excellent PSNR and SSIM on the five datasets and the significantly higher quality of the reconstructed images with richer details and sharper edges fully demonstrate the effectiveness and advancedness of the method in this paper.