Current Issue Cover
分层特征融合注意力网络图像超分辨率重建

雷鹏程, 刘丛, 唐坚刚, 彭敦陆(上海理工大学光电信息与计算机工程学院, 上海 200093)

摘 要
目的 深层卷积神经网络在单幅图像超分辨率任务中取得了巨大成功。从3个卷积层的超分辨率重建卷积神经网络(super-resolution convolutional neural network,SRCNN)到超过300层的残差注意力网络(residual channel attention network,RCAN),网络的深度和整体性能有了显著提高。然而,尽管深层网络方法提高了重建图像的质量,但因计算量大、实时性差等问题并不适合真实场景。针对该问题,本文提出轻量级的层次特征融合空间注意力网络来快速重建图像的高频细节。方法 网络由浅层特征提取层、分层特征融合层、上采样层和重建层组成。浅层特征提取层使用1个卷积层提取浅层特征,并对特征通道进行扩充;分层特征融合层由局部特征融合和全局特征融合组成,整个网络包含9个残差注意力块(residual attention block,RAB),每3个构成一个残差注意力组,分别在组内和组间进行局部特征融合和全局特征融合。在每个残差注意力块内部,首先使用卷积层提取特征,再使用空间注意力模块对特征图的不同空间位置分配不同的权重,提高高频区域特征的注意力,以快速恢复高频细节信息;上采样层使用亚像素卷积对特征图进行上采样,将特征图放大到目标图像的尺寸;重建层使用1个卷积层进行重建,得到重建后的高分辨率图像。结果 在Set5、Set14、BSD(Berkeley segmentation dataset)100、Urban100和Manga109测试数据集上进行测试。当放大因子为4时,峰值信噪比分别为31.98 dB、28.40 dB、27.45 dB、25.77 dB和29.37 dB。本文算法比其他同等规模的网络在测试结果上有明显提升。结论 本文提出的多层特征融合注意力网络,通过结合空间注意力模块和分层特征融合结构的优势,可以快速恢复图像的高频细节并且具有较小的计算复杂度。
关键词
Hierarchical feature fusion attention network for image super-resolution reconstruction

Lei Pengcheng, Liu Cong, Tang Jiangang, Peng Dunlu(School of Optoelectronic Information and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China)

Abstract
Objective Single-image super-resolution (SISR) techniques aim to reconstruct a high-resolution image from a single low-resolution image. Given that high-resolution images contain substantial useful information, SISR technology has been widely used in medical imaging, face authentication, public relations, security monitoring, and other tasks. With the rapid development of deep learning, the convolution neural network (CNN)-based SISR method has achieved remarkable success in the field of SISR. From super-resolution CNN (SRCNN) to residual channel attention network (RCAN), the depth and the performance of the network have considerably improved. However, some problems need to be improved. 1) Increasing the depth of a network can improve reconstruction performance effectively; however, it also increases the calculation complexity of the network and leads to a poor real-time performance. 2) An image contains a large amount of high- and low-frequency information. The area with high-frequency information should be more important than the area with low-frequency information. However, most recent CNN-based methods treat these two areas equally and thus lack flexibility. 3) Feature maps at different depths carry different receptive field information with different scales. Integrating these feature maps can enhance the information flow of different convolution layers. Most current CNN-based methods only consider feature maps with a single scale. To solve these problems, we propose a lightweight hierarchical feature fusion spatial attention network to learn additional useful high-frequency information. Method The proposed network is mainly composed of four parts, namely, the shallow feature extraction, hierarchical feature fusion, up-sampling, and reconstruction parts. In the shallow feature extraction part, a convolution layer is used to extract the shallow feature and expand the number of channels. The hierarchical feature fusion part comprises nine residual attention blocks, which are evenly divided into three residual attention groups, each of which contains three residual attention blocks. The feature maps at different depths are fused by using local and global feature fusion strategies. On the one hand, the local feature fusion strategy is used to fuse the feature maps obtained by the three residual attention blocks in each residual attention group. On the other hand, the global feature fusion strategy is used to fuse the feature maps obtained by three residual attention groups. The two feature fusion strategies can integrate feature maps with different scales to enhance the information flow of different depths in the network. This study focuses on the residual attention block, which is composed of a residual block module and a spatial attention module. In each residual attention block, two 3×3 convolution layers are first used to extract several feature maps, and then a spatial attention module is used to assign different weights to different spatial positions for different feature maps. The core problem is how to obtain the appropriate weight set. According to our analysis, pooling along the channel axis can effectively highlight the importance of the areas with high-frequency information. Hence, we first apply average and maximum pooling along the channel axis to generate two representative feature descriptors. Afterward, a 5×5 and a 1×1 convolution layer are used to fuse the information in each position with its neighbor positions. The spatial attention value of each position is finally obtained by using a sigmoid function. The third part is the up-sampling part, which uses subpixel convolution to upsample the low-resolution (LR) feature maps and obtain a large-scale feature map. Lastly, in the reconstruction part, the number of channels is compressed to the target number by using a 3×3 convolution layer, thus obtaining a reconstructed high-resolution image. During the training stage, a DIVerse 2K(DIV2K) dateset is used to train the proposed network, and 32 000 image patches with a size of 48×48 pixels are obtained as LR images by random cropping. L1 loss is used as the loss function in our network; this function is optimized using the Adam algorithm. Result We compare our network with some traditional methods, such as bicubic interpolation, SRCNN, very deep super-resolution convolutional networks (VDSR), deep recursive residual networks (DRRN), residual dense networks (RDN), and RCAN. Five datasets, including Set5, Set14, Berkeley segmentation dataset(BSD)100, Urban100, and Manga109, are used as testsets to show the performance of the proposed method. Two indices, including peak signal-to-noise ratio (PSNR) and structural similarity (SSIM), are used to evaluate the reconstruction results of the proposed method and the other methods used for comparison. The average PSNR and SSIM values are obtained from the results of different methods on the five test datasets with different scale factors. Four test images with different scales are used to show the reconstruction results from using different methods. In addition, the proposed method is compared with enhanced deep residual networks (EDSR) in the convergence curve. Experiments show that the proposed method can recover more detailed information and clearer edges compared with most of the compared methods. Conclusion We propose a hierarchical feature fusion attention network in this study. Such network can quickly recover high-frequency details with the help of the spatial attention module and the hierarchical feature fusion structure, thus obtaining reconstructed results that have a more detailed texture.
Keywords

订阅号|日报