发布时间: 2020-09-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.190607
2020 | Volume 25 | Number 9

图像处理和编码

分层特征融合注意力网络图像超分辨率重建

雷鹏程, 刘丛, 唐坚刚, 彭敦陆

上海理工大学光电信息与计算机工程学院, 上海 200093

收稿日期: 2019-11-22; 修回日期: 2020-03-20; 预印本日期: 2020-03-27

基金项目: 国家自然科学基金项目（61703278，61772342）

第一作者简介: 雷鹏程, 1995年生, 男, 硕士研究生, 主要研究方向为图像超分辨率重建。E-mail:13027510953@163.com;
唐坚刚, 男, 副教授, 主要研究方向为图像处理。E-mail:tangjg@usst.edu.cn;
彭敦陆, 男, 教授, 主要研究方向为大数据处理。E-mail:dunlu_peng@163.com.

中图法分类号: TP391

文献标识码: A

文章编号: 1006-8961(2020)09-1773-14

摘要

目的深层卷积神经网络在单幅图像超分辨率任务中取得了巨大成功。从3个卷积层的超分辨率重建卷积神经网络（super-resolution convolutional neural network，SRCNN）到超过300层的残差注意力网络（residual channel attention network，RCAN），网络的深度和整体性能有了显著提高。然而，尽管深层网络方法提高了重建图像的质量，但因计算量大、实时性差等问题并不适合真实场景。针对该问题，本文提出轻量级的层次特征融合空间注意力网络来快速重建图像的高频细节。方法网络由浅层特征提取层、分层特征融合层、上采样层和重建层组成。浅层特征提取层使用1个卷积层提取浅层特征，并对特征通道进行扩充；分层特征融合层由局部特征融合和全局特征融合组成，整个网络包含9个残差注意力块（residual attention block，RAB），每3个构成一个残差注意力组，分别在组内和组间进行局部特征融合和全局特征融合。在每个残差注意力块内部，首先使用卷积层提取特征，再使用空间注意力模块对特征图的不同空间位置分配不同的权重，提高高频区域特征的注意力，以快速恢复高频细节信息；上采样层使用亚像素卷积对特征图进行上采样，将特征图放大到目标图像的尺寸；重建层使用1个卷积层进行重建，得到重建后的高分辨率图像。结果在Set5、Set14、BSD（Berkeley segmentation dataset）100、Urban100和Manga109测试数据集上进行测试。当放大因子为4时，峰值信噪比分别为31.98 dB、28.40 dB、27.45 dB、25.77 dB和29.37 dB。本文算法比其他同等规模的网络在测试结果上有明显提升。结论本文提出的多层特征融合注意力网络，通过结合空间注意力模块和分层特征融合结构的优势，可以快速恢复图像的高频细节并且具有较小的计算复杂度。

关键词

超分辨率重建; 卷积神经网络; 分层特征融合; 残差学习; 注意力机制

Hierarchical feature fusion attention network for image super-resolution reconstruction

Lei Pengcheng, Liu Cong, Tang Jiangang, Peng Dunlu

School of Optoelectronic Information and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China

Supported by: National Natural Science Foundation of China (61703278, 61772342)

Abstract

Objective Single-image super-resolution (SISR) techniques aim to reconstruct a high-resolution image from a single low-resolution image. Given that high-resolution images contain substantial useful information, SISR technology has been widely used in medical imaging, face authentication, public relations, security monitoring, and other tasks. With the rapid development of deep learning, the convolution neural network (CNN)-based SISR method has achieved remarkable success in the field of SISR. From super-resolution CNN (SRCNN) to residual channel attention network (RCAN), the depth and the performance of the network have considerably improved. However, some problems need to be improved. 1) Increasing the depth of a network can improve reconstruction performance effectively; however, it also increases the calculation complexity of the network and leads to a poor real-time performance. 2) An image contains a large amount of high- and low-frequency information. The area with high-frequency information should be more important than the area with low-frequency information. However, most recent CNN-based methods treat these two areas equally and thus lack flexibility. 3) Feature maps at different depths carry different receptive field information with different scales. Integrating these feature maps can enhance the information flow of different convolution layers. Most current CNN-based methods only consider feature maps with a single scale. To solve these problems, we propose a lightweight hierarchical feature fusion spatial attention network to learn additional useful high-frequency information. Method The proposed network is mainly composed of four parts, namely, the shallow feature extraction, hierarchical feature fusion, up-sampling, and reconstruction parts. In the shallow feature extraction part, a convolution layer is used to extract the shallow feature and expand the number of channels. The hierarchical feature fusion part comprises nine residual attention blocks, which are evenly divided into three residual attention groups, each of which contains three residual attention blocks. The feature maps at different depths are fused by using local and global feature fusion strategies. On the one hand, the local feature fusion strategy is used to fuse the feature maps obtained by the three residual attention blocks in each residual attention group. On the other hand, the global feature fusion strategy is used to fuse the feature maps obtained by three residual attention groups. The two feature fusion strategies can integrate feature maps with different scales to enhance the information flow of different depths in the network. This study focuses on the residual attention block, which is composed of a residual block module and a spatial attention module. In each residual attention block, two 3×3 convolution layers are first used to extract several feature maps, and then a spatial attention module is used to assign different weights to different spatial positions for different feature maps. The core problem is how to obtain the appropriate weight set. According to our analysis, pooling along the channel axis can effectively highlight the importance of the areas with high-frequency information. Hence, we first apply average and maximum pooling along the channel axis to generate two representative feature descriptors. Afterward, a 5×5 and a 1×1 convolution layer are used to fuse the information in each position with its neighbor positions. The spatial attention value of each position is finally obtained by using a sigmoid function. The third part is the up-sampling part, which uses subpixel convolution to upsample the low-resolution (LR) feature maps and obtain a large-scale feature map. Lastly, in the reconstruction part, the number of channels is compressed to the target number by using a 3×3 convolution layer, thus obtaining a reconstructed high-resolution image. During the training stage, a DIVerse 2K(DIV2K) dateset is used to train the proposed network, and 32 000 image patches with a size of 48×48 pixels are obtained as LR images by random cropping. L1 loss is used as the loss function in our network; this function is optimized using the Adam algorithm. Result We compare our network with some traditional methods, such as bicubic interpolation, SRCNN, very deep super-resolution convolutional networks (VDSR), deep recursive residual networks (DRRN), residual dense networks (RDN), and RCAN. Five datasets, including Set5, Set14, Berkeley segmentation dataset(BSD)100, Urban100, and Manga109, are used as testsets to show the performance of the proposed method. Two indices, including peak signal-to-noise ratio (PSNR) and structural similarity (SSIM), are used to evaluate the reconstruction results of the proposed method and the other methods used for comparison. The average PSNR and SSIM values are obtained from the results of different methods on the five test datasets with different scale factors. Four test images with different scales are used to show the reconstruction results from using different methods. In addition, the proposed method is compared with enhanced deep residual networks (EDSR) in the convergence curve. Experiments show that the proposed method can recover more detailed information and clearer edges compared with most of the compared methods. Conclusion We propose a hierarchical feature fusion attention network in this study. Such network can quickly recover high-frequency details with the help of the spatial attention module and the hierarchical feature fusion structure, thus obtaining reconstructed results that have a more detailed texture.

Key words

super-resolution reconstruction; convolution neural network (CNN); hierarchical feature fusion; residual learning; attention mechanism

0 引言

随着计算机视觉与图像处理技术的高速发展，高分辨率图像因含有丰富的细节与纹理信息，受到广泛关注。如何快速准确获取一幅高分辨率图像是图像处理中的研究热点。单幅图像超分辨率(single image super-resolution，SISR)技术对单幅低分辨率图像进行挖掘，重建出对应的高分辨率图像，很好地解决了该问题，已成功用于医学成像、人脸识别以及公共安全监控等领域(应自炉和龙祥，2019)。

图像超分辨率(super-resolution，SR)是一个不适定逆问题，其试图使用少量的已知像素预测更多的未知像素，所以没有唯一的解。研究者对此提出了大量解决方法，主要方法包括基于插值的方法、基于重建的方法和基于学习的方法。基于插值的方法通过使用相邻像素信息预测未知像素值，常用的方法包括近邻插值法和双三次插值法。该类方法因操作简单受到广泛关注，对平滑区域重建效果较好，但对含有过多纹理和边缘等复杂区域的图像效果不理想。基于重建的方法使用图像内部的先验信息恢复退化图像丢失的高频信息。Yang等人(2015)提出的迭代反投影算法属于该类方法。与插值法相似，该类方法同样存在不能处理复杂结构图像的问题。基于学习的方法试图从大量的训练样本中学习出低分辨率(low-resolution，LR)图像到高分辨率(high-resolution，HR)图像的映射关系，并使用该关系恢复出高分辨率图像。典型的基于学习的方法包括基于邻域嵌入的算法(Fang等，2017)、基于稀疏编码的算法(Yang等，2012)和基于卷积神经网络(convolutional neural networks，CNN)的算法。其中基于CNN的SR算法凭借强大的学习能力，在超分辨率领域取得了显著成功。

Dong等人(2014)提出超分辨率重建卷积神经网络(super-resolution convolutional neural network，SRCNN)，使用双三次插值对LR图像进行上采样，然后使用CNN学习LR图像与HR图像之间的映射关系。该算法比传统的非深度学习算法在重建精度上有了较大提升。在此基础上，Dong等人(2016)对SRCNN算法进行改进，提出快速重建网络(accelerating the super-resolution convolutional neural network，FSRCNN)加快网络的训练速度，首先使用小的卷积核(3×3)提取LR图像特征，然后使用反卷积对特征图上采样以获得HR图像。由于特征提取算子在尺度较小的特征图上执行，所以大幅减少了网络训练的计算量。Shi等人(2016)提出一种利用亚像素卷积层的超分辨率重建(efficient sub-pixel convolutional neural network，ESPCN)算法，也是将特征提取算子作用于尺度较小的特征图上，但与FSRCNN算法不同，ESPCN算法提出使用亚像素卷积代替反卷积进行上采样，获得了更好的重建效果。目前的许多算法都使用该方法执行上采样。

上述算法主要使用浅层网络学习LR图像和HR图像间的映射关系。增加网络的深度可以提高网络的特征表达能力，但随着网络层数的增加，梯度消失和梯度爆炸等众多问题接踵而至，导致网络很难训练。针对该问题，He等人(2016)提出残差网络(ResNet)，核心思想是将每个卷积块的输出特征与输入特征相加作为下一层的输入，该操作构建的网络结构称为残差单元。ResNet通过堆叠大量残差单元，使网络达到很深的深度，并能保证网络的顺利收敛。由于深层网络有更强的特征提取和特征表达能力，所以ResNet网络在图像识别领域取得了巨大成功。受ResNet网络的启发，研究者开始使用残差学习来构建更深层的网络。Kim等人(2016a)提出了一种非常深的超分辨率网络(very deep super-resolution convolutional networks，VDSR)，通过学习LR和SR图像之间的全局残差代替完整的高分辨率图像来加速网络的收敛，获得比较好的效果。结合VDSR网络的优点，Tai等人(2017)设计了一种深度递归残差网络(deep recursive residual networks，DRRN)来防止深度网络的梯度弥散问题。递归学习通过使用参数共享机制，减少了参数数量，获得的重建结果比VDSR网络有了更大提升。此后，SR网络继续向更深方向发展，增强的深度残差网络(enhanced deep super-resolution residual networks，EDSR)通过对大量残差块的堆叠，使网络达到了很深的深度(接近70层)，并夺得2017年超分辨率任务的冠军(Lim等，2017)。除了在深度上的研究外，研究者也在网络通道的重要性上做了大量工作。Zhang等人(2018a)认为目前大多数基于CNN的方法平等对待不同通道的特征，导致在处理不同类型的信息时缺乏灵活性。为了更好地区别不同通道的信息，Zhang等人(2018a)提出了一种残差通道注意力网络(residual channel attention networks，RCAN)，可以自适应地学习不同通道的重要性，并在残差结构内串联大量的残差块(residual in residual，RIR)，使网络深度超过300层仍然可以顺利收敛，获得了最优的重建结果。

虽然基于CNN的图像重建算法已经获得了比较好的效果，但该类算法仍然存在一些问题:1)研究者通常通过增加网络的深度来提高HR图像重建效果，大幅增加了模型的参数量和计算量，导致模型预测的实时性较差；2)图像中往往含有大量的高频信息和低频信息，通常高频信息应比低频信息更有价值，然而现有的网络平等对待高频信息和低频信息；3)不同深度的特征图携带不同尺度的感受野信息，融合不同深度的特征图可以使网络既能关注局部特征信息又能关注全局特征信息，所以能恢复出更多高频细节，然而目前多数网络都只考虑单一尺度的信息。

为了解决上述问题，本文提出一种轻量级的分层特征融合注意力网络。网络主体由一系列残差注意力块(residual attention block，RAB)构成。在每个RAB内部，首先使用两个卷积层提取特征，其次设计了一种空间注意力模块对携带高频信息的区域进行注意力增强，使网络学习到更多的高频细节信息。为了融合不同深度的特征图，提出一种分层机制对每个RAB块提取的特征进行有机融合。最终，使用亚像素卷积对特征图进行上采样，获得重建图像。

本文的主要贡献包括：1)设计了一种空间注意力机制，可以自动对图像中携带高频信息的区域分配更多的注意力，使网络学习到更多的高频细节信息; 2)设计了一种分层特征融合框架，使用局部融合与全局融合结构对不同深度的特征图进行融合，加强不同深度特征之间的信息流动，为重建提供更多的细节信息; 3)由于加入了空间注意力机制和分层特征融合机制，本文提出的轻量级SR网络，比相同规模的SR网络具有更大优势。

1 相关工作

1.1 注意力机制

注意力机制首次在卷积神经网络SENet(squeeze-and-excitation network)中提出(Hu等，2019)。通过加入注意力机制，网络可以自动学习并获取每个特征通道的重要程度。将该重要程度加入到特征通道中，提升有用通道的重要性并抑制无用通道的重要性。SENet网络通过注意力机制大幅提升了网络提取特征的效率，并夺得ImageNet 2017图像分类任务的冠军。

受SENet网络的启发，Zhang等人(2018a)设计了通道注意力(RCAN)网络并将其应用到图像超分辨率任务。通道注意力模块示意图如图 1所示。输入的特征图大小为$H \times W \times C$，其中，$H$、$W$和$C$分别表示特征图的高、宽和通道数。首先对每个通道的特征图进行全局平均池化${\mathit{\boldsymbol{H}}_{{\rm{GP}}}}$，并将池化后的值作为每个通道的特征描述符，由此可得到一个$C$维的向量。其次使用一个两层的感知机网络对不同通道进行信息融合。其中，${\mathit{\boldsymbol{W}}_{\rm{D}}}$表示对通道数目缩减，${\mathit{\boldsymbol{W}}_{\rm{U}}}$表示对通道数目扩充，缩减和扩充的比例因子为$r$。通过该操作可获得一个新的$维向量。再使用sigmoid激活函数$f$对该$C$维向量进行激活，可得到每个通道的权重。最后将权重向量与输入特征相乘，得到带有通道注意力的特征图。

图 1 RCAN网络中通道注意力模块结构图(Zhang等, 2018a)

Fig. 1 Architecture of channel attention module in RCAN (Zhang et al., 2018a)

RCAN网络(Zhang等，2018a)的成功表明注意力机制在超分辨率任务上的可行性。通过使用通道注意力模块，该网络可以自动调节不同通道的重要程度，从而更高效地使用网络提取到的特征。但是RCAN网络的注意力模块仅限于通道层面，在空间层面，并未凸显携带高频信息的位置的重要性。

1.2 特征融合机制

在网络中，不同深度的特征往往具有不同尺度的感受野。浅层特征具有较小尺度的感受野，深层特征具有较大尺度的感受野。将不同深度的特征进行有机融合，可以加强层与层之间的信息流动，为视觉任务提供更多的细节信息。如较经典的特征金字塔网络(feature pyramid network，FPN)通过融合不同深度的特征图，将低分辨率、高语义信息的高层特征和高分辨率、低语义信息的低层特征自上而下进行连接，使所有尺度下的特征都有丰富的语义信息，所以在目标检测和图像分割等领域得到了广泛应用(Lin等，2017)，该网络如图 2所示。

图 2 特征金字塔网络(Lin等，2017)

Fig. 2 Feature pyramid network (Lin et al., 2017)

在图像超分辨率任务中，特征融合网络的设计较少。经典的SRCNN网络(Dong等，2014)、ESPCN网络(Shi等，2016)、VDSR网络(Kim等，2016a)以及EDSR网络(Lim等，2017)都只考虑了单一尺度的特征。一些前沿算法逐渐开始考虑设计不同的特征融合结构，主要包括局部特征融合和全局特征融合。

局部特征融合的代表算法是密集残差网络(residual dense networks，RDN)(Zhang等，2018b)。该网络提出了一种残差密集块(residual dense block，RDB)结构。每个RDB内每一层的输入都与前面所有层的输出进行拼接，使用3×3卷积核融合不同层次的特征。在密集块结束后，使用1×1卷积核将前边所有密集残差块的输出特征进行融合。RDB结构如图 3所示。在RDB内部，通过融合每层的输入特征和输出特征，加强了层与层之间的信息流动和特征复用，为重建提供了更多的细节信息。但是逐层叠加式的融合会极大增加计算量，对于设计轻量级的网络来说是不可取的。

图 3 密集残差块(Zhang等，2018b)

Fig. 3 Residual dense block(Zhang et al., 2018b)

多尺度特征融合网络(multi-scale residual networks，MSRN)设计了全局特征融合结构(Li等，2018)。该结构通过融合不同深度的残差块的输出设计网络，如图 4所示, ${\mathit{\boldsymbol{M}}_n}$代表第$n$个残差块的输出。${\mathit{\boldsymbol{I}}_{{\rm{LR}}}}$表示输入的低分辨率图像，${\mathit{\boldsymbol{I}}_{{\rm{HR'}}}}$表示重建得到的高分辨率图像。全局特征融合比局部特征融合有更小的计算量，该结构从全局角度将不同深度的特征进行融合，提高重建图像的质量。

图 4 全局特征融合(Li等，2018)

Fig. 4 Global feature fusion (Li et al., 2018)

2 本文方法

受注意力机制和特征融合机制的启发，本文提出一种轻量级的分层特征融合注意力网络实现图像超分辨率重建。所提算法分为4部分：1)浅层特征提取部分；2)分层特征融合部分；3)上采样部分；4)图像重建部分。总体结构流程如图 5所示。

图 5 层次特征融合空间注意力网络

Fig. 5 Hierarchical feature fusion spatial attention network

2.1 浅层特征提取

使用1个卷积层提取LR图像的浅层特征${\mathit{\boldsymbol{F}}_s}$，并对特征通道进行扩充，表达式为

$ {\mathit{\boldsymbol{F}}_{\rm{s}}} = \sigma (\mathit{\boldsymbol{W}}_{\rm{s}}^{3 \times 3} \times {\mathit{\boldsymbol{I}}_{{\rm{LR}}}} + {\mathit{\boldsymbol{b}}_{\rm{s}}}) $

(1)

式中，${\mathit{\boldsymbol{I}}_{{\rm{LR}}}}$表示输入的低分辨率图像，$\mathit{\boldsymbol{W}}_{\rm{s}}^{3 \times 3}$表示提取浅层特征使用的卷积核，${\mathit{\boldsymbol{b}}_{\rm{s}}}$表示偏置，$\sigma \left(\cdot \right)$表示ReLU激活函数，表达式为

$ R(\mathit{\boldsymbol{x}}) = {\rm{max}}(0,\mathit{\boldsymbol{x}}) $

(2)

2.2 分层特征融合

分层特征融合是本文方法的重点。由于网络中不同深度的特征图携带不同尺度的感受野信息，将多层特征图有机融合可以提高网络的重建效果，但也会增加算法的时间复杂度。基于此，本文提出了一种分层特征融合结构，如图 5所示。该结构在增加少量计算量的同时，加强了网络各层之间的信息流通和特征重用，使网络能够重建出更多细节信息。

如图 5所示，分层特征融合结构(hierarchical feature fusion，HFF)包含全局特征融合(global feature fusion，GFF)和局部特征融合(local feature fusion，LFF)两种融合结构。整个网络包含9个残差注意力块(RAB)。全局特征融合将9个残差注意力块均匀分为3个残差注意力组(residual attention block group，RABG)，分别表示为RABG1、RABG2和RABG3。每个残差注意力组RABG内包含3个残差注意力块RAB，分别表示为RAB1、RAB2和RAB3。

2.2.1 全局特征融合

全局特征融合(GFF)主要对3个残差注意力组提取的全局特征进行融合。3个残差注意力组的操作分别为

$ {{\mathit{\boldsymbol{F}}_{{\rm{G1}}}} = R{G_1}({\mathit{\boldsymbol{F}}_{\rm{s}}})} $

(3)

$ {{\mathit{\boldsymbol{F}}_{{\rm{G2}}}} = R{G_2}({\mathit{\boldsymbol{F}}_{{\rm{G1}}}})} $

(4)

$ {{\mathit{\boldsymbol{F}}_{{\rm{G3}}}} = R{G_3}({\mathit{\boldsymbol{F}}_{{\rm{G2}}}})} $

(5)

式中，$R{G_1}\left(\cdot \right)$、$R{G_2}\left(\cdot \right)$和$R{G_3}\left(\cdot \right)$分别表示使用3个残差注意力组提取全局特征。右侧的${\mathit{\boldsymbol{F}}_{\rm{s}}}$、${\mathit{\boldsymbol{F}}_{{\rm{G}}1}}$和${\mathit{\boldsymbol{F}}_{{\rm{G}}2}}$分别表示3个输入特征。左侧的${\mathit{\boldsymbol{F}}_{{\rm{G}}1}}$、${\mathit{\boldsymbol{F}}_{{\rm{G}}2}}$和${\mathit{\boldsymbol{F}}_{{\rm{G}}3}}$分别表示3个输出特征。完成上述操作后，使用一个1×1卷积核对初始输入特征${{\mathit{\boldsymbol{F}}_{\rm{s}}}}$以及3个输出特征${\mathit{\boldsymbol{F}}_{{\rm{G}}1}}$、${\mathit{\boldsymbol{F}}_{{\rm{G}}2}}$和${\mathit{\boldsymbol{F}}_{{\rm{G}}3}}$融合。该融合计算为

$ {\mathit{\boldsymbol{F}}_{{\rm{GFF}}}} = \sigma (\mathit{\boldsymbol{W}}_{{\rm{GFF}}}^{1 \times 1} \times [{\mathit{\boldsymbol{F}}_{\rm{s}}};{\mathit{\boldsymbol{F}}_{{\rm{G1}}}};{\mathit{\boldsymbol{F}}_{{\rm{G2}}}};{\mathit{\boldsymbol{F}}_{{\rm{G3}}}}] + {\mathit{\boldsymbol{b}}_{{\rm{GFF}}}}) $

(6)

式中，$\mathit{\boldsymbol{W}}_{{\rm{GFF}}}^{{\rm{1 \times 1}}}$和$分别表示GFF中的1×1的卷积核和偏置，$\left[ {{\mathit{\boldsymbol{F}}_{\rm{s}}};{\mathit{\boldsymbol{F}}_{{\rm{G}}1}};{\mathit{\boldsymbol{F}}_{{\rm{G}}2}};{\mathit{\boldsymbol{F}}_{{\rm{G}}3}}} \right]$表示将4个特征${\mathit{\boldsymbol{F}}_{\rm{s}}}$、${\mathit{\boldsymbol{F}}_{{\rm{G}}1}}$、${\mathit{\boldsymbol{F}}_{{\rm{G}}2}}$和${\mathit{\boldsymbol{F}}_{{\rm{G}}3}}$拼接。

2.2.2 局部特征融合

局部特征融合(LFF)主要对每个残差注意力组内的3个残差注意力块RAB1、RAB2和RAB3进行局部特征融合。下面对第1个残差注意力组RABG1做详细介绍。该组中3个残差注意力块的操作为

$ {{\mathit{\boldsymbol{F}}_{{\rm{B1}}}} = R{L_1}({\mathit{\boldsymbol{F}}_{\rm{s}}})} $

(7)

$ {{\mathit{\boldsymbol{F}}_{{\rm{B2}}}} = R{L_2}({\mathit{\boldsymbol{F}}_{{\rm{B1}}}})} $

(8)

$ {{\mathit{\boldsymbol{F}}_{{\rm{B3}}}} = R{L_3}({\mathit{\boldsymbol{F}}_{{\rm{B2}}}})} $

(9)

式中，$R{L_1}\left(\cdot \right)$、$R{L_2}\left(\cdot \right)$和$R{L_3}\left(\cdot \right)$分别表示使用3个残差注意力块提取局部特征。执行完3个残差注意力块后，使用1×1卷积对包含输入特征在内的4个特征进行局部特征融合，该融合计算为

$ {\mathit{\boldsymbol{F}}_{{\rm{LFF}}}} = \sigma (\mathit{\boldsymbol{W}}_{{\rm{LFF}}}^{1 \times 1} \times [{\mathit{\boldsymbol{F}}_{\rm{s}}};{\mathit{\boldsymbol{F}}_{{\rm{B1}}}};{\mathit{\boldsymbol{F}}_{{\rm{B2}}}};{\mathit{\boldsymbol{F}}_{{\rm{B3}}}}] + {\mathit{\boldsymbol{b}}_{{\rm{LFF}}}}) $

(10)

式中，$\mathit{\boldsymbol{W}}_{{\rm{LFF}}}^{{\rm{1 \times 1}}}$和${\mathit{\boldsymbol{b}}_{{\rm{LFF}}}}$分别代表LFF中的1×1卷积核和偏置。$\left[ {{\mathit{\boldsymbol{F}}_{\rm{s}}};{\mathit{\boldsymbol{F}}_{{\rm{B}}1}};{\mathit{\boldsymbol{F}}_{{\rm{B}}2}};{\mathit{\boldsymbol{F}}_{{\rm{B}}3}}} \right]$表示将${{\mathit{\boldsymbol{F}}_{\rm{s}}}}$、${{\mathit{\boldsymbol{F}}_{{\rm{B}}1}}}$、${{\mathit{\boldsymbol{F}}_{{\rm{B}}2}}}$和${{\mathit{\boldsymbol{F}}_{{\rm{B}}3}}}$进行拼接。

2.2.3 残差注意力块

残差注意力块(RAB)的核心思想为将注意力机制加入到残差块中，以对不同位置学习出不同的空间注意力。其中残差块采用EDSR网络中的残差块(residual block，RB)结构，RB和RAB结构如图 6所示。可以看出，RAB结构为RB结构和空间注意力块(spatial attention，SA)的结合。下面介绍RAB结构的主要流程。对当前RAB的输入特征${\mathit{\boldsymbol{F}}_{{\rm{in}}}}$，使用第1个卷积层提取特征，使用激活函数对其激活，获得激活特征${{\mathit{\boldsymbol{F}}_1}}$，再使用第2个卷积层提取特征获得${{\mathit{\boldsymbol{F}}_2}}$。具体计算为

$ {{\mathit{\boldsymbol{F}}_1} = \sigma (\mathit{\boldsymbol{W}}_1^{3 \times 3} \times {\mathit{\boldsymbol{F}}_{{\rm{in}}}} + {\mathit{\boldsymbol{b}}_1})} $

(11)

$ {{\mathit{\boldsymbol{F}}_2} = \mathit{\boldsymbol{W}}_2^{3 \times 3} \times {\mathit{\boldsymbol{F}}_1} + {\mathit{\boldsymbol{b}}_2}} $

(12)

图 6 残差块以及注意力残差块

Fig. 6 Residual block and residual attention block

((a) residual block; (b) residual attention block)

式中，$\mathit{\boldsymbol{W}}_1^{{\rm{3 \times 3}}}$、$\mathit{\boldsymbol{W}}_2^{{\rm{3 \times 3}}}$和${\mathit{\boldsymbol{b}}_1}$、${\mathit{\boldsymbol{b}}_2}$分别表示第1个卷积层和第2个卷积层使用的3×3的卷积核和偏置。

接下来，使用注意力模块提取特征${{\mathit{\boldsymbol{F}}_2}}$的空间位置注意力值${\mathit{\boldsymbol{A}}_{{\rm{sa}}}}$，并将该值对${{\mathit{\boldsymbol{F}}_2}}$的不同空间位置进行加权，得到${\mathit{\boldsymbol{F}}_{{\rm{sa}}}}$。具体计算为

$ {\mathit{\boldsymbol{A}}_{{\rm{sa}}}} = S({\mathit{\boldsymbol{F}}_2}) $

(13)

$ {\mathit{\boldsymbol{F}}_{{\rm{sa}}}} = {\mathit{\boldsymbol{A}}_{{\rm{sa}}}} \otimes {\mathit{\boldsymbol{F}}_2} $

(14)

式中，$S$ (·)表示使用空间注意力模块(SA)计算输入特征的空间注意力值(计算方法详见2.2.4节)，$ \otimes $表示矩阵中元素对应做乘法。由于两者尺寸不同，所以在运算前需将${\mathit{\boldsymbol{A}}_{{\rm{sa}}}}$进行广播，使${\mathit{\boldsymbol{A}}_{{\rm{sa}}}}$和输入特征${{\mathit{\boldsymbol{F}}_2}}$具有相同的尺寸。${\mathit{\boldsymbol{F}}_{{\rm{sa}}}}$表示使用空间注意力加权之后的特征图。

除此之外，在注意力块中加入短跳连接。短跳连接可以将前一层的特征更顺利地传播到后一层，从而更好地预测稠密的像素值。最后使用局部残差学习得到该RAB的输出特征${\mathit{\boldsymbol{F}}_{{\rm{out}}}}$，计算为

$ {\mathit{\boldsymbol{F}}_{{\rm{ out }}}} = {\mathit{\boldsymbol{F}}_{{\rm{sa}}}} + {\mathit{\boldsymbol{F}}_2} + {\mathit{\boldsymbol{F}}_{{\rm{in}}}} $

(15)

2.2.4 空间注意力模块

低分辨率图像中包含了大量的低频信息和少量有价值的高频信息。其中低频信息一般位于平滑区域，这些区域比较容易重建。而高频信息通常位于边界、纹理等区域，重建相对困难。现有的深度网络通常对两种区域分配相同的权重，往往会弱化高频信息的重要性。基于此，本文希望设计一种模型，该模型可以增加对高频信息区域的注意力，从而获得更好的重建效果。下面重点分析如何在不同的空间位置设置不同的注意力机制。

在残差图像中，低频区域的残差往往趋向于0，而高频区域的残差相对略大一些。根据该思想，本文得出以下结论：对特征图沿着通道轴池化，可以有效地突出携带高频信息的区域。所以，本文算法首先沿着通道轴分别计算平均池化和最大池化，并将两组池化值作为不同位置的特征描述符。其次使用卷积对特征描述符中的每个位置的特征值与其周围位置特征值进行信息融合。最后使用sigmoid函数计算得出不同空间位置的注意力值。图 7展示了空间注意力模块的流程图。下面对空间注意力计算细节进行详细描述。

图 7 空间注意力模块示意图

Fig. 7 Overview of spatial attention module

空间注意力模块的输入特征图为${\mathit{\boldsymbol{F}}_2} = \left[ {\mathit{\boldsymbol{F}}_2^1, \cdots, \mathit{\boldsymbol{F}}_2^C} \right]$，$C$为特征图的通道数。则第$c$个通道可以表示为${\mathit{\boldsymbol{F}}_2^c}$，其中，$c \in \left\{ {1, \cdots, C} \right\}$。该通道包含$H \times W$个空间位置，每个空间位置的特征可以表示为$\mathit{\boldsymbol{F}}_2^c\left({h, w} \right)$，其中，$h \in \left\{ {1, \cdots, H} \right\}, w \in \left\{ {1, \cdots W} \right\}$。

1) 平均池化和最大池化。沿着通道轴对${\mathit{\boldsymbol{F}}_2}$特征做平均池化和最大池化，可获得两组空间位置描述符$\mathit{\boldsymbol{F}}_{{\rm{Avg}}}^{\rm{s}}$和$\mathit{\boldsymbol{F}}_{{\rm{Max}}}^{\rm{s}}$。两组描述符分别包含了所有位置的两种池化值，$\mathit{\boldsymbol{F}}_{{\rm{Avg}}}^{\rm{s}}\left({h, w} \right)$和$\mathit{\boldsymbol{F}}_{{\rm{Max}}}^{\rm{s}}\left( {h,w} \right)$分别表示空间位置为$\left({h, w} \right)$的平均和最大池化值。具体计算为

$ {F_{{\rm{ Avg }}}^{\rm{s}}(h,w) = \frac{{\sum\limits_{c = 1}^C {(F_2^c(} h,w))}}{C}} $

(16)

$ {F_{{\rm{ Max }}}^{\rm{s}}(h,w) = \mathop {{\rm{max}}}\limits_{c = \{ 1, \cdots ,C\} } F_2^c(h,w)} $

(17)

2) 拼接和融合。首先将获得的两组描述符进行拼接。其次使用一个5×5的卷积核对特征描述符中每个位置的特征值与周围位置的特征值进行信息融合。具体计算为

$ {\mathit{\boldsymbol{A}}_{\rm{d}}} = \sigma (\mathit{\boldsymbol{W}}_{{\rm{sa}}}^{5 \times 5} \times [\mathit{\boldsymbol{F}}_{{\rm{Avg}}}^{\rm{s}};\mathit{\boldsymbol{F}}_{{\rm{Max}}}^{\rm{s}}] + \mathit{\boldsymbol{b}}_{{\rm{sa}}}^1) $

(18)

式中，$\mathit{\boldsymbol{W}}_{{\rm{sa}}}^{5 \times 5}$和$\mathit{\boldsymbol{b}}_{{\rm{sa}}}^1$分别表示信息融合中使用的5×5卷积核和偏置。$\left[ {\mathit{\boldsymbol{F}}_{{\rm{Avg}}}^{\rm{s}};\mathit{\boldsymbol{F}}_{{\rm{Max}}}^{\rm{s}}} \right]$表示将${\mathit{\boldsymbol{F}}_{{\rm{Avg}}}^{\rm{s}}}$和$\mathit{\boldsymbol{F}}_{{\rm{Max}}}^{\rm{s}}$进行拼接，${\mathit{\boldsymbol{A}}_{\rm{d}}}$表示融合后的注意力图，其含有两个通道。

3) 获得注意力值。首先使用1×1的卷积核将融合后的两个通道的描述符压缩为一个通道。然后使用sigmoid函数进行激活，可获得空间注意力值${\mathit{\boldsymbol{A}}_{{\rm{sa}}}}$。具体计算为

$ {\mathit{\boldsymbol{A}}_{{\rm{sa}}}} = f(\mathit{\boldsymbol{W}}_{{\rm{sa}}}^{1 \times 1} \times {\mathit{\boldsymbol{A}}_{\rm{d}}} + \mathit{\boldsymbol{b}}_{{\rm{sa}}}^2) $

(19)

式中，$\mathit{\boldsymbol{W}}_{{\rm{sa}}}^{{\rm{1 \times 1}}}$和$\mathit{\boldsymbol{b}}_{{\rm{sa}}}^2$分别表示1×1的卷积核和偏置，$f\left(\cdot \right)$表示sigmoid激活函数，具体计算为

$ f(\mathit{\boldsymbol{x}}) = \frac{1}{{1 + {{\rm{e}}^{ - x}}}} $

(20)

2.3 上采样

采用亚像素卷积对全局融合后的特征${\mathit{\boldsymbol{F}}_{{\rm{GFF}}}}$做上采样。首先，在上采样操作前需使用1个卷积层对特征图进行通道扩充，扩充倍数为放大因子的平方倍。具体计算为

$ {\mathit{\boldsymbol{F}}_{\rm{e}}} = \mathit{\boldsymbol{W}}_{\rm{e}}^{3 \times 3} \times {\mathit{\boldsymbol{F}}_{{\rm{GFF}}}} + {\mathit{\boldsymbol{b}}_{\rm{e}}} $

(21)

式中$\mathit{\boldsymbol{W}}_{\rm{e}}^{3 \times 3}$和${\mathit{\boldsymbol{b}}_{\rm{e}}}$分别表示通道扩充使用的3×3的卷积核和偏置。${\mathit{\boldsymbol{F}}_{\rm{e}}}$表示通道扩充后的特征图。其次使用亚像素卷积对${\mathit{\boldsymbol{F}}_{\rm{e}}}$做上采样，可获得上采样后的特征图${\mathit{\boldsymbol{F}}_{{\rm{up}}}}$，具体计算为

$ {\mathit{\boldsymbol{F}}_{{\rm{up}}}} = {H_{{\rm{up}}}}({\mathit{\boldsymbol{F}}_{\rm{e}}}) $

(22)

式中，${\mathit{\boldsymbol{F}}_{{\rm{up}}}}$表示使用亚像素卷积进行上采样操作。

2.4 重建

使用1个卷积层对上采样后的特征图${\mathit{\boldsymbol{F}}_{{\rm{up}}}}$重建，可得到重建图像${\mathit{\boldsymbol{I}}_{{\rm{HR'}}}}$。具体计算为

$ {\mathit{\boldsymbol{I}}_{{\rm{H}}{{\rm{R}}^\prime }}} = \mathit{\boldsymbol{W}}_{{\rm{rec}}}^{3 \times 3} \times {\mathit{\boldsymbol{F}}_{{\rm{up}}}} + {\mathit{\boldsymbol{b}}_{{\rm{rec}}}} $

(23)

式中，$\mathit{\boldsymbol{W}}_{{\rm{rec}}}^{3 \times 3}$和${\mathit{\boldsymbol{b}}_{{\rm{rec}}}}$分别表示重建使用的3×3的卷积核和偏置，${\mathit{\boldsymbol{I}}_{{\rm{HR'}}}}$表示重建得到的高分辨率图像。

2.5 损失函数

将L1函数作为损失函数来优化网络。对于给定的数据集$\left\{ {\mathit{\boldsymbol{I}}_{{\rm{LR}}}^i, \mathit{\boldsymbol{I}}_{{\rm{HR}}}^i} \right\}_{i = 1}^N$，共包含$N$幅低分辨率图像及对应的高分辨率图像，损失函数计算为

$ L(\mathit{\boldsymbol{\theta }}) = \frac{1}{N}\sum\limits_{i = 1}^N {{{\left\| {\mathit{\boldsymbol{I}}_{{\rm{H}}{{\rm{R}}^\prime }}^i - \mathit{\boldsymbol{I}}_{{\rm{HR}}}^i} \right\|}_1}} $

(24)

式中，$\mathit{\boldsymbol{\theta }}$表示网络需要学习的参数集，${\mathit{\boldsymbol{I}}_{{\rm{HR}}}^i}$表示原始高分辨率图像，$\mathit{\boldsymbol{I}}_{{\rm{HR'}}}^i$表示利用本文算法重建的高分辨率图像。

3 实验结果及分析

3.1 数据集

使用DIV2K数据集作为训练数据集，该数据集包含1 000幅高清自然图像及对应的低分辨率图像(使用插值法获得)，编号从1到1 000。每幅图像都包含丰富的细节纹理信息，适合作为自然图像超分辨率的训练数据。将数据集编号为1~900的图像作为训练集，901~1 000的图像作为验证集，使用Set5、Set14(Zeyde等，2010)、BSD(Berkeley segmentation dataset)100(Martin等，2002)、Urban100(Huang等，2015)和Manga109(Matsui等，2017)数据集作为测试集测试模型性能。本文算法直接对RGB三通道进行重建。在训练阶段，随机剪裁32 000个48×48像素的图像块作为LR图像。根据放大因子2、3、4分别裁剪尺度为96×96、144×144和192×192像素的HR图像块。对于每种尺度，都可获得32 000个训练数据。

3.2 训练细节

训练阶段，将模型中间层特征图的宽度设定为$C$ =64。模型参数使用Adam算法优化，算法中参数设置为${\beta _1} = 0.9$、${\beta _2} = 0.999$、$\varepsilon = {10^{ - 8}}$。模型训练设置300个epoch，初始学习率设定为0.000 1，当运行到第200个epoch时，学习率缩减为初始学习率的一半。实验所用计算机CPU为i5-9400f，GPU为GeForce GTX1080 8 GB，内存16 GB，操作系统为Ubuntu18.04.3，深度学习框架为Pytorch。

3.3 实验结果

3.3.1 基准测试

为了验证本文算法的有效性，与双三次插值法、SRCNN网络(Dong等，2014)、VDSR网络(Kim等，2016a)、DRRN网络(Tai等，2017)、RDN网络(Li等，2018)和RCAN网络(Zhang等，2018a)等现有算法进行对比，这些算法性能优越，具有代表性，且与本文算法有一定相关性。双三次插值法是插值方法中的代表算法；SRCNN网络首次将卷积神经网络应用于图像超分辨率任务；VDSR网络首次使用残差学习构建深度网络；DRRN网络将残差学习和递归学习相结合，构建深度递归网络；RDN网络使用密集残差块(RDB)进行局部特征融合，取得了较好的重建结果；RCAN网络通过在残差块中加入通道注意力机制构建残差通道注意力块(residual channel attention block，RCAB)，获得了非常好的重建结果。对于RDN和RCAN网络，由于原始模型规模为本文模型的数十倍，所以为了对比公平，实验对这两个模型进行精简，将RDN网络中RDB的数目设置为3，将RCAN网络中RCAB的数量设置为9，与本文算法中RAB数目相同。精简后得到的模型与本文算法具有相近的规模，RDN和RCAN两种网络使用与本文算法相同的训练数据和训练技巧进行训练。

1) 客观指标分析。采用通用的两个客观指标：峰值信噪比(peak signal-to-noise ratio，PSNR)和结构相似性(structural similarity index measure，SSIM)对图像质量进行评价，使用参数量(params)和浮点运算量(floating points of operations, FLOPs)描述模型的大小和计算复杂度。表 1和表 2展示了不同算法在不同放大因子和不同基准测试集下的测试结果，表 3是各算法在尺度为×2及重建图像大小为3×96×96时的参数量和浮点运算量。

表 1 超分辨率算法在5个测试集上的PSNR平均值
Table 1 Comparison of average PSNR by different super-resolution algorithms on five datasets

下载CSV

/dB
算法	Set5			Set14			BSD100			Urban100			Manga109
算法	×2	×3	×4	×2	×3	×4	×2	×3	×4	×2	×3	×4	×2	×3	×4
双三次插值法	33.66	30.39	28.42	30.24	27.55	26.00	29.56	27.21	25.96	26.88	24.46	23.14	30.82	26.96	24.91
SRCNN(Dong等，2014)	36.66	32.75	30.48	32.45	29.30	27.50	31.36	28.41	26.90	29.50	26.24	24.52	35.74	30.59	27.66
VDSR(Kim等，2016a)	37.53	33.66	31.35	33.03	29.77	28.01	31.90	28.82	27.29	30.76	27.14	25.18	37.22	32.01	28.83
DRRN(Tai等，2017)	37.74	34.03	31.68	33.23	29.96	28.21	32.05	28.95	27.38	31.23	27.53	25.44	37.60	32.42	29.18
RDN(Li等，2018)	37.78	34.05	31.87	33.32	30.03	28.32	32.05	28.89	27.36	31.61	27.51	25.54	37.71	32.47	29.32
RCAN(Zhang等，2018a)	37.82	34.10	31.98	33.28	30.07	28.32	32.01	28.93	27.36	31.55	27.73	25.60	37.65	32.50	29.40
本文	37.87	34.26	31.98	33.37	30.13	28.40	32.07	28.97	27.45	31.68	27.80	25.77	37.79	32.57	29.39
注：加粗字体为各列最优结果。

表 2 超分辨率算法在5个测试集上的SSIM平均值
Table 2 Comparison of average SSIM by different super-resolution algorithms on five datasets

下载CSV

数据集		算法
数据集		双三次插值法	SRCNN (Dong等，2014)	VDSR (Kim等，2016a)	DRRN (Tai等，2017)	RDN (Li等，2018)	RCAN (Zhang等，2018a)	本文
Set5	×2	0.929 9	0.954 2	0.958 7	0.959 1	0.959 8	0.959 9	0.960 1
	×3	0.868 2	0.909 0	0.921 3	0.924 4	0.924 6	0.925 1	0.926 1
	×4	0.810 4	0.862 8	0.883 8	0.888 8	0.890 6	0.891 4	0.892 8
Set14	×2	0.868 8	0.906 7	0.912 4	0.913 6	0.914 9	0.914 8	0.915 7
	×3	0.774 2	0.821 5	0.831 4	0.834 9	0.835 7	0.836 1	0.838 2
	×4	0.702 7	0.751 3	0.767 4	0.772 1	0.774 0	0.774 4	0.776 7
BSD100	×2	0.843 1	0.887 9	0.896 0	0.897 3	0.898 2	0.897 5	0.898 5
	×3	0.738 5	0.786 3	0.797 6	0.800 4	0.799 7	0.801 1	0.802 3
	×4	0.667 5	0.710 1	0.725 1	0.728 4	0.729 5	0.729 6	0.732 4
Urban100	×2	0.840 3	0.894 6	0.914 0	0.918 8	0.923 3	0.922 2	0.924 0
	×3	0.734 9	0.798 9	0.827 9	0.837 8	0.837 2	0.842 7	0.845 2
	×4	0.657 7	0.722 1	0.752 4	0.763 8	0.767 6	0.770 2	0.775 5
Manga109	×2	0.933 2	0.966 1	0.976 9	0.973 6	0.974 2	0.973 9	0.975 1
	×3	0.855 5	0.910 7	0.931 0	0.935 9	0.934 1	0.934 5	0.935 5
	×4	0.782 6	0.850 5	0.880 9	0.891 4	0.894 0	0.895 2	0.895 3
注：加粗字体为各行最优结果。

表 3 不同算法使用的参数量和浮点运算量
Table 3 Number of parameters and FLOPs used by different algorithms

下载CSV

算法	参数量/K	浮点运算量/G
双三次插值法	0	0
SRCNN(Dong等，2014)	20.1	0.19
VDSR(Kim等，2016a)	668.3	6.17
DRRN(Tai等，2017)	301.8	68.01
RDN(Li等，2018)	959.1	2.22
RCAN(Zhang等，2018a)	898.1	1.98
本文	919.6	2.13

从表 1和表 2可以看出，本文算法在各测试集和各尺度上的PSNR和SSIM指标值比其他算法都有提高。以Set14测试集为例，与具有相近规模的精简版RCAN网络对比可以得出，在尺度×2时，PSNR和SSIM分别提升0.09 dB和0.001 4；尺度×3时分别提升0.06 dB和0.002 1；尺度×4时分别提升0.08 dB和0.002 3。

从表 3可以看出，本文算法在参数量上略高于VDSR网络，但在不同尺度和不同测试集上获得的PSNR指标比VDSR网络都有0.3~1 dB左右的提升。DRRN网络使用递归学习，参数量大幅减少，但是计算量却远高于本文算法，并且DRRN网络在不同尺度不同测试集上获得的重建效果都比本文算法差。与精简版RDN网络相比，本文算法使用更少的参数量获得了更好的重建效果。与表现优异的RCAN网络相比，由于本文算法使用了分层特征融合结构，所以在参数量上略高于RCAN网络，但获得的重建效果更好。

2) 视觉效果分析。除了使用客观指标评价本文算法，还通过视觉效果图对重建结果进行分析。图 8展示了测试集Set14中图像baboon在尺度为×2时重建结果的局部放大图。可以看出，相比于其他算法，本文算法对狒狒胡须的重建线条更加清晰完整并且更接近原始高分辨率图像。图 9展示了测试集Urban100中图像img046在尺度为×3时重建结果的局部放大图。可以看出，对于建筑物线条的重建，本文算法重建结果的线条相比其他算法更直更清晰且有更好的观感。图 10展示了测试集B100中图像119082在尺度为×4时重建结果的局部放大图。可以看出，本文算法对汽车的重建结果轮廓更清晰且有更好的视觉效果。图 11展示了Manga109测试集中图像Garakutayamanta在尺度为×4时重建结果的局部放大图。可以看出，本文方法对卡通人物的重建结果线条更清晰完整。

图 8 不同算法对Set14中baboon在尺度为×2时重建效果对比图

Fig. 8 Comparison of reconstructed HR images of baboon in Set14 by different SR algorithms with the scale factor ×2

((a) baboon×2;(b) HR; (c) bicubic interpolation; (d) SRCNN; (e) VDSR; (f) DRRN; (g) RDN; (h) RCAN; (i) ours)

图 9 不同算法对Urban100中img046在尺度为×3时重建效果对比图

Fig. 9 Comparison of reconstructed HR images of img046 in Urban100 by different SR algorithms with the scale factor ×3

((a) img046×3;(b) HR; (c) bicubic interpolation; (d) SRCNN; (e) VDSR; (f) DRRN; (g) RDN; (h) RCAN; (i) ours)

图 10 不同算法对B100中119082在尺度为×4时重建效果对比图

Fig. 10 Comparison of reconstructed HR images of 119082 in B100 by different SR algorithms with the scale factor ×4

((a) 119082×4;(b) HR; (c) bicubic interpolation; (d) SRCNN; (e) VDSR; (f) DRRN; (g) RDN; (h) RCAN; (i) ours)

图 11 不同算法对Manga109中Garakutayamanta在尺度为×4时重建效果对比图

Fig. 11 Comparison of reconstructed HR images of Garakutayamanta in Manga109 by different SR algorithms with the scale factor ×4

((a) Garakutayamanta×4;(b) HR; (c) bicubic interpolation; (d) SRCNN; (e) VDSR; (f) DRRN; (g) RDN; (h) RCAN; (i) ours)

3.3.2 模型可行性评估

模型可行性评估主要验证网络中空间注意力(SA)模块与层次特征融合结构(HFF)对重建结果的影响。由于本文算法结合了残差块(RB)、SA模块和层次特征融合结构，所以与仅含有残差块的EDSR网络(Lim等，2017)做实验对比。为了使两种算法具有相近的网络规模，设置EDSR网络由9个残差块组成。

1) 收敛分析。图 12(a)展示了当放大因子为2时，本文算法与EDSR模型在训练过程中的收敛曲线。收敛曲线表明，加入了SA模块和HFF模块，收敛速度更快且性能有所提升。图 12(b)展示了两种算法在训练过程中损失函数loss的下降曲线。从图 12可以看出，在训练到250~300轮时，PSNR指标稳定不再上升，损失函数loss值也基本稳定不再下降，说明此时网络在训练数据集上已经收敛。

图 12 关于SA模块和HFF模块的收敛分析

Fig. 12 Convergence analysis on SA and HFF

((a) PSNR-epochs curves; (b) loss-epochs curves)

2) 视觉分析。图 13展示了测试集Urban100中的图像img093在尺度×4时，本文算法与EDSR网络所获得的重建效果。图 13(b)为高分辨率图像，图 13(c)为含有9个残差块的EDSR模型重建结果，图 13(d)为本文算法获得的重建结果。从重建图中可以看出，在图 13(c)中，重建的线条完全偏离真实的高分辨率图像，而图 13(d)能够恢复出更接近真实的高分辨率图像的线条。

图 13 本文算法和EDSR模型在Urban100中img093上重建效果对比图

Fig. 13 Comparison of reconstructed HR images of img093 in Urban100 by EDSR and our method

((a) img093×4; (b) HR; (c) EDSR; (d) ours)

3) 子模块分析。为了分别展示SA模块和HFF模块的性能，使用RB模块、SA模块、通道注意力(channel attention，CA)模块和HFF模块组成不同的模型来分析不同模块对重建结果的影响。在尺度为×2时，分别对仅使用RB模块(即EDSR网络)、RB模块与HFF模块相结合、RB模块与SA模块相结合、RB模块与CA模块相结合(即RCAN网络)、RB模块与HFF模块和SA模块相结合(即本文算法)等不同模型在各测试集上进行实验，实验结果如表 4和表 5所示。可以看出，在RB模块上加入SA模块或HFF模块获得的重建效果都比仅使用RB模型的EDSR网络获得了更好的性能。加入SA模块比加入CA模块对结果的提升更大。

表 4 不同模块组合的模型在5个测试集上的PSNR值
Table 4 Comparison of average PSNR by different module on five benchmark testsets

下载CSV

/dB
方法	Set5	Set14	BSD100	Urban100	Manga109
RB (EDSR)	37.75	33.30	32.02	31.56	37.67
RB + HFF	37.76	33.32	32.03	31.60	31.71
RB + SA	37.80	33.32	32.03	31.58	37.74
RB + CA (RCAN)	37.82	33.28	32.01	31.55	37.65
RB + SA + HFF (本文)	37.87	33.37	32.07	31.68	37.79
注：加粗字体为每列最优结果。

表 5 不同模块组合的模型在5个测试集上的SSIM值
Table 5 Comparison of average SSIM by different module on five benchmark testsets

下载CSV

方法	Set5	Set14	BSD100	Urban100	Manga109
RB (EDSR)	0.959 8	0.914 8	0.897 6	0.922 6	0.974 2
RB + HFF	0.959 9	0.915 1	0.898 1	0.923 2	0.974 5
RB + SA	0.955 9	0.915 4	0.897 9	0.922 9	0.974 4
RB + CA (RCAN)	0.959 9	0.914 8	0.897 5	0.922 2	0.973 9
RB + SA + HFF (本文)	0.960 1	0.915 7	0.898 5	0.924 0	0.975 1
注：加粗字体为每列最优结果。

从上述实验可以得出，无论是在客观指标还是主观感受上，本文设计的SA模块和HFF模块对重建结果都有一定的提升，充分证明了本文算法的可行性和有效性。

4 结论

本文提出了一种轻量级的层次特征融合结构的空间注意力残差网络。首先通过设计空间注意力模块，使网络能够自适应地对携带高频信息的区域分配更多的注意力，以帮助网络更快速地恢复高频细节。其次在各残差注意力块之间设计了层次化的特征融合结构，加强层与层之间的信息流动和特征重用。通过模块分析实验可以看出，本文提出的分层特征融合结构和空间注意力模块可以有效提升重建结果的质量。在标准测试集Set5、Set14、B100、Urban100和Manga109上的实验结果表明，与其他相同量级的算法相比，本文算法在主观视觉评价和客观量化评价上均具有更好的表现。

由于本文提出的空间注意力模块是基于低分辨率特征图设计的，所以在放大因子较小时，对重建结果的提升较为明显；在放大因子较大时，对重建结果的提升相对较小。下一步将研究如何在高分辨率特征图上提取合适的空间注意力模块，以进一步提升实验效果。

参考文献

Dong C, Loy C C, He K M and Tang X O. 2014. Learning a deep convolutional network for image super-resolution//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 184-199[DOI:10.1007/978-3-319-10593-2]

Dong C, Loy C C and Tang X O. 2016. Accelerating the super-resolution convolutional neural network//Proceedings of the 14th European Conference on Computer Vision. Amsterdam: Springer: 391-407[DOI:10.1007/978-3-319-46475-6_25]

Fang B W, Huang Z Q, Li Y, Wang Y. 2017. υ-support vector machine based on discriminant sparse neighborhood preserving embedding. Pattern Analysis and Applications, 20(4): 1077-1089 [DOI:10.1007/s10044-016-0547-x]

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE: 770-778[DOI:10.1109/CVPR.2016.90]

Hu J, Shen L, Albanie S, Sun G and Wu E H. 2019. Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence: 2011-2023[DOI:10.1109/TPAMI.2019.2913372]

Huang J B, Singh A and Ahuja N. 2015. Single image super-resolution from transformed self-exemplars//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE: 5197-5206[DOI:10.1109/CVPR.2015.7299156]

Kim J, Lee J K and Lee K M. 2016a. Accurate image super-resolution using very deep convolutional networks//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE: 1646-1654[DOI:10.1109/CVPR.2016.182]

Li J C, Fang F M, Mei K F and Zhang G X. 2018. Multi-scale residual network for image super-resolution//Proceedings of the 15th European Conference on Computer Vision. Munich: Springer: 527-542[DOI:10.1007/978-3-030-01237-3_32]

Lim B, Son S, Kim H, Nah S and Lee K M. 2017. Enhanced deep residual networks for single image super-resolution//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Honolulu: IEEE: 1132-1140[DOI:10.1109/CVPRW.2017.151]

Lin T Y, Dollár P, Girshick R, He K M, Hariharan B and Belongie S. 2017. Feature pyramid networks for object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE: 2117-2125[DOI:10.1109/CVPR.2017.106]

Martin D, Fowlkes C, Tal D and Malik J. 2002. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics//Proceedings of the 8th IEEE International Conference on Computer Vision. Vancouver: IEEE: 416-423[DOI:10.1109/ICCV.2001.937655]

Matsui Y, Ito K, Aramaki Y, Fujimoto A, Ogawa T, Yamasaki T, Aizawa K. 2017. Sketch-based manga retrieval using manga109 dataset. Multimedia Tools and Applications, 76(20): 21811-21838 [DOI:10.1007/s11042-016-4020-z]

Shi W Z, Caballero J, Huszár F, Totz J, Aitken A P, Bishop R, Rueckert D and Wang Z H. 2016. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE: 1874-1883[DOI:10.1109/CVPR.2016.207]

Tai Y, Yang J and Liu X M. 2017. Image super-resolution via deep recursive residual network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE: 2790-2798[DOI:10.1109/CVPR.2017.298]

Yang J C, Wang Z W, Lin Z, Cohen S, Huang T. 2012. Coupled dictionary training for image super-resolution. IEEE Transactions on Image Processing, 21(8): 3467-3478 [DOI:10.1109/TIP.2012.2192127]

Yang X, Zhang Y, Zhou D, Yang R G. 2015. An improved iterative back projection algorithm based on ringing artifacts suppression. Neurocomputing, 162: 171-179 [DOI:10.1016/j.neucom.2015.03.055]

Ying Z L, Long X. 2019. Single-image super-resolution construction based on multi-scale dense residual network. Journal of Image and Graphics, 24(3): 410-419 (应自炉, 龙祥. 2019. 多尺度密集残差网络的单幅图像超分辨率重建. 中国图象图形学报, 24(3): 410-419) [DOI:10.11834/jig.180431]

Zeyde R, Elad M and Protter M. 2010. On single image scale-up using sparse-representations//Proceedings of the 7th International Conference on Curves and Surfaces. Avignon: Springer: 711-730[DOI:10.1007/978-3-642-27413-8_47]

Zhang Y L, Li K P, Li K, Wang L C, Zhong B E and Fu Y. 2018a. Image super-resolution using very deep residual channel attention networks//Proceedings of the 15th European Conference on Computer Vision. Munich: Springer: 2472-2481

Zhang Y L, Tian Y P, Kong Y, Zhong B E and Fu Y. 2018b. Residual dense network for image super-resolution//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE: 294-310[DOI:10.1109/CVPR.2018.00262]