发布时间: 2018-04-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.170361
2018 | Volume 23 | Number 4

GDC 2017会议专栏

图像超分辨率重建中的细节互补卷积模型

李浪宇^1,2,3, 苏卓^1,2, 石晓红⁴, 黄恩博^1,2, 罗笑南⁵

1. 中山大学数据科学与计算机学院, 广州 510006;

2. 中山大学国家数字家庭工程技术研究中心, 广州 510006;

3. 中山大学深圳研究院, 深圳 518057;

4. 中山大学新华学院信息科学学院, 广州 510520;

5. 桂林电子科技大学计算机与信息安全学院, 桂林 541004

收稿日期: 2017-07-10; 修回日期: 2017-09-28

基金项目: 国家自然科学基金项目（61320106008，61502541，61772140）；广东省自然科学基金-博士启动基金项目（2016A030310202）；中央高校基本科研业务费专项资金-中山大学青年教师培育基金项目（16lgpy39）；广东省科技计划基金项目（2015B010129008）

第一作者简介: 李浪宇(1993-), 男, 中山大学计算机技术专业硕士研究生, 主要研究方向为数字图像处理。E-mail: lily43@mail2.sysu.edu.cn.

中图法分类号: TP391

文献标识码: A

文章编号: 1006-8961(2018)04-0572-11

摘要

目的现有的超分辨卷积神经网络为了获得良好的高分辨率图像重建效果需要越来越深的网络层次和更多的训练，因此存在了对于样本数量依懒性大，参数众多致使训练困难以及训练所需迭代次数大，硬件需求大等问题。针对存在的这些问题，本文提出一种改进的超分辨率重建网络模型。方法本文区别于传统的单输入模型，采取了一种双输入细节互补的网络模型，在原有的SRCNN单输入模型特征提取映射网络外，添加了一个新的输入。本文结合图像局部相似性，构建了一个细节补充网络来补充图像特征，并使用一层卷积层将细节补充网络得到的特征与特征提取网络提取的特征融合，恢复重建高分辨率图像。结果本文分别从主观和客观的角度，对比了本文方法与其他主流方法之间的数据对比和效果对比情况，在与SRCNN在相似网络深度的情况下，本文方法在放大3倍时的PSNR数值在Set5以及Set14数据下分别比SRCNN高出0.17 dB和0.08 dB。在主观的恢复图像效果上，本文方法能够很好的恢复图像边缘以及图像纹理细节。结论实验证明，本文所提出的细节互补网络模型能够在较少的训练以及比较浅的网络下获得有效的重建图像并且保留更多的图像细节。

关键词

超分辨重建; 深度学习; 卷积神经网络; 非线性映射

Mutual-detail convolution model for image super-resolution reconstruction

Li Langyu^1,2,3, Su Zhuo^1,2, Shi Xiaohong⁴, Huang Enbo^1,2, Luo Xiaonan⁵

1. School of Data and Computer Science, Sun Yat-sen University, Guangzhou 510006, China;

2. National Engineering Research Center of Digital Life, Sun Yat-sen University, Guangzhou 510006, China;

3. Research Institute of Sun Yat-sen University in Shenzhen, Shenzhen 518057, China;

4. School of Information Science, Xinhua College of Sun Yat-sen University, Guangzhou 510520, China;

5. School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China

Supported by: National Natural Science Foundation of China (61320106008, 61502541, 61772140)

Abstract

Objective Single-image super-resolution (SR) is a classical problem in computer vision. In visual information processing, high-resolution images are still desired for considerable useful information, such as medical, remote sensing imaging, video surveillance, and entertainment. However, we can obtain low-resolution images of specific objects in some scenes only, such as long-distance shooting, due to the limitation of physical devices. SR has attracted considerable attention from computer vision communities in the past decades. We address the problem of generating a high-resolution image given a low-resolution image, which is commonly referred to as single-image SR. Early methods include bicubic interpolation, Lanczos resampling, statistical priors, neighbor embedding, and sparse coding. In recent years, a series of convolutional neural network (CNN) models has been proposed for single-image SR. Deep learning attempts to learn layered, hierarchical representations of high-dimensional data. However, the classical CNN for SR is a single-input model that limits its performance. These CNNs require deep networks, considerable training consumption, and a large number of sample images to obtain images with good details. These requirements lead to the use of numerous parameters to train the networks, the increased number of iterations for training, and the need for large hardware. In view of these existing problems, an improved super-resolution reconstruction network model is proposed. Method Unlike the traditional single-input model, we adopt a mutual-detail convolution model with double input. The combination of paths of different scales enables the model to synthesize a wide range of receptive fields. The different features of image blocks with different sizes are complemented at different scales. Low-dimensional and high-dimensional features are combined to supplement the details of the restoration images to improve the quality and detail of reconstructed images. Traditional self-similarity-based methods can also be combined with neural networks. The entire convolution model can be divided into three parts:F1, F2, and F3 networks. F1 is the feature extraction and nonlinearly mapping network with four layers. Filters with spatial sizes of 9×9, and 3×3 are used. F2 is the detail network used to complement the features of F1. F2 consists of two layers and filters with spatial sizes of 11×11 and 5×5. F3 is the reconstruction network. We use mean squared error as the loss function. The loss is minimized using stochastic gradient descent (SGD) with the standard backpropagation. The network takes an original low-resolution image and an interpolated low-resolution image (to the desired size) as inputs and predicts the image details. Our method adds a new input to supplement the high-frequency information that is lost during the reconstruction process. As shown in the literature, deep learning generally benefits from big-data training. We use a training dataset of 500 images from BSD500, and the flipped and rotated versions of the training images are considered. We rotate the original images by 90° and 270°. The training images are split into 33×33 and 39×39, with a stride of 14, by considering training time and storage complexities. We set a mini batch size of SGD to 64 and the momentum parameter to 0.9. Result We use Set5 and Set14 as the validation sets. From previous experiments, we follow the conventional approach to super-resolving color images. We transform the color images into the YCbCr space. The SR algorithms are applied only on the Y channel, whereas the Cb and Cr channels are upscaled by bicubic interpolation. We show the quantitative and qualitative results of our method in comparison with those of state-of-the-art methods. Unlike traditional methods and SRCNN, our method can obtain better peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) values of the experimental results shown in the Set5 and Set14 datasets. For the upscaling factor 3, the average gains on PSNR achieved by our method are 0.17 and 0.08 dB higher than those of the next best approach, SRCNN, on the two datasets. A similar trend is observed when we use SSIM as the performance metric. Unlike the training times of SRCNN, the iterations of our approach are decreased by two orders of magnitude. With a lightweight structure, our method achieves superior performance to that of state-of-the-art methods. Conclusion The experiments show that the proposed method can effectively reconstruct images with considerable details with minimal training and relatively shallow networks. However, unlike the result of a very deep neural network, the result of our method is not sufficiently precise, and the network structure is relatively simple. We will consider using deep layers to acquire numerous image features at different layers and extending our model to several image tasks in the next work.

Key words

super-resolution reconstruction; deep learning; convolution neural network; nonlinear mapping

0 引言

超分辨率重建^[1]是指通过单帧或多帧低分辨率图像恢复重建得到对应图像的高分辨图像。超分辨重建是一个不定态问题，即对应问题的解不唯一^[2]。这意味着一幅低分辨率图像对应着多个可能的高分辨率图像。如何寻找到对应的高质量高分辨率图像是问题的关键所在。目前超分辨重建的方法主要分3种:基于插值的方法，基于重建的方法和基于学习的方法^[3]。基于插值的方法是通过将低分辨率图像的像素点映射到高分辨率图像上，缺失的像素点由已知的像素点来估计，经典的方法有新边缘导向插值方法(NEDI)^[4]以及基于局部边缘自适应的插值方法^[5-6]。基于重建的方法主要想法是通过挖掘低分辨率图像中高频信息，结合图像的先验信息，求解低分辨率成像的逆过程，恢复图像中的高频信息。比如相似性冗余先验^[7-8]，梯度轮廓先验^[9]等。基于学习的方法随着近些年的机器学习的发展，成为了最近研究的热门^[10-12]。基于学习的方法不是通过寻找一个通用的先验知识公式来约束解空间，而是通过大量的样本构建学习集，从而构建出低分辨率图像和高分辨图像之间的映射关系，先验信息被隐含在该映射关系中。

本文采用卷积神经网络来进行学习高—低分辨率图像之间的映射关系。相较于传统的端到端的神经网络中，本文所实现的方法受到残差网络(ResNet)^[13]的启发，认为神经网络的低层所产生的特征对于最后的结果也是有帮助的。但是目前的残差网络的层数非常得多，训练花费十分巨大，所以采用了一个更大的输入图像块来补充丢失的细节，模型结构如图 1所示。实验结果表明，在样本数量低、迭代次数少的情况下，本文方法效果相比超分辨率卷积神经网络(SRCNN)^[10]要好。

图 1 本文方法对低分辨率图像通过颜色—尺寸的转化后，针对Y分量使用细节互补网络重建出高分辨版本(整个细节补充网络模型结构，可以分成3个部分：$ {F_1} $网络是特征提取及非线性映射网络；$ {F_2} $是细节补充网络；$ {F_3} $是图像重建网络)

Fig. 1 The pipeline of our model(The whole convolution model can be divided into three parts: $ {F_1} $ is feature extraction and nonlinearly mapping network. $ {F_2} $ is Details network. And $ {F_3} $ is reconstruction network)

本文的主要贡献包括：

1) 设计和实现了一个细节互补的卷积神经网络模型，利用不同尺寸图像块的细节特征补充重建高分辨图像的细节。

2) 本文的方法在样本数量很少的情况下，能够达到一个很好的效果。相对地，本文方法所需的开销少，所建立的映射关系准确。

3) 证明了本文方法的有效性，并且分别从主观和客观的角度分析了本文方法和主流代表性方法之间的不同之处。

1 相关工作

基于学习的超分辨率方法利用样本图像集合或图像本体的高分辨率和低分辨率版本之间的关系建立先验信息从而实现高效的图像重建效果。伴随着深度学习技术的发展，基于学习的超分辨率方法可以被进一步划分成基于经典学习的方法(流形与稀疏学习等)^[14]和基于深度网络表征的超方法^{[12, 15-16]}。下面将分别介绍这两类算法。

1) 基于经典学习的超分辨率方法:这一类方法其主要思想是高—低分辨图像块之间的映射关系可以通过学习获取，从而可用于恢复最可能相似的高分辨率图像。基于实例学习的方法被证明能够突破传统超分辨率重建方法的限制^[17-18]。2004年Change等人^[19]提出了邻域嵌入的超分辨率算法，通过流形学习思想对每一个低分辨率图像块找到 $k$ 个与其相似的低分辨率图像块和它们所对应的权值，再将权值传递给对应的 $k$ 个高分辨图像块，最终加权拟合出的高分辨率图像块。2010年Yang等人^[20]提出了通过稀疏表示与字典学习相结合的方法来解决超分辨重建问题。2011年Dong^[21]提出了自适应稀疏域选择算法缓解了多数超分辨率方法中存在的边缘重建噪声问题。2013年Timofte等人^[22]结合邻域嵌入和稀疏表示方法，提出了锚定邻域回归方法，通过学习样本图像分别建立对应的高-低分辨率字典，通过共享高—低分辨率字典之间的权值能够有效地解决超分辨重建的问题。2014年Timofte等人^[23]又在原来的锚定邻域回归方法的基础上提出了一种改进的锚定邻域回归方法。在基于经典学习的方法当中还有一类方法是通过图像自相似性进行学习的。图像的自相似性刻画了局部分块在非局部(non-local)范围或多尺度(multiscale)范围上存在近似信号，从而为复原高分辨率图像提供了更多的可用信息。Glasner等人^[24]在2009年首次提出了用图像的自相似性来恢复高分辨率图像的可行思路，并验证了其有效性。2011年，Freedman等人^[25]提出了一种基于图像的局部自相似性的超分辨修复算法。Huang等人^[26]在2015年提出了一种通过检测透视几何手段来寻找对应相似块的超分辨重建算法。

2) 通过建立深度学习网络框架的图像重建。随着机器学习的发展，近些年来出现了各种基于深度学习的方法来解决超分辨率重建问题。不同于基于经典学习的方法，通过深度学习的图像重建算法使用大量的图像样本，并构建深度网络模型来学习对应高—低分辨率图像之间的映射关系，从而实现超分辨率重建的结果。2014年，Dong等人^[10]提出采用卷积神经网络来解决单图像的分辨重建问题。2015年，Kim等人^[12]提出了使用20层的深度卷积神经网络来获得高—低分辨图像之间的映射关系。同年Kim等人^[27]提出了有效减少网络参数数量的递归卷积神经网络模型(DRCN)在增强所建网络的感受野(receptive field)。2016年Ledig等人^[11]提出了利用生成对抗网络来解决超分辨重建的网络模型。这些方法都能很好地解决超分辨重建的问题。但是存在着网络层数过多导致难以训练，以及需要大量的迭代次数和样本数据等问题。

2 细节互补网络模型

定义单帧的低分辨率图像$ y \in {{\boldsymbol{\rm{R}}}^{h \times w}} $，其中 $h$ 和 $w$ 分别代表了图像的高度和宽度。先对低分辨率图像利用双三次插值^[28]放大到所需要的倍数, 称放大后的图像为 $Y$ ，$ Y \in {{\boldsymbol{\rm{R}}}^{rh \times rw}} $。其中 $r$ 是放大因子。接下来通过本文所提出的细节互补网络模型去学习得到对应双三次放大后的图像 $Y$ 和真实图像 $X$ 之间的映射关系。在超分辨率问题中使用双三次插值是比较常用的^[10-12]。使用双三次放大好处是处理的速度比较其他方法较为快速，并且不会引入过多的其他信息。

本文的细节互补网络模型可以分成3个部分, 第1部分是用于特征提取和非线性映射的网络; 第2部分是细节补充网络，用于补充重建图像的特征细节; 第3部分是重建网络，用于最后的高分辨图像重建工作。

2.1 特征提取和非线性映射

首先的工作是需要提取低分辨率图像中所蕴含的有效特征信息。目前已有的一些特征提取的方法，通过将图像与卷积核做卷积操作得到边缘特征信息等^[29]。在卷积神经网络当中，同样是将图像与各种卷积核做卷积得到下一层的输入。所以为了提取低维图像特征和获取低维图像特征到高维图像的特征之间的映射，使用了一个3层的网络来实现这一目的，并称这个3层的网络为$ {F_1} $。

第1层$ {F_{1, 1}} $主要实现低分辨率图像的特征提取，形式化定义为

$ {F_{1, 1}} = \max \left( {0, {\mathit{\boldsymbol{W}}_{1, 1}} \otimes Y + {\mathit{\boldsymbol{B}}_{1, 1}}} \right) $

(1)

式中，$ {\mathit{\boldsymbol{W}}_{1, 1}} $和$ {\mathit{\boldsymbol{B}}_{1, 1}} $分别代表的是滤波器和偏移值，⊗代表卷积操作。$ {\mathit{\boldsymbol{W}}_{1, 1}} $对应$ c \times {f_{1, 1}} \times {f_{1, 1}} \times {n_{1, 1}} $个参数， $c$ 代表输入图像的通道数，$ {f_{1, 1}} $表示滤波器的大小，$ {n_{1, 1}} $表示着滤波器的数目。直观而言，即$ {\mathit{\boldsymbol{W}}_{1, 1}} $有$ {n_{1, 1}} $个滤波器，每一个滤波核大小为$ c \times {f_{1, 1}} \times {f_{1, 1}} $。$ {\mathit{\boldsymbol{B}}_{1, 1}} $是一个$ {n_{1, 1}} $维的向量。使用$ {\rm{ReLU}} $函数^[30]作为网络的激活函数。$ {\rm{ReLU}} $的函数表达式为

$ {\rm{ReLU}} = \max \left( {0, x} \right) $

(2)

$ {\rm{ReLU}} $相比于传统的激活函数更容易学习优化并且运算速度快。而传统的sigmoid函数，由于两端饱和，在传播过程中容易丢弃信息^[31]。因此本文选择$ {\rm{ReLU}} $作为激活函数。

通过第1层的卷积操作，对每一个图像块都提取到了$ {n_{1, 1}} $维的特征。接下来的步骤是要将第1层所提取到的低分辨率图像特征通过非线性变化映射到高维图像特征中。在SRCNN^[10]中，作者使用了1×1的卷积核来完成了这一步的工作，随后又将卷积核的大小分别修改为3×3和5×5。SRCNN的实验证明了卷积核尺寸越大，所构建的非线性映射越好，但是当卷积核尺寸越大时，相关参数数目会更多，会使本层难以训练。故本文没有采用使用很大尺寸的卷积核。相关实验结果表明，深度学习的层数越多所能学习到的特征就越多，所形成的映射关系就越好。所以在这里使用了双层网络$ {F_{1, 2}} $和$ {F_{1, 3}} $来实现非线性映射的过程，即

$ {F_{1, 2}} = \max \left( {0, {\mathit{\boldsymbol{W}}_{1, 2}} \otimes {F_{1, 1}}\left( \mathit{\boldsymbol{Y}} \right) + {\mathit{\boldsymbol{B}}_{1, 2}}} \right) $

(3)

$ {F_{1, 3}} = \max \left( {0, {\mathit{\boldsymbol{W}}_{1, 3}} \otimes {F_{1, 2}}\left( \mathit{\boldsymbol{Y}} \right) + {\mathit{\boldsymbol{B}}_{1, 3}}} \right) $

(4)

式中，$ {\mathit{\boldsymbol{W}}_{1, 2}} $是指数量为 $n$_{1, 2}，大小为 $n$_{1, 1}×$f$_{1, 2}×$f$_{1, 2}的卷积核。而$ {\mathit{\boldsymbol{W}}_{1, 3}} $是数量 $n$_{1, 3}，大小为 $n$_{1, 2}×$f$_{1, 3}×$f$_{1, 3}的滤波器, $ {\mathit{\boldsymbol{B}}_{1, 2}} $和$ {\mathit{\boldsymbol{B}}_{1, 3}} $分别为 $n$_{1, 2}维向量和 $n$_{1, 3}维向量。通过$F$_{1, 2}和$F$_{1, 3}分别得到了低分辨率图像通过特征提取和非线性映射之后的高分辨图像的特征。

2.2 细节补充网络

通过$F$₁很难保证到提取到了足够多的细节特征，包括纹理，边缘等细节特征。简单增加$F$₁网络层数会带来更大的硬件消耗，以及导致网络更加的难以训练，并且很难保证效果是否会更加得好^[14]。在残差网络当中，本文认为中间层的输入对于最后的结果也是有一定的影响，能够得到更好的效果^[7]。此外Suetake等人^[32]提出的重建方法，利用了图像中的局部自相似性特性，对于获得高分辨率的结果而言同样是重要的。结合局部自相似性和残差网络的思想，引入了一个细节补充网络$F$₂。令$F$₂的输入是$F$₁的输入以及其周边图像块，通过更大的输入块来提供更多的细节特征。细节补充网络$F$₂是由两层卷积层$F$_{2, 1}和$F$_{2, 2}构成的。$F$_{2, 1}实现了细节特征的提取，有和$ {\mathit{\boldsymbol{F}}_{1, 1}} $类似的，$ {\mathit{\boldsymbol{W}}_{2, 1}} $表示有 $n$_{2, 1}个滤波器，滤波器大小为 $c$× $f$_{2, 1}× $f$_{2, 1}，$ {\mathit{\boldsymbol{B}}_{1, 2}} $表示 $n$_{2, 1}维的向量。通过$F$_{2, 1}可以得到了不同于通过$F$_{1, 1}所获得的特征，再对$F$_{2, 1}所提取的特征在通过一层卷积层做一次卷积操作，抽取得到更接近与高分辨图像的细节特征，过程可以形式化为

$ {F_{2, 1}} = \max \left( {0, {\mathit{\boldsymbol{W}}_{2, 1}} \otimes \mathit{\boldsymbol{Y}} + {\mathit{\boldsymbol{B}}_{2, 1}}} \right) $

(5)

$ {F_{2, 2}} = \max \left( {0, {\mathit{\boldsymbol{W}}_{2, 2}} \otimes {F_{2, 1}}\left( \mathit{\boldsymbol{Y}} \right) + {\mathit{\boldsymbol{B}}_{2, 2}}} \right) $

(6)

式中，$ {\mathit{\boldsymbol{W}}_{2, 2}} $是 $n$_{2, 2}个滤波器，每个滤波器大小为 $n$_{2, 1}×$f$_{2, 2}×$f$_{2, 2}，$ {\mathit{\boldsymbol{B}}_{2, 2}} $是 $n$_{2, 2}维的向量。这样通过$F$₂网络，得到了与$F$₁网络不同的图像特征。这些图像特征将会和通过$F$₁网络里获得的特征一起通过重建网络来得到最后的高分辨率图像。

2.3 重建网络

在传统的方法里，最终的图像结果是由各个特征图之间的加权平均得到的，譬如邻域嵌入方法就是一种使用加权平均的方法^[19]。分别通过$F$₁网络和$F$₂网络得到一帧图像的高分辨图像特征以及细节补充特征，定义为$ {{F_1}\left( \mathit{\boldsymbol{Y}} \right)} $和$ {{F_2}\left( \mathit{\boldsymbol{Y}} \right)} $，令$ {\mathit{\boldsymbol{\tilde Y}}} $表示$ \left\{ {{F_1}\left( \mathit{\boldsymbol{Y}} \right), {F_2}\left( \mathit{\boldsymbol{Y}} \right)} \right\} $，即$ {\tilde Y} $是通过两个网络得到的所有特征。参照SRCNN等经典模型, 本文使用一个卷积层来实现加权平均的过程，称其为重建网络$F$₃。该卷积层的模型为

$ {F_3} = {\mathit{\boldsymbol{W}}_3} \otimes \mathit{\boldsymbol{\tilde Y}} + {\mathit{\boldsymbol{B}}_3} $

(7)

式中，$ {\mathit{\boldsymbol{W}}_3} $是大小为($n$_{2, 2}+$n$_{1, 3})×$f$₃×$f$₃×$c$的卷积核，$ {\mathit{\boldsymbol{B}}_3} $为一个 $c$ 维的向量。 $c$是最终重建图像的通道数，最后一层的功能与SRCNN^[10]中的重建网络是基本相同的。类似的$ {\mathit{\boldsymbol{W}}_3} $是一组线性滤波器。

2.4 训练网络

给定一个训练集$ \left\{ {{\mathit{\boldsymbol{Y}}^i}, {\mathit{\boldsymbol{X}}^i}} \right\}_{i = 1}^N $，其中$ {\mathit{\boldsymbol{Y}}^i} $是低分辨率图像，$ {\mathit{\boldsymbol{X}}^i} $是真实图像， $N$ 是数据集中包含样本图像的数量。训练的目的就是找到一个最优的模型$F$ ，其参数$ \mathit{\boldsymbol{\theta }} = \left\{ {{\mathit{\boldsymbol{W}}_{1, 1}}, \cdots, {\mathit{\boldsymbol{W}}_3}, {\mathit{\boldsymbol{B}}_{1, 1}}, \cdots, {\mathit{\boldsymbol{B}}_3}} \right\} $，使得$ F\left( {{\mathit{\boldsymbol{Y}}^i}, \mathit{\boldsymbol{\theta }}} \right) $和真实图像$ {\mathit{\boldsymbol{X}}^i} $之间的误差最小。为了达成这一目的，本文选择使用均方误差(MSE)构建本文网络模型的损失函数

$ L\left( \mathit{\boldsymbol{\theta }} \right) = \frac{1}{{2N}}\sum\limits_{i = 1}^N {{{\left\| {F\left( {{\mathit{\boldsymbol{Y}}_i}, \mathit{\boldsymbol{\theta }}} \right) - {\mathit{\boldsymbol{X}}_i}} \right\|}^2}} $

(8)

使用 ${\rm{MSE}}$ 作为本文网络模型的损失函数会得到一个较高的峰值信噪比(${\rm{PSNR}}$)。 ${\rm{PSNR}}$ 的数值越高往往意味着重建图像的质量越高，与真实图像的误差越小。 ${\rm{PSNR}}$ 为

$ {\rm{PSNR}} = 10 \times \lg \left( {\frac{{{{\left( {{2^n} - 1} \right)}^2}}}{{{\rm{MSE}}}}} \right) $

(9)

可以看出，当 ${\rm{MSE}}$ 越小时， ${\rm{PSNR}}$ 值将会增大。

3 实验过程和结果

本节详细介绍本文实验数据的获取, 具体参数是如何确定的和模型是如何训练的。最后还将会定量的分析本文提出的方法结果和其他代表性方法结果之间的对比与展示。

3.1 数据集

在深度学习中, 最终所训练的模型效果和实验中所使用的数据集的关系十分密切。一般来说, 使用的训练集里, 样本的数量越大, 最终的网络模型也就更趋向于最优化模型.在SRCNN^[10]中作者使用的训练数据是从ImageNet^[33]数据集中抽选的395 909幅图像, 而在本文的网络训练中选用的数据集是BSD500^[34]数据集, 同时使用Set5^[36]和Set14^[36]作为本文方法的测试集, 部分测试数据如图 2所示。

图 2 本文实验所用的测试集的部分数据展示

Fig. 2 Part of the test iamges for effectiveness validation

3.2 训练过程

首先要将低分辨率图像从RGB颜色空间转换到YCrCb色彩空间。由于人类视觉相比于颜色对于亮度更加敏感^[37]，并且在SRCNN中已经实验证明仅对 ${\rm{Y}}$ 通道做映射关系不会影响最终图像质量^[10]，所以本文仅对 ${\rm{Y}}$ 通道即亮度通道做训练从而得到对应的映射关系，而两个颜色通道使用的是双三次插值的方法进行上采样，这样可以减轻计算的开销，同时保证了图像的质量。

由于本文的细节补充网络在特征提取时分成了$F$₁和$F$₂两个网络来操作，所以在对训练样本分割成图像块做训练集时，两个网络的输入大小是有区别的，$F$₂的输入包含了$F$₁输入的上下文信息.针对于$F$₁，训练样本图像块大小为33×33像素，而$F$₂的输入图像块的大小块为39×39像素，两者的滑动步长为14个像素块，以保证两个网络的针对同一目标的图像块的中心坐标是一致的，保证了两个输入的子图像是对同一区域做操作。同时为了获得更多的训练数据，本文对500个训练样本分别旋转0°，90°和270°，并且令滑动步长为14。这样一共得到了6.4×10⁵对训练样本块。

在$F$_{1, 1}和$F$_{2, 1}中卷积器分别为 $c$×$f$_{1, 1}×$f$_{1, 1}×$n$_{1, 1}， $c$×$f$_{2, 1}×$f$_{2, 1}×$n$_{2, 1}。由于本文方法首先将图像由RGB颜色模型转换到了YCrCb颜色模型，并且仅仅针对 ${\rm{Y}}$ 通道做变换，而另外的两个颜色通道使用双三次插值进行上采样操作，因此 $c$ 被设置为1。对于每一层网络的卷积器大小和数目，在实验中被分别设置为 $f$_{1, 1}=9， $f$_{1, 2}=3， $f$_{1, 3}=3， $f$_{2, 1}=11， $f$_{2, 2}=5， $f$₃=5， $n$_{1, 1}=128， $n$_{1, 2}=64， $n$_{1, 3}=32， $n$_{2, 1}=64， $n$_{2, 2}=32， $n$₃=1。$F$₁网络的参数设定是参考SRCNN的参数设定，$F$₂网络由于网络层数只有2层，为了获得更大的视野域，令这两层的卷积器大小相较于$F$₁网络的卷积器要大，并且逐层的整合图像特征，使得卷积器的数目是递减的。同时设置本文的细节补充网络的学习率为0.001，使用NVIDIA GTX 1080 GPU和Inter Core i7-4790 CPU做训练，一共迭代了1.5×10⁷次.

3.3 实验分析与对比

本文将分别从主观和客观的角度，分别的展示本文提出的方法的有效性以及与其他主流方法之间的对比展示。

如图 3所示，在 $r$ =3的时候，本文对比了真实图像与本文方法所恢复图像之间的对比。图 4为真实图像和本文方法恢复的高分辨率图像的梯度对数图，客观地展示了使用本文方法重建的图像与真实图像之间的梯度对比情况。直观地来看，本文方法在边缘细节上基本上与真实图像相同。但是结合图像的梯度对数图，本文方法缺少了在平滑区域的一些高频信息，使得图像的细节被模糊了许多。这一点是因为图像在下采样的过程在，将平滑区域的高频信息以及图像细节部分丢失得过多，并且网络的深度比较浅，难以恢复重建出该部分的高频信息和图像细节。图 5展现了细节补充网络模型的效果与迭代次数之间的关系。随着迭代次数的上升，本文模型的效果是越来越好。在迭代次数达到1.3×10⁷时，网络的收敛速度变慢。在迭代次数到达1.5×10⁷，网络基本收敛。相对于SRCNN来说，SRCNN在使用ImageNet作为数据集的情况下，需要迭代1.2×10⁹才达到了最好效果，并且通过迭代1.5×10⁷所得到的细节补充网络模型在测试样本数据集上要好于SRCNN网络模型。

图 3 当 $r$ =3时，真实图像和本文方法恢复图像的整体以及局部对比

Fig. 3 The comparison of global and local with $r$ =3 factors

((a) original images; (b) the result of our method)

图 4 当 $r$ =3时，真实图像和本文方法在主观视觉和客观梯度对数直方图上的比较

Fig. 4 The comparison of original image and our method in subjective and gradient distributions in logarithmic

((a) original image; (b) the result of our method)

图 5 当 $r$ =3时，本文方法随着迭代次数增加在Set5数据集上的 ${\rm{PSNR}}$ 值和SSIM值的对比曲线图

Fig. 5 The curve of ${\rm{PSNR}}$ and SSIM on the Set5 data with x3 factor

为了验证本文方法所得到结果的质量，以定性和定量的分析作为基准，对比了Bicubic，SC^[20]，KSVD^[36]，NE+NNLS^[35]，NE+LLE^[19]，ANR^[22]等方法在 $r$ =3的时，在set5和set14下的 ${\rm{PSNR}}$ 的对比情况。如表 1和表 2所示，本文方法相比于现在的主流代表性方法在单帧图像上，本文方法所恢复的图像质量都比较好。在整体的数据集上，对比了Bicubic，A+^[23]，RFL^[38]，SelEX^[26]以及SRCNN^[10]和本文方法在 ${\rm{PSNR}}$ 和SSIM数值上的对比，具体如表 3。结果证明本文方法重建出的图像效果在这两种衡量图像标准的数值上都要优于其他方法。

表 1 在Set5数据集下，本文方法与其他方法的 ${\rm{PSNR}}$ 值对比情况
Table 1 ${\rm{PSNR}}$ measures for single images with 3x factor on Set5 data

下载CSV

Set5	SCALE	BICUBIC	SC^[10]	KSVD^[19]	NE+NNLS^[18]	NE+LLE^[9]	ANR^[12]	本文
baby	3	33.91	34.29	35.08	34.77	35.06	35.13	35.08
bird	3	32.58	34.11	34.57	34.26	34.56	34.60	35.45
butterfly	3	24.04	25.58	25.94	25.61	25.75	25.90	28.92
head	3	32.88	33.17	33.56	33.45	33.60	33.63	33.71
woman	3	28.56	29.94	30.37	29.89	30.22	30.33	31.41
平均	3	30.39	31.42	31.90	31.60	31.84	31.92	32.92
注：粗体标记数字表示最佳效果, 斜体标记则表示次佳效果。

表 2 在Set14数据集下, 本文的方法与其他方法 ${\rm{PSNR}}$ 值对比情况
Table 2 ${\rm{PSNR}}$ measures for single images with 3x factor on Set14 data

下载CSV

Set14	SCALE	BICUBIC	SC^[20]	KSVD^[36]	NE+NNLS^[35]	NE+LLE^[19]	ANR^[22]	本文
baboon	3	23.21	23.47	23.52	23.49	23.55	23.56	23.68
barbara	3	26.25	26.39	26.76	26.67	26.74	26.69	26.48
bridge	3	24.40	24.82	25.02	24.86	24.98	25.01	25.25
coastguard	3	26.55	27.02	27.15	27.00	27.07	27.08	27.33
comic	3	23.12	23.90	23.96	23.83	23.98	24.04	24.73
face	3	32.82	33.11	33.53	33.45	33.56	33.62	33.73
flowers	3	27.23	28.25	28.42	28.21	28.30	28.49	29.49
foreman	3	31.18	32.64	33.19	32.87	33.21	33.21	34.06
Lena	3	31.68	32.64	33.00	32.82	33.01	33.08	33.69
man	3	27.01	27.76	27.90	27.20	27.87	27.92	28.46
monarch	3	29.43	30.71	31.10	30.76	30.95	31.09	33.63
pepper	3	32.39	33.32	34.07	33.56	33.80	33.82	34.70
ppt3	3	23.71	24.98	25.23	24.81	24.94	25.03	27.06
zebra	3	26.63	27.95	28.49	28.12	28.31	28.43	28.83
平均	3	27.54	28.31	28.67	28.44	28.60	28.65	29.37
注：粗体标记数字表示最佳效果, 斜体标记则表示次佳效果。

表 3 在Set5和Set14数据集, 本文方法与其他方法在 ${\rm{PSNR}}$ 和SSIM下的图像恢复质量的对比
Table 3 The average results of ${\rm{PSNR}}$ and SSIM for single images with3x factor on Set5 and Set14

下载CSV

数据集	SCALE	BICUBIC	A+^[23]	SRCNN^[10]	RFL^[38]	SELFEX^[26]	本文
Set5	3	30.39/0.868	32.58/0.909	32.75/0.909	32.43/0.906	32.58/0.909	32.92/0.912
Set14	3	27.55/0.774	29.13/0.819	29.28/0.821	29.05/0.816	29.16/0.820	29.36/0.824
注：粗体标记数字表示最佳效果, 斜体标记则表示次佳效果。

为了验证细节补充网络的有效性，对比了在缺少$F$₂的细节补充网络时和包含$F$₂细节网络这两种情况下，在Set5数据集上 ${\rm{PSNR}}$ 值的对比情况，具体如表 4所示。实验结果表明, 添加细节补充网络的网络模型所恢复的高分辨率图像效果要优于未添加细节补充网络的模型。

表 4 在Set5数据集上, 无补充网络的${\rm{F}}$₁-Net网络和本文方法的 ${\rm{PSNR}}$ 值对比
Table 4 The ${\rm{PSNR}}$ comparison of ${\rm{F}}$1-Net and our method

下载CSV

/dB
SET5	SCALE	$F$₁-Net	本文
baby	3	35.03	35.08
bird	3	35.39	35.45
butterfly	3	28.81	28.92
head	3	33.70	33.71
woman	3	31.27	31.41

实验结果证明了相比于Bicubic，SC^[20]，KSVD^[36]，NE+LLE^[19]，NE+NNLS^[35]，ANR^[22]，A+^[23]，RFL^[38]，SelEX^[26]以及SRCNN^[10]等方法，不论在 ${\rm{PSNR}}$ 还是SSIM的衡量标准下，本文方法都是极为出色的。尤其是针对细节和纹理教多的图像时，如图 3以及图 6所示，本文的细节补充网络所恢复的图像拥有更加清晰的边缘和纹理，而其他的几种方法在边缘上会有模糊，纹理无法完全恢复的情形。在和使用9-5-5结构的SRCNN方法相比较时，在迭代次数和数据集都要远小于SRCNN的情况下，本文方法重建出的高分辨率图像在 ${\rm{PSNR}}$ 和SSIM数值上也是高于SRCNN的。

图 6 当 $r$ 为3时, 几种方法的对比情况和 ${\rm{PSNR}}$ 值

Fig. 6 Comparisons with $r$ =3 magnification

((a) original images; (b) Bicubic; (c) SC method; (d) NE+LLE; (e) KSVD; (f) NE+NNLS; (g) ANR; (h) A+; (i) SRCNN; (j) ours)

4 结论

本文提出了一种细节互补的卷积神经网络模型，通过使用细节补充的方法来对低分辨率图像进行重建工作。本文采用两种不同尺度的输入并且使用了不同大小的卷积器获得多个图像特征，综合地利用了低维特征和高维特征来补充恢复图像的细节。实验结果表明，本文方法在小样本以及较少的迭代次数下，能够有效地恢复图像的细节和纹理。相较于SRCNN^[10]等方法，本文方法所获得的高分辨图像质量更高。但是本文也存在诸多不足之处，本文所实现的图像相较于非常深的神经网络，如文献[12]，所得到的图像结果还有些不足，同时现有的细节互补网络的思想相对是比较简单的，在今后的工作中需要针对细节互补的特征融合方式再进一步深入地研究。此外在所需要的迭代次数虽然相对SRCNN等有所下降，但是依旧相对较高，在下一步的工作中也需要得到更好地解决。但是在同等网络深度的情况下，本文方法所能实现的效果是比较好的。同时在实验的过程中，对于网络的优化还是存在一些问题。在今后的工作中，可以考虑使用更深层的网络模型来获得不同层次的图像特征来更好地恢复图像细节，并且使用更好的网络优化方法来加快网络的训练速度和质量；如何更好地针对纹理细节密集区域的图像恢复重建。

参考文献

[1] Van Ouwerkerk J D. Image super-resolution survey[J]. Image and Vision Computing, 2006, 24(10): 1039–1052. [DOI:10.1016/j.imavis.2006.02.026]

[2] Irani M, Peleg S. Improving resolution by image registration[J]. CVGIP:Graphical Models and Image Processing, 1991, 53(3): 231–239. [DOI:10.1016/1049-9652(91)90045-L]

[3] Tian J, Ma K K. A survey on super-resolution imaging[J]. Signal, Image and Video Processing, 2011, 5(3): 329–342. [DOI:10.1007/s11760-010-0204-6]

[4] Li X, Orchard M T. New edge-directed interpolation[J]. IEEE Transactions on Image Processing, 2001, 10(10): 1521–1527. [DOI:10.1109/83.951537]

[5] Leu J G. Image enlargement based on a step edge model[J]. Pattern Recognition, 2000, 33(12): 2055–2073. [DOI:10.1016/S0031-3203(99)00184-3]

[6] Cha Y, Kim S. Edge-forming methods for color image zooming[J]. IEEE Transactions on Image Processing, 2006, 15(8): 2315–2323. [DOI:10.1109/TIP.2006.875182]

[7] Tai Y W, Liu S C, Brown M S, et al. Super resolution using edge prior and single image detail synthesis[C]//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, CA, USA: IEEE, 2010: 2400-2407. [DOI:10.1109/CVPR.2010.5539933]

[8] Zhang K B, Gao X B, Tao D C, et al. Single image super-resolution with multiscale similarity learning[J]. IEEE Transactions on Neural Networks and Learning Systems, 2013, 24(10): 1648–1659. [DOI:10.1109/TNNLS.2013.2262001]

[9] Sun J, Xu Z B, Shum H Y. Image super-resolution using gradient profile prior[C]//Proceedings of 2008 IEEE Conference on Computer Vision and Pattern Recognition. Anchorage, AK, USA: IEEE, 2008: 1-8. [DOI:10.1109/CVPR.2008.4587659]

[10] Dong C, Loy C C, He K M, et al. Image super-resolution using deep convolutional networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(2): 295–307. [DOI:10.1109/TPAMI.2015.2439281]

[11] Ledig C, Theis L, Huszar F, et al. Photo-realistic single image super-resolution using a generative adversarial network[J]. Computer Vision and Pattern Recognition, 2016: 4681–4690. [DOI:10.1109/CVPR.2017.19]

[12] Kim J, Lee J K, Lee K M. Accurate image super-resolution using very deep convolutional networks[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 1646-1654. [DOI:10.1109/CVPR.2016.182]

[13] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 770-778. [DOI:10.1109/CVPR.2016.90]

[14] Freeman W T, Jones T R, Pasztor E C. Example-based super-resolution[J]. IEEE Computer Graphics and Applications, 2002, 22(2): 56–65. [DOI:10.1109/38.988747]

[15] Kim J, Lee J K, Lee K M. Accurate image super-resolution using very deep convolutional networks[C]. Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 1646-1654. [DOI:10.1109/CVPR.2016.182]

[16] Wang Z W, Liu D, Yang J C, et al. Deep networks for image super-resolution with sparse prior[J]. International Conference on Computer Vision, 2015: 370–378. [DOI:10.1109/ICCV.2015.50]

[17] Wang Q, Tang X O, Shum H. Patch based blind image super resolution[C]//Proceedings of the 10th IEEE International Conference on Computer Vision. Beijing, China: IEEE, 2005: 709-716. [DOI:10.1109/ICCV.2005.186]

[18] Lin Z C, Shum H Y. Fundamental limits of reconstruction-based super-resolution algorithms under local translation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004, 26(1): 83–97. [DOI:10.1109/TPAMI.2004.1261081]

[19] Chang H, Yeung D Y, Xiong Y M. Super-resolution through neighbor embedding[C]//Proceedings of 2004 Computer Society Conference on Computer Vision and Pattern Recognition. Washington DC, USA: IEEE, 2004: I-275-I-282. [DOI:10.1109/CVPR.2004.1315043]

[20] Yang J C, Wright J, Huang T S, et al. Image super-resolution via sparse representation[J]. IEEE Transactions on Image Processing, 2010, 19(11): 2861–2873. [DOI:10.1109/TIP.2010.2050625]

[21] Dong W S, Zhang L, Shi G M, et al. Image deblurring and super-resolution by adaptive sparse domain selection and adaptive regularization[J]. IEEE Transactions on Image Processing, 2011, 20(7): 1838–1857. [DOI:10.1109/TIP.2011.2108306]

[22] Timofte R, De V, Van Gool L. Anchored neighborhood regression for fast example-based super-resolution[C]//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, NSW, Australia: IEEE, 2013: 1920-1927. [DOI:10.1109/ICCV.2013.241]

[23] Timofte R, De Smet V, Van Gool L. A+: adjusted anchored neighborhood regression for fast super-resolution[C]//Computer Vision——ACCV 2014. Cham: Springer, 2015: 111-126. [DOI:10.1007/978-3-319-16817-3_8]

[24] Glasner D, Bagon S, Irani M. Super-resolution from a single image[C]//Proceedings of the IEEE 12th International Conference on Computer Vision. Kyoto, Japan: IEEE, 2009: 349-356. [DOI:10.1109/ICCV.2009.5459271]

[25] Freedman G, Fattal R. Image and video upscaling from local self-examples[J]. ACM Transactions on Graphics, 2011, 30(2): 12. [DOI:10.1145/1944846.1944852]

[26] Huang J B, Singh A, Ahuja N. Single image super-resolution from transformed self-exemplars[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 5197-5206. [DOI:10.1109/CVPR.2015.7299156]

[27] Kim J, Lee J K, Lee K M. Deeply-recursive convolutional network for image super-resolution[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 1637-1645. [DOI:10.1109/CVPR.2016.181]

[28] Keys R. Cubic convolution interpolation for digital image processing[J]. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1981, 29(6): 1153–1160. [DOI:10.1109/TASSP.1981.1163711]

[29] Bertasius G, Shi J B, Torresani L. DeepEdge: a multi-scale bifurcated deep network for top-down contour detection[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 4380-4389. [DOI:10.1109/CVPR.2015.7299067]

[30] Nair V, Hinton G E. Rectified linear units improve restricted boltzmann machines[C]//Proceedings of the 27th International Conference on Machine Learning. Haifa, Israel: ACM, 2010: 807-814.

[31] Glorot X, Bordes A, Bengio Y. Deep sparse rectifier neural networks[C]//Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. Fort Lauderdale, USA: [s. n. ], 2011: 315-323.

[32] Suetake N, Sakano M, Uchino E. Image super-resolution based on local self-similarity[J]. Optical Review, 2008, 15(1): 26–30. [DOI:10.1007/s10043-008-0005-0]

[33] Deng J, Dong W, Socher R, et al. ImageNet: a large-scale hierarchical image database[C]//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL, USA: IEEE, 2009: 248-255. [DOI:10.1109/CVPR.2009.5206848]

[34] Sugano Y, Matsushita Y, Sato Y, et al. Graph-based joint clustering of fixations and visual entities[J]. ACM Transactions on Applied Perception, 2013, 10(2): 10.

[35] Bevilacqua M, Roumy A, Guillemot C, et al. Low-complexity single-image super-resolution based on nonnegative neighbor embedding[C]//Proceedings of British Machine Vision Conference. Surrey, UK: BMVC, 2012.

[36] Zeyde R, Elad M, Protter M. On single image scale-up using sparse-representations[C]//Proceedings of the 7th International Conference on Curves and Surfaces. Avignon, France: Springer-Verlag, 2010: 711-730. [DOI:10.1007/978-3-642-27413-8_47]

[37] Huang J B, Singh A, Ahuja N. Single image super-resolution from transformed self-exemplars[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 5197-5206. [DOI:10.1109/CVPR.2015.7299156]

[38] Schulter S, Leistner C, Bischof H. Fast and accurate image upscaling with super-resolution forests[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 3791-3799. [DOI:10.1109/CVPR.2015.7299003]