发布时间: 2020-08-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.190614
2020 | Volume 25 | Number 8

图像理解和计算机视觉

自学习规则下的多聚焦图像融合

刘子闻¹, 罗晓清¹, 张战成²

1. 江南大学物联网工程学院, 无锡 214122;

2. 苏州科技大学电子与信息工程学院, 苏州 215009

收稿日期: 2019-12-16; 修回日期: 2020-02-23; 预印本日期: 2020-03-02

基金项目: 国家自然科学基金项目(61772237);江苏省六大人才高峰项目(XYDXX-030)

第一作者简介: 刘子闻, 1994年生, 男, 硕士研究生, 主要研究方向为图像融合。E-mail:1186924365@qq.com;
罗晓清, 女, 副教授, 主要研究方向为图像融合, 计算机视觉。E-mail:xqluo@jiangnan.edu.cn.

中图法分类号: TP391.4

文献标识码: A

文章编号: 1006-8961(2020)08-1637-12

摘要

目的基于深度学习的多聚焦图像融合方法主要是利用卷积神经网络（convolutional neural network，CNN）将像素分类为聚焦与散焦。监督学习过程常使用人造数据集，标签数据的精确度直接影响了分类精确度，从而影响后续手工设计融合规则的准确度与全聚焦图像的融合效果。为了使融合网络可以自适应地调整融合规则，提出了一种基于自学习融合规则的多聚焦图像融合算法。方法采用自编码网络架构，提取特征，同时学习融合规则和重构规则，以实现无监督的端到端融合网络；将多聚焦图像的初始决策图作为先验输入，学习图像丰富的细节信息；在损失函数中加入局部策略，包含结构相似度（structural similarity index measure，SSIM）和均方误差（mean squared error，MSE），以确保更加准确地还原图像。结果在Lytro等公开数据集上从主观和客观角度对本文模型进行评价，以验证融合算法设计的合理性。从主观评价来看，模型不仅可以较好地融合聚焦区域，有效避免融合图像中出现伪影，而且能够保留足够的细节信息，视觉效果自然清晰；从客观评价来看，通过将模型融合的图像与其他主流多聚焦图像融合算法的融合图像进行量化比较，在熵、Q_w、相关系数和视觉信息保真度上的平均精度均为最优，分别为7.457 4，0.917 7，0.978 8和0.890 8。结论提出了一种用于多聚焦图像的融合算法，不仅能够对融合规则进行自学习、调整，并且融合图像效果可与现有方法媲美，有助于进一步理解基于深度学习的多聚焦图像融合机制。

关键词

多聚焦图像融合; 自编码; 自学习; 端到端; 结构相似度

Multi-focus image fusion with a self-learning fusion rule

Liu Ziwen¹, Luo Xiaoqing¹, Zhang Zhancheng²

1. School of Internet of Things, Jiangnan University, Wuxi 214122, China;

2. School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China

Supported by: National Natural Science Foundation of China (61772237)

Abstract

Objective The existing multi-focus image fusion approaches based on deep learning methods consider a convolutional neural network (CNN) as a classifier. These methods use CNNs to classify pixels into focused or defocused pixels, and corresponding fusion rules are designed in accordance with the classified pixels. The expected full-focused image mainly depends on handcraft labeled data and fusion rule and is constructed on the learned feature maps. The training process is learned based on label pixel. However, manually labeling a focused or defocused pixel is an arduous problem and may lead to inaccurate focus prediction. Existing multi-focus datasets are constructed by adding Gaussian blur to some parts of full-focused images, which makes the training data unrealistic. To solve these issues and enable CNN to adaptively adjust fusion rules, a novel multi-focus image fusion algorithm based on self-learning fusion rules is proposed. Method Autoencoders are unsupervised learning networks, and their hidden layer can be considered a feature representation of the input samples. Multi-focus images are usually collected from the same scene with public scene information and private focus information, and the paired images should be encoded in their common and private feature spaces, respectively. This study uses joint convolutional autoencoders (JCAEs) to learn structured features. JCAEs consist of public and private branches. The public branches share weights to obtain the common encoding features of multiple input images, and the private branches can acquire private encoding features. A fusion layer with concentrating operation is designed to obtain a self-learned fusion rule and constrain the entire fusion network to work in an end-to-end style. The initial focus map is regarded as a prior input to enable the network to learn precise details. Current multi-focus image fusion algorithms based on deep learning train networks by applying data augmentation to datasets and utilize various skills to adjust the networks. The design of fusion rules is significant. Fusion rules generally comprise direct cascading fusion and pixel-level aspects. The cascading fusion stacks multiple inputs and then blends with the next convolutional layer to help networks gain rich image features. Pixel-level fusion rules are formed with maximum, sum, and mean rules, which can be selected depending on the characteristics of datasets. The mean rule is introduced based on cascading fusion to make the network feasible for achieving the autonomous adjustment of the fusion rules in the training process. The fusion rules of JCAEs are quantitatively and qualitatively discussed to identify the way they work in the process. Image entropy is used to represent the amount of information contained in the aggregated features of grayscale distribution in images. The fusion rules are reasonably demonstrated by calculating the retaining information of the feature map in the network fusion layer. In this study, a pair of multi-focus images is fed into the network, and the feature map of the convolution operation pertaining to the fusion layer is trained to produce fused images. The fusion rules can be visually interpreted by comparing the image information quantity and the learned weight value subjectively. Instead of using the basic loss function to train CNN, the model adds a local strategy to the loss function, including structural similarity index measure and mean squared error. Such a strategy can effectively drive the fusion unit to learn pixel-wise features and ensure accurate image restoration. More accurate and abstract features can be obtained when source images are passed through deep networks rather than shallow networks. However, problems, such as gradient vanishing and high network convergence time, occur in the back-propagation stage of deep networks. The residual network skips a few training layers by using skip connection or shortcut and can easily learn residual images rather than the original input image. Therefore, we use the short connection strategy to improve the feature learning ability of JCAEs. Result The model is trained on the Keras framework based on TensorFlow. We test our model on Lytro dataset and conduct subjective and objective evaluations with existing multi-focus fusion algorithms to verify the performance of the proposed fusion method. The dataset has been widely used in multi-focus image fusion research. We magnify the key areas, such as the region between focused and defocused pixels in the fusion image, to illustrate the differences of fusion images in detail. From the perspective of subjective evaluation, the model can effectively fuse the focus area and shun the artifacts in the fused image. Detailed information is fused, and thus, the visual effect is naturally clear. From the perspective of objective evaluation, a comparison of the image of the model fusion with the fusion image of other mainstream multi-focus image fusion algorithms demonstrates that the average precision of the entropy, Q_w, correlation coefficient, and visual information fidelity are the best, which are 7.457 4, 0.917 7, 0.978 8, and 0.890 8, respectively. Conclusion Most deep learning-based multi-focus image fusion methods fulfill a pattern, that is, employing CNN to classify pixels into focused and defocused ones, manually designing fusion rules in accordance with the classified pixels, and conducting a fusion operation on the original spatial domain or learned feature map to acquire a fused full-focused image. This pipeline ignores considerable useful information of the middle layer and heavily relies on labeled data. To solve the above-mentioned problems, this study proposes a multi-focus image fusion algorithm with self-learning style. A fusion layer is designed based on JCAEs. We discuss its network structure, the loss function design, and a method on how to embed pixel-wise prior knowledge. In this way, the network can output vivid fused images. We also provide a reasonable geometric interpretation of the learnable fusion operation on quantitative and qualitative levels. The experiments demonstrate that the model is reasonable and effective; it can not only achieve self-learning of fusion rules but also performs efficiently with subjective visual perception and objective evaluation metrics. This work offers a new idea for the fusion of multi-focus images, which will be beneficial to further understand the mechanism of deep learning-based multi-focus image fusion and motivate us to develop an interpretable image fusion method with popular neural networks.

Key words

multi-focus image fusion; auto-encoders; self-learning; end-to-end; structural similarity

0 引言

在有限景深的情况下，由于相机的聚焦深度有限，聚焦平面无法获取纵深场景中目标的全局清晰图像，容易出现散焦和模糊现象。多聚焦图像融合技术是将相同场景中聚焦位置不同的多幅图像融合成一幅信息量更丰富的全聚焦图像(赵毅力等，2015)。当前，多聚焦图像融合算法可根据融合策略分为基于变换域的融合方法(杨勇等，2014)，基于空间域的融合方法(Zhao等，2013)和基于深度学习的融合方法(Zhang等，2020)。

基于变换域的融合方法一般利用多种分解工具将源图像分解为多层级的系数表示，然后根据每个层级系数的特点设计不同的融合规则，最后对融合后的各层级系数进行逆多尺度变换获得融合图像。变换工具的设计和融合规则的设计在基于变换域的融合方法中对融合性能有着重要影响。常用的变换工具包括曲波变换(curvelet transform, CVT)(Mahyari和Yazdi，2009)，非下采样轮廓波变换(non-subsampled contourlet transform, NSCT)(Zhang和Guo，2009)，拉普拉斯金字塔(Laplacian pyramid，LP)(严春满等，2012；张剑等，2014)，低通金字塔(ratio of low-pass pyramid, RP)(Liu等，2015a)和梯度金字塔(gradient pyramid, GP)(Li等，2014)等。融合规则包含取大、加权平均、显著性方法和活动水平测量等。基于特征空间的多聚焦图像融合方法引起了更多关注，包括鲁棒主成分分析(robust principal component analysis, RPCA)(Wan等，2013)，稀疏表示(sparse representation, SR)(Yang和Li，2010)和高阶奇异值分解(higher-order singular-value decomposition, HOSVD)(Liang等，2012)等方法。

基于空间域的融合方法根据聚焦测量对象的不同可以分为基于像素(Li等，2017)，基于块和基于区域(Luo等，2012)等3种。基于像素的多聚焦图像融合方法能够从源图像中提取特征信息并最大程度地保留图像的原始信息，具有精确度高和鲁棒性强的特点。该方法包含基于导引滤波(guided filtering, GF)(Li等，2013a)，基于图像消光(image matting, IM)(Li等，2013b)和基于密集尺度不变特征变换(dense scale-invariant feature transform, DSIFT)(Liu等，2015b)方法。基于块和区域的多聚焦图像融合方法利用一些分割策略将源图像划分为不同的块或区域，然后通过聚焦度量选择更多的聚焦块或区域作为融合图像的一部分。常用的聚焦度量方法有图像梯度和空间频率等。这类方法的分块大小与分段算法直接影响融合图像的视觉效果，容易出现“块效应”。基于变换域的融合方法和基于空间域的融合方法都需要手工设计融合判据和融合规则，然而复杂的图像场景限制了特征的表达能力和融合规则的鲁棒性。

为了提高特征的表达能力与融合规则的鲁棒性，深度学习技术被引入多聚焦图像融合研究中。这类方法通常利用卷积神经网络(convolutional neural network, CNN)强大的特征提取能力将聚焦判别表达为分类问题。刘羽和汪增福(2013)将2012年ImageNet大规模图像识别挑战赛中的验证集(ImageNet large scale visual recognition challenge，ILSVRC)作为训练数据，使用不同标准差的多尺度高斯滤波在灰度图像上的随机区域进行模糊处理，以模拟多聚焦图像。该模型采用有监督学习，将图像逐像素分类为聚焦像素与散焦像素，取得与输入图像大小相同的聚焦映射图。然后在聚焦映射图上进行取大与一致性验证，生成聚焦决策图。最后，根据判据图在空间域上利用逐像素加权平均策略获得融合图像。Tang等人(2018)提出了基于像素级卷积神经网络(pixel-wise convolutional neural network, P-CNN)的多聚焦图像融合方法。该模型使用Cifar-10作为训练集，可通过相邻像素信息学习到3种像素：聚焦像素、散焦像素和未知像素。源图像首先经P-CNN逐像素评分后，形成代表像素聚焦水平的评分矩阵。然后通过比较两个源图像得分矩阵的值得到决策图。最后根据阈值过滤后的最终决策图对输入的两幅图像进行加权平均获得融合图像。该模型在实时性与融合效果上均表现优异，但有监督学习的局限性在于无法获取准确的标签数据用于图像融合。为进一步区分多聚焦图像中的私有特征和公共特征，罗晓清等人(2018)提出了一种联合卷积自编码网络，根据私有分支学习的图像特征获得聚焦映射图，利用像素级的加权平均规则得到全聚焦的融合图像。该方法采用无监督学习，不需要人工标记聚焦标签，在主观评测与多个客观评价指标上均实现理想的效果。然而，这些方法仅利用了CNN的特征提取与分类能力，仍然使用手工设计的融合规则，这使得模型无法根据应用场景调整融合策略。

为进一步实现融合规则的自学习，充分利用CNN的特征提取能力，并结合手工特征的先验知识，在联合卷积自编码网络(罗晓清等，2018)基础上，本文设计了融合规则自学习的多聚焦图像融合网络, 将多聚焦图像及其初始决策图作为网络输入，使网络能够学习到更精确的细节信息。使用局部结构相似度(structural similarity index measure, SSIM)与局部均方误差(mean squared error, MSE)作为损失函数，驱动融合单元学习融合规则。

1 本文方法

首先介绍本文用于多聚焦图像融合的网络结构，接着详细论述网络的融合单元，最后讨论损失函数的设计。

1.1 基于联合卷积自编码的特征提取网络

图 1展示了本文使用的联合卷积自编码网络，整个网络分为输入层、编码层、融合层、解码层和输出层。输入层包括多聚焦图像$\mathit{\boldsymbol{A}}$，多聚焦图像$\mathit{\boldsymbol{B}}$和多聚焦图像$\mathit{\boldsymbol{A}}$的初始决策图。编码层包括9个卷积核大小为3×3的可训练卷积层，每个卷积层后接ReLU层。编码层可分为多聚焦图像$\mathit{\boldsymbol{A}}$的私有分支PriA与公有分支ComA，多聚焦图像$\mathit{\boldsymbol{B}}$的私有分支PriB与公有分支ComB。其中PriA与PriB用于提取输入图像各自的私有特征，ComA与ComB共享权重，用于提取多个输入图像间的公有特征。融合层将PriA与PriB输出的特征图沿通道级联，然后将级联特征图与下一个卷积核大小为1×1的可训练卷积层连接。ComA和ComB输出特征图的处理方式与PriA和PriB相同。解码层包括4个卷积核大小为3×3的可训练卷积层，最后一个卷积层用于重构全聚焦图像。本文在公有分支加入了短连接以解决在训练过程中出现的梯度消失问题。相对于之前的网络(罗晓清等，2018)，该网络增加了融合单元并使用短连接来提高特征学习的鲁棒性。

图 1 联合卷积自编码网络结构图

Fig. 1 Structure of the joint convolutional auto encoders network

1.2 融合层设计

在基于深度学习的多聚焦图像融合研究中，网络融合层通常包含两种可用于融合多个输入的卷积特征的方法：1)将多个输入的卷积特征沿通道级联，然后与下一个卷积层融合; 2)多个输入的卷积特征利用像素级融合规则进行融合。级联融合方法对多个输入进行堆叠，使网络能够学习到充分的特征信息。像素级融合规则包含求和、取大和均值规则，可根据数据集的特性选择融合策略。在多聚焦图像中，由于图像的像素值表示信息的显著程度，因此本文方法在级联融合的基础上引入了均值规则，以保证特征学习的多样化与精确化。融合层设计具体实现包括权重初始化与权重约束。

(1) 权重初始化。权重初始化是为了模拟加权平均融合规则，通过对融合层的权重进行合理的赋值可精准地融合编码层提取的特征。分别将PriA与PriB，ComA与ComB编码层的输出特征图沿着通道拼接，后接一个1×1的可训练卷积层，并将1×1卷积层第$k$个通道的第$I$个与$I + p$个权重值初始化为0.5，即

$ \begin{array}{*{20}{l}} {W_k^I = W_k^{I + p} = 0.5}\\ {I, k = 0, 1, 2, \cdots, 127} \end{array} $

(1)

式中，$k$为卷积操作后的通道数，$I$为第$k$个通道的滤波器数, $p$的值为128，可根据实际需求调整。$W_k^I$为第$k$个通道的第$I$个权重值。

(2) 权重约束。由于权重值在网络迭代过程中可能会出现数值越界现象，因此对每个权重值添加约束以实现权重值在有效范围内波动。根据图像融合方法中的均值规则，两幅图像的融合系数之和为1。然而，由于训练网络的激活函数采用ReLU，对于第$k$个通道，易出现$\sum\limits_{I = 0}^{p - 1} {W_k^I} + \sum\limits_{I = p}^{2p - 1} {W_k^{I + p}} > 1$。因此需要对融合层第$k$个通道的$2p$个权重值进行最小/最大范数权值约束。

首先，计算第$k$个通道$2p$个权重值的L2范数

$ {S_k} = \sqrt {\sum\limits_{I = 0}^{p - 1} {{{(W_k^I)}^2}} + \sum\limits_{I = p}^{2p - 1} {{{(W_k^{I + p})}^2}} } $

(2)

然后将${S_k}$截断至$\left({{S_{\min }}, {S_{\max }}} \right)$范围内, 即

$ {S_t} = \left\{ {\begin{array}{*{20}{l}} {{S_{{\rm{min}}}}}&{{S_k} < {S_{{\rm{min}}}}}\\ {{S_k}}&{{S_{{\rm{min}}}} < {S_k} < {S_{{\rm{max}}}}}\\ {{S_{{\rm{max}}}}}&{{S_k} > {S_{{\rm{max}}}}} \end{array}} \right. $

(3)

式中，${{S_{\min }}}$为输入权重值的最小L2范数，${{S_{\max }}}$为输入权重值的最大L2范数。

最后将第$k$个通道的每一个权重值重新调整

$ {W_k^m = W_k^m \times {Z_k}, \quad m = 0, 1, 2, \cdots, 2p - 1} $

(4)

$ {{Z_k} = \frac{{\alpha \times {S_t} + (1 - \alpha) \times {S_k}}}{{\gamma + {S_k}}}} $

(5)

式中，$W_k^m$为第$k$个通道的第$m$个权重值，${Z_k}$为权重值约束范围。$\alpha $为强制执行约束的比例，数值为1时表示严格执行约束，而数值小于1时则表示每一步都要重新调整权重。为了避免分母为0，出现梯度爆炸问题，$\gamma $取值为1E-3。

在经过权重初始化和权重约束后，融合层的规则最终被转化为

$ \mathop f\limits^ \wedge {{\kern 1pt} _k}(x, y) = W_k^l{f_I}(x, y) + W_k^{l + p}{f_{I + p}}(x, y) $

(6)

式中，${f_I}\left({x, y} \right)$为编码层输出的第$I$幅特征图，${\widehat f_k}\left({x, y} \right)$为融合层第$k$幅特征图。

1.3 损失函数设计

为了保证网络准确有效地学习输入图像的特征，本文方法在损失函数中加入局部策略，包含局部结构相似度与局部均方误差。

1) 局部结构相似度。人类的视觉系统对结构损失和变形较为敏感，因此可用结构相似度(structural similarity index measure, SSIM)以更加直观的方法比较失真图像和参考图像的结构信息(崔莹等，2014)。SSIM主要由3部分组成：相关度，亮度和对比度，这3个部分的乘积是融合图像的评估结果为

$ \begin{array}{*{20}{c}} { SSIM (\mathit{\boldsymbol{X}}, \mathit{\boldsymbol{F}}) = }\\ {\sum\limits_{\mathit{\boldsymbol{x}}, \mathit{\boldsymbol{f}}} {\frac{{(2{\mu _x}{\mu _f} + {C_1})(2{\sigma _x}{\sigma _f} + {C_2})({\sigma _{xf}} + {C_3})}}{{(\mu _x^2 + \mu _f^2 + {C_1})(\sigma _x^2 + \sigma _f^2 + {C_2})({\sigma _x}{\sigma _f} + {C_3})}}} } \end{array} $

(7)

式中，$SSIM\left({\mathit{\boldsymbol{X}}, \mathit{\boldsymbol{F}}} \right)$表示源图像$\mathit{\boldsymbol{X}}$和融合图像$\mathit{\boldsymbol{F}}$的结构相似度；$\mathit{\boldsymbol{x}}$和$\mathit{\boldsymbol{f}}$分别表示源图像和融合图像中的图像块；${\mu _x}$和${\sigma _x}$分别表示图像$\mathit{\boldsymbol{X}}$的均值与标准差；${\mu _f}$和${\sigma _f}$分别表示融合图像$\mathit{\boldsymbol{F}}$的均值和标准差；${\sigma _{xf}}$表示源图像和融合图像的协方差；${C_1}$，${C_2}$和${C_3}$是用于算法稳定的参数。

在SSIM的基础上，结合输入图像$\mathit{\boldsymbol{X}}$的初始决策图${\mathit{\boldsymbol{X}}_{\rm{m}}}$提取图像$\mathit{\boldsymbol{X}}$对应区域

$ \mathit{\boldsymbol{\bar X}} = {\rm{min}}({\mathit{\boldsymbol{X}}_{\rm{m}}}, \mathit{\boldsymbol{X}}) $

(8)

输入图像$\mathit{\boldsymbol{A}}$，$\mathit{\boldsymbol{B}}$对应的初始决策图分别为${\mathit{\boldsymbol{X}}_A}$和${\mathit{\boldsymbol{X}}_B}$，根据式(8)可分别得到图像$\mathit{\boldsymbol{A}}$，$\mathit{\boldsymbol{B}}$和融合图像$\mathit{\boldsymbol{F}}$的对应区域$\mathit{\boldsymbol{\overline A}} $, $\overline {\mathit{\boldsymbol{AF}}} $, $\mathit{\boldsymbol{\overline B}} $, $\overline {\mathit{\boldsymbol{BF}}} $。根据式(7)求得$SSIM\left({\mathit{\boldsymbol{\overline A}}, \overline {\mathit{\boldsymbol{AF}}} } \right)$和$SSIM\left({\mathit{\boldsymbol{\overline B}}, \overline {\mathit{\boldsymbol{BF}}} } \right)$。

2) 局部均方误差。均方误差用于度量源图像和融合图像之间的差异程度，均方误差的值和融合图像质量成反比关系，其值越小，融合图像质量越高，计算公式为

$ MSE (\mathit{\boldsymbol{X}}, \mathit{\boldsymbol{F}}) = \frac{1}{{MN}}\sum\limits_{i = 0}^{M - 1} {\sum\limits_{j = 0}^{N - 1} {(X(} } i, j) - F(i, j){)^2} $

(9)

式中, $MSE\left({\mathit{\boldsymbol{X}}, \mathit{\boldsymbol{F}}} \right)$表示输入图像$\mathit{\boldsymbol{X}}$与融合图像$\mathit{\boldsymbol{F}}$的误差值。根据式(9)可求得$MSE\left({\mathit{\boldsymbol{\overline A}}, \overline {\mathit{\boldsymbol{AF}}} } \right)$和$MSE\left({\mathit{\boldsymbol{\overline B}}, \overline {\mathit{\boldsymbol{BF}}} } \right)$。

网络的最终损失函数为

$ \begin{array}{*{20}{l}} {L = {\lambda _1}(SSIM (\mathit{\boldsymbol{\bar A}}, \overline {\mathit{\boldsymbol{AF}}}) + SSIM (\mathit{\boldsymbol{\bar B}}, \overline {\mathit{\boldsymbol{BF}}})) + }\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\lambda _2}(MSE (\mathit{\boldsymbol{\bar A}}, \overline {\mathit{\boldsymbol{AF}}}) + MSE (\mathit{\boldsymbol{\bar B}}, \overline {\mathit{\boldsymbol{BF}}}))} \end{array} $

(10)

式中, ${\lambda _1}$和${\lambda _2}$分别表示局部结构相似度和局部均方误差的权重。在本文中, ${\lambda _1}$用于调整融合图像与源图像的相似度, ${\lambda _1}$越大，融合图像与源图像的相似度越高。${\lambda _2}$用于强化源图像在融合图像中的聚焦区域, ${\lambda _2}$越大，源图像的聚焦区域越显著。在广泛的实验基础上，本文将${\lambda _1}$和${\lambda _2}$分别设置为10和1。

2 实验结果与分析

本文方法与7种多聚焦图像融合方法进行主观比较，包括基于边界发现的多聚焦图像融合(boundary finding, BF)(Zhang等，2017)，用于动态场景中多聚焦图像融合的图像消光(IM)(Li等，2013b)，基于密集尺度不变特征转换的多聚焦图像融合(DSIFT)(Liu等，2015b)，结合小波变换和自适应分块的多聚焦图像快速融合(discrete wavelet transform and adaptive block，DWTDE)(刘羽和汪谱福，2013)，基于多尺度加权梯度的多聚焦图像融合(multi-scale weighted gradient-based, MWGF)(Zhou等，2014)，基于卷积神经网络的图像融合(CNN)(Liu等，2017)，基于联合卷积自编码网络的多聚焦图像融合(joint convolutional auto-encoders, JCAE)(罗晓清等，2018)。所有方法均使用文献中的默认参数。然后在多对多聚焦图像上进行详细的客观比较与分析。最后针对本文的融合规则给出更为直观的展示。

2.1 实验设置

为了验证提出的联合卷积自编码网络的有效性，本文在40对多聚焦图像上进行了实验，其中20对来自开源数据集Lytro(Nejati等，2015)，其余20对已广泛应用于多聚焦图像融合的研究中。采用滑动窗口取块的方法，步幅为14，将数据集中的每幅图像划分成$M$个224×224像素的图像块。本文的初始决策图获取包含分割、映射和再处理3部分。首先将数据集中的每幅图像分割为4×4像素的块图，并求取空间频率；接着将空间频率矩阵映射至源图像大小，重叠部分采用均值处理，得到空间频率映射图，通过比较大小得到二值图；最后对二值图进行一致性验证和引导滤波得到网络的初始决策图(Liu等，2017)。本文方法的网络训练部分使用基于Tensorflow的Keras框架，硬件环境为GTX 1080Ti/64 GB RAM。对比实验运行在MATLAB R2016b上，硬件环境为Intel-Core i3-4150 CPU/ 3.5 GHz/ 8 GB。

2.2 与其他方法比较

本文以Disk为例验证提出的融合方法，其余7种融合方法与本文方法得到的融合图像如图 2所示。为了对各融合方法进行更为直观地比较，本文在图 2每个融合图像中闹钟的左侧轮廓处选择一个较小的区域，用红色矩形框标出，同时给出了形区域放大图。

图 2 各种算法在Disk图像上的融合结果

Fig. 2 Fusion results of various algorithms on Disk((a) source image $\mathit{\boldsymbol{A}}$; (b) source image $\mathit{\boldsymbol{B}}$; (c) IM; (d) BF; (e) DSIFT; (f) DWTDE; (g) MWGF; (h) CNN; (i) JCAE; (j) ours)

由图 2可看出，上述方法均能获得主观视觉良好的全聚焦图像。DSIFT和DWTDE在图像闹钟边缘处出现了“伪影”等虚假信息。IM融合效果较好，但在书桌区域存在一定的吉布斯现象，丢失了部分细节信息。BF在局部放大区域出现了模糊失真，这是由于此方法着重于查找边界，聚焦度量在单个块内执行。MWGF，CNN和JCAE得到的融合结果效果很好，但在闹钟的左侧边界存在轻微的“凹陷”。相较而言，本文方法的视觉效果与其他方法的主观视觉效果相当，从图 2的放大区域能够看出，本文方法在细节部分处理良好，尤其是闹钟的边缘区域平滑自然，得到了较好的融合结果。由于在网络中加入了聚焦图像的初始决策图与损失函数的局部策略，本文方法得到的融合图像在关键信息留存上表现优异，适合于人类的视觉感知。图 3给出了其余5对多聚焦图像在多种融合方法与本文方法上的融合结果。从图中可以看出，所有方法在一定程度上都能很好地融合多聚焦图像。本文方法与其他方法相比实现了更好的融合结果。

图 3 5对多聚焦图像融合结果

Fig. 3 Fusion results of 5 pairs of multi-focus images((a) source image $\mathit{\boldsymbol{A}}$; (b) source image $\mathit{\boldsymbol{B}}$; (c) IM; (d) BF; (e) DSIFT; (f) DWTDE; (g) MWGF; (h) CNN; (i) JCAE; (j) ours)

2.3 客观评价指标

为了客观评价各融合方法的结果，本文采用熵(entropy, EN)(Ma等，2019)，Piella和Heijmans(2003)提出的评价指标${Q_W}$，相关系数(correlation coefficient, CC)(Ma等，2019)和视觉信息保真度(visual information fidelity, VIFF)(Han等，2013)来验证本文方法的有效性。熵是基于信息论的指标，用来反映图像的信息量。如果熵值相对较大，则表明融合图像包含相对更多的信息量。${Q_W}$是通用图像质量指标的一种变体指标，通过对视觉显著性区域赋予高权重，以探讨失真像素的位置和大小。${Q_W}$的值越大，融合效果越好。相关系数衡量了源图像与融合图像之间的相关性。相关系数值的大小与融合效果呈正相关。视觉信息保真度是模拟人眼的主观视觉以衡量融合图像保真度信息的指标，包括分块、评估、计算子带保真度与计算总保真度等4个步骤。VIFF的值越大，表明融合图像和源图像之间的失真度越小。为了保证客观评价的公平性，所有指标均使用文献中的默认参数。

表 1列举了本文方法和其他融合方法在5对多聚焦图像上融合结果的客观评价指标。从表 1中可看出，本文融合方法在一些融合指标上较其他融合方法存在明显的优势。Disk和Flower图像在VIFF指标上仅弱于MWGF，这可能是由于本文模型专注于提升融合图像的对比度，从而引入了局部噪声，影响了VIFF的分块、评估和计算。Gril图像在熵指标上仅次于DWTDE，但与其差距较小。本文方法在EN、${Q_W}$、CC和VIFF上的平均精度均为最优，分别为7.457 4，0.917 7, 0.978 8和0.890 8。总的来说，本文方法在${Q_W}$、相关系数和平均精度指标上都取得了最好的效果，在熵和视觉信息保真度指标上也取得了较好的效果，说明该算法是一种有效的融合方法。

表 1 5对多聚焦图像各种融合方法的客观评价
Table 1 Objective evaluation on various fusion methods of five pairs of multi-focus images

下载CSV

图像	评价指标	IM	BF	DSIFT	DWTDE	MWGF	CNN	JCAE	本文	排名
Disk	EN	7.2837	7.269 9	7.291 2	7.289 6	7.281 8	7.280 3	7.279 7	7.296 2	1
	${Q_W}$	0.926 5	0.920 8	0.930 7	0.929 6	0.931 9	0.932 0	0.931 8	0.932 7	1
	CC	0.976 1	0.976 2	0.976 1	0.976 2	0.976 2	0.976 7	0.976 7	0.978 8	1
	VIFF	0.878 1	0.867 6	0.882 0	0.871 2	0.885 3	0.880 0	0.878 0	0.883 1	2

Wine	EN	7.448 2	7.415 2	7.420 3	7.567 7	7.483 7	7.442 7	7.459 5	7.644 5	1
	${Q_W}$	0.897 0	0.891 9	0.891 9	0.889 9	0.892 0	0.894 3	0.894 9	0.903 5	1
	CC	0.973 7	0.972 9	0.972 9	0.973 7	0.972 9	0.973 5	0.973 7	0.976 1	1
	VIFF	0.874 5	0.861 3	0.866 3	0.846 7	0.866 5	0.869 1	0.866 2	0.882 0	1

Book	EN	7.266 2	7.269 4	7.272 8	7.281 3	7.271 5	7.270 2	7.262 4	7.295 0	1
	${Q_W}$	0.926 7	0.928 7	0.928 4	0.928 5	0.928 4	0.929 7	0.930 5	0.932 3	1
	CC	0.983 0	0.982 8	0.982 7	0.982 8	0.982 7	0.982 8	0.982 9	0.983 5	1
	VIFF	0.938 0	0.936 8	0.939 4	0.936 8	0.941 1	0.938 5	0.942 6	0.960 8	1

Flower	EN	7.180 5	7.179 2	7.182 4	7.176 0	7.185 6	7.180 1	7.181 3	7.192 6	1
	${Q_W}$	0.924 0	0.923 2	0.925 2	0.924 4	0.925 7	0.926 7	0.927 0	0.934 7	1
	CC	0.962 5	0.962 3	0.961 9	0.964 2	0.962 3	0.963 1	0.963 1	0.971 1	1
	VIFF	0.926 1	0.923 2	0.931 1	0.911 2	0.942 3	0.931 0	0.931 0	0.941 4	2

Girl	EN	7.855 7	7.854 9	7.855 1	7.859 3	7.856 3	7.855 1	7.855 1	7.858 4	2
	${Q_W}$	0.864 3	0.873 1	0.873 0	0.854 1	0.873 4	0.874 5	0.873 5	0.885 3	1
	CC	0.980 9	0.981 3	0.981 3	0.982 5	0.981 1	0.981 7	0.981 8	0.984 6	1
	VIFF	0.729 7	0.739 7	0.739 5	0.748 7	0.743 8	0.742 4	0.746 0	0.786 6	1
注：加粗字体表示每行最优结果，下划线字体表示每行次优结果。

2.4 学习到的融合规则的几何解释

为了对学习到的融合规则进行合理解释，对融合算法的编码层和融合层的特征图进行可视化，并对融合规则给出详细论述，包含编码层特征可视化和融合规则的验证。

1) 编码层特征可视化。将图 4(a)(d)所示一对多聚焦图像输入联合卷积自编码网络，并将编码层输出的特征图以图像的格式保存。图 4(b)(c)和图 4(e)(f)分别给出了源图$\mathit{\boldsymbol{A}}$和源图$\mathit{\boldsymbol{B}}$编码层输出的私有特征和公有特征128幅特征图的某一幅。从图中可看出，私有特征可显著地进行聚焦像素和模糊像素的判别，公有特征可很好地表达待融合图像间的冗余关系。对于源图$\mathit{\boldsymbol{A}}$，前清晰后模糊，其对应的私有特征前部分有较强的激活，公有特征在图像全局均有激活。源图$\mathit{\boldsymbol{B}}$对应的特征图也有一致的激活效果。

图 4 联合卷积自编码网络编码层特征可视化

Fig. 4 Feature visualization of JCAE network encoder layer((a) source image $\mathit{\boldsymbol{A}}$; (b) private features of $\mathit{\boldsymbol{A}}$; (c) common features of $\mathit{\boldsymbol{A}}$; (d) source image $\mathit{\boldsymbol{B}}$; (e) private features of $\mathit{\boldsymbol{B}}$; (f) common features of $\mathit{\boldsymbol{B}}$)

2) 融合规则的验证。本文引入图像的1维熵表示图像信息量的丰富程度，通过统计融合层特征图信息量的留存情况对融合规则进行合理论证。首先计算融合层卷积操作特征图的信息熵，进行排序，选取第$i$幅图像${\mathit{\boldsymbol{M}}_i}$并取得${\mathit{\boldsymbol{M}}_i}$的权重${W_j}$，对${W_j}$进行排序，选取权重值前10个的索引值${I_t}$；然后计算PriA，ComA，PriB，ComB编码层输出特征图的信息熵${H_{pA}}$, ${H_{cA}}$, ${H_{pB}}$, ${H_{cB}}$，对其进行排序；接着查看${I_t}$在${H_{pA}}$, ${H_{cA}}$, ${H_{pB}}$, ${H_{cB}}$中的数值大小与排名。最后将${I_t}$每幅特征图进行展示，主观上比较其信息丰富度。

为了对学习到的融合规则进行合理解释，本文选取了融合层卷积核大小为1×1可训练卷积层的128幅特征图的任意一幅特征图${\mathit{\boldsymbol{F}}_x}$。${\mathit{\boldsymbol{F}}_x}$包含256个权重$W$，假设${W_i} > {W_j}$，其中${W_i}$对应的前一层特征图为${\mathit{\boldsymbol{P}}_i}$，${W_j}$对应的前一层特征图为${\mathit{\boldsymbol{P}}_j}$，比较${\mathit{\boldsymbol{P}}_i}$和${\mathit{\boldsymbol{P}}_j}$的1维熵${E_i}$与${E_j}$的大小。若${E_i} > {E_j}$，那么特征图的信息量与权重值大小是正相关，进一步可证明网络的有效性与融合规则的合理性。

将图 5(a)(b)一对多聚焦图像输入联合卷积自编码网络，并将融合层卷积操作的特征图以图像的格式输出。图 5(c)—(e)直观展示了融合层Pri分支第21个特征图和其第17个$W$与第25个$W$对应的前置特征图。可直观看出第17个$W$的前置特征图信息量较第25个$W$更为丰富。图 5(f)—(h)展示了融合层Com分支第117个特征图和其第3个$W$与第109个$W$对应的前置特征图。可观察到第3个$W$所含细节较第109个$W$更丰富。图 6给出了融合层每个通道对应的256个$W$与前256个特征图熵值的相关度。由图中可以看出，特征图的信息量与$W$基本呈正相关。

图 5 融合规则可视化

Fig. 5 Fusion rule visualization((a) source image $\mathit{\boldsymbol{A}}$; (b) source image $\mathit{\boldsymbol{B}}$; (c) Pri21 feature map; (d) 17th $W$ of Pri21; (e) 25th $W$ of Pri21; (f) Com117 feature map; (g) 3th $W$ of Com117; (h) 109th $W$ of Com117)

图 6 特征图信息量与权重值相关度

Fig. 6 Correlation between the information volume of feature map and weight value

3 结论

基于深度学习的多聚焦图像融合方法主要是将CNN作为图像融合算法的一部分，这类方法通常利用CNN强大的特征提取能力将聚焦判别表达为分类问题。然而，有监督学习无法获取准确的标签数据用于图像融合。此外，这些方法仅根据网络最后一层的输出设计融合策略，丢失了中间层的有用信息。为了解决上述问题，本文基于联合卷积自编码网络提出了一种端到端的无监督多聚焦图像融合算法。结合多聚焦图像的先验知识，使网络可学习到精确的图像细节。在融合层设计了合理的权重初始化与权重约束，并在定量与定性层面给出了融合规则的几何解释。在损失函数中应用局部结构相似度与局部均方误差策略，以驱动融合单元有效地学习融合规则。

实验结果表明，本文方法不仅在融合过程中可以实现融合规则的自学习，而且在主观视觉和客观评价上均能够取得良好的效果，这对进一步理解基于深度学习的多聚焦图像融合机制以及研究图像融合多模态通用框架具有重要意义。

目前实验的数据集仅限于多源同模态，下一步的工作是使模型更健壮和通用，可用于融合多源多模态图像。

参考文献

Cui Y, Xiong B L, Jiang Y M, Kuang G Y. 2014. Multi-scale approach based on structure similarity for change detection in SAR images. Journal of Image and Graphics, 19(10): 1507-1513 (崔莹, 熊博莅, 蒋咏梅, 匡纲要. 2014. 结合结构相似度的自适应多尺度SAR图像变化检测. 中国图象图形学报, 19(10): 1507-1513) [DOI:10.11834/jig.20141013]

Han Y, Cai Y Z, Cao Y, Xu X M. 2013. A new image fusion performance metric based on visual information fidelity. Information Fusion, 14(2): 127-135 [DOI:10.1016/j.inffus.2011.08.002]

Li M J, Dong Y B, Wang X L. 2014. Image fusion algorithm based on gradient pyramid and its performance evaluation. Applied Mechanics and Materials, 525: 715-718 [DOI:10.4028/www.scientific.net/AMM.525.715]

Li S T, Kang X D, Fang L Y, Hu J W, Yin H T. 2017. Pixel-level image fusion:a survey of the state of the art. Information Fusion, 33: 100-112 [DOI:10.1016/j.inffus.2016.05.004]

Li S T, Kang X D, Hu J W. 2013a. Image fusion with guided filtering. IEEE Transactions on Image Processing, 22(7): 2864-2875 [DOI:10.1109/TIP.2013.2244222]

Li S T, Kang X D, Hu J W, Yang B. 2013b. Image matting for fusion of multi-focus images in dynamic scenes. Information Fusion, 14(2): 147-162 [DOI:10.1016/j.inffus.2011.07.001]

Liang J L, He Y, Liu D, Zeng X J. 2012. Image fusion using higher order singular value decomposition. IEEE Transactions on Image Processing, 21(5): 2898-2909 [DOI:10.1109/TIP.2012.2183140]

Liu Y, Wang Z F. 2013. Multi-focus image based on wavelet transform and adaptive block. Journal of Image and Graphics, 18(11): 1435-1444 (刘羽, 汪增福. 2013. 结合小波变换和自适应分块的多聚焦图像快速融合. 中国图象图形学报, 18(11): 1435-1444) [DOI:10.11834/jig.20131106]

Liu Y, Chen X, Peng H, Wang Z F. 2017. Multi-focus image fusion with a deep convolutional neural network. Information Fusion, 36: 191-207 [DOI:10.1016/j.inffus.2016.12.001]

Liu Y, Liu S P, Wang Z F. 2015a. A general framework for image fusion based on multi-scale transform and sparse representation. Information Fusion, 24: 147-164 [DOI:10.1016/j.inffus.2014.09.004]

Liu Y, Liu S P, Wang Z F. 2015b. Multi-focus image fusion with dense SIFT. Information Fusion, 23: 139-155 [DOI:10.1016/j.inffus.2014.05.004]

Luo X Q, Xiong M Y and Zhang Z C. 2018. Multi-focus image fusion method based on the joint convolutional auto-encoders network[EB/OL].[2019-12-01] http://kns.cnki.net/kcms/detail/21.1124.TP.20190318.1051.001.html (罗晓清, 熊梦渔, 张战成. 2018.基于联合卷积自编码网络的多聚焦图像融合方法[EB/OL].[2019-12-01]http://kns.cnki.net/kcms/detail/21.1124.TP.20190318.1051.001.html)

Luo X Y, Zhang J, Dai Q H. 2012. A regional image fusion based on similarity characteristics. Signal Processing, 92(5): 1268-1280 [DOI:10.1016/j.sigpro.2011.11.021]

Ma J Y, Ma Y, Li C. 2019. Infrared and visible image fusion methods and applications:a survey. Information Fusion, 45: 153-178 [DOI:10.1016/j.inffus.2018.02.004]

Mahyari A G and Yazdi M. 2009. A novel image fusion method using curvelet transform based on linear dependency test//Proceedings of 2009 International Conference on Digital Image Processing. Bangkok, Thailand: IEEE: 351-354[DOI: 10.1109/ICDIP.2009.67]

Nejati M, Samavi S, Shirani S. 2015. Multi-focus image fusion using dictionary-based sparse representation. Information Fusion, 25: 72-84 [DOI:10.1016/j.inffus.2014.10.004]

Piella G and Heijmans H. 2003. A new quality metric for image fusion//Proceedings of the International Conference on Image Processing. Barcelona, Spain: IEEE: 173-176[DOI: 10.1109/ICIP.2003.1247209]

Tang H, Xiao B, Li W S, Wang G Y. 2018. Pixel convolutional neural network for multi-focus image fusion. Information Sciences, 433-434: 125-141 [DOI:10.1016/j.ins.2017.12.043]

Wan T, Zhu C C, Qin Z C. 2013. Multifocus image fusion based on robust principal component analysis. Pattern Recognition Letters, 34(9): 1001-1008 [DOI:10.1016/j.patrec.2013.03.003]

Yan C M, Guo B L, Yi M. 2012. Multifocus image fusion method based on improved LP and adaptive PCNN. Control and Decision, 27(5): 703-707, 712 (严春满, 郭宝龙, 易盟. 2012. 基于改进LP变换及自适应PCNN的多聚焦图像融合方法. 控制与决策, 27(5): 703-707, 712) [DOI:10.13195/j.cd.2012.05.66.yanchm.016]

Yang B, Li S T. 2010. Multifocus image fusion and restoration with sparse representation. IEEE Transactions on Instrumentation and Measurement, 59(4): 884-892 [DOI:10.1109/tim.2009.2026612]

Yang Y, Zheng W J, Huang S Y, Fang Z J, Yuan F N. 2014. Multi-focus image fusion based on human visual perception characteristic in non-subsampled contourlet transform domain. Journal of Image and Graphics, 19(3): 447-455 (杨勇, 郑文娟, 黄淑英, 方志军, 袁非牛. 2014. 人眼视觉感知特性的非下采样Contourlet变换域多聚焦图像融合. 中国图象图形学报, 19(3): 447-455) [DOI:10.11834/jig.20140215]

Zhang J, He H, Zhan X S, Xiao J. 2014. Three dimensional face reconstruction via feature adaptation and Laplace deformation. Journal of Image and Graphics, 19(9): 1349-1359 (张剑, 何骅, 詹小四, 肖俊. 2014. 结合特征适配与拉普拉斯形变的3维人脸重建. 中国图象图形学报, 19(9): 1349-1359) [DOI:10.11834/jig.20140912]

Zhang Q, Guo B L. 2009. Multifocus image fusion using the nonsubsampled contourlet transform. Signal Processing, 89(7): 1334-1346 [DOI:10.1016/j.sigpro.2009.01.012]

Zhang Y, Bai X Z, Wang T. 2017. Boundary finding based multi-focus image fusion through multi-scale morphological focus-measure. Information Fusion, 35: 81-101 [DOI:10.1016/j.inffus.2016.09.006]

Zhang Y, Liu Y, Sun P, Yan H, Zhao X L, Zhang L. 2020. IFCNN:a general image fusion framework based on convolutional neural network. Information Fusion, 54: 99-118 [DOI:10.1016/j.inffus.2019.07.011]

Zhao H J, Shang Z W, Tang Y Y, Fang B. 2013. Multi-focus image fusion based on the neighbor distance. Pattern Recognition, 46(3): 1002-1011 [DOI:10.1016/j.patcog.2012.09.012]

Zhao Y L, Zhou Y, Xu D. 2015. Multi-focus image capture and fusion system for macro photography. Journal of Image and Graphics, 20(4): 544-550 (赵毅力, 周屹, 徐丹. 2015. 微距摄影的多聚焦图像拍摄和融合. 中国图象图形学报, 20(4): 544-550) [DOI:10.11834/jig.20150411]

Zhou Z Q, Li S, Wang B. 2014. Multi-scale weighted gradient-based fusion for multi-focus images. Information Fusion, 20: 60-72 [DOI:10.1016/j.inffus.2013.11.005]