Current Issue Cover
自学习规则下的多聚焦图像融合

刘子闻1, 罗晓清1, 张战成2(1.江南大学物联网工程学院, 无锡 214122;2.苏州科技大学电子与信息工程学院, 苏州 215009)

摘 要
目的 基于深度学习的多聚焦图像融合方法主要是利用卷积神经网络(convolutional neural network,CNN)将像素分类为聚焦与散焦。监督学习过程常使用人造数据集,标签数据的精确度直接影响了分类精确度,从而影响后续手工设计融合规则的准确度与全聚焦图像的融合效果。为了使融合网络可以自适应地调整融合规则,提出了一种基于自学习融合规则的多聚焦图像融合算法。方法 采用自编码网络架构,提取特征,同时学习融合规则和重构规则,以实现无监督的端到端融合网络;将多聚焦图像的初始决策图作为先验输入,学习图像丰富的细节信息;在损失函数中加入局部策略,包含结构相似度(structural similarity index measure,SSIM)和均方误差(mean squared error,MSE),以确保更加准确地还原图像。结果 在Lytro等公开数据集上从主观和客观角度对本文模型进行评价,以验证融合算法设计的合理性。从主观评价来看,模型不仅可以较好地融合聚焦区域,有效避免融合图像中出现伪影,而且能够保留足够的细节信息,视觉效果自然清晰;从客观评价来看,通过将模型融合的图像与其他主流多聚焦图像融合算法的融合图像进行量化比较,在熵、Qw、相关系数和视觉信息保真度上的平均精度均为最优,分别为7.457 4,0.917 7,0.978 8和0.890 8。结论 提出了一种用于多聚焦图像的融合算法,不仅能够对融合规则进行自学习、调整,并且融合图像效果可与现有方法媲美,有助于进一步理解基于深度学习的多聚焦图像融合机制。
关键词
Multi-focus image fusion with a self-learning fusion rule

Liu Ziwen1, Luo Xiaoqing1, Zhang Zhancheng2(1.School of Internet of Things, Jiangnan University, Wuxi 214122, China;2.School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China)

Abstract
Objective The existing multi-focus image fusion approaches based on deep learning methods consider a convolutional neural network (CNN) as a classifier. These methods use CNNs to classify pixels into focused or defocused pixels, and corresponding fusion rules are designed in accordance with the classified pixels. The expected full-focused image mainly depends on handcraft labeled data and fusion rule and is constructed on the learned feature maps. The training process is learned based on label pixel. However, manually labeling a focused or defocused pixel is an arduous problem and may lead to inaccurate focus prediction. Existing multi-focus datasets are constructed by adding Gaussian blur to some parts of full-focused images, which makes the training data unrealistic. To solve these issues and enable CNN to adaptively adjust fusion rules, a novel multi-focus image fusion algorithm based on self-learning fusion rules is proposed. Method Autoencoders are unsupervised learning networks, and their hidden layer can be considered a feature representation of the input samples. Multi-focus images are usually collected from the same scene with public scene information and private focus information, and the paired images should be encoded in their common and private feature spaces, respectively. This study uses joint convolutional autoencoders (JCAEs) to learn structured features. JCAEs consist of public and private branches. The public branches share weights to obtain the common encoding features of multiple input images, and the private branches can acquire private encoding features. A fusion layer with concentrating operation is designed to obtain a self-learned fusion rule and constrain the entire fusion network to work in an end-to-end style. The initial focus map is regarded as a prior input to enable the network to learn precise details. Current multi-focus image fusion algorithms based on deep learning train networks by applying data augmentation to datasets and utilize various skills to adjust the networks. The design of fusion rules is significant. Fusion rules generally comprise direct cascading fusion and pixel-level aspects. The cascading fusion stacks multiple inputs and then blends with the next convolutional layer to help networks gain rich image features. Pixel-level fusion rules are formed with maximum, sum, and mean rules, which can be selected depending on the characteristics of datasets. The mean rule is introduced based on cascading fusion to make the network feasible for achieving the autonomous adjustment of the fusion rules in the training process. The fusion rules of JCAEs are quantitatively and qualitatively discussed to identify the way they work in the process. Image entropy is used to represent the amount of information contained in the aggregated features of grayscale distribution in images. The fusion rules are reasonably demonstrated by calculating the retaining information of the feature map in the network fusion layer. In this study, a pair of multi-focus images is fed into the network, and the feature map of the convolution operation pertaining to the fusion layer is trained to produce fused images. The fusion rules can be visually interpreted by comparing the image information quantity and the learned weight value subjectively. Instead of using the basic loss function to train CNN, the model adds a local strategy to the loss function, including structural similarity index measure and mean squared error. Such a strategy can effectively drive the fusion unit to learn pixel-wise features and ensure accurate image restoration. More accurate and abstract features can be obtained when source images are passed through deep networks rather than shallow networks. However, problems, such as gradient vanishing and high network convergence time, occur in the back-propagation stage of deep networks. The residual network skips a few training layers by using skip connection or shortcut and can easily learn residual images rather than the original input image. Therefore, we use the short connection strategy to improve the feature learning ability of JCAEs. Result The model is trained on the Keras framework based on TensorFlow. We test our model on Lytro dataset and conduct subjective and objective evaluations with existing multi-focus fusion algorithms to verify the performance of the proposed fusion method. The dataset has been widely used in multi-focus image fusion research. We magnify the key areas, such as the region between focused and defocused pixels in the fusion image, to illustrate the differences of fusion images in detail. From the perspective of subjective evaluation, the model can effectively fuse the focus area and shun the artifacts in the fused image. Detailed information is fused, and thus, the visual effect is naturally clear. From the perspective of objective evaluation, a comparison of the image of the model fusion with the fusion image of other mainstream multi-focus image fusion algorithms demonstrates that the average precision of the entropy, Qw, correlation coefficient, and visual information fidelity are the best, which are 7.457 4, 0.917 7, 0.978 8, and 0.890 8, respectively. Conclusion Most deep learning-based multi-focus image fusion methods fulfill a pattern, that is, employing CNN to classify pixels into focused and defocused ones, manually designing fusion rules in accordance with the classified pixels, and conducting a fusion operation on the original spatial domain or learned feature map to acquire a fused full-focused image. This pipeline ignores considerable useful information of the middle layer and heavily relies on labeled data. To solve the above-mentioned problems, this study proposes a multi-focus image fusion algorithm with self-learning style. A fusion layer is designed based on JCAEs. We discuss its network structure, the loss function design, and a method on how to embed pixel-wise prior knowledge. In this way, the network can output vivid fused images. We also provide a reasonable geometric interpretation of the learnable fusion operation on quantitative and qualitative levels. The experiments demonstrate that the model is reasonable and effective; it can not only achieve self-learning of fusion rules but also performs efficiently with subjective visual perception and objective evaluation metrics. This work offers a new idea for the fusion of multi-focus images, which will be beneficial to further understand the mechanism of deep learning-based multi-focus image fusion and motivate us to develop an interpretable image fusion method with popular neural networks.
Keywords

订阅号|日报