发布时间: 2021-10-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.200113
2021 | Volume 26 | Number 10

图像处理和编码

并行生成网络的红外—可见光图像转换

余佩伦¹, 施佺^1,2, 王晗^1,2

1. 南通大学信息科学与技术学院, 南通 226019;

2. 南通大学交通与土木工程学院, 南通 226019

收稿日期: 2020-04-03; 修回日期: 2020-09-23; 预印本日期: 2020-09-30

基金项目: 国家自然科学基金项目（61872425，61771265）

作者简介: 余佩伦, 1994年生, 男, 硕士研究生, 主要研究方向为图像去雾、图像模态转换、深度学习。E-mail: happy_yu0531@163.com
施佺, 男, 教授, 主要研究方向为智能交通、大数据分析、数据挖掘。E-mail: sq@ntu.edu.cn
王晗, 通信作者, 男, 副教授, 主要研究方向为面向智能交通的图像处理、计算机视觉、深度学习。E-mail: hanwang@ntu.edu.cn
*通信作者: 王晗 hanwang@ntu.edu.cn

中图法分类号: TP391

文献标识码: A

文章编号: 1006-8961(2021)10-2346-11

摘要

目的针对现有图像转换方法的深度学习模型中生成式网络（generator network）结构单一化问题，改进了条件生成式对抗网络（conditional generative adversarial network，CGAN）的结构，提出了一种融合残差网络（ResNet）和稠密网络（DenseNet）两种不同结构的并行生成器网络模型。方法构建残差、稠密生成器分支网络模型，输入红外图像，分别经过残差、稠密生成器分支网络各自生成可见光转换图像，并提出一种基于图像分割的线性插值算法，将各生成器分支网络的转换图像进行融合，获取最终的可见光转换图像；为防止小样本条件下的训练过程中出现过拟合，在判别器网络结构中插入dropout层；设计最优阈值分割目标函数，在并行生成器网络训练过程中获取最优融合参数。结果在公共红外-可见光数据集上测试，相较于现有图像转换深度学习模型Pix2Pix和CycleGAN等，本文方法在性能指标均方误差（mean square error，MSE）和结构相似性（structural similarity index，SSIM）上均取得显著提高。结论并行生成器网络模型有效融合了各分支网络结构的优点，图像转换结果更加准确真实。

关键词

模态转换; 残差网络; 稠密网络; 线性插值融合; 并行生成器网络

Infrared-to-visible image translation based on parallel generator network

Yu Peilun¹, Shi Quan^1,2, Wang Han^1,2

1. School of Information Science and Technology, Nantong University, Nantong 226019, China;

2. School of Transportation and Civil Engineering, Nantong University, Nantong 226019, China

Supported by: National Natural Science Foundation of China(61872425, 61771265)

Abstract

Objective Image-to-image translation involves the automated conversion of input data into a corresponding output image, which differs in characteristics such as color and style. Examples include converting a photograph to a sketch or a visible image to a semantic label map. Translation has various applications in the field of computer vision such facial recognition, person identification, and image dehazing. In 2014, Goodfellow proposed an image generation model based on generative adversarial networks (GANs). This algorithm uses a loss function to classify output images as authentic or fabricated while simultaneously training a generative model to minimize loss. GANs have achieved impressive image generation results using adversarial loss specifically. For example, the image-to-image translation framework Pix2Pix was developed using a GAN architecture. Pix2Pix operates by learning a conditional generative model from input-output image pairs, which is more suitable for translation tasks. In addition, U-Net has often been used as generator networks in place of conventional decoders. While Pix2Pix provides a robust framework for image translation, acquiring sufficient quantities of paired input-output training data can be challenging. In order to solve this problem, cycle-consistent adversarial networks (CycleGANs) were developed by adding an inverse mapping and cycle consistency loss to enforce the relationship between generated and input images. In addition, ResNets have been used as generators to enhance translated image quality. Pix2PixHD offers high-resolution (2 048×1 024 pixels) output using a modified multiscale generator network that includes an instance map in the training step. Although these algorithms have effectively been used for image-to-image translation and a variety of related applications, they typically adopt U-Net or ResNet generators. These single-structure networks struggle to keep high performance across multiple evaluation indicators. As such, this study presents a novel parallel stream-based generator network to increase the robustness across multiple evaluation indicators. Unlike in previous studies, this model consists of two entirely different convolutional neural network (CNN) structures. The output translated visible image of each stream is fused with a linear interpolation-based fusion method to allow for simultaneous optimization of parameters in each model. Method The proposed parallel generator network consists of one ResNet processing stream and one DenseNet processing stream, which are fused in parallel. The ResNet stream includes down-sampling and nine Res-Unit feature extraction networks. Each Res-Unit consists of a feedforward neural network exhibiting elementwise addition. Two convolution layers are skipped. Similarly, the DenseNet stream includes down-sampling and nine Den-Unit feature extraction networks. Every Den-Unit is composed of three convolutional layers and two concatenation layers. As a result, the Den-Units output a concatenation of deep feature maps produced in all three convolutional layers. To utilize the advantages of both ResNet and DenseNet streams, two generated images are segmented into low-and high-intensity image parts with an optimal intensity threshold. Then, a linear interpolation method is proposed to fuse the segmented output images of two generator streams in the R, G, B channel respectively. We also design an intensity threshold objective function to obtain optimal parameters in the generator raining process. In addition, to avoid overfitting during training under a small dataset, we modify the discriminator structure by including four convolution-dropout pairs and a convolution layer. Result We compared our model with six state-of-the-art saliency models, including CRN(cascaded refinement networks), SIMS(semi-parametric image synthesis), Pix2Pix(pixel to pixel), CycleGAN(cycle generative adversarial networks), MUNIT(multimodal unsupervised image-to-image translation) and GauGAN(group adaptive normalization generative adversarial networks), on a public dataset named "AAU(Aalborg University) RainSnow Traffic Surveillance Dataset". The experimental dataset, which was composed of 22 5-min video sequences acquired from traffic intersections in the Danish cities of Aalborg and Viborg, was used for testing purposes. This dataset was collected at seven different locations with a conventional RGB camera and a thermal camera, each with a resolution of 640×480 pixels, at 20 frames per second. The total experimental dataset consisted of 2 100 RGB-IR image pairs, and each scene was then randomly divided into training and test datasets by 80%-20%. In this study, multi-perspective evaluation results were acquired using the mean square error (MSE), structural similarity index (SSIM), gray intensity histogram correlation, and Bhattacharyya distance. The advantages of a parallel stream-based generator network were assessed by comparing the proposed parallel generator with a ResNet, DenseNet, and residual dense block (RDN)-based hybrid network. We evaluated the average MSE and SSIM values for the test data, produced using four different generators (ParaNet, ResNet, DenseNet, and RDN). The proposed method achieved an average MSE of 34.835 8, which was lower than that of ResNet, DenseNet, and hybrid RDN network. Simultaneously, the average SSIM value produced with the proposed method was 0.747 7, which was also higher than that of DenseNet, ResNet, and RDN. This result shows that the proposed parallel structure-based network produced more effective fusion results than RDB-based hybrid network structure. Moreover, comparative experiments demonstrated that parallel generator structure improves the robustness performance across multi-perspective evaluations for infrared-to-visible image translation. Compared with the six conventional methods, the MSE performance (lower is better) increased by at least 22.30%, and the SSIM (higher is better) decreased by at least 8.55%. The experimental results show that the proposed parallel generator network-based infrared-to-visible image translation deep learning model achieves high performance in terms of MSE or SSIM compared with conventional deep learning models such as CRN, SIMS, Pix2Pix, CycleGAN, MUNIT, and GauGAN. Conclusion A novel parallel stream architecture-based generator network was proposed for infrared-to-visible image translation. Unlike conventional models, the proposed parallel generator structure consists of two different network architectures: a ResNet and a DenseNet. Parallel linear combination-based fusion allowed the model to incorporate benefits from both networks simultaneously. The structure of discriminator networks used in the conditional GAN framework was also improved for training and identifying optimal ParaNet parameters. The experimental results showed that the inclusion of different networks led to increases in common assessment metrics. The MSE, SSIM, and intensity histogram similarity for the proposed parallel generator network were higher than those of existing models. In the future, this algorithm will be applied to image dehazing.

Key words

modal translation; ResNet; DenseNet; linear interpolation fusion; parallel generator network

0 引言

图像转换(image-to-image translation)是指通过建立从输入图像到输出图像的映射，对图像进行多种类型的风格转换。如画像到照片、素描到油画、红外到可见光图像等(Goodfellow等，2014；刘哲良等，2019；Isola等，2017)。这种风格转换在诸多领域有着广泛应用，如图像去雾(Engin等，2018；陈玮等，2019)、人脸识别(姚乃明等，2018；Zhang等，2018)、行人检测(Wei等，2018)、图像融合(杨晓莉等，2019)和图像增强(黄鐄等，2019)等。

一直以来，图像转换是计算机领域一个重要而富有挑战的问题。2014年Goodfellow等人(2014)提出了一个通过对抗过程来训练生成图像模型的新框架——生成式对抗网络(generative adversarial network，GAN)，包括生成器模型和判别器模型两个相互对抗的模型。生成器模型用于拟合样本数据分布，判别器模型用于估计输入样本是否是真实的训练数据。训练过程中，生成器G(generator)通过映射函数将噪声映射到数据空间，同时判别器D(discriminator)识别出该数据来源于真实训练数据的概率。GAN的提出显著提高了计算机绘制图像的真实性，相关理论迅速发展。条件对抗网络(conditional generative adversarial network，CGAN)(Mirza和Osindero，2014)是GAN的扩展，在生成器模型和判别器模型中同时加入条件，约束引导数据的生成过程，使得GAN能够更好地应用于跨模态问题。条件指类标签和其他模态的数据等补充信息。

Isola等人(2017)提出一种采用U型网络(U-Net)作为生成器的条件对抗网络Pix2Pix(pixel to pixel)，可有效地从标签图合成照片、从边缘图重建图像以及对图像进行着色等，提出的判别器PatchGAN，可以在轮廓图的条件下，有效判断生成图像的真伪。Pix2Pix的核心思想是学习获取训练样本对之间的对应映射关系。然而，针对图像跨模态转换问题，很难采集和建立充足的成对训练数据。为实现在没有成对的训练数据条件下完成图像转换的过程，Zhu等人(2017)提出了一种循环对抗生成网络(cycle-consistent adversarial network，CycleGAN)。整个网络是一个对偶结构，同时存在两个生成器和两个判别器。损失函数部分除了沿用基础GAN的对抗损失，还定义了一个循环损失函数cycle-loss来确保网络生成的图像必须留有原始图像的特性。因此，整个训练过程中，对偶的网络结构满足循环的一致性。为获取高分辨率的图像转换结果，Wang等人(2018)提出了一种新型条件对抗网络来生成高分辨率转换图像的方法Pix2PixHD，利用残差网络(ResNet)构建多尺度生成器和判别器架构，并在训练过程中引入实例图(instance map)，输出生成分辨率为2 048 × 1 024像素的转换图像，同时该方法增加了两个交互式可视化操作：1)允许增加或删除对象和更改对象类别；2)允许用户交互式地编辑对象的外观。

综上所述，现有的图像转换方法建立在条件对抗网络CGAN的基础上，其生成器G均选择单一结构的卷积网络结构，如U-Net、ResNet等。然而，不同的网络结构存在各自的优势。U-Net的结构能够有效结合底层和高层的信息；ResNet能够在深度的卷积结构中缓解梯度消失问题(He等，2016)；稠密网络(DenseNet)加强了特征传播、鼓励特征复用，极大地减少了参数数量(Huang等，2017)。

图 1分别给出了基于U-Net、ResNet和DenseNet等3种网络结构的生成器在红外—可见光图像转换的评价结果。对比可见，不同的网络结构在实验中表现出了各自的优势。U-Net网络对于近景区域的还原度较好；ResNet网络转换图像的结构相似性(structural similarity index，SSIM)最高；而DenseNet网络转换图像的均方误差(mean square error，MSE)最小。3种网络任一种均无法同时获得多评价指标的最优转换结果。

图 1 不同结构生成器网络转换图像多指标比较

Fig. 1 Performance of different generator structures

((a) infrared image; (b) U-Net; (c) ResNet; (d) DenseNet; (e)truth image)

为在图像转换中同时获取不同网络结构的特点，区别于现有的生成器模型，本文提出了一种基于两种不同网络结构的并行生成器网络。实验表明，并行生成器网络能够有效地结合不同网络的优点，融合后的转换图像在均方误差(MSE)和结构相似度(SSIM)的评价中均明显优于现有方法。

1 改进CGAN的图像转换方法

1.1 并行生成网络结构

1.1.1

本文提出的并行生成器由残差单元生成器分支网络模块、稠密单元生成器分支网络模块和图像融合模块组成，如图 2所示。红外图像分别输入到两个并行的生成器分支网络——基于残差单元的生成器网络和基于稠密单元的生成器网络。每个分支的网络通过卷积层提取特征，然后再经过3次上采样获取各自的转换图像。最后，利用训练集中获取的最优先验知识对两个转换图像进行线性插值，融合生成最后的转换图像输出。

图 2 并行生成器网络结构示意图

Fig. 2 The structure of proposed parallel generator network

1.1.2 残差单元结构

卷积神经网络的多层结构能够自动学习输入的特征，且学习到的特征与网络的深度相关。浅层的感知域学习获取局部特征，深层感知域学习获取更加复杂而抽象的特征。因此，多层感知的生成器可以得到更多细节且真实的图像。然而，随着卷积层数的增加，训练中的损失值会出现先递减后突然增加的情况，当网络的层数增加到一定数值时，会出现梯度消失，进而导致网络的效果下降。残差网络能够在有效防止梯度消失的同时提高网络的效率，实现了卷积神经网络的跳跃连接。

残差网络单元结构如图 3所示。其中，输出${a^2}\left(\mathit{\boldsymbol{x}} \right) = g\left({\left({{\mathit{\boldsymbol{w}}^2}*\;{a^1}\left(\mathit{\boldsymbol{x}} \right) + {\mathit{\boldsymbol{b}}^2}} \right) + \mathit{\boldsymbol{x}}} \right)$。${a^1}\left(\mathit{\boldsymbol{x}} \right) = g\left({{\mathit{\boldsymbol{w}}^1}*\;\mathit{\boldsymbol{x}} + {b^1}} \right)$是残差单元前一次卷积后的值，函数$ a$是激励函数处理计算后的值，函数$g$是ReLU激励函数，$\mathit{\boldsymbol{x}}$是输入的特征值矩阵。${{\mathit{\boldsymbol{w}}^2}}$是系数矩阵，如果梯度消失，则${{\mathit{\boldsymbol{w}}^2}}=0$。$b$是偏差，令${{b^2} = 0}$，那么该多项式为零，因此最后输出${a^2}\left(\mathit{\boldsymbol{x}} \right) = \mathit{\boldsymbol{x}}$，表明残差单元即使在增加的两层中没有学到任何信息，也不会影响网络。反之，如果残差引入了有效的特征信息，那么增加了残差的网络就能学到有效特征，确保梯度不会消失。

图 3 残差网络单元结构

Fig. 3 ResNet unit structure

1.1.3 稠密单元结构

除了使用跳跃连接的方式，在解决卷积神经网络的深度问题时，还可将神经网络从前向后的卷积层两两拼接，这就是稠密网络(DenseNet)的单元结构，如图 4所示。其中[, ]代表concatenate拼接。

图 4 稠密网络单元结构

Fig. 4 DenseNet unit structure

稠密单元由稠密块和过渡层两部分组成。稠密块是实现特征复用的密集连接功能部分，过渡层是1 × 1的卷积层，实现不改变特征图大小(size)的情况下，压缩通道数，使输出与输入保持相同大小，提高传输效率。

1.1.4 基于先验知识的图像融合

图 5给出了残差网络与稠密网络在各颜色通道独立的亮度直方图拟合情况。通过对比，明显可以看出，残差网络的生成图像在接近255附近高亮度像素值区间拟合程度更好，而稠密网络在除此之外的低亮度像素值区间拟合程度更好。

图 5 残差、稠密网络转换图像与真值图像的各颜色通道亮度直方图拟合

Fig. 5 Intensity histogram fitting results of ResNet and Densnet in the R, G, B channel

((a) truth image; (b) residual network fitting result; (c) dense network fitting result)

为了融合两种网络结构在实验中表现出的优势，本文提出了基于各颜色通道阈值分割的线性插值融合方法，具体流程如图 6所示。

图 6 基于阈值分割的线性插值图像融合方法

Fig. 6 Linear interpolation image fusion method based on a threshold segmentation

首先，输入红外图像经过残差单元网络和稠密单元网络各自生成转换图像${\mathit{\boldsymbol{I}}_1}$和${\mathit{\boldsymbol{I}}_2}$。接着，分别在蓝色(B)、绿色(G)和红色(R)通道中，利用亮度阈值${T_{\rm{B}}}$、${T_{\rm{G}}}$、和${T_{\rm{R}}}$、将${\mathit{\boldsymbol{I}}_1}$和${\mathit{\boldsymbol{I}}_1}$均分割成低亮度图像分量${\mathit{\boldsymbol{I}}_{\rm{L}}}$和高亮度图像分量${\mathit{\boldsymbol{I}}_{\rm{H}}}$两部分。然后，在每一个通道中利用线性组合的方法，将残差单元网络生成图像${\mathit{\boldsymbol{I}}_1}$和稠密单元网络生成图像${\mathit{\boldsymbol{I}}_2}$的高、低两个图像分量的亮度值进行加权融合。最后，将获取的R、G、B这3个通道的融合结果拼接成彩色图像。为了寻求最优的分割阈值和权值系数，提出最佳参数优化目标函数，具体为

$ \left\{\begin{array}{l} J\left(T_{i}\right)=\exp \left(\frac{M S E^{\prime}\left(T_{i}\right)-\mu_{i 1}}{\sigma_{i}}\right)^{2} \times \\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \exp \left(\frac{N^{\prime}\left(T_{i}\right)-\mu_{i 2}}{\sigma_{i}}\right)^{2} \\ T_{i}=\operatorname{argmax}\left(J\left(T_{i}\right)\right), T \in[0,255] ,\\ \ \ \ \ \ \ \ \ \ i \in[R, G, B] \end{array}\right. $

(1)

式中，$MSE'\left({{T_i}} \right)$是当$i$通道$\left({i = {\rm{R}}、{\rm{G}}、{\rm{B}}} \right)$分割阈值取$T_i$时，所有训练样本在该通道的融合平均拟合误差。$N'\left({{T_i}} \right)$是当$i$通道$\left({i = {\rm{R}}、{\rm{G}}、{\rm{B}}} \right)$分割阈值取$T_i$时，所有训练样本中在该通道融合后的拟合误差同时小于残差单元网络和稠密单元网络的数量。${\mu _{i1}}$是$i$通道期望最小拟合误差，${\mu _{i2}}$是训练样本个数。

通过在训练集中最大化最优参数优化目标函数，可以得到各颜色通道的最佳分割阈值。基于最佳阈值采用线性插值融合即可得到综合残差网络和稠密网络优势的输出图像。线性插值融合公式为

$ \left\{\begin{array}{l} \boldsymbol{I}_{\mathrm{FL}}\left(T_{i}\right)=\lambda_{1} \boldsymbol{I}_{i \mathrm{L} 1}\left(T_{i}\right)+\left(1-\lambda_{1}\right) \boldsymbol{I}_{i \mathrm{L} 2}\left(T_{i}\right) \\ \boldsymbol{I}_{\mathrm{FH}}\left(T_{i}\right)=\lambda_{2} \boldsymbol{I}_{i \mathrm{H} 1}\left(T_{i}\right)+\left(1-\lambda_{2}\right) \boldsymbol{I}_{i \mathrm{H} 2}\left(T_{i}\right) \end{array}\right. $

(2)

式中，${\mathit{\boldsymbol{\lambda }}_1}$和${\mathit{\boldsymbol{\lambda }}_2}$是残差单元网络生成图像${\mathit{\boldsymbol{I}}_1}$分别在低亮度区域图像分量和高亮度区域图像分量对应的融合权重值。${\mathit{\boldsymbol{I}}_{{\rm{FL}}}}\left({{T_i}} \right)$代表在$i$通道阈值$T_i$下分割后低亮度区域图像分量的融合结果。${\mathit{\boldsymbol{I}}_{{\rm{FH}}}}\left({{T_i}} \right)$代表在$i$通道阈值$T_i$下分割后高亮度区域图像分量的融合结果。

1.2 判别网络结构

为防止由9个残差、稠密单元模块组成的并行生成网络在小样本的数据训练过程中出现过度拟合问题(Peng等，2019)，在判别器网络中的第1、2、3、4卷积层的后面添加了dropout层，改进后的判别网络结构如图 7所示。

图 7 改进的判别网络结构

Fig. 7 Modified discriminant network structure

1.3 损失函数

条件对抗网络CGAN的优化是分别优化生成器和判别器，当优化判别器时，目标优化函数为

$ \begin{gathered} L_{\mathrm{cGAN}}(G, D)=E_{X, Y}[\log D(x, y)]+ \\ E_{X}[\log (1-D(x, G(x))] \end{gathered} $

(3)

式中，$D$表示判别器函数，$G$表示生成器函数。固定生成器而优化判别器时，需要最大优化上述目标函数，使得$D\left({x, G\left(x \right)} \right) \to 0, D\left({x, y} \right) \to 1$。这样，判别器能够正确判断出真实训练图像与生成图像。

当优化生成器时，目标函数可由式(3)简化为

$ L_{\mathrm{cGAN}}(G, D)=E_{X}[\log (1-D(x, G(x))] $

(4)

此时，需要最小化式(4)的目标函数，使得$D\left({x, G\left(x \right)} \right) \to 1$，目的是为了提高生成图像质量以能够欺骗判别器。因为优化生成器时，判别器对真实图像的判别始终是1，所以在数值上始终是0，故优化的目标函数都可视为式(3)。生成器和判别器就是以相互对抗的方式，使生成器能够生成人眼无法判别真假的图像。

为了保证生成图像的清晰度和与之对应的红外图像在内容上具有关联性，在目标函数中依旧保留了L1范数，即

$ L_{\mathrm{Ll}}(G)=E_{x, y}\left[\|y-G(x)\|_{1}\right] $

(5)

因此，最终生成器的目标优化函数为

$ G^{*}=\arg \min \limits_{G} L_{\mathrm{cGAN}}\left(G, D_{0}\right)+\lambda L_{\mathrm{L} 1}(G) $

(6)

式中，$D_0$为优化后的判别器参数。

2 实验结果

2.1 实验过程、数据与评价方法

为验证所提方法的正确性和可行性，在交通路口的多模态公共数据集(Jensen等，2018)中选取2 100对可见光—红外图像进行红外—可见光图像转换的模型训练、测试及评价。其中，80 % 作为训练集，20 % 作为测试集。

实验过程如下：首先，利用训练样本训练所有比较对象模型；然后，利用训练结果对测试集图像进行转换；最后，采用均方误差(MSE)、结构相似性(SSIM)和灰度直方图拟合程度(相似度)多角度对转换图像的质量进行量化评价与比较分析。

2.2 网络结构参数

改进的CGAN网络包含生成器D和判别器G两部分。其中，在判别器D中，改进了Isola等人(2017)提出的PatchGAN，如图 7所示，网络结构参数见表 1。

表 1 改进的判别网络结构参数
Table 1 The parameters of modified discriminant network

下载CSV

结构说明	卷积核、步长、边框	输出数据维度
卷积层1	4×4、2、1	128×128×64
dropout层1	保留率为0.8	128×128×64
卷积层2	4×4、2、1	64×64×128
dropout层2	保留率为0.8	64×64×128
卷积层3	4×4、2、1	32×32×256
dropout层3	保留率为0.8	32×32×256
卷积层4	4×4、1、1	31×31×512
dropout层4	保留率为0.8	31×31×512
卷积层5	4×4、1、1	30×30×1

插入的dropout层有效防止了小样本条件下训练过程中的过拟合问题，判断的准确性和鲁棒性均有所提高。生成器G中的并行生成网络结构如图 2所示，其中，残差、稠密生成分支网络各由9个单元组成，各单元结构如图 3和图 4所示。生成器平行网络结构参数见表 2。

表 2 并行生成网络结构参数
Table 2 The parameters of parallel network structure

下载CSV

阶段	结构	卷积核/步长/边框	输出数据维度
下采样	卷积层1	7×7/1/3	256×256×64
	卷积层2	3×3/2/0.5	128×128×128
	卷积层3	3×3/2/0.5	64×64×256
第$N$个残差单元	卷积层$N$	3×3/1/1	64×64×256
	卷积层$N$+1	3×3/1/1	64×64×256
	元素相加层	-	64×64×256
第$N$个稠密单元	卷积层$N$	3×3/1/1	64×64×256
	卷积层$N$+1	3×3/1/1	64×64×512
	向量拼接层1	-	64×64×256
	卷积层$N$+2	3×3/1/1	64×64×768
	向量拼接层2	-	64×64×512
	过渡层	1×1/1/0	64×64×256
上采样	反卷积层1	3×3/2/0.5	128×128×128
	反卷积层2	3×3/2/0.5	256×256×64
	卷积层	7×7/1/3	256×256×3
注：$N$表示残差、稠密单元的编号，实验中$N$∈[1, 9]，“-”表示无上述参数。

2.3 判别网络结构有效性

为验证改进的判别器的有效性，分别利用改进前、后的判别器训练并行网络，采用MSE、SSIM及灰度直方图拟合度(巴氏距离、相关性)等性能指标对模态转换结果进行比较。

图 8给出了判别网络在改进前、后对并行网络进行训练及测试的评价结果比较。可以看出，在添加了dropout层之后，并行网络在小样本训练集上的训练结果在4种评价标准中均有显著提高。

图 8 改进的判别网络训练效果

Fig. 8 Performance of modified discriminant network

((a) MSE; (b) SSIM; (c) Bhattacharyya distance; (d) correlation)

2.4 图像转换实例分析

采用图像亮度值直方图拟合程度、均方误差(MSE)和图像结构相似度(SSIM)作为量化评价指标，多角度综合评价本文提出的并行生成网络方法，并与Pix2Pix、CycleGAN和MUNIT (multimodal unsupervised image-to-image translation)(Huang等，2018)等图像转换方法进行比较。

图 9给出了4种不同方法的红外—可见光图像转换实例。图 9(a)(b)分别是目标可见光的真值图像和输入远红外图像；图 9(c)—(f)分别是Pix2Pix、CycleGAN、MUNIT和并行生成网络的输出图像；图 9(g)—(j)分别是Pix2Pix、CycleGAN、MUNIT和并行网络转换图像与真实图像的灰度直方图拟合结果，通过拟合结果可以更直观地分析比较不同方法转换图像的性能。

图 9 不同图像转换模型和本文方法比较实例

Fig. 9 Comparation examples among different methods and proposed

((a) target image; (b) infrared image; (c) Pix2Pix; (d) CycleGAN; (e) MUNIT; (f) proposed; (g) Pix2Pix histogram; (h) CycleGAN histogram; (i) MUNIT histogram; (j) proposed histogram)

从图 9可以看出，CycleGAN转换图像整体轻微失真，黄色框区域相较标真值图像细节涂抹感严重，红色和绿色框区域有较多的细节丢失。Pix2Pix转换图像与目标真实图像有较高的相似性，近景黄色区域换还原度较高，但远景红色和绿色框区域同样存在较多的细节丢失情况。MUNIT转换图像远景还原度较好，但近景黄色框区域的涂抹感觉严重。本文提出的并行网络方法的图像整体清晰度明显好于其他方法，在细节上与目标真实图像较为相似，仅在绿色区域中的车身较目标图像有轻微失真，在视觉感知上也好于其他3种方法。

灰度直方图的拟合程度反映了对应图像之间的相似度。直方图相似度越高，拟合差异越小，表示两幅图像之间的相似度越高。从整个灰度值的拟合情况可以看出，Pix2Pix的拟合度在高亮度区间有较大的偏差，CycleGAN的转换图像的直方图较真实图像在亮度峰值区域有明显差异，在黄色框区域出现了细节丢失情况。MUNIT的拟合度优于Pix2Pix和CycleGAN，但是在亮度值峰值区域出现了明显的向左偏移。本文方法的转换图像灰度直方图在整体区间内拟合情况均较好，转换图像更加接近于目标真实图像。

2.5 比较实验及性能评价

为进一步量化地评价对比本文方法与其他方法的性能，表 3给出了所有测试样本采用不同方法的MSE和SSIM评价结果。MSE反映了转换图像与目标图像亮度值之间的差异，值越小表示转换图像与目标图像差异越小。并行网络方法的转换图像与目标图像间的MSE为36.64，在各种方法中最低，相对于其他模型，至少降低了22.3 %。SSIM是衡量两幅图像结构相似程度的重要指标，值越大表示两幅图像的相似度越高。并行网络转换图像的SSIM值为78.82 %，相较于其他方法，至少提升了8.55 %。并行网络转换图像的均方差和结构相似性均明显优于其他对比方法。图 10给出了随机抽取的几组本文方法在公共测试数据集上红外—可见光图像转换的结果实例。

表 3 不同方法的MSE和SSIM指标
Table 3 MSE and SSIM of different methods

下载CSV

方法	MSE	SSIM
CRN(Chen和Koltun, 2017)	75.942 2	0.308 9
SIMS(Qi等, 2018)	73.025 3	0.326 1
Pix2Pix(Isola等, 2017)	72.979 7	0.367 1
CycleGAN(Zhu等, 2017)	83.043 5	0.536 4
MUNIT(Huang等, 2018)	50.742 6	0.690 8
GauGAN(Park等, 2019)	47.163 2	0.726 1
本文	36.644 2	0.788 2
注：加粗字体表示各列最优结果, CRN为cascaded refinement networks；SIMS为semi-parametric image synthesis；GauGAN为group adaptive normalization generative adversarial networks。

图 10 本文方法图像转换效果实例

Fig. 10 More examples of infrared to visible image translation performance of proposed method

((a) target images; (b) proposed model translated images)

3 结论

针对现有基于条件对抗网络CGAN的图像转换方法在生成器网络中结构单一问题，在对实验结果比较分析的基础上，提出了并行生成器网络结构。实验结果表明，本文方法在视觉感知层面，转换图像更加真实，保留了原图像中的绝大部分细节；在量化评价指标层面，较Pix2Pix、CycleGAN和MUNIT等方法，并行生成器网络结构在像素亮度值均方差(MSE)、图像结构相似性(SSIM)及灰度直方图拟合度上均取得了显著提高。此外，判别器网络结构的改进有效增强了并行网络在小样本训练条件下的鲁棒性。实验结果表明，并行结构的图像生成器网络可以将不同网络结构的优点进行有效融合，有效进行不同图像风格(模态)的转换。在下一步工作中，将研究该模型基于可见光—红外图像在多模态图像去雾中的应用。

致谢本文实验数据来自丹麦奥尔堡大学的Morten提供的2018关于城市路口的红外—可见光图像对数据，在此表示感谢。

参考文献

Chen Q F and Koltun V. 2017. Photographic image synthesis with cascaded refinement networks//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 1520-1529[DOI: 10.1109/ICCV.2017.168]

Chen W, Li Z W, Yin Z. 2019. Image deblurring algorithm based on generative adversarial network. Information and Control, 48(6): 707-714, 722 (陈玮, 李正旺, 尹钟. 2019. 基于生成对抗网络的图像去雾算法. 信息与控制, 48(6): 707-714, 722) [DOI:10.13976/j.cnki.xk.2019.9078]

Engin D, Genç A and Ekenel H K. 2018. Cycle-Dehaze: enhanced CycleGAN for single image Dehazing//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR). Salt Lake City, USA: IEEE: 825-833[DOI: 10.1109/CVPRW.2018.00127]

Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A and Bengio Y. 2014. Generative adversarial nets//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: ACM: 2672-2680

He K M, Zhang Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 770-778[DOI: 10.1109/CVPR.2016.90]

Huang G, Liu Z, van der Maaten L and Weinberger K Q. 2017. Densely connected convolutional networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 12[DOI: 10.1109/CVPR.2017.243]

Huang H, Tao H J, Wang H F. 2019. Low-illumination image enhancement using a conditional generative adversarial network. Journal of Image and Graphics, 24(12): 2149-2158 (黄鐄, 陶海军, 王海峰. 2019. 条件生成对抗网络的低照度图像增强方法. 中国图象图形学报, 24(12): 2149-2158) [DOI:10.11834/jig.190145]

Huang X, Liu M Y, Belongie S and Kautz J. 2018. Multimodal unsupervised image-to-image translation//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 179-196[DOI: 10.1007/978-3-030-01219-9_11]

Isola P, Zhu J Y, Zhou T H and Efros A A. 2017. Image-to-image translation with conditional adversarial networks[EB/OL]. [2016-11-21]. https://arxiv.org/pdf/1611.07004.pdf

Jensen M B, Bahnsen C H, Lahrmann H S, Madsen T K O, Moeslund T B. 2018. Collecting traffic video data using portable poles: survey, proposal, and analysis. Journal of Transportation Technologies, 8(4): 376-400 [DOI:10.4236/jtts.2018.84021]

Liu Z L, Zhu W, Yuan Z Y. 2019. Image instance style transfer combined with fully convolutional network and CycleGAN. Journal of Image and Graphics, 24(8): 1283-1291 (刘哲良, 朱玮, 袁梓洋. 2019. 结合全卷积网络与CycleGAN的图像实例风格迁移. 中国图象图形学报, 24(8): 1283-1291) [DOI:10.11834/jig.180624]

Mirza M and Osindero S. 2014. Conditional generative adversarial nets[EB/OL]. [2014-11-06]. https://arxiv.org/pdf/1411.1784.pdf

Park T, Liu M Y, Wang T C and Zhu J Y. 2019. Semantic image synthesis with spatially-adaptive normalization//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 2332-2341[DOI: 10.1109/CVPR.2019.00244]

Peng Z M, Li Z C, Zhang J G, Li Y, Qi G J and Tang J H. 2019. Few-shot image recognition with knowledge transfer//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE: 441-449[DOI: 10.1109/ICCV.2019.00053]

Qi X J, Chen Q F, Jia J Y and Koltun V. 2018. Semi-parametric Image Synthesis//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 8808-8816[DOI: 10.1109/CVPR.2018.00918]

Wang T C, Liu M Y, Zhu J Y, Tao A, Kautz J and Catanzaro B. 2018. High-resolution image synthesis and semantic manipulation with conditional GANs//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 8798-8807[DOI: 10.1109/CVPR.2018.00917]

Wei L H, Zhang S L, Gao W and Tian Q. 2018. Person transfer GAN to bridge domain gap for person re-identification//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 79-88[DOI: 10.1109/CVPR.2018.00016]

Yang X L, Lin S Z, Lu X F, Wang L F, Li D W, Wang B. 2019. Multimodal image fusion based on generative adversarial networks. Laser and Optoelectronics Progress, 56(16): #161004 (杨晓莉, 蔺素珍, 禄晓飞, 王丽芳, 李大威, 王斌. 2019. 基于生成对抗网络的多模态图像融合. 激光与光电子学进展, 56(16): #161004) [DOI:10.3788/LOP56.161004]

Yao N M, Guo Q P, Qiao F C, Chen H, Wang H A. 2018. Robust facial expression recognition with generative adversarial networks. Acta Automatica Sinica, 44(5): 865-877 (姚乃明, 郭清沛, 乔逢春, 陈辉, 王宏安. 2018. 基于生成式对抗网络的鲁棒人脸表情识别. 自动化学报, 44(5): 865-877) [DOI:10.16383/j.aas.2018.c170477]

Zhang H, Han H, Cui J Y, Shan S G and Chen X L. 2018. RGB-D face recognition via deep complementary and common feature learning//Proceedings of the 13th IEEE International Conference on Automatic Face and Gesture Recognition. Xi'an, China: IEEE: 8-15[DOI: 10.1109/FG.2018.00012]

Zhu J Y, Park T, Isola P and Efros A A. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 2242-2251[DOI: 10.1109/ICCV.2017.244]