发布时间: 2019-12-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.190095
2019 | Volume 24 | Number 12

图像分析和识别

多源域混淆的双流深度迁移学习

闫美阳, 李原

北京理工大学自动化学院, 北京 100081

收稿日期: 2019-04-12; 修回日期: 2019-07-04; 预印本日期: 2019-07-11

基金项目: 国家自然科学基金项目（61472037，61433003）

第一作者简介: 闫美阳, 1994年生, 女, 硕士研究生, 主要研究方向为计算机视觉。E-mail:13752593085@163.com.

中图法分类号: TP301.6

文献标识码: A

文章编号: 1006-8961(2019)12-2243-12

摘要

目的针对深度学习严重依赖大样本的问题，提出多源域混淆的双流深度迁移学习方法，提升了传统深度迁移学习中迁移特征的适用性。方法采用多源域的迁移策略，增大源域对目标域迁移特征的覆盖率。提出两阶段适配学习的方法，获得域不变的深层特征表示和域间分类器相似的识别结果，将自然光图像2维特征和深度图像3维特征进行融合，提高小样本数据特征维度的同时抑制了复杂背景对目标识别的干扰。此外，为改善小样本机器学习中分类器的识别性能，在传统的softmax损失中引入中心损失，增强分类损失函数的惩罚监督能力。结果在公开的少量手势样本数据集上进行对比实验，结果表明，相对于传统的识别模型和迁移模型，基于本文模型进行识别准确率更高，在以DenseNet-169为预训练网络的模型中，识别率达到了97.17%。结论利用多源域数据集、两阶段适配学习、双流卷积融合以及复合损失函数，构建了多源域混淆的双流深度迁移学习模型。所提模型可增大源域和目标域的数据分布匹配率、丰富目标样本特征维度、提升损失函数的监督性能，改进任意小样本场景迁移特征的适用性。

关键词

小样本; 迁移学习; 多源域; 双流卷积融合; 域混淆

Two-stream deep transfer learning with multi-source domain confusion

Yan Meiyang, Li Yuan

School of Automation, Beijing Institute of Technology, Beijing 100081, China

Supported by: National Natural Science Foundation of China (61472037, 61433003)

Abstract

Objective Feature extraction can be completed automatically by using a nonlinear network structure for deep learning.Thus, multi-dimensional features can be obtained through the distributed expression of features. Deep convolutional neural networks are supported by a large volume of valid data. However, obtaining a large volume of effective labeled data is often labor-intensive and time-consuming. Hence, achieving deep learning on a large volume of labeled datasets is still a challenge. Presently, deep convolutional neural networks on few-shot datasets have become a popular research topic in deep learning, and deep learning with transfer learning is the latest approach to solve the problem of data poverty. In this paper, two-stream deep transfer learning with multi-source domain confusion is proposed to address the limited adaptionissue of the source model's general features extracted on the target data. Method The proposed deep transfer learning network is based on the confusion domain deep transfer learning model. First, amulti-source domain transfer strategy is used to increase the coverage of target domain transfer features from the source domain. Second, a two-stage adaptive learning method is proposed to achieve domain-invariant deep feature representations and similar recognition results of the inter-domain classifier. Third, a data fusion strategy of natural light images with two-dimensional features and depth images with three-dimensional features is proposed to enrich the features dimension of few-shot datasets and suppress the influence of a complex background. Finally, the composite loss function is presented with the softmax and center loss functions to improve the recognition performance of the classifier in few-shot deep learning, and intra-and inter-class distances are shortened and expanded, respectively. The proposed method increases the recognition rate by improving the feature extraction and loss function of the deep convolutional neural network. Regarding feature extraction, the efficiency of feature transfer is enhanced, and the feature parameters of few-shot datasets are enriched by multi-source deep transfer features and feature fusion. The efficiency of multi-source domain feature transfer is improved with three kinds of loss functions. The inter-and intra-class feature distances are adjusted by introducing the center loss function. To extract the deep adaptation features, the difference loss of domain-invariant deep feature representation is calculated, and the inter-domain features are aligned with oneanother. In addition, the mutual adaptation of different domain classifiers is designed with the difference loss function. A two-stream deep transfer learning model with multi-source domain confusion is developed by combining the above methods. The model enhances the characterization of targets in complex contexts while improving the applicability of transfer features. Gesture recognition experiments are conducted on public datasets to verify the validity of the proposed model. Quantitative analysis of comparative experiments shows that the performance of the proposed model is superior to that of other classical gesture recognition models. Result The two-stream deep transfer learning model with multi-source domain confusiondemonstratesa more effective gesture recognition performance on few-shot datasets than previous models. In the model with the DenseNet-169 pre-training network, theproposed network achieves 97.17% accuracy. Compared with other classic gesture recognition and transfer learning models, the two-stream deep transfer learning model with multi-source domain confusion has 2.34% higher accuracy.The recognition performance of the proposed model in a small gesture sample dataset is evaluated through comparison as follows. First, compared with other transfer learning models, the proposed framework of the two-stream fusion model with multi-source domain confusion transfer learning can effectively complete the transfer of features. Second, the performance of the proposed fusion model is superior to that of the traditional two-stream information fusion model, which verifies that the proposed fusion model can improve recognition efficiency while effectively combining natural light and depth image features. Conclusion A deep transfer learning method with multi-source domain confusion is proposed. By studying the principle and mechanism of deep learning and transfer learning, a multi-source domain transfer method that covers the characteristics of the target domain is proposed. First, an adaptable featureis introduced to enhance the description capability of the transfer feature. Second, a two-stage adaptive learning method is proposed to represent the deep features of the invariant domain and reduce the prediction differences of inter-domain classifiers. Third, combined with the three-dimensional feature information of the depth image, a two-stream convolution fusion strategy that can realize the full use of scene information is proposed. Through the fusion of natural light imaging and depth information, the capability to segment the foreground and background in the image is improved, and the data fusion strategy realizes the recombination of the twotypes of modal information. Finally, the efficiency of multi-source domain feature transfer is improved by three kinds of loss functions. To improve the recognition performance of the classifier in few-shot datasets, the penalty performance of classifiers on inter-and intra-class features is adjusted by introducing center loss to softmax loss. The inter-domain features are adapted to oneanother by calculating the loss of the domain-invariant deep feature. The mutual adaptation of different domain classifiers is designed with the difference loss function of inter-domain classifiers. The two-stream deep transfer learning model with multi-source domain confusion is generated through two-stage adaptive learning, which can facilitate the feature transfer from the source domain to the target domain. The model structure of the two-stream deep transfer learning with multi-source domain confusion is designed by combining the proposed deep transfer learning method and data fusion strategy with multi-source domain confusion. On the public gesture dataset, the superior performance of the proposed model is verified through the contrast of multiple angles.Experimental results prove that the proposed method can increase the matching rate of the source and target domains, enrich the feature dimension, and enhance the penalty supervision capability of the loss function. The proposed method can improve the recognition accuracy of the deep transfer network on few-shot datasets.

Key words

few-shot datasets; transfer learning; multi-source domain; two-stream convolution fusion; domain confusion

0 引言

近年来，深度学习模型在图像处理领域的能力得到了指数级的提升，已成为人工智能领域最为活跃的研究热点之一。深度学习中的自主学习图像特征的方法，一般需要海量的标注数据做支撑，在小规模图像数据集上应用深度卷积神经网络，得到的效果与传统的人工提取特征法相比，并没有明显的提升，而获得大量的标注数据往往需要耗费极高的人力、物力以及时间成本。因此，如何有效地解决数据贫乏问题已经成为深度学习领域的热点研究问题。由于数据缺乏严重制约了深度学习的发展，学术界开始研究相关算法并引入到深度学习中，研究方向包括迁移学习和数据增强。

在迁移学习方面，具体有模型迁移学习、度量空间学习以及元数据学习3类形式。Oquab等人(2014)最早发表了模型迁移学习的研究成果，将ImageNet(Deng等，2009)大数据集上训练得到的AlexNet(Krizhevsky等，2012)模型应用到小样本数据集上，借助学习率和全连接层输出参数的调整，获得目标模型；Koch等人(2015)的孪生网络、Vinyals等人(2016)的匹配网络、Snell等人(2017)的原型网络以及Garcia等人(2017)提出的图神经网络，都是借助度量空间对样本特征间的距离分布进行建模，进而实现同类样本靠近、异类样本远离的目标；元学习即学会学习，通过学习大量的任务，获得内在的元知识，利用神经网络学会比较元知识与新知识的区别，快速处理同类的新任务。如Santoro等人(2016)和Ravi等人(2017)结合神经网络图灵机与长短期记忆网络形成的元分类器以及Finn等人(2017)通过梯度下降策略训练的元分类器实现了元知识的多任务适应。

在数据增强方面，具体有基于神经网络、特征映射以及图像处理3类形式。基于神经网络的数据增强，以生成对抗网络(GAN)为代表，Huang等人(2018)提出数据增强式GAN，利用生成和判别的联合损失函数，实现跨领域的数据增广；特征映射的数据增强方法，如Chen等人(2018)提出的语义增强手段，借助语义空间的丰富信息，通过编码器将视觉特征映射到语义空间，实现视觉特征在语义空间的特征增广，最后从语义空间映射回视觉空间获取增广后的图像样本。利用图像处理的数据增强方法，主要有颜色抖动、主成分抖动、随机剪切、尺度变换、水平或垂直翻转、旋转或仿射变换以及添加噪声等方法，如Liu等人(2017)借助样本平移、添加噪声以及线性组合等进行数据增广，解决了拉曼光谱数据集样本数量少和不均衡的问题，与传统的机器学习算法相比，识别准确度提高了20%~40%。

基于迁移学习和数据增强的算法在解决少量样本深度学习上都存在一些问题。其中，基于迁移学习算法的缺点主要有：1)由于提取特征的能力与模型的网络结构存在紧密联系，因此模型迁移方法的识别性能过分依赖于预训练模型；2)在度量空间学习中，源域与目标域的相似程度作为唯一的迁移学习手段，其度量距离标准的选择严重影响了最终的实验结果；3)元学习方法大幅度增加了迁移学习算法的实现难度，如长短期记忆网络(LSTM)是构建元关系的常用网络，该网络结构的运算存在一定的复杂度，其模型训练也不易于实现。另外，基于数据增强的算法也存在一些问题：1)数据增强即利用特定手段对数据进行扩增，将小样本数据变换成大数据集，其网络输入的原始数据实质仍是大样本数据，没有根本解决小样本深度学习问题；2)针对生成对抗网络的数据增广方法，因将原始数据与增广后数据的相似性作为训练标准，所以增广后的特征维度不会发生较大改变，因此其数据增强的效果并不明显；3)针对特征映射的图像增广方法，其映射规律尤为重要，否则会生成与原始数据集类别不一致的样本。

针对以上问题，本文提出了一种多源域混淆的双流深度迁移学习方法，借助多源域混淆的手段提高迁移特征对目标域特征的适应能力，通过双流卷积网络实现目标特征不同维度的融合，并利用复合损失函数增强识别模型的分类性能。该方法在未经数据增强的原始目标域数据集上进行验证，结果显示，相比于传统的迁移学习，所提方法在多个预训练模型上的准确率和训练收敛速率都有明显提高。

1 双流深度迁移学习模型构建

本文算法针对多源域混淆的双流深度迁移学习模型的构建阶段，主要由3个步骤组成：1)通过分析源域数量对迁移特征的影响，提出多源域与目标域的两阶段适配学习策略；2)通过结合多源特征迁移以及双流卷积特征融合的策略，设计目标域特征提取的操作流程；3)建立模型的复合损失函数。

1.1 多源域混淆的深度迁移学习策略

对于少量样本数据集，在源域和目标域数据样本相似度较低的情况下，通过深度神经网络和迁移学习的结合得到的识别效果并不理想。如图 1所示，相比于单源域，多源域的迁移学习大幅度提高了对目标域的覆盖率，增强了源域特征的迁移效果。

图 1 源域与目标域的匹配

Fig. 1 Matching source domain to the target domain

在多源域迁移学习中，当源域和目标域出现数据分布混淆时，将导致相应判别结构的错误对齐，如图 2所示，源域手势类别“w”错误地对齐了目标域手势类别“a”的特征分布。

图 2 源域与目标域的分布情况

Fig. 2 Distribution of source and target domains

本文采用深度域混淆的学习策略(DDC)(Tzeng等，2014)，通过学习到深层域间共享特征，优化域间概率分布差异和分类误差，实现源域与目标域间的适配学习，提高模型对目标任务的适应度，其优化策略如图 3所示。

图 3 DDC方法的两个优化目标

Fig. 3 Two optimization goals of the DDC method

DDC主要借助核空间中源域与目标域间的概率分布均值期望$E$量化概率分布差异，采用最大均值差异(MMD)(Borgwardt等，2006)损失函数，实现领域间差异的最小化。单源域迁移中，记$F$是连续函数集{$f$:$\mathit{\pmb{X}} \to {\rm{\bf{R}}}$}，利用概率分布$p$和$q$分别生成独立同分布的源域$\mathit{\pmb{X}}_s$和目标域$\mathit{\pmb{X}}_t$，即

$ \boldsymbol{X}_{\mathrm{s}}=\left\{x_{1}^{\mathrm{s}}, x_{2}^{\mathrm{s}}, \cdots, x_{\left|x_{\mathrm{s}}\right|}^{\mathrm{s}}\right\}, \boldsymbol{X}_{\mathrm{t}}=\left\{x_{1}^{\mathrm{t}}, x_{2}^{\mathrm{t}}, \cdots, x_{\left|x_{\mathrm{t}}\right|}^{\mathrm{t}}\right\} $

(1)

MMD的数学表达式为

$ \begin{array}{l} \;\;\;\;\;\;\;\;\;{\mathop{\rm MMD}\nolimits} [F, p, q] = \\ su{p_{\left\| f \right\| \le 1}}\left({{E_{{x_{i - p}}}}\left[ {f\left({{x_i}} \right)} \right] - {E_{{x_{i - q}}}}\left[ {f\left({{x_j}} \right)} \right]} \right) \end{array} $

(2)

经验估计为

$ \operatorname{MMD}\left[F, \boldsymbol{X}_{\rm{s}}, \boldsymbol{X}_{\rm{t}}\right]=\left\|\sum\limits_{{x_i} \in {{\mathit{\pmb{X}}}_{\rm{s}}}}\frac{\phi\left(x_{i}\right)}{\left|\boldsymbol{X}_{\rm{s}}\right|}-\sum\limits_{x_{j} \in {\mathit{\pmb{X}}}_{t}} \frac{\phi\left(x_{i}\right)}{\left|\boldsymbol{X}_{\rm{t}}\right|}\right\|_{\mathrm{H}} $

(3)

$\mathit{\pmb{X}}_{\rm{s}}$与$\mathit{\pmb{X}}_{\rm{t}}$的概率分布距离损失为

$ L_{\mathrm{sm}}\left(\mathit{\pmb{X}}_{\mathrm{t}}, \mathit{\pmb{X}}_{\mathrm{s}}\right)=\operatorname{MMD}^{2}\left(\mathit{\pmb{X}}_{\mathrm{t}}, \mathit{\pmb{X}}_{\mathrm{s}}\right) $

(4)

非线性特征映射$\phi $(·)可以通过优化$L_{\rm{sm}}$($\mathit{\pmb{X}}_{\rm{t}}$, $\mathit{\pmb{X}}_{\rm{s}}$)得到(Ghifary等，2014)，即

$ L_{\mathrm{sm}}\left(\boldsymbol{X}_{\mathrm{t}}, \boldsymbol{X}_{\mathrm{s}}\right)=\left\|\sum\limits_{x_{i} \in {\mathit{\pmb{X}}}_{\mathrm{s}}} \frac{\phi\left(x_{i}\right)}{\left|\boldsymbol{X}_{\mathrm{s}}\right|}-\sum\limits_{x_{j}={\mathit{\pmb{X}}}_{t}} \frac{\boldsymbol{\phi}\left(x_{i}\right)}{\left|\boldsymbol{X}_{\mathrm{t}}\right|}\right\|_{\mathrm{H}}^{2} $

(5)

单源域$L_{\rm{sm}}$($\mathit{\pmb{X}}_{\rm{t}}$, $\mathit{\pmb{X}}_{\rm{s}}$)损失为

$ {L_{{\rm{sm}}}}\left({{{\mathit{\pmb{X}}}_{\rm{t}}}, {\mathit{\pmb{X}}_{\rm{s}}}} \right) = \left\| {\sum\limits_{{x_i} \in {\mathit{\pmb{X}}_{\rm{s}}}} {\frac{{H\left({{\mathit{\pmb{X}}_{\rm{s}}}} \right)}}{{\left| {{\mathit{\pmb{X}}_{\rm{s}}}} \right|}}} - \sum\limits_{{x_j} \in {\mathit{\pmb{X}}_{\rm{t}}}} {\frac{{H\left({{\mathit{\pmb{X}}_{\rm{t}}}} \right)}}{{\left| {{X_{\rm{t}}}} \right|}}} } \right\|_2^2 $

(6)

式中，$H$($\mathit{\pmb{X}}_{\rm{t}}$)和$H$($\mathit{\pmb{X}}_{\rm{s}}$)是深层特征的输出。在$N$个多源域和一个目标域中完成浅层特征提取后的源域$\mathit{\pmb{X}}_{\rm{s}}$和目标域X$\mathit{\pmb{X}}_{\rm{t}}$为

$ \left. {{{\mathit{\pmb{X}}}_{\rm{s}}} = \left\{ {\left({{\mathit{\pmb{X}}_{{\rm{s}}j}}, {{\mathit{\pmb{Y}}}_{{\rm{s}}j}}} \right)} \right\}_{j = 1}^N, {\mathit{\pmb{X}}_{\rm{t}}} = \left\{ {x_i^{\rm{t}}, y_i^{\rm{t}}} \right\}} \right\}_{i = 1}^{\left| {{\mathit{\pmb{X}}_t}} \right|} $

(7)

随后对深层特征进行MMD损失计算，具体为

$ {L_{{\rm{MMD}}}} = \frac{1}{N}\sum\limits_{j = 1}^N {{L_{{\rm{sm}}}}} \left({F\left({{\mathit{\pmb{X}}_{{\rm{s}}j}}} \right), F\left({{\mathit{\pmb{X}}_{\rm{t}}}} \right)} \right) $

(8)

另外，多源域迁移学习中$N$个源域数据可训练形成$N$个不同的分类器，而不同的分类器预测相同的目标样本可能会产生分歧，各分类器间差异的损失函数为

$ \begin{array}{*{20}{c}} {{L_{{\rm{disc}}}} = \frac{2}{{N \times (N - 1)}} \times }\\ {\sum\limits_{j = 1}^{N - 1} {\sum\limits_{i = j + 1}^N {{E_{x - {\mathit{\pmb{X}}_t}}}} } \left[ {\left| {{C_i}\left({{H_i}\left({F\left({{x_k}} \right)} \right)} \right) - {C_j}\left({{H_j}\left({F\left({{x_k}} \right)} \right)} \right)} \right|} \right]} \end{array} $

(9)

$L_{\rm{disc}}$对完成浅层和深层特征提取后的数据，使用所有分类器的预测概率输出间差值的绝对值作为差异损失。通过最小化$L_{\rm{disc}}$，所有分类器的概率输出是相似的。

多源域混淆的深度迁移学习策略的实现流程如图 4所示。由图可知，该流程由一个浅层公共特征提取器、$N$个深层个性特征提取器和$N$个分类器组成。首先，通过公共子网络$F$(·)共享固定权重，对所有域提取低层语义的公共特征表示；其次，通过$N$个不共享权重的子网络，并借助$L_{\rm{MMD}}$损失，提取公共特征$F$(${{\mathit{\pmb{X}}}_{{\rm{s}}j}}$)和$F$($\mathit{\pmb{X}}_{\rm{t}}$)的深层特征；然后，通过$N$个子网络的分类损失$L_{\rm{cls}}$对深层特征$H$($F$($x$))进行惩罚，并利用$L_{\rm{disc}}$获得不同分类器对相同样本的相似预测结果；最后，取所有域间分类器的输出平均值作为多源域混淆迁移模型的最终分类结果。

图 4 多源域DDC迁移学习策略操作流程

Fig. 4 Multi-source domain DDC transfer learning strategy operation flow

1.2 多源域迁移及双流融合的特征提取

1.2.1 多源域迁移的特征提取

由于在实际条件下往往无法获取庞大数据集(如ImageNet)供卷积神经网络训练参数，因此识别模型会面临样本数量少而网络参数量大的过拟合问题。针对该问题，在迁移学习中通常将分类器前的层次结构充当为通用的特征提取器，将样本输入到该特征提取器中可获得具有较强的泛化能力(郑胤等，2014；余化鹏等，2017)的深度特征表示向量。借助多源域混淆的深度迁移学习策略，将源模型视为一个经预训练之后的特征提取器，完成目标域少量样本特征的提取。

1.2.2 双流融合的特征提取

由于自然光图像仅包含目标的2维颜色纹理等特征，对于解决实际问题的目标识别问题具有一定的挑战性。本文采用特征级的数据融合策略，设计双流卷积网络(徐琳琳等，2019)，将深度图像的空间3维信息融入到目标的特征信息表达中，通过2维和3维信息的相互补充和约束，提高现实生活中复杂背景下的目标识别效果。为有效融合两类模态信息并抑制网络复杂度的增加，采用卷积层特征级的数据融合方式，通过1×1的卷积核实现再卷积操作，融合前后特征图的宽度$W$、高度$H$以及深度$D$的特征关系如图 5所示。

图 5 双流卷积特征融合的操作流程

Fig. 5 The operation flow of two-stream convolution fusion in the feature level

已知自然光通道和深度通道对应的特征卷积融合函数$f_{\rm{cvo}}$，借助1×1×$D$的卷积层，通过权重大小调节两模态特征的对应关系，可获得自然光通道特征${\mathit{\pmb{x}}}_{i, j, d}^a$和深度通道特征${\mathit{\pmb{x}}}_{i, j, d}^b$的融合特征${\mathit{\pmb{y}}}_{i, j, d}^{{\rm{cvo}}}$，即

$ \boldsymbol{y}_{i, j, d-1}=\boldsymbol{x}_{i, j, d}^{a}, \boldsymbol{y}_{i, j, d-1}^{\prime}=\boldsymbol{x}_{i, j, d}^{b} $

(10)

$ \boldsymbol{y}_{i, j, d}^{\mathrm{cvo}}=f_{\mathrm{cvo}}\left(\boldsymbol{x}_{i, j, d}^{a}, \boldsymbol{x}_{i, j, d}^{b}\right) $

(11)

上述方法是在合并融合(concatenation fusion)的基础上提出的，其融合方式可以保证网络自动学习到两类模态对应的特征关系，融合后特征的高度、宽度以及特征通道数保持不变，即

$ {{\mathit{\pmb{y}}}_{{\rm{cvo}}}} = {\mathit{\pmb{y}}_{{\rm{cat}}}}*{\mathit{\pmb{w}}} + {\mathit{\pmb{b}}} $

(12)

自然光图像和深度图像分别经过数次卷积与池化后，将获取的特征进行合并，经卷积核自动提取特征后，即可完成两类模态特征的相互学习。表示该融合方法的神经元表达式为

$ \begin{aligned} \alpha_{j}^{l} &=f_{\mathrm{F}}\left\{\boldsymbol{W}^{l}\left[\sum\limits_{i \in \boldsymbol{M}_{\mathrm{Rj}}^{l}}\left(\alpha \cdot \boldsymbol{a}_{i}^{l-1} * \boldsymbol{k}_{i j}^{l}\right)+\right.\right.\\ &\left.\left.\sum\limits_{i \in {\mathit{\pmb{M}}}_{\mathrm{D} j}^{l}}\left(\beta, \boldsymbol{a}_{i}^{l-1} * \boldsymbol{k}_{i j}^{l}\right)+\boldsymbol{b}^{l}\right]\right\} \end{aligned} $

(13)

式中，自然光图像特征和深度图像特征的融合系数分别用$\alpha $、$\beta $表示，两类特征经过$k_{ij}^l$进行卷积，通过共享权值${{\mathit{\pmb{W}}}^l}$和${{\mathit{\pmb{b}}}^l}$后的非线性激活函数${f_{\rm{F}}}$输出${\mathit{\pmb{\alpha}}} _j^l$。融合系数$\alpha $、$\beta $借助网络对自然光图像和深度图像数据集的识别准确度${R_{{\rm{RGB}}}}$和${R_{{\rm{depth}}}}$确定，具体为

$ \frac{\alpha}{\beta}=\frac{R_{\mathrm{RGB}}}{R_{\mathrm{depth}}}, \alpha+\beta=1 $

(14)

1.2.3 多源域迁移及双流融合的特征提取

结合上述多源域迁移以及双流融合的特征提取方法，设计本文提取目标域特征的具体操作流程，如图 6所示。由图可知，首先，多源域混淆的深度迁移特征获得源域的通用性特征后，通过${L_{{\rm{MMD}}}}$损失可得到与目标域相近的深层特征参数；其次，在双流卷积层特征级融合中，模型将自然光图像和深度图像特征流合并到卷积层，通过卷积核的自主特征学习的方式实现2维与3维信息的相互融合。通过以上特征提取策略，可以保证模型获得多样性的目标样本特征，从而为提升模型的识别性能打下坚实的基础。

图 6 特征提取流程

Fig. 6 The process of feature extraction

1.3 复合分类损失函数

1.3.1 softmax损失函数

softmax损失函数(Xie等，2015)将上一层输出的特征参数映射到目标类别中，定义$m$个样本数及对应的标签，在$k$分类中计算每个样本的估计概率，其表达式为

$ {L_{\rm{s}}} = - \frac{1}{m}\left\lceil {\sum\limits_{i = 1}^m {\sum\limits_{j = 1}^k 1 } \left\{ {{{\mathit{\pmb{y}}}^{(i)}} = j} \right\}\lg \frac{{{{\rm{e}}^{{\rm{W}}_m^{\rm{T}}}}}}{{\sum\limits_{l = 1}^k {{{\rm{e}}^{{\rm{W}}_m^{\rm{T}}}}} }}} \right\rceil $

(15)

式中$m = {j^{x\left(i \right)}}$，$W$是网络参数，$\left\lceil \cdot \right\rceil $表示向上取整。

1.3.2 中心损失函数

中心损失函数(center loss)(张延安等，2017)利用目标的特征中心，借助聚类思想搭建其损失函数的具体形式，实现流程为

$ {L_{\rm{c}}} = \frac{1}{{2m}}\sum\limits_{i = 1}^m {\left\| {{x_i} - {c_{yi}}} \right\|_2^2} $

(16)

$ \frac{{\partial {L_c}}}{{\partial {x_i}}} = {x_i} - {c_{yi}} $

(17)

$ \Delta {c_j} = \frac{{\sum\limits_{i = 1}^m \delta \left({{y_i} = j} \right) \cdot \left({{c_j} - {x_i}} \right)}}{{1 + \sum\limits_{i = 1}^m \delta \left({{y_i} = j} \right)}} $

(18)

同时，引入

$ c_j^{t + 1} = c_j^t + \Delta c_j^t $

(19)

使中心特征在小批次中完成更新。

1.3.3 构建复合分类损失函数

综合softmax和center loss的优点，构建复合分类损失函数。其中，softmax计算样本间的类间差异性，而center loss计算样本间的类内相似性。结合1.1节提到的${L_{{\rm{MMD}}}}$和${L_{{\rm{disc}}}}$损失，所提的多源域迁移学习共含${L_{{\rm{cls}}}}$、${L_{{\rm{MMD}}}}$以及${L_{{\rm{disc}}}}$3类损失，具体为

$ \begin{array}{l} {L_{{\rm{total }}}} = {L_{{\rm{cls}}}} + \gamma {L_{{\rm{MMD}}}} + \eta {L_{{\rm{disc}}}} = \\ \;\;{L_{\rm{s}}} + \lambda {L_{\rm{c}}} + \gamma {L_{{\rm{MMD}}}} + \eta {L_{{\rm{disc}}}} \end{array} $

(20)

式中，$\gamma $和$\eta $通常为1。通过最小化${L_{{\rm{cls}}}}$分类损失，实现域内数据的分类；通过最小化${L_{{\rm{MMD}}}}$损失，可以学习到源域和目标域中同一对象的域不变深层特征表示；通过最小化${L_{{\rm{disc}}}}$损失，可以减少不同域分类器间的差异。

2 实验结果与分析

为验证本文方法不依赖预训练模型的具体结构形式，实验选用AlexNet、VggNet-16(Simonyan等，2014)、ResNet-50(He等，2016)以及DenseNet-169(Huang等，2017)作为迁移学习中目标数据域的特征提取器。

2.1 实验平台和数据集

实验软件选用PyTorch深度学习框架作为实现深度神经网络的平台；实验硬件平台选用NVIDIA Quadro P5000的显卡进行加速计算。实验中的目标数据集是美国字母手势(ASL)(Pansare等，2012)，共24类，每类手势包含自然光图像和对应的深度图像。实验中从ASL的每类手势图像中抽取自然光图像和深度图像各1 000幅作为实验数据，其中训练集600幅, 验证集和测试集各200幅。多源域数据由ImageNet数据集、自采集的简单背景和复杂背景手势彩色图像组成，对比实验中的单源域模型中的源域数据集是ImageNet数据集。自采集图像由10位参与者拍摄的1 920幅手势样本组成，每种背景下各960幅，经过数据增广后，每类背景下各包含4 800幅手势图像样本，如图 7和图 8所示。

图 7 简单背景下24个字母的自采集手势图像

Fig. 7 Self-collecting gesture images of 24 letters in the simple background

图 8 复杂背景下24个字母的自采集手势图像

Fig. 8 Self-collecting gesture images of 24 letters in the complex background

2.2 实验超参数设置

实验的超参数设置如表 1所示。首先，为抑制单一样本的输入引起的梯度震荡并提高训练速率，将训练样本按照批次(batch)的方式输入到网络中，每次输入60个训练样本迭代24 000次。参数valid_iter和valid_interval表示模型在训练集上每训练迭代1 000次验证600次，test batch_size表示1次输入16幅样本进行批量测试。其次，网络采用随机梯度下降(SGD)(含有动量)的优化策略，每迭代1 000(步长)次学习率会从初始的0.01逐步减小到它的0.1倍。最后，为抑制模型的过拟合，添加权值衰减的正则项系数；同时为防止模型梯度爆炸，设置梯度阈值使权重更新设置在一个合理的范围内。

表 1 超参数设置
Table 1 Settings of hyperparameter

下载CSV

参数	值
train batch_size	60
epoch (max_iter)	100(24 000)
valid_interval	1 000
valid_iter	600
test batch_size	16
动量	0.9
步长	1 000
base_lr	0.01
权重衰减	0.000 5
梯度阈值	40

2.3 模型识别流程

模型的前向学习包括多源域数据的特征迁移和双流目标域数据的特征融合形成的目标域特征，最终经1.3节设计的复合损失函数完成分类任务；而反向传播过程借助预训练模型在目标域上的适配学习，通过不断地迭代更新目标模型参数。具体的识别流程如图 9所示。

图 9 模型的识别流程图

Fig. 9 The flow chart of model recognition

((a) training; (b) testing)

2.4 实验结果

2.4.1 验证复合损失函数的有效性

为寻找中心损失最佳的权重值，实验以权重因子为0.1间隔的变化经过100次epoch，得到4种网络效果最优的$\lambda $，如表 2所示。

表 2 中心损失的权重值
Table 2 Weights of center loss

下载CSV

模型	$\lambda $	准确度/%
AlexNet	0.7	88.36
VggNet-16	0.6	84.96
ResNet-50	0.5	56.62
DenseNet-169	0.5	64.93

单支损失函数模型的识别结果如图 10所示。在最佳的$\lambda $下，经过100次epoch后，其识别结果如图 11所示。与图 10对比，可知：1)复合损失函数模型的收敛速度及识别率明显高于单支损失函数模型；2)第83次epoch时，DenseNet-169模型的识别率为54.72%；而第80次epoch时的识别率达到64.93%，接近最优。表明复合损失函数有利于加速模型收敛，并可以在一定程度上提升模型的预测准确率。

图 10 单源DDC迁移学习(softmax)的识别结果

Fig. 10 The recognition results of single source DDC transter learning model with softmax loss

图 11 单源DDC迁移学习(softmax+center)的识别结果

Fig. 11 The recognition results of single source DDC transter learning model with softmax and center loss

2.4.2 验证多源域混淆迁移学习的有效性

结合多源域混淆策略和复合损失函数构建多源域混淆迁移学习识别模型，其结果如图 12所示。与图 11相比，可知：1)添加多源域的迁移特征后，模型的收敛速度明显加快；2)第40次epoch时，DenseNet-169的识别率为93.27%，对比于第80次epoch的复合损失函数模型，识别率上升了28.32%，表明在解决少量样本深度学习问题上，利用多源域数据可增强源域对目标域特征的迁移自适应能力。

图 12 多源DDC迁移学习(softmax+center)的识别结果

Fig. 12 The recognition results of multi-source DDC transter learning model with softmax and center loss

2.4.3 多源域混淆的双流深度迁移学习模型性能

以DenseNet-169预训练模型为例，具体说明双流融合策略对模型识别性能的影响。由式(14)可知，融合信息中两类模态流的权重由单支模型的准确度决定，因此融合权重分别为$\alpha $= 0.505(${R_{{\rm{RGB}}}}$= 93.27%)、$\beta $= 0.495(${R_{{\rm{depth}}}}$= 91.58%)。经过100次epoch之后，识别结果如图 13所示。

图 13 多源DDC迁移学习(双流特征融合+softmax+ center)的识别结果

Fig. 13 The recognition results of multi-source DDC transfer learning model with two-stre an feature, softmax loss and center loss

与图 12相比，可知：1)经双流特征融合后，模型的收敛速度得到明显上升；2)第10次epoch时，经双流特征融合后，识别率上升到97.17%，表明双流特征融合的网络结构可有效提升模型的收敛速度和预测准确度。为全面了解各个模型对每类手势的预测能力，本文通过精度和召回率两个指标展示识别结果，如图 14和图 15所示。

图 14 多源域混淆的双流DenseNet-169迁移模型的精度对比

Fig. 14 The precision comparison of multi-source domain confusion two-stream deep transfer model

图 15 DenseNet-169迁移模型的召回率对比

Fig. 15 The recall comparison of multi-source domain confusion two-stream deep transfer model

对比上述4类模型可以看出，首先，所提的多源域混淆的深度迁移学习模型的每类手势，在预测结果总数中，正确预测所占的比例平均为0.976，明显高于其他模型的预测精度；其次，所提模型正确预测为正(负)样本占实际正(负)样本的比例为达到0.968，优于其他模型，说明该模型具备较强的识别敏感度。

本文实验中的模型识别结果如表 3所示。由表可知，在不同的预训练模型下，利用多源域数据集、两阶段适配策略、双流融合方法以及复合损失函数，有效完成了深度神经网络在小样本数据集上的识别任务，且神经网络的特征描述能力和识别性能获得了明显提升。

表 3 少量样本在ASL数据集上的识别结果
Table 3 The recognition results on the few-shot of ASL

下载CSV

/%
模型	单支损失		复合损失
模型	单源域	单源域	多源域	多源域双流
AlexNet	84.41	88.36	89.69	92.87
VggNet-16	82.19	84.96	90.18	94.13
ResNet-50	49.84	56.62	82.16	85.79
DenseNet-169	54.72	64.93	93.27	97.17

为比较本文所提模型的优越性，首先与未进行迁移学习的HSF+RDF(hue saturation value + random decision forest)模型(Pugeault等，2011)、SIFT+PLS(scale-invariant feature transform + partial least squares)模型(Estrela等，2013)和MPC(model predictive control)模型(Pansare等，2012)在ASL整体数据集上进行对比，然后与迁移学习中的DAN(deep alignment network)(Long等，2015)、D-CORAL(deep coral)(Sun等，2016)和RevGrad(Ganin等，2014) 3类经典模型在相同的目标数据域上进行对比，结果如表 4所示。由表可知，多源域混淆的DenseNet-169双流深度迁移学习模型的识别率为97.17%，高于其他模型的识别准确度，证明本文所提方法具有一定的性能优越性和研究价值。

表 4 不同模型的识别率比较
Table 4 Comparison of recognition rates of different models

下载CSV

模型	数据量	准确度/%
HSF+RDF	120 000	75.21
SIFT+PLS	120 000	71.51
MPC	120 000	90.19
DAN	48 000	92.4
D-CORAL	48 000	91.37
RevGrad	48 000	94.83
DenseNet-169(本文)	48 000	97.17

3 结论

针对深度迁移学习中的源模型在目标数据集上抽取的通用性特征缺乏适用性的问题，提出了一种多源域混淆的双流深度迁移学习，在模型的特征提取和损失函数部分进行了改进。为增强源域特征迁移的高效性，提升目标域的特征参数，抑制小样本深度学习中严重的过拟合问题，首先引入了多源深层特征迁移方法增强目标特征的表征能力；其次针对如何对齐源域与目标域表示特征的问题，提出了多源域混淆的迁移学习策略；最后结合深度图像的3维特征信息，提出双流卷积的特征融合策略，实现了目标的2维和3维模态信息的相互补充和约束。在域内分类损失部分，通过引入中心损失函数提高对类间及类内特征的惩罚监督性能；在深层特征适配损失部分，引入域不变深层特征表示的计算，进行域间特征的相互对齐；在域间分类器损失部分，引入差异损失函数，实现不同域分类器的相互适配。最后在ASL上抽取的少量样本数据集上，选用单支损失函数模型、复合损失函数模型、多源域混淆的迁移学习模型以及多源域混淆的双流迁移学习模型进行对比实验，证明了所提模型的优越性。

下一步的工作包括特征融合和模型优化两个方面。在特征融合方面，由于深度迁移学习中不同网络结构的源模型可以学习到不同的表示特征，如何充分融合这些有效的特征需要进行更加深入的探索；在模型优化方面，本文所提模型对少量样本深度学习具有一定的研究价值，但是模型的训练过程会出现一定程度的过拟合问题，如何对源模型进行有效的网络适配训练，提升模型的泛化能力，是一个很有意义的研究方向。

参考文献

Borgwardt K M, Gretton A, Rasch M J, Kriegel H P, Scholkopf B, Smola A J. 2006. Integrating structured biological data by Kernel Maximum Mean Discrepancy. Bioinformatics, 22(14): e49-e57 [DOI:10.1093/bioinformatics/btl242]

Chen Z T, Fu Y W, Zhang Y D, Jiang Y G, Xue X and Sigal L. 2018. Semantic feature augmentation in few-shot learning[EB/OL].[2019-03-28]. https://arxiv.org/pdf/1804.05298.pdf

Deng J, Dong W, Socher R, Li L J, Li K and L F F. 2009. Imagenet: a large-scale hierarchical image database//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL, USA: IEEE, 248-255[DOI: 10.1109/CVPR.2009.5206848]

Estrela B, Cámara-Chávez G, Campos M F M, Schwartz W R and Nascimento E R. 2013. Sign language recognition using partial least squares and RGB-D information//Proceedings of 2013 Conference on Workshop de Visão Computacional. Minas Gerais, Brazil: IEEE, 672-678

Finn C, Abbeel P and Levine S. 2017. Model-agnostic meta-learning for fast adaptation of deep networks//Proceedings of 2017 IEEE Conference on Machine Learning. Sydney, Australia: IEEE, 1126-1135

Ganin Y and Lempitsky V. 2014. Unsupervised domain adaptation by backpropagation[EB/OL].[2019-03-28].https://arxiv.org/pdf/1409.7495.pdf

Garcia V and Bruna J. 2017. Few-shot learning with graph neural networks[EB/OL].[2019-03-28]https://arxiv.org/pdf/1711.04043.pdf

Ghifary M, Kleijn W B and Zhang M J. 2014. Domain adaptive neural networks for object recognition//Proceedings of the 13th Pacific Rim International Conference on Artificial Intelligence. Gold Coast, QLD, Australia: Springer, 898-904[DOI: 10.1007/978-3-319-13560-1_76]

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 770-778[DOI: 10.1109/CVPR.2016.90]

Huang G, Liu Z, Van Der Maaten L and Weinberger K Q. 2017. Densely connected convolutional networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2261-2269[DOI: 10.1109/CVPR.2017.243]

Huang S W, Lin C T, Chen S P, Wu Y Y, Hsu P H and Lai S H. 2018. AugGAN: cross domain adaptation with GAN-based data augmentation//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer, 718-731[DOI: 10.1007/978-3-030-01240-3_44]

Koch G, Zemel R and Salakhutdinov R. 2015. Siamese neural networks for one-shot image recognition//Proceedings of the 32nd International Conference on Machine Learning. Lille, France: ACM, 212-217

Krizhevsky A, Sutskever I and Hinton G E. 2012. Imagenet classification with deep convolutional neural networks//Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, Nevada, USA: ACM, 1097-1105

Liu J C, Osadchy M, Ashton L, Foster M, Solomon C J, Gibson S J. 2017. Deep convolutional neural networks for Raman spectrum recognition:a unified solution. Analyst, 142(21): 4067-4074 [DOI:10.1039/C7AN01371J]

Long M S, Cao Y, Wang J M and Jordan M I. 2015. Learning transferable features with deep adaptation networks//Proceedings of the 32nd International Conference on Machine Learning. Lille, France: ACM, 4891-4897

Oquab M, Bottou L, Laptev I and Sivic J. 2014. Learning and transferring mid-level image representations using convolutional neural networks//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, Ohio, USA: IEEE, 1717-1724[DOI: 10.1109/CVPR.2014.222]

Pansare J R, Gawande S H, Ingle M. 2012. Real-time static hand gesture recognition for American Sign Language (ASL) in complex background. Journal of Signal and Information Processing, 3(3): 364-367 [DOI:10.4236/jsip.2012.33047]

Pugeault N and Bowden R. 2011. Spelling it out: real-time ASL fingerspelling recognition//Proceedings of 2011 International Conference on Computer Vision Workshops. Barcelona, Spain: IEEE, 1114-1119[DOI: 10.1109/ICCVW.2011.6130290]

Ravi S and Larochelle H. 2017. Optimization as a model for few-shot learning//Proceedings of 2017 International Conference on Machine Learning. New York, USA: IEEE, 1317-1325

Santoro A, Bartunov S, Botvinick M, Wierstra D and Lillicrap T. 2016. Meta-learning with memory-augmented neural networks//Proceedings of the 33rd International Conference on Machine Learning. New York, USA: IEEE, 1842-1850

Simonyan K and Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition[EB/OL].[2019-03-28].https://arxiv.org/pdf/1409.1556.pdf

Snell J, Swersky K and Zemel R. 2017. Prototypical networks for few-shot learning//Proceedings of the 31st Conference on Neural Information Processing Systems. Long Beach, Philippines: ACM, 4077-4087

Sun B C and Saenko K. 2016. Deep coral: correlation alignment for deep domain adaptation//Proceedings of 2016 European Conference on Computer Vision. Amsterdam, Netherlands: Springer, 443-450[DOI: 10.1007/978-3-319-49409-8_35]

Tzeng E, Hoffman J, Zhang N, Saenko K and Darrell T. 2014. Deep domain confusion: maximizing for domain invariance[EB/OL].[2019-03-28]. https://arxiv.org/pdf/1412.3474.pdf

Vinyals O, Blundell C, Lillicrap T and Wierstra D. 2016. Matching networks for oneshot learning//Proceedings of the 30th Conference on Neural Information Processing Systems. Barcelona, Spain: IEEE, 3630-3638

Xie S N and Tu Z W. 2015. Holistically-nested edge detection//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 1395-1403[DOI: 10.1109/ICCV.2015.164]

Xu L L, Zhang S M, Zhao J L. 2019. Expression recognition algorithm for parallel convolutional neural networks. Journal of Image and Graphics, 24(2): 227-236 (徐琳琳, 张树美, 赵俊莉. 2019. 构建并行卷积神经网络的表情识别算法. 中国图象图形学报, 24(2): 227-236) [DOI:10.11834/jig.180346]

Yu H P, Zhang P, Zhu J. 2017. Study on face recognition method based on deep transfer learning. Journal of Chengdu University:Natural Science, 36(2): 151-156 (余化鹏, 张朋, 朱进. 2017. 基于深度迁移学习的人脸识别方法研究. 成都大学学报:自然科学版, 36(2): 151-156) [DOI:10.3969/j.issn.1004-5422.2017.02.009]

Zhang Y A, Wang H Y, Xu F. 2017. Face recognition based on deep convolution neural network and center loss. Science Technology and Engineering, 17(35): 92-97 (张延安, 王宏玉, 徐方. 2017. 基于深度卷积神经网络与中心损失的人脸识别. 科学技术与工程, 17(35): 92-97) [DOI:10.3969/j.issn.1671-1815.2017.35.015]

Zheng Y, Chen Q Q, Zhang Y J. 2014. Deep learning and new progress in target and behavior recognition. Journal of Image and Graphics, 19(2): 175-184 (郑胤, 陈权崎, 章毓晋. 2014. 深度学习及其在目标和行为识别中的新进展. 中国图象图形学报, 19(2): 175-184) [DOI:10.11834/jig.20140202]