Print

发布时间: 2021-11-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.200419
2021 | Volume 26 | Number 11




    医学图像处理    




  <<上一篇 




  下一篇>> 





对抗学习遥感图像场景识别
expand article info 李彤1,2, 张钧萍1
1. 哈尔滨工业大学电子与信息工程学院, 哈尔滨 150001;
2. 上海卫星工程研究所, 上海 201109

摘要

目的 在高分辨率遥感图像场景识别问题中,经典的监督机器学习算法大多需要充足的标记样本训练模型,而获取遥感图像的标注费时费力。为解决遥感图像场景识别中标记样本缺乏且不同数据集无法共享标记样本问题,提出一种结合对抗学习与变分自动编码机的迁移学习网络。方法 利用变分自动编码机(variational auto-encoders,VAE)在源域数据集上进行训练,分别获得编码器和分类器网络参数,并用源域编码器网络参数初始化目标域编码器。采用对抗学习的思想,引入判别网络,交替训练并更新目标域编码器与判别网络参数,使目标域与源域编码器提取的特征尽量相似,从而实现遥感图像源域到目标域的特征迁移。结果 利用两个遥感场景识别数据集进行实验,验证特征迁移算法的有效性,同时尝试利用SUN397自然场景数据集与遥感场景间的迁移识别,采用相关性对齐以及均衡分布适应两种迁移学习方法作为对比。两组遥感场景数据集间的实验中,相比于仅利用源域样本训练的网络,经过迁移学习后的网络场景识别精度提升约10%,利用少量目标域标记样本后提升更为明显;与对照实验结果相比,利用少量目标域标记样本时提出方法的识别精度提升均在3%之上,仅利用源域标记样本时提出方法场景识别精度提升了10%~40%;利用自然场景数据集时,方法仍能在一定程度上提升场景识别精度。结论 本文提出的对抗迁移学习网络可以在目标域样本缺乏的条件下,充分利用其他数据集中的样本信息,实现不同场景图像数据集间的特征迁移及场景识别,有效提升遥感图像的场景识别精度。

关键词

场景识别; 遥感图像; 对抗学习; 迁移学习; 变分自动编码机(VAE)

Remote sensing image scene recognition based on adversarial learning
expand article info Li Tong1,2, Zhang Junping1
1. School of Electronics and Information Engineering, Harbin Institute of Technology, Harbin 150001, China;
2. Shanghai Institute of Satellite Engineering, Shanghai 201109, China
Supported by: National Natural Science Foundation of China (61871150)

Abstract

Objective While dealing with high-resolution remote sensing image scene recognition, classical supervised machine learning algorithms are considered effective on two conditions, namely, 1) test samples should be in the same feature space with training samples, and 2) adequate labeled samples should be provided to train the model fully. Deep learning algorithms, which achieve remarkable results in image classification and object detection for the past few years, generally require a large number of labeled samples to learn the accurate parameters. The main image classification methods select training and test samples randomly from the same dataset, and adopt cross validation to testify the effectiveness of the model. However, obtaining scene labels is time consuming and expensive for remote sensing images. To deal with the insufficiency of labeled samples in remote sensing image scene recognition and the problem that labeled samples cannot be shared between different datasets due to different sensors and complex light conditions, deep learning architecture and adversarial learning are investigated. A feature transfer method based on adversarial variational autoencoder (VAE) is proposed. Method Feature transfer architecture can be divided into three parts. The first part is the pretrain module. Given the limited samples with scene labels, the unsupervised learning model, VAE, is adopted. The VAE is unsupervised trained on the source dataset, and the encoder part in the VAE is finetuned together with classifier network using labeled samples in the source dataset. The second part is adversarial learning module. In most of the research, adversarial learning is adopted to generate new samples, while the idea is used to transfer the features from source domain to target domain in this paper. Parameters of the finetuned encoder network for the source dataset are then used to initialize the target encoder. Using the idea of adversarial training in generative adversarial networks (GAN), a discrimination network is introduced into the training of the target encoder. The goal of the target encoder is to extract features in the target domain to have as much affinity to those of the source domain as possible, such that the discrimination network cannot distinguish the features are from either the source domain or target domain. The goal of the discrimination network is to optimize the parameters for better distinction. It is called adversarial learning because of the contradiction between the purpose of encoder and discrimination network. The features extracted by the target encoder increasingly resemble those by the source encoder by training and updating the parameters of the target encoder and the discrimination network alternately. In this manner, by the time the discrimination network can no longer differentiate between source features and target features, we can assume that the target encoder can extract similar features to the source samples, and remote sensing feature transfer between the source domain and target domain is accomplished. The third part is target finetuning and test module. A small number of labeled samples in target domain is employed to finetune the target encoder and source classifier, and the other samples are used for evaluation. Result Two remote sensing scene recognition datasets, UCMerced-21 and NWPU-RESISC45, are adopted to prove the effectiveness of the proposed feature transfer method. SUN397, a natural scene recognition dataset is employed as an attempt for the cross-view feature transfer. Eight common scene types between the three datasets, namely, baseball field, beach, farmland, forest, harbor, industrial area, overpass, and river/lake, are selected for the feature transfer task. Correlation alignment (CORAL) and balanced distribution adaptation (BDA) are used as comparisons. In the experiments of adversarial learning between two remote sensing scene recognition datasets, the proposed method boosts the recognition accuracy by about 10% compared with the network trained only by the samples in the source domain. Results improve more substantially when few samples in the target domain are involved. Compared with CORAL and BDA, the proposed method improves scene recognition accuracy by more than 3% when using a few samples in the target domain and between 10%~40% without samples in the target domain. When using the information of a natural scene image, the improvement is not as much as that of a remote sensing image, but the scene recognition accuracy using the proposed feature transfer method is still increased by approximately 6% after unsupervised feature transfer and 36% after a small number of samples in the target domain are involved in finetuning. Conclusion In this paper, an adversarial VAE-based transfer learning network is proposed. The experimental results show that the proposed adversarial learning method can make the most of sample information of other dataset when the labeled samples are insufficient in the target domain. The proposed method can achieve the feature transfer between different datasets and scene recognition effectively, and remarkably improve the scene recognition accuracy.

Key words

scene recognition; remote sensing image; adversarial learning; transfer learning; variational autoencoder(VAE)

0 引言

场景是多类地物、目标及其背景上下文信息的有机组合。遥感图像场景识别是通过分析图像场景中包含的地物和目标的组成及统计关系实现对图像场景主题的描述,分为基于中层特征的方法和深度学习的方法。基于中层特征的方法包括视觉词包模型(bag-of-visual-words,BOVW)及其改进(Zhang等,2016)、Fisher矢量编码(Fisher vector coding, FVC)(Zhao等,2016)、局部特征聚合描述概率主题模型(vector of locally aggregated descriptors, VLAD)(Wang等,2017a)、结合稀疏表示(Zhu等,2018)等。近年来,深度学习的方法在遥感场景识别中取得了巨大突破。Chaib等人(2017)提出基于VGG-Net(Visual Geometry Group net)与判别相关分析的融合方法,有效提升了场景识别精度。Cheng等人(2018)将深度学习与度量学习相结合,提出判别卷积神经网络,很好地解决了遥感场景类间相似性高、类内差异大的问题。Wang等人(2019)Tong等人(2020)将注意机制与深度学习模型相结合,一定程度上解决了深度学习模型在遥感图像处理中的过拟合问题。然而适用这些监督机器学习算法的最主要前提是训练样本和测试样本必须处于同一特征空间,服从相同的分布。同时,算法需要足够的标记样本,才能使模型得到充分训练,尤其是近年来在图像分类识别任务中表现出色的深度学习算法,更是需要大量的标记样本来训练网络参数。对于遥感图像而言,获取大量类别标签既耗时又费力,且不同传感器获取的多时相遥感图像由于光学器件、成像链路、光照条件等的不同,无法共享同类别的标记样本。因此,若能实现有效的知识迁移,便可显著提高机器学习模型的效果,同时避免代价高昂的人工标注。

迁移学习以其在处理数据分布差异、充分利用原有标注数据方面的有效性,在遥感图像的分类识别(Wu等,2019)等实际应用中颇受关注。基于迁移学习的高分图像场景级分类识别的研究多为利用计算机视觉领域中的深度学习网络的参数迁移。李冠东等人(2019)利用ImageNet数据集对卷积神经网络进行预训练,而后利用少量有标签的遥感场景微调网络,实现网络参数的迁移。Hu等人(2015)在预训练卷积神经网络的基础上提取多尺度遥感图像场景的特征并进行全局特征编码。Yao等人(2017)利用预训练的深度学习网络对遥感场景进行特征提取,采用随机森林分类器实现遥感图像的场景识别。Gong等人(2018)将预训练的卷积神经网络与视觉显著性区域提取相结合,同时在场景中引入随机噪声,提升网络模型的抗噪性能。然而这类利用网络参数迁移的方法,若利用前人提供的网络参数则网络结构无法自由调整,若搭建网络自行训练则十分耗时。因此,本文考虑从特征迁移的角度实现不同数据集间标记样本的共享。

光学遥感领域中,大多数特征迁移算法均针对高光谱图像像素级分类,典型算法包括联合分布适应(Long等,2013)、迁移成分分析(Matasci等,2015)、相关性对齐(Sun等,2016)和均衡分布适应(Wang等,2017b)等。这些特征迁移算法的共同点是寻找源域到目标域的变换,使变换后源域和目标域样本特征具有相同的概率分布或较大的相关性。出于同样的目的,本文提出基于对抗学习的遥感图像场景识别方法,结合对抗学习思想,引入判别网络约束变分自动编码机(variational autoencoders, VAE)的训练过程,使判别网络无法区分变分自动编码机提取的源域场景特征和目标域场景特征,从而实现不同场景数据集间的特征迁移。

1 变分自动编码机

在深度学习的研究中,生成式模型由于可以自动学习原始输入数据的数据分布并生成相似的模拟数据而成为热点。变分自动编码机(Kingma和Welling,2014)是自动编码机(Hinton和Salakhutdinov,2006)的隐变量模型的扩展形式,属于典型的生成模型,其将变分推断与自动编码机思想相结合,在编码器与解码器之间引入隐变量$ \boldsymbol{z} $,输入$ \boldsymbol{x} $通过编码器$ p\left({\boldsymbol{z}|\boldsymbol{x}} \right) $得到隐变量$ \boldsymbol{z} $分布$ p\left(\boldsymbol{z})\right. $的均值和方差,通过对该分布采样得到新的特征$ \boldsymbol{z'} $,再通过解码器$ p\left({\boldsymbol{x}|\boldsymbol{z}} \right) $对输入进行重建,网络训练完成后,只需对$ \boldsymbol{z} $进行采样,再通过解码器,即可生成与输入$ \boldsymbol{x} $同分布的新样本$ {\boldsymbol{\hat x}} $。网络结构如图 1所示。

图 1 变分自动编码机网络结构
Fig. 1 The graphical model of variational autoencoders

采用极大似然估计训练隐变量模型,则需计算

$ \arg \max \limits_{\theta} E_{\boldsymbol{x} \sim p} \log p_{\theta}(\boldsymbol{x}) $ (1)

式中,$ E\left(\cdot \right) $表示分布函数的期望,$ \theta $为模型参数。

由贝叶斯公式$ p_{\theta}(\boldsymbol{x})=\frac{p_{\theta}(\boldsymbol{x}, \boldsymbol{z})}{p_{\theta}(\boldsymbol{z} \mid \boldsymbol{x})} $,可将问题转化为求$ {p_\theta }\left({\boldsymbol{z}|\boldsymbol{x}} \right) $。对此,引入变分推断思想,在变分函数组中寻找与后验分布$ {p_\phi }\left({\boldsymbol{z}|\boldsymbol{x}} \right), \phi $最接近的分布$为变分拟合参数。采用KL(Kullback-Leibler)散度$ {D_{{\rm{KL}}}} $来描述两个概率分布之间的差异,即

$ \begin{gathered} D_{\mathrm{KL}}\left(q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x}) \| p_{\theta}(\boldsymbol{z} \mid \boldsymbol{x})\right)= \\ E_{\boldsymbol{z} \sim q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})}\left(\log q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})-\log p_{\theta}(\boldsymbol{z} \mid \boldsymbol{x})\right)= \\ E_{q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})}\left(\log q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})-\log p_{\theta}(\boldsymbol{x}, \boldsymbol{z})+\log p_{\theta}(\boldsymbol{x})\right) \end{gathered} $ (2)

式中,$ \log {p_\theta }\left(\boldsymbol{x} \right) $$ \boldsymbol{z} $无关,因此有

$ \begin{gathered} \log p_{\theta}(\boldsymbol{x})=D_{\mathrm{KL}}\left(q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x}) \| p_{\theta}(\boldsymbol{z} \mid \boldsymbol{x})\right)+ \\ E_{q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})}\left(\log p_{\theta}(\boldsymbol{x}, \boldsymbol{z})-\log q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})\right) \end{gathered} $ (3)

式中,第2项称为证据下界(evidence lower bound, ELBO)或变分下界(variational lower bound),记作$ L\left({\theta, \phi ;\boldsymbol{x}} \right) $。由于需要令$ p\left({\boldsymbol{z}|\boldsymbol{x}} \right) $$ p\left({\boldsymbol{z}|\boldsymbol{x}} \right) $尽可能相似,即式(2)最小,由KL散度性质可知

$ D_{\mathrm{KL}}(q(\boldsymbol{z} \mid \boldsymbol{x}) \| p(\boldsymbol{z} \mid \boldsymbol{x}))>0 $ (4)

同时,在$ \boldsymbol{z} $的空间中,$ \log {p_\theta }\left(\boldsymbol{x} \right) $固定,要令KL散度最小,因此问题进一步转化为求$ L\left({\theta, \phi ;\boldsymbol{x}} \right) $的最大值。ELBO可变换为

$ \begin{aligned} f_{\mathrm{ELBO}}=&-D_{\mathrm{KL}}\left(q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x}) \| p_{\theta}(\boldsymbol{z})\right)+\\ & E_{q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})}\left(\log p_{\theta}(\boldsymbol{x} \mid \boldsymbol{z})\right) \end{aligned} $ (5)

变分自动编码机的损失函数$ {L_{{\rm{VAE}}}} $的定义为

$ \begin{aligned} L_{\mathrm{VAE}} &=-E_{q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})}\left(\log p_{\theta}(\boldsymbol{x} \mid \boldsymbol{z})\right)+\\ & D_{\mathrm{KL}}\left(q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x}) \| p_{\theta}(\boldsymbol{z})\right) \end{aligned} $ (6)

2 基于对抗学习的特征迁移

2.1 对抗学习思想

对抗学习的思想源于Goodfellow等人(2014)提出的生成对抗网络(generative adversarial network, GAN)。对抗学习思想由生成网络和判别网络构成,利用生成的随机分布的数据通过生成网络生成模拟数据,将真实数据与模拟数据同时输入判别网络,将网络输出反馈回生成网络更新网络参数。称之为对抗学习是因为生成网络与判别网络的目标是相互冲突的,即生成网络的目标是令生成的模拟数据与真实数据尽可能接近,使得判别网络无法判断输入是否为真实数据;而判别网络的目标是不断优化网络,使判别网络能够分辨真实数据和模拟数据。

2.2 对抗变分自动编码机

本文提出的基于对抗变分自动编码机的迁移学习方法(adversarial autoencoder transfer, AAET) 借鉴了对抗学习的思想,令源域场景图像提取的特征与对目标域场景图像提取的特征尽量相似,流程如图 2所示。

图 2 基于对抗迁移学习的高分遥感图像场景识别流程图
Fig. 2 The diagram of remote sensing images scene recognition based on adversarial transfer learning

图 2可以看出,依据式(6),利用源域场景图像$ {\boldsymbol{I}_S} $充分训练变分自动编码机网络后,保留编码器进行特征映射$ {\boldsymbol{M}_S} $,继而输入分类器$ C $,利用相应的场景类别标签$ Y_S $进行整个网络的训练,获得源域场景的特征映射$ {\boldsymbol{M}_S} $和分类器$ C $,完成网络的初始训练。初始训练的目标函数$ L_{rm{MC}} $

$ L_{\mathrm{MC}}\left(\boldsymbol{I}_{S}, Y_{S}\right)=-E_{\left(\boldsymbol{I}_{S}, Y_{S}\right)} \sum\limits_{k=1}^{K} 1_{k=y_{S}} \log C\left(\boldsymbol{M}_{S}\left(\boldsymbol{I}_{S}^{k}\right)\right) $ (7)

然后进行对抗学习,固定源域场景特征映射$ {\boldsymbol{M}_S} $并初始化目标域场景图像特征映射参数$ {\boldsymbol{M}_{{\rm{RS}}}} $,令$ \boldsymbol{M}_{\mathrm{RS}}=\boldsymbol{M}_{S} $,将源域场景图像$ {\boldsymbol{I}_S} $和目标域场景图像$ \boldsymbol{I}_{\mathrm{RS}} $分别进行特征映射$ \boldsymbol{M}_{S} $$ \boldsymbol{M}_{\mathrm{RS}} $,并输入判别网络$ D $,交替更新特征映射$ \boldsymbol{M}_{\mathrm{RS}} $和判别网络$ D $

判别网络的目标函数$ L_D $

$ \begin{array}{c} L_{D}\left(\boldsymbol{I}_{S}, \boldsymbol{I}_{\mathrm{RS}}, \boldsymbol{M}_{N}, \boldsymbol{M}_{\mathrm{RS}}\right)=-E_{I \sim \boldsymbol{I}_{S}}\left[\log D\left(\boldsymbol{M}_{S}\left(\boldsymbol{I}_{S}\right)\right)\right]- \\ E_{I \sim \boldsymbol{I}_{\mathrm{RS}}}\left[\log \left(1-D\left(\boldsymbol{M}_{\mathrm{RS}}\left(\boldsymbol{I}_{\mathrm{RS}}\right)\right)\right)\right] \end{array} $ (8)

目标域编码器的目标函数$ L_M $

$ L_{M}\left(\boldsymbol{I}_{\mathrm{RS}}, Y_{\mathrm{RS}}, D\right)=-E_{I \sim \boldsymbol{I}_{\mathrm{RS}}}\left[\log D\left(\boldsymbol{M}_{\mathrm{RS}}\left(\boldsymbol{I}_{\mathrm{RS}}\right)\right)\right] $ (9)

利用源域场景图像特征映射的网络参数初始化目标域场景图像特征映射网络,以此为基础继续调节网络参数,即

$ \psi_{l_{i}}\left(\boldsymbol{M}_{S}^{l_{i}}, \boldsymbol{M}_{\mathrm{RS}}^{l_{i}}\right)=\left(\boldsymbol{M}_{S}^{l_{i}}=\boldsymbol{M}_{\mathrm{RS}}^{l_{i}}\right) $ (10)

式中,$ \psi \left(\cdot \right) $表示源域与目标域映射间的约束关系,$ l_i $为第$ i $层网络参数。网络整体的损失函数可记为

$ \begin{gathered} \min \limits_{\boldsymbol{M}_{S}, C} L_{\mathrm{MC}}\left(\boldsymbol{I}_{S}, Y_{S}\right) \\ \min \limits_{D} L_{D}\left(\boldsymbol{I}_{S}, \boldsymbol{I}_{\mathrm{RS}}, \boldsymbol{M}_{S}, \boldsymbol{M}_{\mathrm{RS}}\right) \\ \min \limits_{\boldsymbol{M}_{S}, \boldsymbol{M}_{\mathrm{RS}}} L_{M}\left(\boldsymbol{I}_{\mathrm{RS}}, Y_{\mathrm{RS}}, D\right) \\ \text { s. t. } \quad \psi_{l_{i}}\left(\boldsymbol{M}_{S}^{l_{i}}, \boldsymbol{M}_{\mathrm{RS}}^{l_{i}}\right)=\left(\boldsymbol{M}_{S}^{l_{i}}=\boldsymbol{M}_{\mathrm{RS}}^{l_{i}}\right) \end{gathered} $ (11)

得到目标域场景图像的特征映射后,利用少量有标签的目标域场景图像对特征映射$ \boldsymbol{M}_{\mathrm{RS}} $和分类器$ C $进行简单微调,最终利用特征映射$ \boldsymbol{M}_{\mathrm{RS}} $和分类器$ C $实现对目标域的遥感场景图像识别。

3 实验结果及分析

3.1 实验数据

实验在UCMerced-21(Yang和Newsam,2010)与NWPU-RESISC45(Cheng等,2017)两组遥感场景识别数据集与SUN397(Xiao等,2016)自然场景图像数据集上进行,主要考察遥感场景数据集间的特征迁移,尝试视角差异较大的自然场景与遥感场景间的特征迁移。选取数据集中共同的8种场景类别,即棒球场、沙滩、农田、森林、港口、工业区、立交桥和河流/湖泊,UCMerced-21数据集中图像空间分辨率均为0.3 m,NWPU-RESISC45数据集中图像空间分辨率为0.2 m~30 m不等。3组数据集中各场景类别的图像对比如图 3所示,图中每组场景类别左侧为UCMerced-21遥感场景图像,中间为NWPU-RESISC45遥感场景图像,右侧为SUN397自然场景图像。

图 3 实验数据
Fig. 3 Experiment data
((a)baseball diamond; (b)beach; (c)farmland; (d)forest; (e)habor; (f)industrial area; (g)overpass; (h)river/lake)

3.2 NWPU-RESISC45到UCMerced-21场景迁移识别

实验中首先利用特征提取与图像重建,选择卷积与反卷积训练变分自动编码机网络,编码器的网络结构与参数设置如图 4所示,编码器输出层节点数设为500,解码器的网络结构与编码器对称,变分自动编码机网络训练完成后,舍弃解码器部分,在编码器后添加分类器,利用源域数据集中的样本与类别标签微调编码器与分类器的网络参数,训练次数设为200,完成网络的初步训练。

图 4 源域与目标域编码器网络结构
Fig. 4 The architecture of encoders for source and target domain

在对抗训练迁移学习部分,用源域的编码器网络参数初始化目标域的编码器,判别网络包含2个隐含层,隐含层节点数均设为500,输出层节点数为2,用于判别输入特征为源域特征或目标域特征。首先训练判别网络,使其能够区分源域与目标域特征,源域为1,目标域为0;然后固定判别网络参数,训练目标域编码器,使判别网络的输出为1,如此循环训练,实现对抗学习;最后用少量目标域的样本微调网络。采用相关性对齐(correlation alignment,CORAL)以及均衡分布适应(balanced distribution adaptation, BDA)作为对比实验,选择总体分类精度(overall accuracy,OA)和Kappa系数作为评价指标,实验结果如表 1所示。表 1中AE(auto-encoder)表示不利用源域样本和迁移学习方法,而直接利用目标域样本训练自动编码机网络后,利用少量有标签的目标域样本微调自动编码机网络与分类器的场景识别结果;AAET-C(AAET-classifier)表示仅利用少量目标域的样本微调分类器,而不对编码器参数进行调整的场景识别结果;AAET-E(AAET-encoder)表示利用少量目标域样本同时调整目标域编码器与分类器参数的场景识别结果;无迁移表示利用源域编码器与分类器直接对目标域实验图像进行场景识别的结果;无监督表示经过对抗学习后,不利用目标与样本对网络进行调整的场景识别结果;后续依次为利用目标域数据集中每类5幅、10幅、15幅有标签的场景图像作为训练样本调整网络时的场景识别结果。从表 1可以看出,利用NWPU-RESISC45数据集特征迁移后的识别结果均明显优于仅利用少量目标域UCMerced-21数据集中的标记样本的识别结果,且经过对抗学习特征迁移后对UCMerced-21数据集中的场景识别精度相比直接利用NWPU-RESISC45数据集的编码器与分类器提升了约12%,而后仅使用少量UCMerced-21数据集中的样本即可以显著提高场景识别精度。与CORAL和BDA方法相比,对抗学习的特征迁移方法在利用少量目标域样本参与训练时识别精度有所提升,并在无监督情况下具有明显优势。

表 1 NWPU-RESISC45迁移到UCMerced-21的场景识别精度和Kappa系数
Table 1 Scene recognition results and Kappa transfering NWPU-RESISC45 to UCMerced-21 dataset

下载CSV
方法 无迁移 无监督 目标域样本数
5 10 15
OA/% Kappa OA/% Kappa OA/% Kappa OA/% Kappa OA/% Kappa
AE - - - - 40.00 0.314 47.38 0.399 54.75 0.483
CORAL - - 12.63 0.001 58.29 0.523 68.19 0.637 73.53 0.697
BDA - - 46.25 0.001 62.50 0.523 70.14 0.637 74.12 0.697
AAET-C 45.75 0.38 57.13 0.51 63.63 0.584 72.25 0.683 75.63 0.721
AAET-E 45.75 0.38 57.13 0.51 63.88 0.587 73.38 0.696 77.13 0.739
注:加粗字体表示各列最优结果,“-”表示未计算该项精度。

每类场景选择15个训练样本微调网络时,UCMerced-21场景识别的混淆矩阵如图 5所示。由混淆矩阵可见,工业区仍然是较难识别区分的场景类别,主要是由于工业区场景中的主要目标地物(如油库)的样式、大小以及目标地物的空间分布较为多变,图 3(f)可印证这一点,这种不规律为模型的建立制造了一定障碍。立交桥场景也有同样的问题,相较之下,不同情况下一致性较好的沙滩、农田、森林、港口等场景则可以获得较为满意的识别结果。

图 5 NWPU-RESISC45迁移到UCMerced-21场景识别混淆矩阵
Fig. 5 The confusion matrix of UCMerced-21 transferring NWPU-RESISC45 dataset

为验证算法的稳定性,依次选取5~8个场景类别进行特征迁移和场景识别实验。本文提出的AAET方法的识别精度如图 6所示。由图中结果可见,场景类别数对无迁移学习情况下的场景识别精度影响较大;经过迁移学习后,无论是否采用少量目标域样本对网络参数进行微调,场景识别精度均可在一定范围内保持稳定。

图 6 不同场景类别数下场景识别精度变化
Fig. 6 Scene recognition accuracies of different numbers of scene categories

3.3 UCMerced-21到NWPU-RESISC45场景迁移识别

从UCMerced-21数据集到NWPU-RESISC45数据集的特征迁移采用了相同的网络结构和参数设置,微调网络参数时,分别选择NWPU-RESISC45数据集中1%、2%和3% 的场景图像作为训练样本进行场景识别实验,结果如表 2所示。可以看出,基于对抗学习的变分自动编码机网络虽然如从NWPU-RESISC45数据集迁移到UCMerced-21数据集一样大幅提升了场景识别精度,但识别结果远不及其,原因主要为UCMerced-21数据集中图像的空间分辨率均为0.3 m,故同一类别的场景中各类地物的组分大致相近,而NWPU-RESISC45数据集中图像的空间分辨率从亚米级到米级差异较大,使得特征迁移时,NWPU-RESISC45数据集可以涵盖UCMerced-21中大部分图像场景的特征,而从UCMerced-21中提取的场景特征却不能对NWPU-RESISC45中的图像场景进行完整描述,但相较于仅利用目标域数据集的标记样本时,场景识别精度提升在20%以上,同时优于对比方法的实验结果。

表 2 UCMerced-21迁移到NWPU-RESISC45的场景识别精度和Kappa系数
Table 2 Scene recognition results and Kappa transfering UCMerced-21 to NWPU-RESISC45 dataset

下载CSV
方法 无迁移 无监督 目标域样本比例
1% 2% 3%
OA/% Kappa OA/% Kappa OA/% Kappa OA/% Kappa OA/% Kappa
AE - - - - 25.29 0.146 28.46 0.182 31.43 0.216
CORAL - - 12.59 0.001 42.37 0.341 46.81 0.392 50.40 0.433
BDA - - 15.14 0.03 44.21 0.362 47.08 0.395 49.56 0.424
AAET-C 33.75 0.243 42.77 0.346 45.52 0.377 47.88 0.404 51.39 0.444
AAET-E 33.75 0.243 42.77 0.346 46.89 0.393 49.46 0.422 53.07 0.464
注:加粗字体表示各列最优结果,“-”表示未计算该项精度。

每类场景选择3% 的图像作为训练样本微调网络时,对NWPU-RESISC45数据集进行场景识别的混淆矩阵如图 7所示。可以看出,与基于光谱信息的地物像素分类中水体的光谱及其他地物光谱的良好可分性不同,河流场景间的差异较大,主要体现在河流的形状走势各异,且两岸地物组成也相差较大,可能为峡谷或植被;此外,实验中图像空间分辨率的不同对河流场景的识别精度影响也较大。

图 7 UCMerced-21迁移到NWPU-RESISC45场景识别混淆矩阵
Fig. 7 The confusion matrix of NWPU-RESISC45 transferring UCMerced-21 dataset

3.4 SUN397到UCMerced-21场景迁移识别

尝试自然场景与遥感场景数据集间的迁移学习,实现从水平视角的SUN397数据集到鸟瞰视角的UCMerced-21数据集的特征迁移。由于SUN397自然场景识别数据集中的图像大小大多不一致,故将二者的图像统一变换为256×256像素后进行后续实验,场景识别精度如表 3所示。虽然在无迁移学习的情况下总体分类精度约为23%,但通过分析混淆矩阵可知,网络将遥感图像场景几乎全部分为第2类和第4类,并没有很好地提取到遥感图像场景的特征,而随着对抗学习对网络参数的调整,这种情况才有所改善。但由于两组数据集的视角差异,基于对抗自动编码机的方法对于这种视角差异较大的情况仍有待改进。

表 3 SUN397迁移到UCMerced-21的场景识别精度及Kappa系数
Table 3 Scene recognition results and Kappa transfering SUN397 to UCMerced-21 dataset

下载CSV
方法 无迁移 无监督 目标域样本数
5 10 15
OA/% Kappa OA/% Kappa OA/% Kappa OA/% Kappa OA/% Kappa
AE - - - - 40.00 0.314 47.38 0.399 54.75 0.483
AAET-E 23.38 0.124 29.63 0.196 45.13 0.373 54.75 0.483 59.38 0.536
注:加粗字体表示各列最优结果,“-”表示未计算该项精度。

4 结论

本文针对遥感图像场景识别中标记样本缺乏且不同数据集无法共享标记样本问题,提出基于对抗变分自动编码机的遥感场景识别方法。该方法一方面利用无监督学习的变分自动编码机,充分利用源域场景图像信息;另一方面引入对抗学习思想,使编码器网络提取的目标域特征与源域特征尽量相似,实现对目标域基于特征迁移的场景识别。实验表明,本文方法有效提升了目标域标记样本较少时的场景识别精度,证明了利用特征迁移实现遥感场景识别的有效性。但是,对于水平视角的自然场景到鸟瞰视角的遥感场景的特征迁移,本文方法仅能在一定程度上实现场景识别精度的提升,仍需进一步研究以更好地实现视角差异较大情况下的特征迁移。

参考文献

  • Chaib S, Liu H, Gu Y F, Yao H X. 2017. Deep feature fusion for VHR remote sensing scene classification. IEEE Transactions on Geoscience and Remote Sensing, 55(8): 4775-4784 [DOI:10.1109/TGRS.2017.2700322]
  • Cheng G, Han J W, Lu X Q. 2017. Remote sensing image scene classification: benchmark and state of the art. Proceedings of the IEEE, 105(10): 1865-1883 [DOI:10.1109/JPROC.2017.2675998]
  • Cheng G, Yang C Y, Yao X W, Guo L, Han J W. 2018. When deep learning meets metric learning: remote sensing image scene classification via learning discriminative CNNs. IEEE Transactions on Geoscience and Remote Sensing, 56(5): 2811-2821 [DOI:10.1109/TGRS.2017.2783902]
  • Gong X, Xie Z, Liu Y Y, Shi X G, Zheng Z. 2018. Deep salient feature based anti-noise transfer network for scene classification of remote sensing imagery. Remote Sensing, 10(3): #410 [DOI:10.3390/rs10030410]
  • Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A and Bengio Y. 2014. Generative adversarial nets//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press: 2672-2680
  • Hinton G E, Salakhutdinov R R. 2006. Reducing the dimensionality of data with neural networks. Science, 313(5786): 504-507 [DOI:10.1126/science.1127647]
  • Hu F, Xia G S, Hu J W, Zhang L P. 2015. Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery. Remote Sensing, 7(11): 14680-14707 [DOI:10.3390/rs71114680]
  • Kingma D P and Welling M. 2014. Auto-encoding Variational Bayes[EB/OL]. [2020-07-28]. http://arxiv.org/pdf/1312.6114.pdf
  • Li G D, Zhang C J, Wang M K, Zhang X Y, Gao F. 2019. Transfer learning using convolutional neural network for scene classification within high resolution remote sensing image. Science of Surveying and Mapping, 44(4): 116-123, 174 (李冠东, 张春菊, 王铭恺, 张雪英, 高飞. 2019. 卷积神经网络迁移的高分影像场景分类学习. 测绘科学, 44(4): 116-123, 174) [DOI:10.16251/j.cnki.1009-2307.2019.04.018]
  • Long M S, Wang J M, Ding G G, Sun J G and Yu P S. 2013. Transfer feature learning with joint distribution adaptation//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE: 2200-2207[DOI: 10.1109/ICCV.2013.274]
  • Matasci G, Volpi M, Kanevski M, Bruzzone L, Tuia D. 2015. Semisupervised transfer component analysis for domain adaptation in remote sensing image classification. IEEE Transactions on Geoscience and Remote Sensing, 53(7): 3550-3564 [DOI:10.1109/TGRS.2014.2377785]
  • Sun B C, Feng J S, Saenko K. 2016. Return of frustratingly easy domain adaptation/Proceedings of the 30th AAAI Conference on Artificial Intelligence. Phoenix, USA: AAAI Press: 2058-2065
  • Tong W, Chen W, Han W, Li X, Wang L. 2020. Channel-attention-based densenet network for remote sensing image scene classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 13: 4121-4132 [DOI:10.1109/JSTARS.2020.3009352]
  • Wang G L, Fan B, Xiang S M, Pan C H. 2017a. Aggregating rich hierarchical features for scene classification in remote sensing imagery. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 10(9): 4104-4115 [DOI:10.1109/JSTARS.2017.2705419]
  • Wang J D, Chen Y Q, Hao S J, Feng W J and Shen Z Q. 2017b. Balanced distribution adaptation for transfer learning//Proceedings of 2017 IEEE International Conference on Data Mining (ICDM). New Orleans, USA: IEEE: 1129-1134[DOI: 10.1109/ICDM.2017.150]
  • Wang Q, Liu S T, Chanussot J, Li X L. 2019. Scene classification with recurrent attention of VHR remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, 57(2): 1155-1167 [DOI:10.1109/TGRS.2018.2864987]
  • Wu R, Li Y, Han H, Chen X and Lin Y. 2019. Remote sensing image analysis based on transfer learning: a survey//Proceedings of International Conference on Advanced Hybrid Information Processing. Nanjing, China: Springer: 408-415[DOI: 10.1007/978-3-030-19086-6_45]
  • Xiao J X, Ehinger K A, Hays J, Torralba A, Oliva A. 2016. SUN database: exploring a large collection of scene categories. International Journal of Computer Vision, 119(1): 3-22 [DOI:10.1007/s11263-014-0748-y]
  • Yang Y and Newsam S. 2010. Bag-of-visual-words and spatial extensions for land-use classification//Proceeding of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems. San Jose, USA: Association for Computing Machinery: 270-279[DOI: 10.1145/1869790.1869829]
  • Yao Y, Liang H, Li X, Zhang J and He J. 2017. Sensing urban land-use patterns by integrating Google Tensorflow and scene-classification models//The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences. XLII-2/W7: 981-988[DOI: 10.5194/isprs-archives-XLII-2-W7-981-2017]
  • Zhang J P, Li T, Lu X C, Cheng Z. 2016. Semantic classification of high-resolution remote-sensing images based on mid-level features. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 9(6): 2343-2353 [DOI:10.1109/JSTARS.2016.2536943]
  • Zhao B, Zhong Y F, Zhang L P, Huang B. 2016. The Fisher kernel coding framework for high spatial resolution scene classification. Remote Sensing, 8(2): #157 [DOI:10.3390/rs8020157]
  • Zhu Q Q, Zhong Y F, Wu S Q, Zhang L P, Li D R. 2018. Scene classification based on the sparse homogeneous-heterogeneous topic feature model. IEEE Transactions on Geoscience and Remote Sensing, 56(5): 2689-2703 [DOI:10.1109/TGRS.2017.2781712]