发布时间: 2019-04-24
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.180420
2019 | Volume 24 | Number 4

遥感图像处理

面向遥感影像的深度语义哈希检索

陈诚¹, 邹焕新¹, 邵宁远¹, 孙嘉赤¹, 秦先祥²

1. 国防科技大学电子科学学院, 长沙 410073;

2. 空军工程大学信息与导航学院, 西安 710077

收稿日期: 2018-07-04; 修回日期: 2018-08-10

基金项目: 国家自然科学基金项目（61331015，41601436）

第一作者简介: 陈诚, 1994年生, 男, 硕士研究生, 主要研究方向为遥感图像检索。E-mail:cc233cc@foxmail.com;
邵宁远, 女, 硕士, 主要研究方向为多源遥感数据变化检测。E-mail:ningyuanshao@163.com;
孙嘉赤, 男, 硕士, 主要研究方向为光学遥感影像目标检测。E-mail:445219733@qq.com;
秦先祥, 男, 讲师, 博士, 主要研究方向为SAR图像处理与应用。E-mail:qxxzhijia@126.com.

中图法分类号: TP391

文献标识码: A

文章编号: 1006-8961(2019)04-0655-09

摘要

目的哈希检索旨在将海量数据空间中的高维数据映射为紧凑的二进制哈希码，并通过位运算和异或运算快速计算任意两个二进制哈希码之间的汉明距离，从而能够在保持相似性的条件下，有效实现对大数据保持相似性的检索。但是，遥感影像数据除了具有影像特征之外，还具有丰富的语义信息，传统哈希提取影像特征并生成哈希码的方法不能有效利用遥感影像包含的语义信息，从而限制了遥感影像检索的精度。针对遥感影像中的语义信息，提出了一种基于深度语义哈希的遥感影像检索方法。方法首先在具有多语义标签的遥感影像数据训练集的基础上，利用两个不同配置参数的深度卷积网络分别提取遥感影像的影像特征和语义特征，然后利用后向传播算法针对提取的两类特征学习出深度网络中的各项参数并生成遥感影像的二进制哈希码。生成的二进制哈希码之间能够有效保持原始高维遥感影像的相似性。结果在高分二号与谷歌地球遥感影像数据集、CIFAR-10数据集及FLICKR-25K数据集上进行实验，并与多种方法进行比较和分析。当编码位数为64时，相对于DPSH（deep supervised Hashing with pairwise labels）方法，在高分二号与谷歌地球遥感影像数据集、CIFAR-10数据集、FLICKR-25K数据集上，mAP（mean average precision）指标分别提高了约2%、6%7%、0.6%。结论本文提出的端对端的深度学习框架，对于带有一个或多个语义标签的遥感影像，能够利用语义特征有效提高对数据集的检索性能。

关键词

哈希; 影像检索; 深度学习; 语义挖掘; 遥感

Deep semantic Hashing retrieval of remote sensing images

Chen Cheng¹, Zou Huanxin¹, Shao Ningyuan¹, Sun Jiachi¹, Qin Xianxiang²

1. College of Electronic Science, National University of Defense Technology, Changsha 410073, China;

2. School of Information and Navigation, Air Force Engineering University, Xi'an 710077, China

Supported by: National Natural Science Foundation of China(61331015, 41601436)

Abstract

Objective Hashing methods, which aim at mapping the high-dimensional data to compact binary Hashing codes in Hamming space and rapidly calculate the Hamming distance by bit operation and XOR operation, can effectively achieve search and retrieval with remaining similarity for big data. However, a massive number of remote sensing images are associated with semantic information. Traditional methods of extracting image features and generating Hash codes cannot effectively use semantic information, thereby limiting the accuracy of remote sensing image retrieval. This study proposes an image retrieval method based on DSH(deep semantic Hashing) for mining semantic information of remote sensing images with tags or other semantic annotations. The contribution of this study includes introducing Hashing methods for RS images which encode the high-dimensional image feature vector to binary bits by using a limited number of labeled (annotated) images. Furthermore, DSH directly learns the discrete Hashing codes without relaxation thereby deteriorating the accuracy of the learned Hashing codes. Hence, DSH provides highly time-efficient (in terms of storage and speed) and accurate search capability within huge data archives. Method The DSH model performs simultaneous feature learning and Hashing codes learning in an end-to-end framework, which is organized into two main parts, namely feature learning and Hashing learning. In feature learning, we use two deep neural networks for images and semantic annotations. The deep neural network for image is a convolutional neural network (CNN) adapted from vgg_net. Particularly, feature learning has seven layers of vgg_16 network pretrained on ImageNet. We replace the eighth layer as a fully-connected layer with the output of the learned image features. The first seven layers use the rectified linear unit (ReLU) as the activation function, and the eighth layer uses identity function as the activation function. For semantic annotations, we use semantic vectors as the input to a deep neural network with two fully-connected layers. Moreover, we use ReLU and identity function for two fully-connected layers as activation function. In Hashing learning, we assume that f(x_i; θ_x) represents the learned feature for image x_i, which corresponds to the output of the CNN for images. Furthermore, let g(y_j; θ_y) denote the learned feature for semantic y_i, which corresponds to the output of the deep neural network for semantic vectors. Here, θ_x is the network parameter of the CNN for images, and θ_y is the network parameter of the deep neural network for semantic vectors. For binary codes, B={b_i}_i=1ⁿ, Then, we define the similarities with the likelihood and optimization function and learn the parameters of the CNN through an alternating learning strategy, which learns one parameter while fixing the other parameters. Result We have conducted experiments on three archives. The first archive consists of 2 000 images acquired from GF-2 satellite and Google Earth. Each image in the archive is a section of 224×224 pixels and is associated with several textual tags. In our experiments, we consider several tags, which are similar to one semantic annotation. We use CIFAR-10 dataset as the second archive, which is a single-label dataset consisting 60 000 color images with a size of 32×32 pixels. Each image belongs to one of the ten classes. The third archive is the FLICKR-25K dataset, which consists of 25 000 images associated with several textual tags. We consider several tags that are similar to one semantic annotation such as the first archive. Each image in the archive is a section of 224×224 pixels. On GF-2 satellite and Google Earth remote sensing image dataset, when the Hashing bit is 64, the mean average precision (mAP) value can be improved by approximately 2% contrary to DPSH(deep supervised Hashing with pairwise labels). On the CIFAR-10 dataset, the proposed method attains an improvement by 6%7% compared with DPSH for the mAP evaluation when the Hashing bit is 64. On the FLICKR-25K dataset, the proposed method attains improvement by approximately 0.6% compared with DPSH for the mAP evaluation when the Hashing bit is 64. Conclusion In this study, we propose an end-to-end deep learning framework, which considers image visual and semantic features based on deep learning and generates Hashing functions for Hashing codes by utilizing the semantic information, thereby providing high accuracy for RS image retrieval. Experimental results show our proposed method greatly improves the detection accuracy of image retrieval. Notably, the archives used in the experiments are benchmarks, which are composed of a moderate number of images, whereas in many actual applications, the search is expected to be applied to considerably larger archives.

Key words

Hashing; image retrieval; deep learning; semantic mining; remote sensing

0 引言

近年来，一系列卫星的发射为影像检索提供了大量数据。大规模遥感影像数据集具有3大特点：数据量大、特征维数高、响应时间短。因此，如何实现快速高效的遥感影像检索已成为一个越来越具有挑战性的问题^[1]。哈希方法是解决这些问题的关键技术之一。由于结构简单、检索效率高、空间成本低、扩展简单、不受维数灾难影响的特点，哈希方法已成为大规模影像检索的重要技术^[2]。

哈希方法的基本思想是将数据点从原始的高维特征空间映射至低维汉明空间中的二进制编码，以大幅度减少存储开销和提高检索速度^[3]。如图 1所示，在原始空间中相似的两幅图片，经过映射后只有1位不同。而原始空间中不同的点，相似性很小，两个码的距离很大^[4]。

图 1 哈希检索示例

Fig. 1 An example of Hashing retrieval

然而，传统的哈希方法往往使用影像的底层特征(颜色、纹理和形状等)进行检索，而人类对影像的理解逐渐倾向于文字等高层特征(语义)，这就产生了语义鸿沟^[5]。传统的哈希方法通过聚类学习或分类方法难以获取能够更准确描述影像的语义特征，在遥感影像数据上应用的鲁棒性很差。

目前，深度学习已经成为提取影像特征的有效手段之一。与传统方法相比，基于深度卷积神经网络的方法学习能力更加强大，可以表示更为复杂的函数关系，更有利于语义特征的提取。因此，近年来许多研究也开始探索利用深度学习来有效生成哈希码进行影像检索。针对语义鸿沟问题，本文基于深度学习，提出了一种面向遥感影像语义特征的检索方法。

该方法首先使用两个不同配置参数的深度卷积网络对具有多语义标签的遥感影像数据训练集分别提取遥感影像的影像特征和语义特征，然后利用后向传播算法针对提取的两类特征学习出深度网络中的各项参数，最后生成遥感影像的二进制哈希码用于影像检索。

本文方法的特点包括：1)将特征提取与哈希码生成整合在一个模型中，可进行端到端的训练；2)直接学习离散哈希码，避免了松弛过程，减少了量化损失；3)利用深度卷积网络提取影像的语义特征，有助于提高检索的准确度。

本文在多个数据集上进行了评估。结果显示，生成的哈希码能够保持原始高维遥感影像的相似性，从而有效提高了多标签影像检索性能。

1 相关工作

哈希方法通常可以分为数据独立哈希方法和数据依赖哈希方法^[6]。

数据独立哈希方法的哈希函数在构造中没有利用数据本身的信息，而是通过一定的概率分布随机产生哈希函数，模型简单，易于扩展，至今仍在广泛应用，是很多数据依赖的哈希学习方法的基础。LSH(locality-sensitive Hashing)是典型的数据独立哈希方法^[7]。LSH方法按照给定概率分布(如正态分布)随机生成投影矩阵，将数据的特征向量映射为与哈希编码长度相同的实数向量，再对其阈值化生成二进制编码。原始数据通过哈希函数映射后能保持在原始空间的距离相似度，但需要较长的哈希码才能达到令人满意的检索效果。

数据依赖哈希方法利用数据本身的信息，通过机器学习等方法构建哈希函数。典型的数据依赖哈希方法包括SH(spectral Hashing)方法^[8]、SKLSH(shift-invariant kernelized locality-sensitive Hashing)方法^[9]和ITQ(iterative quantization)方法^[10]。SH方法用谱图描述样本数据的结构信息，并将哈希码生成方法建模为谱图划分问题，可以通过特征值分解求解的问题。SH方法可以有效地对低维数据进行编码，但当数据维数较高时，其效果会退化为与主成分分析法等价。此外，SH方法的前提是基于数据均匀分布，但大多数真实数据的分布不一定满足这一条件。而SKLSH方法用特征间的欧氏距离或原始特征空间的核值衡量影像间的相似性，不依赖数据的分布，仅取决于核函数。ITQ方法的核心思想是从以0为中心分布的数据找到较理想的旋转角度，从而保证将该数据映射到以0为中心的二值超立方体顶点的量化误差最小。ITQ方法的用法比较广泛，既可以与典范相关分析方法结合，作为监督哈希方法用于标签数据库的检索，也可以与PCA(principal component analysis)方法结合，作为非监督哈希方法用于无标签数据库的检索。

近年来，一些研究开始将深度卷积神经网络应用于哈希检索，文献[11]证明了利用深度学习来学习哈希函数的可行性。CNNH(convolutional neural networks Hashing)^[12]最先将深度学习运用于哈希。CNNH方法可以分为两个阶段，第1阶段利用坐标下降法将显示矩阵分解为哈希码，第2阶段从生成的哈希码中利用深度卷积网络学习哈希函数和影像表达。但是，在CNNH的第2阶段学习到的特征表示不能反馈于第1阶段来学习更优的哈希码。Lai等人^[13]的研究也表明，CNNH不能同时执行特征学习和哈希码学习。DPSH^[14]是一个端到端的学习框架，首先通过深度神经网络学习影像的特征表示，然后将特征表示映射到哈希码，再通过一个衡量哈希码质量的损失函数来进行反馈。DSRH(deep semantic ranking Hashing)^[15]是一种基于深度语义排序的影像检索方法，通过多标签语义相似来学习哈希函数。本文方法模型是一个端到端的学习框架，能够同时执行特征学习和哈希码学习，并进一步挖掘遥感影像中的语义信息，达到提高影像检索准确率的目的。

2 模型

图 2给出了本文方法的基本实现框架，主要分为特征学习模块和哈希学习模块两部分。

图 2 本文方法实现框架

Fig. 2 The main framework of proposed model

特征学习模块包含两个不同的深度卷积网络，分别用来提取遥感影像的影像特征和语义特征。而哈希学习模块对提取的特征，通过构造目标函数最小化来量化损失，并在这一基础上进行优化，完成对哈希码的学习。在整个学习过程中，每一个模块均可以向另一个模块进行反馈，两者整合为一个端对端的深度学习框架。

2.1 特征学习模块

对于影像特征，本文选用vgg_net中的网络模型^[16]。具体来说，特征学习的前7层采用在ImageNet^[17]上预训练的网络模型，再将第8层替换为将学习到的影像特征作为输出的全连接层。网络的详细参数如表 1所示。其中前7层和第8层分别使用修正线性单元和恒等函数作为激活函数。

对于语义特征，本文将语义向量输入两个全连接层组成的网络中，分别使用修正线性单元和恒等函数作为两个全连接层激活函数。网络具体参数如表 2所示。

表 1 影像深度神经卷积网络(CNN)配置
Table 1 Configuration of the CNN for image

下载CSV

卷积层	配置
conv1	filter 64×11×11, stride 4×4, pad 0, LRN, pool 2×2
conv2	filter 256×5×5, stride 1×1, pad 2, LRN, pool 2×2
conv3	filter 256×3×3, stride 1×1, pad 1
conv4	filter 256×3×3, stride 1×1, pad 1
conv5	filter 256×3×3, stride 1×1, pad 1, pool 2×2
full6	4 096
full7	4 096
full8	哈希码长c

2.2 哈希学习模块

假设存在$n$幅遥感影像${\mathit{\boldsymbol{X}}} =\{{ {\mathit{\boldsymbol{x}}} _{i}}\}^{n}_{i=1}$，对应语义标注为${\mathit{\boldsymbol{Y}}} =\{{ {\mathit{\boldsymbol{y}}}\} _{j}}^{n}_{j=1}$，则可以定义${\mathit{\boldsymbol{x}}} _{i}$和${\mathit{\boldsymbol{y}}} _{j}$之间相似度$M_{ij}$为

表 2 语义深度神经卷积网络配置
Table 2 Configuration of the CNN for semantics

下载CSV

卷积层	配置
full1	8 192
full2	哈希码长$c$

$ {M_{ij}} = \left\{ \begin{array}{l} 1{\rm{ }}{\mathit{\boldsymbol{ x}}}{_i}{\rm{与 }}y{_j}相似\\ {\mathit{\boldsymbol{ x}}}{_i}与{\rm{ }}y{_j}不相似 \end{array} \right. $

(1)

为了从提取的特征中学习到保持相似度的哈希码，目标函数需使原始空间中相似点的哈希码距离尽可能小。令$f({\mathit{\boldsymbol{ x}}}_{i};{\mathit{\boldsymbol{ \theta }}}_{x})$表示学习到的遥感影像${\mathit{\boldsymbol{x}}} _{i}$的影像特征，即深度卷积网络影像模块的输出；$g(y _{j};{\mathit{\boldsymbol{ \theta }}}_{y})$表示学习到的语义${\mathit{\boldsymbol{y}}} _{i}$的特征，即深度学习网络语义矢量的输出。${\mathit{\boldsymbol{ \theta }}} _{x}和{\mathit{\boldsymbol{ \theta }}} _{y}$分别表示深度卷积网络中影像模块和语义模块的网络参数。对于二进制编码${\mathit{\boldsymbol{ B}}} =\{{{\mathit{\boldsymbol{b }}} _{i}}\}^{n}_{i=1}$，定义相似度的对数似然函数为

$ p\left( {{M_{ij}}|\mathit{\boldsymbol{F}}{_{*i}}, \mathit{\boldsymbol{G}}{_{*j}}} \right) = \left\{ \begin{array}{l} \sigma \left( {{\rm{ }}{\mathit{\boldsymbol{ \boldsymbol{\varTheta} }}_{ij}}} \right)\;\;\;\;\;\;\;\;{M_{ij}} = 1\\ 1-\sigma \left( {{\rm{ }}\mathit{\boldsymbol{ \boldsymbol{\varTheta} }}{_{ij}}} \right)\;\;{M_{ij}} = 0 \end{array} \right. $

(2)

式中${\mathit{\boldsymbol{F}}_{*i}} = f({\rm{ }}{\mathit{\boldsymbol{ x}}}{_i}; \mathit{\boldsymbol{\theta }}{_x}), \mathit{\boldsymbol{G}}{_{*j}} = g({\rm{ }}\mathit{\boldsymbol{y}}{_j}; \mathit{\boldsymbol{\theta }}{_y})\mathit{\boldsymbol{ \boldsymbol{\varTheta} }}{_{ij}} = {\rm{ }}\frac{1}{2}{\rm{ }}{\mathit{\boldsymbol{F}}}{^{\rm{T}}}_{*i}{{\mathit{\boldsymbol{{G}}}}}{_{*j}}, \sigma \left({{\mathit{\boldsymbol{ \boldsymbol{\varTheta} }}_{ij}}} \right) = \frac{1}{{1 + {e^{ - {\rm{ }}\mathit{{\mathit{\boldsymbol{\Theta}}} } {_{ij}}}}}}$。

由于哈希码的生成是离散学习问题，其输出并不是连续可导的值，此时方差损失函数和交叉熵损失函数并不适用。因此本文采用对数似然函数作为参数优化的损失函数，不仅可以满足要求，同时也便于后续反向传播算法梯度计算。针对相似度的负对数似然函数，可以定义如下优化问题

$ {\rm{min}}\;{J_1} = - \sum\limits_{i, j = 1}^n {({M_{ij}}\mathit{\boldsymbol{ \boldsymbol{\varTheta} }}{_{ij}} - {\rm{ln}}(1 + {e^{\mathit{{\mathit{\boldsymbol{{\mathit{\boldsymbol{\Theta}}}}}} } {_{ij}}}}))} $

(3)

式中，最大化似然函数即使得负对数函数最小化，满足$M_{ij}=1$时，相似度最大；而$M_{ij}=0$时，相似度最小，即保留影像和语义之间的相似性。

为进一步优化目标函数，在式(3)中添加约束项，即

$ \begin{array}{l} {\rm{min}}{J_2} = - \sum\limits_{i, j = 1}^n {} \sum {\rm{ }}\left( {{M_{ij}}\mathit{\boldsymbol{ \boldsymbol{\varTheta} }}{_{ij}}{\rm{ln}}\left( {1 + {\rm{e}^{\mathit{{\mathit{\boldsymbol{\Theta}}} } {_{ij}}}}} \right)} \right) + \\ \;\;\;\;\;\;\;\gamma ({\rm{ ||}}{\mathit{\boldsymbol{ B}}}{^{(x)}}{\rm{ }}-{\mathit{\boldsymbol{ F}}}||{^2}_{\rm{F}} + {\rm{ }}||{\mathit{\boldsymbol{ B}}}{^{(y)}}{\rm{ }}-{\mathit{\boldsymbol{ G}}}||{^2}_{\rm{F}}) \end{array} $

(4)

式中，${\mathit{\boldsymbol{ B}}}^{(x)}=sgn({\mathit{\boldsymbol{ F}}})$，${\mathit{\boldsymbol{ B}}} ^{(y)}=sgn({\mathit{\boldsymbol{ G}}})$，表示符号函数^[18]。因此，将${\mathit{\boldsymbol{ B}}} ^{(x)}$和${\mathit{\boldsymbol{ B}}} ^{(y)}$作为${\mathit{\boldsymbol{ F}}} $和${\mathit{\boldsymbol{ G}}} $的近似值可以保留影像和语义之间的相似性。$ ||·|| _\rm{F}$表示矩阵范数，$γ$为超参数。

本研究希望均衡训练集哈希码的每一位使其能提供的信息最大化。因此添加优化项$η(\|{\mathit{\boldsymbol{ F}}} 1\| ^{2}_{{\rm{ F}}}\|+ \|{\mathit{\boldsymbol{ G}}} 1 \|^{2}_{\rm{ F}})$使得训练集每一位中$+1$和$－1$的数目大致相等，进一步得到如下目标函数

$ \begin{array}{l} {\rm{min}}{J_3} = -\sum\limits_{i, j = 1{\rm{ }}}^n {\left( {{M_{ij}}\mathit{\boldsymbol{ \boldsymbol{\varTheta} }}{_{ij}}{\rm{ln}}\left( {1 + {e^{\mathit{{\mathit{\boldsymbol{{\mathit{\boldsymbol{\Theta}}}}}} } {_{ij}}}}} \right)} \right) + } {\rm{ }}\\ \;\;\;\;\;\;\gamma \left( {{\rm{|| }}\mathit{\boldsymbol{B}}{^{(x)}} - \mathit{\boldsymbol{F}}{\rm{||}}_{\rm{F}}^2 + {\rm{|| }}\mathit{\boldsymbol{B}}{^{(y)}}{\rm{ }}\mathit{\boldsymbol{G}}{\rm{||}}_{\rm{F}}^2} \right) + \\ \;\;\;\;\;\;\;\;\;\;\;\eta ({\rm{||}}\mathit{\boldsymbol{F}}1{\rm{||}}_{\rm{F}}^2 + {\rm{||}}\mathit{\boldsymbol{G}}1{\rm{||}}_{\rm{F}}^2) \end{array} $

(5)

式中，$η$为超参数。

实验证明，在训练中，当影像模块和语义模块的二进制码设置相同时可以取得更好的效果，即设定${\mathit{\boldsymbol{ B}}} ^{(x)}= {\mathit{\boldsymbol{B }}} ^{(y)}= {\mathit{\boldsymbol{B }}} $，那么，优化函数最终定义为

$ \begin{array}{l} {\rm{min}}{J_4} = - \sum\limits_{i, j = 1{\rm{ }}}^n {\left( {{M_{ij}}\mathit{\boldsymbol{ \boldsymbol{\varTheta} }}{_{ij}}{\rm{ln}}\left( {1 + {e^{\mathit{{\mathit{\boldsymbol{\Theta}}} } {_{ij}}}}} \right)} \right) + } {\rm{ }}\\ \;\;\;\;\;\;\gamma \left( {{\rm{|| }}\mathit{\boldsymbol{B}}{^{(x)}} - \mathit{\boldsymbol{F}}{\rm{||}}_{\rm{F}}^2 + {\rm{|| }}\mathit{\boldsymbol{B}}{^{(y)}}{\rm{ }}\mathit{\boldsymbol{G}}{\rm{||}}_{\rm{F}}^2} \right) + \\ \;\;\;\;\;\;\;\;\;\;\;\eta ({\rm{||}}\mathit{\boldsymbol{F}}1{\rm{||}}_{\rm{F}}^2 + {\rm{||}}\mathit{\boldsymbol{G}}1{\rm{||}}_{\rm{F}}^2) \end{array} $

(6)

式中，仅仅在训练时设置${\mathit{\boldsymbol{ B}}} ^{(x)}= {\mathit{\boldsymbol{ B}}}^{(y)}= {\mathit{\boldsymbol{ B}}} $。在生成哈希码时，使用哈希函数$h(x)$得到哈希码$ b $，即

$ \mathit{\boldsymbol{b}}_i^{(x)} = {h^{(x)}}({\rm{ }}{\mathit{\boldsymbol{ x}}}{_i}) $

(7)

$ \mathit{\boldsymbol{b}}_i^{(y)} = {h^{(y)}}({{\mathit{\boldsymbol{y}}}_j}) $

(8)

最后，与大多数深度学习方法一样，本文使用反向传播(BP)算法来学习深度卷积网络中的各个参数。具体来说，即每次固定其他所有参数来学习一个参数。以学习${\mathit{\boldsymbol{ \theta }}} _{x}为例，假设{\mathit{\boldsymbol{ \theta }}} _{y}$和${\mathit{\boldsymbol{ B}}} $固定不变，利用损失函数对${\mathit{\boldsymbol{ G}}} _{*j}$计算梯度，则当梯度为0时，可得到最优的参数解。

3 实验

3.1 数据集

实验基于3个不同的数据集进行训练和测试。第1个数据集由从高分二号卫星和谷歌地球获取的2 000幅影像组成，每个影像的尺寸为224×224像素，并且与一个或多个文本标签相关联。图 3展示了一些示例原始遥感影像及对应的相似影像。在实验中，若一幅影像具有多个类似含义的标签，则将其归纳为一类语义注释。图 4为数据集语义注释示例，按照该分类方法将数据集分为13类，平均每幅图像的标签为3.9个。

图 3 原始遥感影像及对应的相似影像示例

Fig. 3 Example of original remote sensing images and their similar images((a)original images; (b)similar images)

图 4 数据集语义注释示例

Fig. 4 Example of semantic annotation in the image archive

第2个是CIFAR-10数据集，包含60 000幅尺寸为32×32像素的单标签彩色影像，共10类标签，每类标签有6 000幅图像。

第3个是FLICKR-25K数据集，包含25 000幅图像，并且每幅图像都携带有多个标签，数据集中每幅图像的尺寸为224×224像素，本文按照与处理第1个数据集类似的方法将其归为24类，平均每幅图像的标签为3.8个。

对于遥感影像，现存标注数据往往有限，获取标注数据的成本也非常大，因此，本文方法在训练时选取较少的样本以贴近实际情况，对测试集与训练集的比例没有严格要求。1)对于高分二号、谷歌地球数据集，随机采样100幅影像作为测试集，其余影像作为数据库，同时在数据库中采样500幅影像作为训练集。由于为多标签数据集，真值定义为共享至少一个公共注释的影像语义对，用传统非深度方法进行实验时，使用512维GIST(generalized search tree)特征向量表示图像。2)对于CIFAR-10数据集，随机采样1 000幅影像作为测试集，其余影像作为数据库，此外在数据库中采样5 000幅影像作为训练集。同样，如果两幅影像共享一个共同的标签，则视为检索真值。在传统非深度方法进行实验时，同样使用512维GIST特征向量表示图像。3)对于FLICKR-25K数据集，随机采样1 000幅影像作为测试集，其余影像作为数据库，同时在数据库中采样5 000幅影像作为训练集。真值同样定义为共享至少一个公共注释的影像语义对，用传统非深度方法进行实验时，使用1 386维GIST特征向量表示图像。

3.2 训练设置

实验基于MatconvNet^[19]进行训练和测试，利用在ImageNet上训练好的vgg_16等网络模型来提取影像特征。Mini_batch的大小设置为128，模型的学习率设置为0.001，衰减因子为0.002，迭代次数为500，超参数$γ$和$η$均为1。

3.3 实验精度比较

实验将本文方法与多种哈希方法进行比较，包括LSH，SH，SKLSH，ITQ，DPSH和MIHASH^[20]。实验结果采用mAP进行评价。mAP算法不仅考虑了影像的召回率，同时兼顾了真值影像在检索结果中的次序，能够很好地体现影像检索的效果。

图 5—图 7分别展示了本文方法在高分二号与谷歌地球数据集、CIFAR-10数据集和FLICKR-25K数据集，哈希码分别为16位、32位和64位的mAP对比结果。

图 5 高分二号与谷歌地球数据集mAP性能图

Fig. 5 Accuracy in terms of mAP on GF-2 satellite and Google Earth

图 6 CIFAR-10数据集mAP性能图

Fig. 6 Accuracy in terms of mAP on CIFAR-10

图 7 FLICKR-25K数据集集mAP性能图

Fig. 7 Accuracy in terms of mAP on FLICKR-25K

从实验结果可以看出，随着哈希码位数的增加，检索效果逐渐优化。总体来看，3种基于深度学习的方法在检索准确度上均优于其他基于非深度的方法。从图 5—图 7可以看出，由于LSH和SKLSH的基本思想都是基于具有位置敏感性的哈希函数对原始数据进行映射，而位置敏感哈希映射中采用了随机映射算法，使得映射结果不可控，实际应用结果不稳定，因此这两种方法检索准确率较低。对比图 5和图 6可以看出，当数据集增大时，传统非深度方法的准确性急剧下降，特别是SH方法，直接应用了主成分分析法来对高维特征进行降维处理，在数据维度较高时，效果会退化为与主成分分析法等价，SH方法基于数据均匀分布的前提，但遥感影像数据的分布很难满足这一条件。此时，基于深度学习的哈希算法，凭借强大的特征学习能力，表现出更明显的优势。

大多数哈希方法通过谱松弛将哈希学习离散问题转变为连续问题，即舍弃目标函数中符号函数的同时，引入正交约束来保证哈希码中的每一位是平衡且不相关的，但这种松弛在实际应用中，在特征值分解时大量方差将集中在前几维投影上，导致前几维产生大量量化损失，因而检索准确度在学习哈希码时产生衰减，本文通过直接学习离散哈希码来避免这一问题。对比3个数据集上的实验，本文方法在准确度上总体优于DPSH和MIHASH方法。在高分二号与谷歌地球数据集，哈希编码位数为32时，本文方法与DPSH方法效果持平，表现出略微优势。对于CIFAR-10数据集，本文方法的mAP值始终高于其他方法，表明本文算法能够有效实现对遥感影像的语义检索，且在大规模训练集下的性能提升尤为显著。

3.4 超参数影响测试

图 8和图 9展示了哈希编码位数分别为16位和32位时，超参数$γ$和$η$在两个数据集上对实验结果的影响。在调整一个超参数时，另一个超参数设置为1。可以看出，对于高分二号与谷歌地球数据集，本文方法在0.001＜$γ$＜2，0.001＜$η$＜2范围内对超参数稳定性较好。对于CIFAR-10数据集，仅当$η$过小，即减少到小于0.1时，检索精度才产生明显下降，在0.001＜$γ$＜2，0.1＜$η$＜2范围内检索效果仍能保持稳定。

图 8 高分二号与谷歌地球数据集超参数影响图

Fig. 8 The influence of hyper-parameters on GF-2 satellite and Google Earth ((a)trend of mAP values with hyper-parameter $γ$; (b)trend of mAP values with hyper-parameter $η$)

图 9 CIFAR-10数据集超参数影响图

Fig. 9 The influence of hyper-parameters on CIFAR-10 ((a)trend of mAP values with hyper-parameter $γ$; (b)trend of mAP values with hyper-parameter $η$)

4 结论

影像检索的核心是对影像的准确理解。从语义层次去描述影像是影像处理研究的热点问题。本文提出了一种基于深度学习的遥感影像检索方法，利用深度卷积网络，分别提取遥感影像的影像特征和语义特征来实现对遥感影像的准确描述。实验结果表明，本文方法能够显著提升影像检索特别是大数据检索的性能。此外，本文模型作为一个端对端的框架，将影像本身作为深度卷积网络的输入，对于实际应用更为契合。但是由于本文方法针对的是具有一个或多个标签的影像，而实际中大多数遥感影像并不具备标签，这是将来需要解决的一个难点。

参考文献

[1] Yang X L, Yao J L, Wang X H, et al. Image copy detection method based on contextual descriptor[J]. Journal of Image and Graphics, 2017, 22(8): 1098–1105. [杨醒龙, 姚金良, 王小华, 等. 构建近邻上下文的拷贝图像检索[J]. 中国图象图形学报, 2017, 22(8): 1098–1105. ] [DOI:10.11834/jig.160562]

[2] Yu J Q, Wu Z B, Wu F, et al. Multimedia technology 2016:advances and trends in image retrieval[J]. Journal of Image and Graphics, 2017, 22(11): 1467–1485. [于俊清, 吴泽斌, 吴飞, 等. 多媒体工程:2016-图像检索研究进展与发展趋势[J]. 中国图象图形学报, 2017, 22(11): 1467–1485. ] [DOI:10.11834/jig.170503]

[3] Chen F, Lyu S H, Li J, et al. Multi-label image retrieval by Hashing with object proposal[J]. Journal of Image and Graphics, 2017, 22(2): 232–240. [陈飞, 吕绍和, 李军, 等. 目标提取与哈希机制的多标签图像检索[J]. 中国图象图形学报, 2017, 22(2): 232–240. ] [DOI:10.11834/jig.20170211]

[4] Cao Y, Long M S, Wang J M, et al. Deep visual-semantic quantization for efficient image retrieval[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017.[DOI: 10.1109/CVPR.2017.104]

[5] Bahmanyar R, de Oca A M M, Datcu M, et al. The semantic gap:an exploration of user and computer perspectives in earth observation images[J]. IEEE Geoscience and Remote Sensing Letters, 2015, 12(10): 2046–2050. [DOI:10.1109/LGRS.2015.2444666]

[6] Gong Y C, Lazebnik S. Iterative quantization: a procrustean approach to learning binary codes[C]//Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition. Colorado Springs, CO, USA: IEEE, 2011: 817-824.[DOI: 10.1109/CVPR.2011.5995432]

[7] Andoni A, Indyk P. Near-optimal Hashing algorithms for approximate nearest neighbor in high dimensions[C]//47th Annual IEEE Symposium on Foundations of Computer Science. Berkeley, CA, USA: IEEE, 2006: 117-129.[DOI: 10.1109/FOCS.2006.49]

[8] Weiss Y, Fergus R, Torralba A. Multidimensional spectral Hashing[C]//Proceedings of the 12th European Conference on Computer Vision-ECCV 2012. Florence, Italy: Springer, 2012: 340-353.[DOI: 10.1007/978-3-642-33715-4_25]

[9] Raginsky M, Lazebnik S. Locality-sensitive binary codes from shift-invariant kernels[C]//Advances in Neural Information Processing Systems 22-Proceedings of the 2009 Conference. Vancouver, BC, Canada: Neural Information Processing Systems, 2009: 1509-1517.

[10] Gong Y C, Lazebnik S, Gordo A, et al. Iterative quantization:a procrustean approach to learning binary codes for large-scale image retrieval[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(12): 2916–2929. [DOI:10.1109/TPAMI.2012.193]

[11] Yang H F, Lin K, Chen C S. Supervised learning of semantics-preserving Hash via deep convolutional neural networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(2): 437–451. [DOI:10.1109/TPAMI.2017.2666812]

[12] Xia R K, Pan Y, Lai H J, et al. Supervised Hashing for image retrieval via image representation learning[C]//Proceedings of the 28th AAAI Conference on Artificial Intelligence. Quebec City, Canada: AAAI, 2014.

[13] Lai H J, Pan Y, Liu Y, et al. Simultaneous feature learning and Hash coding with deep neural networks[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA: IEEE, 2015: 3270-3278.[DOI: 10.1109/CVPR.2015.7298947]

[14] Li W J, Wang S, Kang W C. Feature learning based deep supervised Hashing with pairwise labels[C]//Proceedings of the 25th International Joint Conference on Artificial Intelligence. New York: AAAI, 2016: 1711-1717.

[15] Zhao F, Huang Y Z, Wang L, et al. Deep semantic ranking based Hashing for multi-label image retrieval[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 1556-1564.[DOI: 10.1109/CVPR.2015.7298763]

[16] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[EB/OL].[2018-05-23]. https://arxiv.org/pdf/1409.1556.pdf.

[17] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[C]//Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, Nevada: Curran Associates Inc., 2012: 1097-1105.

[18] Nguyen V A, Do M N. Deep learning based supervised Hashing for efficient image retrieval[C]//Proceedings of 2016 IEEE International Conference on Multimedia and Expo. Seattle, WA, USA: IEEE, 2016.[DOI: 10.1109/ICME.2016.7552927]

[19] Vedaldi A, Lenc K. MatConvNet: convolutional neural networks for MATLAB[C]//Proceedings of the 23rd ACM International Conference on Multimedia. Brisbane, Australia: ACM, 2015: 689-692.[DOI: 10.1145/2733373.2807412]

[20] Cakir F, He K, Bargal S A, et al. MIHash: online Hashing with mutual information[C]//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 437-445.[DOI: 10.1109/ICCV.2017.55]