发布时间: 2023-02-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.210842
2023 | Volume 28 | Number 2

图像理解和计算机视觉

结合Transformer与非对称学习策略的图像检索

贺超, 魏宏喜

1. 内蒙古大学计算机学院, 呼和浩特 010010;

2. 内蒙古自治区蒙古文信息处理技术重点实验室, 呼和浩特 010010;

3. 蒙古文智能信息处理技术国家地方联合工程研究中心, 呼和浩特 010010

收稿日期: 2021-09-23; 修回日期: 2022-02-14; 预印本日期: 2022-02-21

基金项目: 内蒙古自治区自然科学基金重大项目（2019ZD14）；内蒙古自治区科技计划项目（2019GG281）；内蒙古自治区高等学校青年科技英才支持计划项目（NJYT-20-A05）

作者简介: 贺超, 男, 硕士研究生, 主要研究方向为图像检索。E-mail: 3089840275@qq.com
魏宏喜, 通信作者, 男, 教授, 主要研究方向为机器学习和多媒体信息处理。E-mail: cswhx@imu.edu.cn
*通信作者: 魏宏喜 cswhx@imu.edu.cn

中图法分类号: TP391

文献标识码: A

文章编号: 1006-8961(2023)02-0535-10

摘要

目的图像检索是计算机视觉领域的一项基础任务，大多采用卷积神经网络和对称式学习策略，导致所需训练数据量大、模型训练时间长、监督信息利用不充分。针对上述问题，本文提出一种Transformer与非对称学习策略相结合的图像检索方法。方法对于查询图像，使用Transformer生成图像的哈希表示，利用哈希损失学习哈希函数，使图像的哈希表示更加真实。对于待检索图像，采用非对称式学习策略，直接得到图像的哈希表示，并将哈希损失与分类损失相结合，充分利用监督信息，提高训练速度。在哈希空间通过计算汉明距离实现相似图像的快速检索。结果在CIFAR-10和NUS-WIDE两个数据集上，将本文方法与主流的5种对称式方法和性能最优的两种非对称式方法进行比较，本文方法的mAP（mean average precision）比当前最优方法分别提升了5.06%和4.17%。结论本文方法利用Transformer提取图像特征，并将哈希损失与分类损失相结合，在不增加训练数据量的前提下，减少了模型训练时间。所提方法性能优于当前同类方法，能够有效完成图像检索任务。

关键词

图像检索; Transformer; 哈希函数; 非对称式学习; 哈希损失; 分类损失

Image retrieval based on transformer and asymmetric learning strategy

He Chao, Wei Hongxi

1. School of Computer Science, Inner Mongolia University, Hohhot 010010, China;

2. Provincial Key Laboratory of Mongolian Information Processing Technology, Hohhot 010010, China;

3. National and Local Joint Engineering Research Center of Mongolian Information Processing Technology, Hohhot 010010, China

Supported by: Natural Science Foundation of Inner Mongolia Autonomous Region, China (2019ZD14); Project for Science and Technology of Inner Mongolia Autonomous Region, China (2019GG281); Program for Young Talents of Science and Technology in Universities of Inner Mongolia Autonomous Region (NJYT-20-A05)

Abstract

Objective Image retrieval is one of the essential tasks in computer vision. Most of deep learning-based image retrieval methods are implemented via learning-symmetrical strategy. Training images are profiled into a pair and then melt into convolutional neural network (CNN) for features extraction. Similar loss is utilized to learn the hash-relevant images. Consequently, a feasible performance can be obtained through this symmetric learning scheme. In recent years, to improve its performance better, a mass of the CNNs-improved are extended horizontally or deepened vertically. CNNs-based structure is complicated and time-consuming for large scale image datasets. Recently, the Transformer has been developing to the domain of computer vision and its image classification ability is improved intensively. Our Transformer-relevant research is melted into the task of large-scale image retrieval method because the Transformer is feasible for large scale dataset like ImageNet-21k and JFT-300M. For symmetric methods, the overall image dataset-related should be involved in the training phase. Meanwhile, query images have to be added into a pair for training, which results in the problem of time-consuming. The hash function between the training image and the query image is learnt in terms of similar calculation. The information-supervised is used for the similarity matrix only, which is insufficient to be used. A learning-asymmetric scheme is optimized for training based on some images-selected only. Furthermore, the corresponding hash function can be learnt from the hash loss. For the rest of images, their feature representations can be picked out as well. Moreover, classification constraints can be used for the query images and the corresponding classification loss can be optimized in terms of learning-altered technology. To resolve the problems of time-consuming and the insufficient information-supervised, we develop a deep supervised hash image retrieval method in terms of the integrated Transformer and learning-asymmetric strategy. Method For the training images, a Transformer-designed can be utilized to generate the hash representation of these images. The hash loss is used to guarantee the hash representation of images is closer to the real hash value. In the original Transformer, its input is derived of one-dimensional data. First, each image is required to be divided into multiple blocks. Then, each block is mapped into one-dimensional vector. At last, to form the one-dimensional vector, these one-dimensional vectors of all blocks of one image are concatenated together. Our Transformer is designed and composed of 1) two normalization layers, 2) a multi-head attention module, 3) a fully-connected module, and 4) a hash layer. First, the input one-dimensional vector is penetrated into the normalization layer. Next, the output of the normalization layer is fed into the multi-head attention layer of those are 16 heads. Hence, multiple local features of images can be captured. Additionally, the residual link is developed to integrate the initial one-dimensional vector with the output of the multi-head attention layer. By this way, the global features of images can be preserved better. Finally, the representation vector of each image can be obtained through the fully-connected module and the hash layer. In this study, the process for generating representation vectors mentioned above will be replicated for 24 times. For the rest of images, the classification loss is taken as a constraint, which can be used to learn the hash representation of these images in an asymmetric way. To improve the training efficiency, supervised information can be used effectively. The query images are not required to be involved in. The model can be trained via learning-alternated technology. Specifically, first, hash representation of the query images and the weights of the classification are configured and initialized randomly. Then, the parameters of the model can be optimized in terms of stochastic gradient descent algorithm. Following by an epoch of training, the weights of the classification can be balanced by the trained model. Meanwhile, the hash representation ability of the rest of images is improved gradually. In this manner, the hash code of the rest images can be obtained from a well-trained model directly, which optimizes the training efficiency. Finally, our method can realize fast retrieval of similar images through the calculation of the Hamming distance in the hash space. Result In our experiment, our method is compared to five categories of symmetric methods and two sorts of asymmetric methods based on two large scale image retrieval datasets. The performance of the proposed method is increased by 5.06% and 4.17% of each on the two datasets (i.e., CIFAR-10 and NUS-WIDE). The evaluation metric is based on mean average precision (mAP). Ablation experiments validate that the classification loss can make the images closer to the real hash representation. Furthermore, the hyper-parameters of the classification loss are tested as well, and the appropriate hyper-parameters are obtained. Conclusion To complete the image retrieval task effectively, our Transformer-based method has its potentials on image features extraction for large scale image retrieval, and the hash loss can be melted into classification loss for model training further.

Key words

image retrieval; Transformer; hash function; asymmetric learning; hash loss; classification loss

0 引言

在大数据时代，互联网等媒介中的图像数量呈指数级增长，如何从海量图像中快速查找出所需图像是值得研究的问题。近似最近邻(approximate nearest neighbor，ANN)搜索(Andoni和Indyk，2008；Zhang等，2010)是解决此问题的常用方法。作为一种广泛使用的近似最近邻搜索技术，哈希方法(Wu等，2019；Guo等，2017, 万方等，2021)的目的是将原始空间上的数据点映射到汉明空间上的离散2进制哈希码。同时，原始空间上的相似性也保留在汉明空间上。哈希码具有存储空间占用小、计算速度快等优点，广泛应用于图像的大规模检索。

哈希方法分为与数据无关的方法和与数据相关的方法两种(刘颖等，2020)。前者将图像随机投影在特征空间中，用2进制方法生成哈希码(Gionis等，1999)。后者通过机器学习方法学习哈希函数，将图像特征映射为2进制代码(Lai等，2015)。对于依赖数据的方法，维数较少的哈希码可以很好地表示图像特征并获得更好的结果，因此与数据相关的哈希方法在现实中广泛使用。与数据相关的哈希方法分为有监督的哈希方法和无监督的哈希方法。最具代表性的无监督哈希方法是迭代量化(iterative quantization，ITQ)(Gong等，2013)，它根据给定的训练样本，通过迭代投影和阈值法对投影矩阵进行优化。为了更好地利用语义标记进行特征表示学习，研究人员提出了监督学习方法。监督学习方法分为非深度监督哈希和深度监督哈希。传统的非深度监督哈希算法利用图像的颜色和纹理学习哈希函数。具有代表性的非深度监督哈希方法有核监督哈希法(supervised hashing with kernels，KSH)(Liu等，2012)、隐因子哈希法(latent factor hashing，LFH)(Zhang等，2014)和快速监督哈希法(fast supervised hashing，FastH)(Lin等，2014)等。随着深度学习的发展，哈希函数的学习嵌入到深度学习框架中。

随着卷积神经网络(convolutional neural networks，CNN)在图像分类、目标检测和文字识别等方面的优异表现，许多基于CNN的深度监督哈希方法相继提出。基于卷积神经网络的哈希算法(convolutional neural networks based hashing，CNNH)(Xia等，2014)是利用深度神经网络学习哈希码的早期工作之一。在CNNH的基础上，网中网哈希(network in network hashing，NINH)(Lai等，2015)提出了一种利用三重排序损失保持相对相似性的深层结构。之后，一些基于成对标签的深度哈希方法应用于相关任务中。深度成对监督哈希(deep pairwise supervised hashing，DPSH)(Li等，2016)在使用成对标签的同时学习特征表示和哈希函数。深度哈希网络(deep hashing network，DHN)(Zhu等，2016)优化了语义相似性损失和更紧凑的哈希码量化损失。为了更好地利用标签信息，深度监督离散哈希(deep supervised discrete hash，DSDH)(Li等，2017)综合了语义相似性损失和分类损失。对称式深度监督哈希方法目前取得了相对较好的效果，如果想进一步提升性能，可以使用更大规模的卷积神经网络。但是大规模模型的训练周期较长，并且待检索图像参与网络训练也会增加训练时间。为了解决上述问题，Jiang和Li(2018)提出了非对称深度监督哈希(asymmetric deep supervised hashing，ADSH)，这是第1个以非对称方式学习训练图像和待检索图像哈希码的基于卷积神经网络实现的算法，待检索图像哈希码可直接由训练图像哈希码计算得到，这使得训练效率大幅提升。基于联合学习的深度监督哈希算法(joint learning based deep supervised hashing，JLDSH)(Gu等，2020)以ADSH为基础，将分类损失与哈希损失结合起来，更加充分利用了监督信息。

一些研究人员尝试将Transformer(Vaswani等，2017)应用到计算机视觉领域。Transformer在自然语言处理任务中有着十分出色的表现，在机器翻译和语音识别等领域得到了广泛应用。主流方法大都是在大型文本语料库上进行预训练，然后在较小的特定任务数据集上进行微调。Transformer具有出色的计算效率和扩展性，使得训练规模庞大的模型成为可能。这些特性使得将Transformer应用于计算机视觉领域的研究变得流行起来。VIT(vision transformer)(Dosovitskiy等，2021)在图像分类领域有着突出表现，它将图像切成小块输入到Transformer中，并加入图像块的位置信息和分类位，经过Transformer编码器可以对图像进行分类，在海量的训练数据下，达到了很高的分类正确率，性能超过现有的分类模型。图像处理Transformer(image processing transformer，IPT)(Chen等，2021a)可以应用于图像去噪和超分辨率等图像处理任务，在大规模数据集上训练的该模型可在上述任务中获得最先进的性能。基于Transformer的端到端目标检测网络(detection transformer，DETR)(Carion等，2020)使用2进制匹配和Transformer编码器—解码器进行预测，极大简化了目标检测过程。

受VIT和ADSH的启发，本文提出了一种基于Transformer的非对称监督深度哈希方法(asymmetric deep hashing method based on transformer，ADSHT)。本文的主要贡献如下：1)使用Transformer生成图像的哈希表示，结合非对称学习策略将Transformer应用到大规模图像检索任务中，在提高训练效率的同时，提升了检索性能。2)为了更加充分地利用监督信息，将哈希损失与分类损失相结合，使模型能够更好地学习哈希函数，使图像的哈希表示更加真实。3)在两个公开数据集CIFAR-10(Krizhevsky，2009)(该数据集为单标签数据集)和NUS-WIDE(Chua等，2009)(该数据集为多标签数据集)上进行实验，通过与主流的对称式方法和最优的非对称式方法进行比较，验证了所提出方法的有效性。

1 问题定义

将$m$个查询图像标记为$\boldsymbol{X}=\left\{\boldsymbol{x}_i\right\}_{i=1}^m$，$n$个待检索图像表示为$\boldsymbol{Y}=\left\{\boldsymbol{y}_j\right\}_{j=1}^n$，成对监督的相似信息记为$\boldsymbol{S}=\{-1, +1\}^{m \times n}$。$S_{i j}=1$说明$\boldsymbol{x}_{i}$和$\boldsymbol{y}_{j}$是相似的，反之，$S_{ij}=-1$则代表不相似。深度监督哈希利用监督信息从查询图像和待检索图像中学习哈希函数，两幅相似图像中至少包含1个相同的标签。$\boldsymbol{R}=\left\{\boldsymbol{r}_j\right\}_{j=1}^n \in\{-1, +1\}^{n \times c}$表示待检索图像的哈希码，$\boldsymbol{Q}=\left\{\boldsymbol{q}_i\right\}_{i=1}^m \in\{-1, +1\}^{m \times c}$表示查询图像的哈希码，其中$c$表示哈希码的长度。本文方法通过从整个待检索图像中获取查询图像来学习哈希函数，在这种情况下，$\boldsymbol{X} \subseteq \boldsymbol{Y}$。

2 提出方法

2.1 模型架构

本文提出的ADSHT模型主要包括图像块嵌入部分、特征提取部分(Dosovitskiy等，2021)和损失函数部分(Gu等，2020)，结构图如图 1所示。特征提取模块是提取图像的特征，并将特征转换为哈希编码表示。损失函数部分是使图像特征更接近真实的哈希码，并保持查询图像与待检索图像的相似性。原始的Transformer只能处理1维数据。为了能够处理2维图像，本文将图像$\boldsymbol{x} \in \mathbf{R}^{H \times W \times C}$重塑为一系列扁平的2维图像块$\boldsymbol{x}_p \in \mathbf{R}^{N \times\left(P^2 \times C\right)}$。$H$、$W$和$C$为原始图像的高度、宽度和通道数，$(P, P)$是每个图像块的高度和宽度, $p$为图像的序号。$N=H W / P^2$是Transformer输入的序列长度。

图 1 ADSHT模型结构图

Fig. 1 Architecture of ADSHT

本文使用线性投影层$\boldsymbol{E} \in \mathbf{R}^{\left(p^2 \times C\right) \times D}$将每个图像块向量映射到$D$维空间。随后，拼接所有图像块向量，使之成为一个完整的图像向量，并在该图像向量第1位加入分类位向量$\boldsymbol{x}_{\text {class }}$。最后，将每个图像块的位置信息$\boldsymbol{E}_{\mathrm{pos}} \in \mathbf{R}^{(N+1) \times D}$与图像块向量$\boldsymbol{x}_p$相加，最终得到嵌入的图像向量$\boldsymbol{s}_q \in \mathbf{R}^{(N+1) \times D}$。上述过程称为图像块嵌入，如图 2所示。

图 2 图像块嵌入

Fig. 2 Patches embedding

本文方法的特征提取部分使用Transformer的编码器模块，并在该模块后加入哈希模块。编码器模块在ImageNet1k数据集上进行预训练，该模块由多层交替的多头注意力模块(multi-head self-attention，MSA)(Vaswani等，2017)和全连接模块(fully-connected network，MLP)组成，具体为

$ \begin{gathered} \boldsymbol{s}_q^*={MSA}\left(L N\left(\boldsymbol{s}_{(q-1)}\right)\right)+\boldsymbol{s}_{(q-1)} \\ q=1, 2, \cdots, B \end{gathered} $

(1)

$ \begin{gathered} \boldsymbol{s}_q={MLP}\left(L N\left(\boldsymbol{s}_q^*\right)\right)+\boldsymbol{s}_q^* \\ q=1, 2, \cdots, B \end{gathered} $

(2)

式中，$B$表示编码器模块的层数。在每个模块之前使用规范化层(layernorm，LN)(Wang等，2019)，在每个模块之后均使用残差连接。最后通过哈希模块$HASH$输出图像的哈希表示$\boldsymbol{h} \in \mathbf{R}^c$，具体为

$ \boldsymbol{h}=H A S H\left(L N\left(\boldsymbol{s}_B^0\right)\right) $

(3)

本文用最后一个全连接模块输出的向量的分类位$\boldsymbol{s}_B^0$作为提取到的图像特征，输入到哈希模块中获取图像的哈希表示。全连接模块包括两个具有GELU(gaussian error linear units)激活函数的非线性的层。哈希模块包括一个具有TANH(hyperbolic tangent)激活函数的非线性的层。

传统的对称式学习方法只利用监督信息获得相似矩阵，很少关注监督信息对图像的分类作用。本文对检索图像添加分类约束，充分利用监督信息，使待检索图像得到更真实的哈希表示。为了使损失函数最小化，在待检索图像哈希码与查询图像哈希码均不更新的情况下，对损失函数求导，计算分类损失函数参数。模型结束训练时，会输出查询图像的哈希码和分类损失函数的参数，令上述两个参数不更新，计算待检索图像的哈希码。

2.2 损失函数

为了更好地学习能够保持查询图像和待检索图像相似性的哈希码，可以优化监督信息的相似度矩阵与查询—待检索图像哈希码对的内积之间的损失。损失函数(Jiang和Li，2018)定义为

$ \begin{gathered} \min\limits_{U, V} J(\boldsymbol{Q}, \boldsymbol{R})=\sum\limits_{i=1}^m \sum\limits_{j=1}^n\left(\boldsymbol{q}_i^{\mathrm{T}} \boldsymbol{r}_j-c S_{i j}\right)^2 \\ \text { s.t. } \boldsymbol{Q} \in\{-1, +1\}^{m \times c}, \boldsymbol{R} \in\{-1, +1\}^{n \times c} \end{gathered} $

(4)

式中，$\boldsymbol{q}_i=h\left(\boldsymbol{x}_i\right)$很难学习，它分布是离散的，因此设置$h\left(\boldsymbol{x}_i\right)=\operatorname{sign}\left(F\left(\boldsymbol{x}_i; \boldsymbol{\varTheta}\right)\right), F\left(\boldsymbol{x}_i; \boldsymbol{\varTheta}\right) \in \mathbf{R}$，而$h\left(\boldsymbol{x}_i\right)$分布依然是离散的，很难通过反向传播来优化网络参数。其中，$F\left(\boldsymbol{x}_i; \boldsymbol{\varTheta}\right)$是Transformer编码器网络的输出，$\boldsymbol{\varTheta}$是该网络的参数。因此，本文使用$\tanh \left(F\left(\boldsymbol{x}_i; \boldsymbol{\varTheta}\right)\right)$代替$h\left(\boldsymbol{x}_i\right)$来解决上述问题。$\boldsymbol{\xi}=\{1, 2, \cdots, n\}$表示所有待检索图像的索引值，$\boldsymbol{\varOmega}=\{1, 2, \cdots, m\} \subseteq \boldsymbol{\xi}$表示所有查询图像的索引值。损失函数(Jiang和Li，2018)为

$ \begin{aligned} & \min\limits_{\boldsymbol{\varTheta}, \boldsymbol{R}} J(\boldsymbol{\varTheta}, \boldsymbol{R})=\sum\limits_{i \in \boldsymbol{\varOmega}} \sum\limits_{j \in \boldsymbol{\xi}}\left(\tanh \left(F\left(\boldsymbol{x}_i ; \boldsymbol{\varTheta}\right)\right)^{\mathrm{T}} \boldsymbol{r}_j-c S_{i j}\right)^2 \\ & \text { s. t. } \boldsymbol{Q} \in\{-1, +1\}^{m \times c}, \boldsymbol{R} \in\{-1, +1\}^{n \times c} \end{aligned} $

(5)

由于$\boldsymbol{X} \subseteq \boldsymbol{Y}, \boldsymbol{x}_i$有两种表示，分别是待检索图像中的2进制代码$\boldsymbol{r}_{j}$和查询图像中的2进制代码$\tanh \left(F\left(\boldsymbol{x}_i; \boldsymbol{\varTheta}\right)\right)$。本文添加另一个约束来减少它们之间的差异。新的损失函数(Jiang和Li，2018)定义为

$ \begin{gathered} \min\limits_{\boldsymbol{\varTheta}, \boldsymbol{R}} J(\boldsymbol{\varTheta}, \boldsymbol{R})= \\ \sum\limits_{i \in \boldsymbol{\varOmega}} \sum\limits_{j \in \boldsymbol{\xi}}\left[\tanh \left(F\left(\boldsymbol{x}_i ; \boldsymbol{\varTheta}\right)\right)^{\mathrm{T}} \boldsymbol{r}_j-c S_{i j}\right]^2+ \\ \gamma \sum\limits_{i \in \boldsymbol{\varOmega}}\left[\boldsymbol{r}_i-\tanh \left(F\left(\boldsymbol{y}_i ; \boldsymbol{\varTheta}\right)\right)\right]^2 \\ \text { s. t. } \boldsymbol{Q} \in\{-1, +1\}^{m \times c}, \boldsymbol{R} \in\{-1, +1\}^{n \times c} \end{gathered} $

(6)

式中，$γ$是超参数。为了更好地利用监督信息，本文采用简单的线性分类器对检索图像的哈希码和对应的标签信息进行建模，利用标签信息约束待检索图像的哈希码的学习，让学习到的待检索图像哈希码更接近真实的哈希码。令$\boldsymbol{L}=\left\{\boldsymbol{i}_1, \boldsymbol{i}_2, \boldsymbol{i}_3, \cdots, \boldsymbol{i}_n\right\}$表示真实图像标签信息的One-hot编码。$\boldsymbol{i}_i \in\{0, 1\}^h, i=1, …, n$是每幅图像的标签信息组成的向量，$h$表示类别的数量。元素1表示图像包含此标签，元素0则相反。用$\boldsymbol{W}$表示分类器的权重，将分类损失和哈希损失结合起来。最终的损失函数(Gu等，2020)为

$ \begin{gathered} \min\limits_{\theta, \boldsymbol{R}, \boldsymbol{W}} J(\theta, \boldsymbol{R}, \boldsymbol{W})= \\ \sum\limits_{i \in \boldsymbol{\varOmega}} \sum\limits_{j \in \boldsymbol{\xi}}\left(\tanh \left(F\left(\boldsymbol{x}_i ; \theta\right)\right)^{\mathrm{T}} \boldsymbol{r}_j-c S_{i j}\right)^2+ \\ \begin{aligned} & \gamma \sum\limits_{i \in \boldsymbol{\varOmega}}\left[\boldsymbol{r}_i-\tanh \left(F\left(\boldsymbol{x}_i ; \boldsymbol{\varTheta}\right)\right)\right]^2+ \\ & \mu \sum\limits_{i \in \boldsymbol{\varOmega}}\left\|\boldsymbol{i}_i-\boldsymbol{r}_i \boldsymbol{W}\right\|_{\mathrm{F}}^2+\boldsymbol{\varphi}\|\boldsymbol{W}\|_{\mathrm{F}}^2 \\ & \text { s. t. } \boldsymbol{Q} \in\{-1, +1\}^{m \times c}, \boldsymbol{R} \in\{-1, +1\}^{n \times c} \end{aligned} \end{gathered} $

(7)

式中，$μ$和$φ$为超参数。

2.3 优化方法

2.3.1 优化参数$\boldsymbol{\varTheta}$

当$\boldsymbol{R}$和$\boldsymbol{W}$不改变时，令$\boldsymbol{z}_i=F\left(\boldsymbol{x}_i; \boldsymbol{\varTheta}\right), \boldsymbol{u}_i=\tanh \left(F\left(\boldsymbol{x}_i; \boldsymbol{\varTheta}\right)\right), \boldsymbol{z}_i$的梯度计算为

$ \begin{gathered} \frac{\partial J}{\partial \boldsymbol{z}_i}=\left[2 \sum\limits_{j \in \boldsymbol{\xi}}\left(\boldsymbol{u}_i^{\mathrm{T}} \boldsymbol{r}_i-c S_{i j}\right) \boldsymbol{r}_i+\right. \\ \left.2 \gamma\left(\boldsymbol{r}_i-\boldsymbol{u}_i\right)\right] \odot\left(\mathbf{1}-\boldsymbol{u}_i^2\right) \end{gathered} $

(8)

式中，⊙是哈达玛积(Hadamard)(Jiang和Li，2018)。随后，根据链推导规则基于$\frac{\partial J}{\partial \boldsymbol{z}_i} \text { 计算} \frac{\partial J}{\partial \boldsymbol{\varTheta}}$，通过反向传播(backpropagation，BP)算法更新网络参数$\boldsymbol{\varTheta}$。

2.3.2 优化参数$\boldsymbol{W}$

当$\boldsymbol{\varTheta}$和$\boldsymbol{R}$不改变时，式(7)变为

$ \min\limits_{\boldsymbol{W}} J(\boldsymbol{W})=\mu \sum\limits_{i \in \boldsymbol{\varOmega}}\left\|\boldsymbol{i}_i-\boldsymbol{r}_i \boldsymbol{W}\right\|_{\mathrm{F}}^2+\boldsymbol{\varphi}\|\boldsymbol{W}\|_{\mathrm{F}}^2 $

(9)

这是一个具有封闭解的最小化问题，对其求解，解得$\boldsymbol{W}$为

$ \boldsymbol{W}=\left(\boldsymbol{r}_i^{\mathrm{T}} \boldsymbol{r}_i+\frac{\varphi}{\mu} \boldsymbol{I}\right)^{-1} \boldsymbol{r}_i^{\mathrm{T}} \boldsymbol{L} $

(10)

式中，$\boldsymbol{L}$代表真实图像标签信息的One-hot编码。

2.3.3 优化参数$\boldsymbol{R}$

当$\boldsymbol{\varTheta}$和$\boldsymbol{W}$不改变时，式(7)变为

$ \begin{gathered} \min\limits_{\boldsymbol{R}} J(\boldsymbol{R})=\left\|\boldsymbol{U}^{\mathrm{T}} \boldsymbol{R}-c S_{i j}\right\|_{\mathrm{F}}^2+\gamma\|\boldsymbol{R}-\boldsymbol{U}\|_{\mathrm{F}}^2+ \\ \mu\|\boldsymbol{L}-\boldsymbol{R} \boldsymbol{W}\|_{\mathrm{F}}^2 \end{gathered} $

(11)

式中，$\boldsymbol{U}=\left\{\boldsymbol{u}_j\right\}_{j=1}^n \in\{-1, +1\}^{n \times c}$，然后，逐位更新$\boldsymbol{R}$。也就是说，每更新$\boldsymbol{R}$的一列时，其他的列是不更新的。令$\boldsymbol{U}_{* j}$表示$\boldsymbol{U}$的第$j$列，$\hat{\boldsymbol{U}}_j$表示$\boldsymbol{U}$的矩阵但不包括$\boldsymbol{U}_{*j}$。$\boldsymbol{R}_{*j}$表示$\boldsymbol{R}$的第$j$列，$\hat{\boldsymbol{R}}_j$表示$\boldsymbol{R}$的矩阵但不包括$\boldsymbol{R}_{*j}$。$\boldsymbol{O}_{*j}$表示$\boldsymbol{O}$的第$j$列，$\hat{\boldsymbol{O}}_j$表示$\boldsymbol{O}$的矩阵但不包括$\boldsymbol{O}_{*j}$。$\boldsymbol{W}_{*j}$表示$\boldsymbol{W}$的第$j$列，$\hat{\boldsymbol{W}}_j$表示$\boldsymbol{W}$的矩阵但不包括$\boldsymbol{W}_{*j}$。这里$\boldsymbol{O}=-2 c \boldsymbol{S}^{\mathrm{T}} \boldsymbol{U}-2 \mu \boldsymbol{L} \boldsymbol{W}^{\mathrm{T}}-2 \gamma \overline{\boldsymbol{U}}$。其中，$\overline{\boldsymbol{U}}=\left\{\overline{\boldsymbol{u}}_j\right\}_{j=1}^n \in\{-1, +1\}^{n \times c}$，当$j \in \boldsymbol{\varOmega}$时，$\overline{\boldsymbol{u}}_j=\overline{\boldsymbol{u}}_j$，否则$\bar{\boldsymbol{u}}_j=0$。可以得出

$ \boldsymbol{R}_{* j}=-\operatorname{sign}\left(2 \hat{\boldsymbol{R}}_j \hat{\boldsymbol{U}}_j^{\mathrm{T}} \boldsymbol{U}_{* j}+2 \mu \hat{\boldsymbol{R}}_j \hat{\boldsymbol{W}}_j \boldsymbol{W}_{* j}^{\mathrm{T}}+\boldsymbol{O}_{* j}\right) $

(12)

式(12)用来更新$\boldsymbol{R}$。最后可以得到整个数据库对应的哈希码。

2.4 学习算法

ADSHT学习算法的具体步骤如下：

输入：$n$个数据点$\boldsymbol{Y}=\left\{\boldsymbol{y}_j\right\}_{j=1}^n$，监督相似矩阵$\boldsymbol{S} \in\{-1, +1\}^{n \times n}$，分类损失参数$\boldsymbol{W} \in \mathbf{R}^{c \times h}$，哈希码的长度$c$。

输出：待检索图像的哈希码$\boldsymbol{R}$，神经网络参数$\varTheta$。

1) 初始化$\varTheta$、$\boldsymbol{R}$和$\boldsymbol{W}$，批处理大小$M$，迭代次数$T_{\rm{out}}$和$T_{\rm{in}}$。

2) FOR $i→T_{\rm{out}}$ DO;

(1) 从待检索图像中随机选取查询图像$\boldsymbol{X}=\boldsymbol{Y}^{\varOmega}$，并获取查询图像的相似矩阵$\boldsymbol{S}^{\varOmega}$。

(2) FOR $j→T_{\rm{in}}$ DO;

FOR $s=1, 2, …, m/M$ DO;

① 从$\boldsymbol{X}$中随机选取$\boldsymbol{M}$幅图像，构建一次批处理。

② 通过前向传播计算批处理中各幅图像的$\boldsymbol{z}_{i}$和$\boldsymbol{u}_{i}$。

③ 计算批处理的哈希损失，并根据式(8)计算梯度。

④ 根据反向传播算法更新网络参数$\varTheta$，并根据式(10)更新$\boldsymbol{W}$。

3) FOR $k=1, 2, …, c$ DO；

根据式(12)更新$\boldsymbol{R}_{*j}$。

3 实验

3.1 数据集

CIFAR-10是单标签数据集，包含60 000幅图像，50 000幅为查询图像，10 000幅为测试图像，共10个类别，每个类别包含6 000幅图像，每幅图像的大小为32×32像素。NUS-WIDE是大规模图像检索任务常用的多标签数据集，包含269 648幅图像，共81个类别。

3.2 评价指标

实验采用全局平均精度(mean average precision，mAP)作为评价指标。mAP是图像检索中最重要的衡量模型检索性能的指标，是一组查询中每个查询的平均精度(AP)相加所取的平均值，定义为

$ A P=\frac{1}{R} \sum\limits_c^N \frac{R_c}{c} \times r e l_c $

(13)

式中，$N$表示数据集的大小，$R$是数据集中相关图像的数量，$c$是进行检索以后，数据集中返回图像的数量，$R_{c}$是前$c$中返回相关图像的数量，如果排名在第$c$位置的图像是相关的，则$rel_{c}$为1，否则为0(Zheng等，2020)。

3.3 实现细节

深度监督哈希算法大多基于对称式方法，少部分基于非对称式方法，目前非对称式方法都是基于CNN实现的。实验通过与目前最优的对称、非对称深度监督哈希算法对比，验证本文方法的性能。

在CIFAR-10数据集上，选择Transhash(Chen等，2021b)、ADSH(Jiang和Li，2018)、DBDH(deep balanced discrete hashing)(Zheng等，2020)、JLDSH(Gu等，2020)、DFH(deep Fisher hashing)(Li等，2019)、DSDH(Li等，2017)和DSHSD(deep supervised hashing based on stabe distribution)(Wu等，2019)进行对比。实验中，遵循Jiang和Li(2018)的方法，随机选取1 000幅图像(每个类别100幅)用做测试集，其余59 000幅图像用做检索集，并从检索集中随机抽取5 000幅图像(每个类别500幅)作为训练集。

在NUS-WIDE数据集上，选择Transhash(Chen等，2021b)、ADSH(Jiang和Li，2018)、DBDH(Zheng等，2020)、DFH(Li等，2019)、DSDH(Li等，2017)和DSHSD(Wu等，2019)进行对比。与Li等人(2017)的方法类似，共选择195 969幅图像，包含21个常见类别，每个类别至少有5 000幅图像。实验随机选取2 100幅(每类100幅图像)作为测试集，其余图像作为检索集，并从检索集中随机抽取10 500幅(每类500幅图像)图像作为训练集。特别说明，mAP结果是基于NUS-WIDE数据集返回的Top-5K样本计算的。

实验参数参考Jiang和Li(2018)使用的参数，为了避免过拟合，权重衰减参数设为^-5。批处理数量为128，学习率的调试范围为[10^-6, 10^-2]，超参数$γ$为200，$T_{\rm{out}}$为60，$T_{\rm{in}}$为3。本文将$\boldsymbol{S}$中元素-1的权重设为元素1的数量与元素-1的数量之比。

3.4 实验结果

ADSHT-R和ADSHT模型的提取特征模块分别为ResNet50网络和Transformer编码器网络。实验使用Tesla P40 GPU，两个模型都在ImageNet1k数据集上进行预训练，设置训练轮数为60，批处理大小为32，都使用随机梯度下降方法，超参数$γ$为200。两个模型在32位哈希码上的训练效率对比如表 1所示。可以看出，在相同训练轮数下，ADSHT的参数量大约是ADSHT-R的13倍，但训练时间少于ADSHT-R，表明ADSHT的训练效率更高。

表 1 不同网络下模型的效率对比
Table 1 Efficiency comparison of models under different networks

下载CSV

模型	参数量/M	训练时间/h
模型	参数量/M	CIFAR-10	NUS-WIDE
ADSHT-R	23.6	11.1	23.8
ADSHT	305.5	2.6	7.5

本文方法结合了分类损失函数和哈希损失函数，分类损失函数是式(7)的最后两项，超参数分别为$μ$和$φ$。若$μ$为0，则最小化分类权重为

$ \min\limits_{\boldsymbol{W}} J(\boldsymbol{W})=\boldsymbol{\varphi}\|\boldsymbol{W}\|_{\mathrm{F}}^2 $

(14)

这样求得的分类权重为0或任意值。若$φ$为0，则最小化分类权重为

$ \min\limits_{\boldsymbol{W}} J(\boldsymbol{W})=\mu \sum\limits_{i \in \boldsymbol{\varOmega}}\left\|\boldsymbol{i}_i-\boldsymbol{r}_i \boldsymbol{W}\right\|_{\mathrm{F}}^2 $

(15)

超参数$μ$不会影响分类权重，在求解当中会被消掉。因此两项要一起存在，本文令$p=φ/μ$，对不同的$p$值(0，0.05，0.1，0.2，1，5，10，20)进行实验，得到的结果如图 3所示，本文取$φ=1$。

图 3 两个数据集的超参数取值

Fig. 3 Hyper parameter values of two datasets

((a) CIFAR-10 dataset; (b) NUS-WIDE dataset)

从图 3可以看出，在CIFAR-10数据集上，模型在$p$=0.2时，$μ=5$，得到最好的性能。在NUS-WIDE数据集上，在$p$=5时，$μ=0.2$，模型获得最好的性能。

不同方法的检索性能如表 2所示。ADSHT为仅使用哈希损失函数的模型，ADSHT-F代表结合分类损失函数的模型。从24位、32位和48位3种不同的哈希位数与其他方法对比，可以发现ADSHT-F在CIFAR-10和NUS-WIDE数据集上的mAP值都高于其他方法。尤其在CIFAR-10数据集上，与表中mAP值最高的JLDSH相比，ADSHT-F在24位时提升高达5.06%。在NUS-WIDE数据集上，ADSHT-F与ADSH相比，在24位时提升了4.17%。ADSHT-F模型相对于ADSHT模型，性能也有一定提升，说明监督信息得到充分利用，分类损失会产生一定贡献。从表 2可以看出，随着哈希表示位数的增加，模型性能会逐渐提升，因为位数增加可以更好地表示图像特征。表中前5行的DSDH(Li等，2017)、DFH (Li等，2019)、DSHSD(Wu等，2019)、DBDH (Zheng等，2020)和Transhash(Chen等，2021b)模型均使用对称式方法，第6—10行的JLDSH(Gu等，2020)、ADSH(Jiang和Li，2018)、ADSHT-R、ADSHT和ADSHT-F模型均使用非对称式方法，可以看出除了ADSHT-R模型因未完全收敛导致性能偏低以外，非对称式方法的性能明显优于对称式方法。将ADSHT与ADSHT-R对比，可以看出，ADSHT模型性能远高于ADSHT-R，并且从实验结果中可以观察到，训练轮数设置为60的情况下，ADSHT-R模型并没有完全收敛，因此其性能偏低。这也说明ADSHT模型的收敛速度快于ADSHT-R模型。

表 2 不同方法的mAP对比
Table 2 Comparison of mAP of different methods

下载CSV

方法	CIFAR-10数据集			NUS-WIDE数据集
方法	24 bit	32 bit	48 bit	24 bit	32 bit	48 bit
DSDH(Li等，2017)	0.786 0	0.801 0	0.820 0	0.808 0	0.820 0	0.829 0
DFH(Li等，2019)	0.825 0	0.831 0	0.844 0	0.823 0	0.833 0	0.842 0
DSHSD(Wu等，2019)	0.813 0	0.821 0	0.827 0	0.826 0	0.831 0	0.835 0
DBDH(Zheng等，2020)	0.735 6	0.742 7	0.748 3	-	-	-
Transhash(Chen等，2021b)	-	0.910 8	0.914 1	-	0.739 3	0.753 2
JLDSH(Gu等，2020)	0.935 0	0.937 5	0.944 4	-	-	-
ADSH(Jiang和Li，2018)	0.928 0	0.931 0	0.939 0	0.878 4	0.895 1	0.905 5
ADSHT-R(本文)	0.745 1	0.685 5	0.711 1	0.830 7	0.841 6	0.831 4
ADSHT(本文)	0.983 7	0.985 2	0.986 6	0.914 7	0.921 8	0.926 1
ADSHT-F(本文)	0.985 6	0.986 4	0.988 6	0.920 1	0.926 3	0.934 1
注：加粗字体表示各列最优结果。“-”表示该项结果为空。

4 结论

本文提出了一种Transformer与非对称学习策略相结合的图像检索方法。一方面，利用哈希损失学习哈希函数，并使用Transformer生成查询图像的哈希表示，使查询图像的哈希表示更加真实；另一方面，采用非对称式学习策略，根据查询图像的哈希表示与分类损失函数直接计算得到待检索图像的哈希表示。在此过程中，将哈希损失与分类损失相结合，充分利用监督信息，提高训练效率。与目前最优的对称、非对称深度监督哈希算法的对比实验表明，本文方法在CIFAR-10和NUS-WIDE数据集上均获得了最优性能，验证了本文方法的有效性。但是本文方法使用的Transformer模型未关注图像局部信息，且模型参数量较大，解决这些问题是后续工作的重点。

参考文献

Andoni A, Indyk P. 2008. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications of the ACM, 51(1): 117-122 [DOI:10.1145/1327452.1327494]

Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A and Zagoruyko S. 2020. End-to-end object detection with transformers//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 213-229 [DOI: 10.1007/978-3-030-58452-8_13]

Chen H T, Wang Y H, Guo T Y, Xu C, Deng Y P, Liu Z H, Ma S W, Xu C J, Xu C and Gao W. 2021a. Pre-trained image processing transformer//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 12294-12305 [DOI: 10.1109/CVPR46437.2021.01212]

Chen Y B, Zhang S, Liu F X, Chang Z G, Ye M and Qi Z W. 2021b. Transhash: transformer-based hamming hashing for efficient image retrieval [EB/OL]. [2021-05-05]. https://arxiv.org/pdf/2105.01823.pdf

Chua T S, Tang J H, Hong R C, Li H J, Luo Z P and Zheng Y T. 2009. NUS-WIDE: a real-world web image database from national university of Singapore//Proceedings of 2009 ACM International Conference on Image and Video Retrieval. Santorini, Fira Greece: Association for Computing Machinery: #48 [DOI: 10.1145/1646396.1646452]

Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X H, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J and Houlsby N. 2021. An image is worth 16×16 words: transformers for image recognition at scale [EB/OL]. [2021-06-03]. https://arxiv.org/pdf/2010.11929.pdf

Gionis A, Indyk P and Motwani R. 1999. Similarity search in high dimensions via hashing//Proceedings of the 25th International Conference on Very Large Data Bases. San Francisco, USA: Morgan Kaufmann Publishers Inc. : 518-529

Gong Y C, Lazebnik S, Gordo A, Perronnin F. 2013. Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12): 2916-2929 [DOI:10.1109/TPAMI.2012.193]

Gu G H, Liu J T, Li Z Y, Huo W H, Zhao Y. 2020. Joint learning based deep supervised hashing for large-scale image retrieval. Neurocomputing, 385: 348-357 [DOI:10.1016/j.neucom.2019.12.096]

Guo Y C, Ding G G, Liu L, Han J G, Shao L. 2017. Learning to hash with optimized anchor embedding for scalable retrieval. IEEE Transactions on Image Processing, 26(3): 1344-1354 [DOI:10.1109/TIP.2017.2652730]

Jiang Q Y and Li W J. 2018. Asymmetric deep supervised hashing//Proceedings of the 32nd AAAI Conference on Artificial Intelligence and the 13th Innovative Applications of Artificial Intelligence Conference and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence. New Orleans, USA: AAAI: 3342-3349

Krizhevsky A. 2009. Learning Multiple Layers of Features from Tiny Images. Toronto, Canada: University of Toronto

Lai H J, Pan Y, Liu Y and Yan S C. 2015. Simultaneous feature learning and hash coding with deep neural networks//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3270-3278 [DOI: 10.1109/CVPR.2015.7298947]

Li Q, Sun Z, He R and Tan T N. 2017. Deep supervised discrete hashing//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc. : 2479-2488

Li W J, Wang S and Kang W C. 2016. Feature learning based deep supervised hashing with pairwise labels//Proceedings of the 25th International Joint Conference on Artificial Intelligence. New York, USA: AAAI Press: 1711-1717

Li Y Q, Pei W J, Zha Y F and van Gemert J. 2019. Push for quantization: deep fisher hashing [EB/OL]. [2021-08-31]. https://arxiv.org/pdf/1909.00206.pdf

Lin G S, Shen C H, Shi Q F, Van den Hengel A and Suter D. 2014. Fast supervised hashing with decision trees for high-dimensional data//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 1971-1978 [DOI: 10.1109/CVPR.2014.253]

Liu W, Wang J, Ji R R, Jiang Y G and Chang S F. 2012. Supervised hashing with kernels//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 2074-2081 [DOI: 10.1109/CVPR.2012.6247912]

Liu Y, Cheng M, Wang F P, Li D X, Liu W, Fan J L. 2020. Deep Hashing image retrieval methods, 25(7): 1296-1317 (刘颖, 程美, 王富平, 李大湘, 刘伟, 范九伦. 2020. 深度哈希图像检索方法综述. 中国图象图形学报, 25(7): 1296-1317[DOI: 10.11834/jig.190518]) http://www.cjig.cn/jig/ch/reader/view_abstract.aspx?file_no=20200702&flag=1

Wan F, Qiang H P, Lei G B. 2021. Self-supervised deep discrete hashing for image retrieval. Journal of Image and Graphics, 26(11): 2659-2669 (万方, 强浩鹏, 雷光波. 2021. 自监督深度离散哈希图像检索. 中国图象图形学报, 26(11): 2659-2669) [DOI:10.11834/jig.200212]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc. : 5000-6010

Wang Q, Li B, Xiao T, Zhu J B, Li C L, Wong D F and Chao L S. 2019. Learning deep transformer models for machine translation//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics: 1810-1822 [DOI: 10.18653/v1/P19-1176]

Wu L, Ling H F, Li P, Chen J Z, Fang Y, Zhou F H. 2019. Deep supervised hashing based on stable distribution. IEEE Access, 7: 36489-36499 [DOI:10.1109/ACCESS.2019.2900489]

Xia R K, Pan Y, Lai H J, Liu C and Yan S C. 2014. Supervised hashing for image retrieval via image representation learning//Proceedings of the 28th AAAI Conference on Artificial Intelligence. Québec City, Canada: AAAI Press: 2156-2162

Zhang D, Wang J, Cai D and Lu J S. 2010. Self-taught hashing for fast similarity search//Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. Geneva, Switzerland: Association for Computing Machinery: 18-25 [DOI: 10.1145/1835449.1835455]

Zhang P C, Wei Z, Li W J and Guo M Y. 2014. Supervised hashing with latent factor models//Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval. Gold Coast, Australia: Association for Computing Machinery: 173-182 [DOI: 10.1145/2600428.2609600]

Zheng X T, Zhang Y C, Lu X Q. 2020. Deep balanced discrete hashing for image retrieval. Neurocomputing, 403: 224-236 [DOI:10.1016/j.neucom.2020.04.037]

Zhu H, Long M S, Wang J M and Cao Y. 2016. Deep hashing network for efficient similarity retrieval//Proceedings of the 13th AAAI Conference on Artificial Intelligence. Phoenix, USA: AAAI Press: 2415-2421