Current Issue Cover
结合Transformer与非对称学习策略的图像检索

贺超1,2,3, 魏宏喜1,2,3(1. 内蒙古大学计算机学院, 呼和浩特 010010;2.
2. 内蒙古自治区蒙古文信息处理技术重点实验室, 呼和浩特 010010;3.
3. 蒙古文智能信息处理技术国家地方联合工程研究中心, 呼和浩特 010010)

摘 要
目的 图像检索是计算机视觉领域的一项基础任务,大多采用卷积神经网络和对称式学习策略,导致所需训练数据量大、模型训练时间长、监督信息利用不充分。针对上述问题,本文提出一种Transformer与非对称学习策略相结合的图像检索方法。方法 对于查询图像,使用Transformer生成图像的哈希表示,利用哈希损失学习哈希函数,使图像的哈希表示更加真实。对于待检索图像,采用非对称式学习策略,直接得到图像的哈希表示,并将哈希损失与分类损失相结合,充分利用监督信息,提高训练速度。在哈希空间通过计算汉明距离实现相似图像的快速检索。结果 在CIFAR-10和NUS-WIDE两个数据集上,将本文方法与主流的5种对称式方法和性能最优的两种非对称式方法进行比较,本文方法的mAP (mean average precision)比当前最优方法分别提升了5.06%和4.17%。结论 本文方法利用Transformer提取图像特征,并将哈希损失与分类损失相结合,在不增加训练数据量的前提下,减少了模型训练时间。所提方法性能优于当前同类方法,能够有效完成图像检索任务。
关键词
Image retrieval based on transformer and asymmetric learning strategy

He Chao1,2,3, Wei Hongxi1,2,3(1. School of Computer Science, Inner Mongolia University, Hohhot 010010, China;2.
2. Provincial Key Laboratory of Mongolian Information Processing Technology, Hohhot 010010, China;3.
3. National and Local Joint Engineering Research Center of Mongolian Information Processing Technology, Hohhot 010010, China)

Abstract
Objective Image retrieval is one of the essential tasks in computer vision. Most of deep learning-based image retrieval methods are implemented via learning-symmetrical strategy. Training images are profiled into a pair and then melt into convolutional neural network (CNN) for features extraction. Similar loss is utilized to learn the hash-relevant images. Consequently, a feasible performance can be obtained through this symmetric learning scheme. In recent years, to improve its performance better, a mass of the CNNs-improved are extended horizontally or deepened vertically. CNNs-based structure is complicated and time-consuming for large scale image datasets. Recently, the Transformer has been developing to the domain of computer vision and its image classification ability is improved intensively. Our Transformer-relevant research is melted into the task of large-scale image retrieval method because the Transformer is feasible for large scale dataset like ImageNet-21k and JFT-300M. For symmetric methods, the overall image dataset-related should be involved in the training phase. Meanwhile, query images have to be added into a pair for training, which results in the problem of time-consuming. The hash function between the training image and the query image is learnt in terms of similar calculation. The information-supervised is used for the similarity matrix only, which is insufficient to be used. A learning-asymmetric scheme is optimized for training based on some images-selected only. Furthermore, the corresponding hash function can be learnt from the hash loss. For the rest of images, their feature representations can be picked out as well. Moreover, classification constraints can be used for the query images and the corresponding classification loss can be optimized in terms of learning-altered technology. To resolve the problems of time-consuming and the insufficient information-supervised, we develop a deep supervised hash image retrieval method in terms of the integrated Transformer and learning-asymmetric strategy. Method For the training images, a Transformer-designed can be utilized to generate the hash representation of these images. The hash loss is used to guarantee the hash representation of images is closer to the real hash value. In the original Transformer, its input is derived of one-dimensional data. First, each image is required to be divided into multiple blocks. Then, each block is mapped into one-dimensional vector. At last, to form the one-dimensional vector, these one-dimensional vectors of all blocks of one image are concatenated together. Our Transformer is designed and composed of 1) two normalization layers, 2) a multi-head attention module, 3) a fully-connected module, and 4) a hash layer. First, the input one-dimensional vector is penetrated into the normalization layer. Next, the output of the normalization layer is fed into the multi-head attention layer of those are 16 heads. Hence, multiple local features of images can be captured. Additionally, the residual link is developed to integrate the initial one-dimensional vector with the output of the multi-head attention layer. By this way, the global features of images can be preserved better. Finally, the representation vector of each image can be obtained through the fully-connected module and the hash layer. In this study, the process for generating representation vectors mentioned above will be replicated for 24 times. For the rest of images, the classification loss is taken as a constraint, which can be used to learn the hash representation of these images in an asymmetric way. To improve the training efficiency, supervised information can be used effectively. The query images are not required to be involved in. The model can be trained via learning-alternated technology. Specifically, first, hash representation of the query images and the weights of the classification are configured and initialized randomly. Then, the parameters of the model can be optimized in terms of stochastic gradient descent algorithm. Following by an epoch of training, the weights of the classification can be balanced by the trained model. Meanwhile, the hash representation ability of the rest of images is improved gradually. In this manner, the hash code of the rest images can be obtained from a well-trained model directly, which optimizes the training efficiency. Finally, our method can realize fast retrieval of similar images through the calculation of the Hamming distance in the hash space. Result In our experiment, our method is compared to five categories of symmetric methods and two sorts of asymmetric methods based on two large scale image retrieval datasets. The performance of the proposed method is increased by 5.06% and 4.17% of each on the two datasets (i.e., CIFAR-10 and NUS-WIDE). The evaluation metric is based on mean average precision (mAP). Ablation experiments validate that the classification loss can make the images closer to the real hash representation. Furthermore, the hyper-parameters of the classification loss are tested as well, and the appropriate hyper-parameters are obtained. Conclusion To complete the image retrieval task effectively, our Transformer-based method has its potentials on image features extraction for large scale image retrieval, and the hash loss can be melted into classification loss for model training further.
Keywords

订阅号|日报