Image retrieval based on transformer and asymmetric learning strategy

Chao He; Hongxi Wei

doi:10.11834/jig.210842

Image Understanding and Computer Vision | Views : 0 下载量: 0 CSCD: 0

PDF
Export
Share
Collection
Album

Image retrieval based on transformer and asymmetric learning strategy
Vol. 28, Issue 2, Pages: 535-544(2023)
Published： 16 February 2023 ，

Accepted： 21 February 2022
DOI： 10.11834/jig.210842
稿件说明：

移动端阅览

Chao He, Hongxi Wei. Image retrieval based on transformer and asymmetric learning strategy. [J]. Journal of Image and Graphics 28(2):535-544(2023)
DOI：

Chao He, Hongxi Wei. Image retrieval based on transformer and asymmetric learning strategy. [J]. Journal of Image and Graphics 28(2):535-544(2023) DOI： 10.11834/jig.210842.

摘要

目的

图像检索是计算机视觉领域的一项基础任务，大多采用卷积神经网络和对称式学习策略，导致所需训练数据量大、模型训练时间长、监督信息利用不充分。针对上述问题，本文提出一种Transformer与非对称学习策略相结合的图像检索方法。

方法

对于查询图像，使用Transformer生成图像的哈希表示，利用哈希损失学习哈希函数，使图像的哈希表示更加真实。对于待检索图像，采用非对称式学习策略，直接得到图像的哈希表示，并将哈希损失与分类损失相结合，充分利用监督信息，提高训练速度。在哈希空间通过计算汉明距离实现相似图像的快速检索。

结果

在CIFAR-10和NUS-WIDE两个数据集上，将本文方法与主流的5种对称式方法和性能最优的两种非对称式方法进行比较，本文方法的mAP（mean average precision）比当前最优方法分别提升了5.06%和4.17%。

结论

本文方法利用Transformer提取图像特征，并将哈希损失与分类损失相结合，在不增加训练数据量的前提下，减少了模型训练时间。所提方法性能优于当前同类方法，能够有效完成图像检索任务。

Abstract

Objective

Image retrieval is one of the essential tasks in computer vision. Most of deep learning-based image retrieval methods are implemented via learning-symmetrical strategy. Training images are profiled into a pair and then melt into convolutional neural network (CNN) for features extraction. Similar loss is utilized to learn the hash-relevant images. Consequently

a feasible performance can be obtained through this symmetric learning scheme. In recent years

to improve its performance better

a mass of the CNNs-improved are extended horizontally or deepened vertically. CNNs-based structure is complicated and time-consuming for large scale image datasets. Recently

the Transformer has been developing to the domain of computer vision and its image classification ability is improved intensively. Our Transformer-relevant research is melted into the task of large-scale image retrieval method because the Transformer is feasible for large scale dataset like ImageNet-21k and JFT-300M. For symmetric methods

the overall image dataset-related should be involved in the training phase. Meanwhile

query images have to be added into a pair for training

which results in the problem of time-consuming. The hash function between the training image and the query image is learnt in terms of similar calculation. The information-supervised is used for the similarity matrix only

which is insufficient to be used. A learning-asymmetric scheme is optimized for training based on some images-selected only. Furthermore

the corresponding hash function can be learnt from the hash loss. For the rest of images

their feature representations can be picked out as well. Moreover

classification constraints can be used for the query images and the corresponding classification loss can be optimized in terms of learning-altered technology. To resolve the problems of time-consuming and the insufficient information-supervised

we develop a deep supervised hash image retrieval method in terms of the integrated Transformer and learning-asymmetric strategy.

Method

For the training images

a Transformer-designed can be utilized to generate the hash representation of these images. The hash loss is used to guarantee the hash representation of images is closer to the real hash value. In the original Transformer

its input is derived of one-dimensional data. First

each image is required to be divided into multiple blocks. Then

each block is mapped into one-dimensional vector. At last

to form the one-dimensional vector

these one-dimensional vectors of all blocks of one image are concatenated together. Our Transformer is designed and composed of 1) two normalization layers

2) a multi-head attention module

3) a fully-connected module

and 4) a hash layer. First

the input one-dimensional vector is penetrated into the normalization layer. Next

the output of the normalization layer is fed into the multi-head attention layer of those are 16 heads. Hence

multiple local features of images can be captured. Additionally

the residual link is developed to integrate the initial one-dimensional vector with the output of the multi-head attention layer. By this way

the global features of images can be preserved better. Finally

the representation vector of each image can be obtained through the fully-connected module and the hash layer. In this study

the process for generating representation vectors mentioned above will be replicated for 24 times. For the rest of images

the classification loss is taken as a constraint

which can be used to learn the hash representation of these images in an asymmetric way. To improve the training efficiency

supervised information can be used effectively. The query images are not required to be involved in. The model can be trained via learning-alternated technology. Specifically

first

hash representation of the query images and the weights of the classification are configured and initialized randomly. Then

the parameters of the model can be optimized in terms of stochastic gradient descent algorithm. Following by an epoch of training

the weights of the classification can be balanced by the trained model. Meanwhile

the hash representation ability of the rest of images is improved gradually. In this manner

the hash code of the rest images can be obtained from a well-trained model directly

which optimizes the training efficiency. Finally

our method can realize fast retrieval of similar images through the calculation of the Hamming distance in the hash space.

Result

In our experiment

our method is compared to five categories of symmetric methods and two sorts of asymmetric methods based on two large scale image retrieval datasets. The performance of the proposed method is increased by 5.06% and 4.17% of each on the two datasets (i.e.

CIFAR-10 and NUS-WIDE). The evaluation metric is based on mean average precision (mAP). Ablation experiments validate that the classification loss can make the images closer to the real hash representation. Furthermore

the hyper-parameters of the classification loss are tested as well

and the appropriate hyper-parameters are obtained.

Conclusion

To complete the image retrieval task effectively

our Transformer-based method has its potentials on image features extraction for large scale image retrieval

and the hash loss can be melted into classification loss for model training further.

关键词

图像检索Transformer哈希函数非对称式学习哈希损失分类损失

Keywords

image retrievalTransformerhash functionasymmetric learninghash lossclassification loss

references

Andoni A and Indyk P. 2008. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications of the ACM, 51(1): 117-122 [DOI: 10.1145/1327452.1327494]

Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A and Zagoruyko S. 2020. End-to-end object detection with transformers//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 213-229 [DOI: 10.1007/978-3-030-58452-8_13http://dx.doi.org/10.1007/978-3-030-58452-8_13]

Chen H T, Wang Y H, Guo T Y, Xu C, Deng Y P, Liu Z H, Ma S W, Xu C J, Xu C and Gao W. 2021a. Pre-trained image processing transformer//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 12294-12305 [DOI: 10.1109/CVPR46437.2021.01212http://dx.doi.org/10.1109/CVPR46437.2021.01212]

Chen Y B, Zhang S, Liu F X, Chang Z G, Ye M and Qi Z W. 2021b. Transhash: transformer-based hamming hashing for efficient image retrieval [EB/OL]. [2021-05-05].https://arxiv.org/pdf/2105.01823.pdfhttps://arxiv.org/pdf/2105.01823.pdf

Chua T S, Tang J H, Hong R C, Li H J, Luo Z P and Zheng Y T. 2009. NUS-WIDE: a real-world web image database from national university of Singapore//Proceedings of 2009 ACM International Conference on Image and Video Retrieval. Santorini, Fira Greece: Association for Computing Machinery: #48 [DOI: 10.1145/1646396.1646452http://dx.doi.org/10.1145/1646396.1646452]

Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X H, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J and Houlsby N. 2021. An image is worth 16×16 words: transformers for image recognition at scale [EB/OL]. [2021-06-03].https://arxiv.org/pdf/2010.11929.pdfhttps://arxiv.org/pdf/2010.11929.pdf

Gionis A, Indyk P and Motwani R. 1999. Similarity search in high dimensions via hashing//Proceedings of the 25th International Conference on Very Large Data Bases. San Francisco, USA: Morgan Kaufmann Publishers Inc. : 518-529

Gong Y C, Lazebnik S, Gordo A and Perronnin F. 2013. Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12): 2916-2929 [DOI: 10.1109/TPAMI.2012.193]

Gu G H, Liu J T, Li Z Y, Huo W H and Zhao Y. 2020. Joint learning based deep supervised hashing for large-scale image retrieval. Neurocomputing, 385: 348-357 [DOI: 10.1016/j.neucom.2019.12.096]

Guo Y C, Ding G G, Liu L, Han J G and Shao L. 2017. Learning to hash with optimized anchor embedding for scalable retrieval. IEEE Transactions on Image Processing, 26(3): 1344-1354 [DOI: 10.1109/TIP.2017.2652730]

Jiang Q Y and Li W J. 2018. Asymmetric deep supervised hashing//Proceedings of the 32nd AAAI Conference on Artificial Intelligence and the 13th Innovative Applications of Artificial Intelligence Conference and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence. New Orleans, USA: AAAI: 3342-3349

Krizhevsky A. 2009. Learning Multiple Layers of Features from Tiny Images. Toronto, Canada: University of Toronto

Lai H J, Pan Y, Liu Y and Yan S C. 2015. Simultaneous feature learning and hash coding with deep neural networks//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3270-3278 [DOI: 10.1109/CVPR.2015.7298947http://dx.doi.org/10.1109/CVPR.2015.7298947]

Li Q, Sun Z, He R and Tan T N. 2017. Deep supervised discrete hashing//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc. : 2479-2488

Li W J, Wang S and Kang W C. 2016. Feature learning based deep supervised hashing with pairwise labels//Proceedings of the 25th International Joint Conference on Artificial Intelligence. New York, USA: AAAI Press: 1711-1717

Li Y Q, Pei W J, Zha Y F and van Gemert J. 2019. Push for quantization: deep fisher hashing [EB/OL]. [2021-08-31].https://arxiv.org/pdf/1909.00206.pdfhttps://arxiv.org/pdf/1909.00206.pdf

Lin G S, Shen C H, Shi Q F, Van den Hengel A and Suter D. 2014. Fast supervised hashing with decision trees for high-dimensional data//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 1971-1978 [DOI: 10.1109/CVPR.2014.253http://dx.doi.org/10.1109/CVPR.2014.253]

Liu W, Wang J, Ji R R, Jiang Y G and Chang S F. 2012. Supervised hashing with kernels//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 2074-2081 [DOI: 10.1109/CVPR.2012.6247912http://dx.doi.org/10.1109/CVPR.2012.6247912]

Liu Y, Cheng M, Wang F P, Li D X, Liu W, Fan J L. 2020. Deep Hashing image retrieval methods, 25(7): 1296-1317http://www.cjig.cn/jig/ch/reader/view_abstract.aspx?file_no=20200702&flag=1.

刘颖, 程美, 王富平, 李大湘, 刘伟, 范九伦. 2020. 深度哈希图像检索方法综述. 中国图象图形学报, 25(7): 1296-1317[DOI: 10.11834/jig.19051810.11834/jig.190518]

Wan F, Qiang H P, Lei G B. 2021. Self-supervised deep discrete hashing for image retrieval. Journal of Image and Graphics, 26(11): 2659-2669

万方, 强浩鹏, 雷光波. 2021. 自监督深度离散哈希图像检索. 中国图象图形学报, 26(11): 2659-2669[DOI: 10.11834/jig.200212]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc. : 5000-6010

Wang Q, Li B, Xiao T, Zhu J B, Li C L, Wong D F and Chao L S. 2019. Learning deep transformer models for machine translation//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics: 1810-1822 [DOI: 10.18653/v1/P19-1176http://dx.doi.org/10.18653/v1/P19-1176]

Wu L, Ling H F, Li P, Chen J Z, Fang Y and Zhou F H. 2019. Deep supervised hashing based on stable distribution. IEEE Access, 7: 36489-36499 [DOI: 10.1109/ACCESS.2019.2900489]

Xia R K, Pan Y, Lai H J, Liu C and Yan S C. 2014. Supervised hashing for image retrieval via image representation learning//Proceedings of the 28th AAAI Conference on Artificial Intelligence. Québec City, Canada: AAAI Press: 2156-2162

Zhang D, Wang J, Cai D and Lu J S. 2010. Self-taught hashing for fast similarity search//Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. Geneva, Switzerland: Association for Computing Machinery: 18-25 [DOI: 10.1145/1835449.1835455http://dx.doi.org/10.1145/1835449.1835455]

Zhang P C, Wei Z, Li W J and Guo M Y. 2014. Supervised hashing with latent factor models//Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval. Gold Coast, Australia: Association for Computing Machinery: 173-182 [DOI: 10.1145/2600428.2609600http://dx.doi.org/10.1145/2600428.2609600]

Zheng X T, Zhang Y C and Lu X Q. 2020. Deep balanced discrete hashing for image retrieval. Neurocomputing, 403: 224-236 [DOI: 10.1016/j.neucom.2020.04.037]

Zhu H, Long M S, Wang J M and Cao Y. 2016. Deep hashing network for efficient similarity retrieval//Proceedings of the 13th AAAI Conference on Artificial Intelligence. Phoenix, USA: AAAI Press: 2415-2421

Alert me when the article has been cited

提交

Whole slide pathological image classification of breast cancer based on mixed supervision learning

Transformer network for stereo matching of weak texture objects

Spatial-spectral model distillation network for hyperspectral scene classification

Blueprint separable convolution Transformer network for lightweight image super-resolution

Survey of image deblurring