耦合保持投影哈希跨模态检索

闵康凌; 张国宾; 王磊; 李丹萍

doi:10.11834/jig.200526

图像处理和编码 | 浏览量 : 0 下载量: 165 CSCD: 0

PDF
导出
分享
收藏
专辑

耦合保持投影哈希跨模态检索
Structure-preserving hashing with coupled projections for cross-modal retrieval
2021年26卷第7期页码：1558-1567
收稿：2020-08-28，

修回：2021-2-2，

录用：2021-2-9，

纸质出版：2021-07-16
DOI： 10.11834/jig.200526
稿件说明：

移动端阅览

闵康凌, 张国宾, 王磊, 李丹萍. 耦合保持投影哈希跨模态检索[J]. 中国图象图形学报, 2021,26(7):1558-1567. DOI： 10.11834/jig.200526.

Kangling Min, Guobin Zhang, Lei Wang, Danping Li. Structure-preserving hashing with coupled projections for cross-modal retrieval[J]. Journal of Image and Graphics, 2021, 26(7): 1558-1567. DOI： 10.11834/jig.200526.

摘要

目的

基于哈希的跨模态检索方法因其检索速度快、消耗存储空间小等优势受到了广泛关注。但是由于这类算法大都将不同模态数据直接映射至共同的汉明空间，因此难以克服不同模态数据的特征表示及特征维度的较大差异性，也很难在汉明空间中同时保持原有数据的结构信息。针对上述问题，本文提出了耦合保持投影哈希跨模态检索算法。

方法

为了解决跨模态数据间的异构性，先将不同模态的数据投影至各自子空间来减少模态“鸿沟”，并在子空间学习中引入图模型来保持数据间的结构一致性；为了构建不同模态之间的语义关联，再将子空间特征映射至汉明空间以得到一致的哈希码；最后引入类标约束来提升哈希码的判别性。

结果

实验在3个数据集上与主流的方法进行了比较，在Wikipedia数据集中，相比于性能第2的算法，在任务图像检索文本（I to T）和任务文本检索图像（T to I）上的平均检索精度（mean average precision，mAP）值分别提升了6%和3%左右；在MIRFlickr数据集中，相比于性能第2的算法，优势分别为2%和5%左右；在Pascal Sentence数据集中，优势分别为10%和7%左右。

结论

本文方法可适用于两个模态数据之间的相互检索任务，由于引入了耦合投影和图模型模块，有效提升了跨模态检索的精度。

Abstract

Objective

With the rapid development of multimedia technology

the scale of multimedia data has been growing rapidly. For example

people are used to describing the things they want to show with multimedia data such as texts

images

and videos. Obtaining the relevant results of one modality using another modality is a good objective. In this sense

how to effectively perform semantic correlation analysis and measure the similarity between the data has gradually become a hot research topic. As the representation of different modal data is heterogeneous

it poses a great challenge to the cross-modal retrieval task. Hashing-based methods have received great attention in cross-modal retrieval because of its fast retrieval speed and low storage consumption. To solve the problem of heterogeneity between different modalities of the data

most of the current supervised hashing algorithms directly map different modal data into the Hamming space. However

these methods have the following limitations: 1) The data from each modality have different feature representations

and the dimensions of their feature spaces vary greatly. Therefore

it is difficult for these methods to obtain a consistent hash code by directly mapping the data from different modalities into the same Hamming space. 2) Although label information has been considered for these hashing methods

the structural information of the original data is ignored

which could result in a less-representative hash code to encode the original structural information in each modality. To solve these issues

a novel hashing algorithm called structure-preserving hashing with coupled projections (SPHCP) is proposed in this paper for cross-modal retrieval.

Method

Considering the heterogeneity between the cross-modal data

this algorithm first projects the data from different modalities into their respective subspaces to reduce the modal difference. A local graph model is also designed in the subspace learning to maintain the structural consistency between the samples. Then

to build a semantic relationship between different modalities

the algorithm maps the subspace features to the Hamming space to obtain a consistent hash code. At the same time

the label constraint is exploited to improve the discriminant power of the obtained hash codes. Finally

the algorithm measures the similarity of different modal data in terms of the Hamming distance.

Result

We compared our model with several state-of-the-art methods on three public datasets

namely

Wikipedia

MIRFlickr

and Pascal Sentence. The mean average precision (mAP) is used as the quantitative evaluation metric. We first test our method on two benchmark datasets

Wikipedia and MIRFlickr. To evaluate the impact of hash-code length on the performance of the algorithm

this experiment set the hash code length to 16

and 128 bits. The experimental results show that for both the text-retrieving image task and image-retrieving text task

our proposed method outperforms the existing methods in each length setting. To further measure the performance of our proposed method on the dataset with deep features

we test the algorithm on the Pascal Sentence dataset. The experimental results show that our SPHCP algorithm can also achieve higher mAP on such dataset with deep features. In general

cross-modal retrieval methods based on deep networks can handle nonlinear features well

so their retrieval accuracy is supposed to be higher than that of traditional methods

but they need much more computational power. As a "shallow" method

the proposed SPHCP algorithm is competitive with deep methods in terms of mAP. Therefore

as an interesting direction

our framework can be used in conjunction with the deep learning method in the future

i.e.

using deep learning to extract the features of images and text offline

and using the SPHCP algorithm for fast retrieval. Furthermore

we analyze the parameter sensitivity of the proposed algorithm. As this algorithm has 7 parameters

a controlled variable method is used for evaluation. The experimental results show that the proposed algorithm is not sensitive to parameters

which means that the training process does not require much optimization time

making it suitable for practical application.

Conclusion

In this study

a novel method called SPHCP is proposed to solve the problems mentioned. First

aiming at the "modal gap" between cross-modal data

the scheme of coupled projections is applied to gradually reduce the modal difference of multimedia data. In this way

a more consistent hash code can be obtained. Second

considering the structural information and semantic discrimination of the original data

the algorithm introduces the graph model in subspace learning

which can maintain the intra-class and inter-class relationship of the samples. Finally

a label constraint is introduced to improve the discriminability of the hash code. The experiments on the benchmark datasets verify the effectiveness of the proposed algorithm. Specifically

compared with the second-best method

SPHCP achieves an improvement by 6% and 3% on Wikipedia for two retrieval tasks. On MIRFlickr

SPHCP achieves an improvement by 2% and 5%. On Pascal Sentence

the improvement is approximately 10% and 7%. However

the proposed method requires a large amount of computing power when dealing with large-scale data

because SPHCP introduces a graph model to maintain the structural information between the data. The calculation of the structural information between each sample leads to a larger computing complexity.In future research

we will introduce nonlinear feature mapping into our SPHCP framework to improve its scalability when dealing with nonlinear feature data. Furthermore

we can extend the SPHCP from a cross-modal retrieval algorithm to a multi-modal version.

关键词

Keywords

references

Ding G G, Guo Y C and Zhou J L. 2014. Collective matrix factorization hashing for multimodal data//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 2083-2090[ DOI:10.1109/CVPR.2014.267 http://dx.doi.org/10.1109/CVPR.2014.267 ]

Gong Y C, Ke Q F, Isard M and Lazebnik S. 2012. A multi-view embedding space for modeling internet images, tags, and their semantics. International Journal of Computer Vision, 106(2): 210-233[DOI:10.1007/s11263-013-0658-4]

Hu M Q, Yang Y, Shen F M, Xie N, Hong R C and Shen H T. 2019. Collective reconstructive embeddings for cross-modal hashing. IEEE Transactions on Image Processing, 28(6): 2770-2784[DOI:10.1109/TIP.2018.2890144]

Jiang Q Y and Li W J. 2017. Deep cross-modal hashing//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 3270-3278[ DOI:10.1109/CVPR.2017.348 http://dx.doi.org/10.1109/CVPR.2017.348 ]

Jin L, Li K, Li Z C, Xiao F, Qi G J and Tang J H. 2019. Deep semantic-preserving ordinal hashing for cross-modal similarity search. IEEE Transactions on Neural Networks and Learning Systems, 30(5): 1429-1440[DOI:10.1109/TNNLS.2018.2869601]

Kumar S and Udupa R. 2011. Learning hash functions for cross-view similarity search//Proceedings of the 22nd International Joint Conference on Artificial Intelligence. Barcelona, Catalonia, Spain: IJCAI/AAAI: 1360-1365[ DOI:10.5591/978-1-57735-516-8/IJCAI11-230 http://dx.doi.org/10.5591/978-1-57735-516-8/IJCAI11-230 ]

Li C, Deng C, Li N, Liu W, Gao X B and Tao D C. 2018. Self-supervised adversarial hashing networks for cross-modal retrieval//Proceedings of 2018 IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4242-4251[ DOI10.1109/CVPR.2018.00446 http://dx.doi.org/10.1109/CVPR.2018.00446 ]

Lin Z J, Ding G G, Hu M Q and Wang J M. 2015. Semantics-preserving hashing for cross-view retrieval//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3864-3872[ DOI:10.1109/CVPR.2015.7299011 http://dx.doi.org/10.1109/CVPR.2015.7299011 ]

Sharma A, Kumar A, Daume H and Jacobs D W. 2012. Generalized multiview analysis: a discriminative latent space//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 2160-2167[ DOI:10.1109/CVPR.2012.6247923 http://dx.doi.org/10.1109/CVPR.2012.6247923 ]

Wei Y C, Zhao Y, Lu C Y, Wei S K, Liu L Q, Zhu Z F and Yan S C. 2017. Cross-modal retrieval with CNN visual features: a new baseline. IEEE Transactions on Cybernetics, 47(2): 449-460[DOI:10.1109/TCYB.2016.2519449]

Wen Z W and Yin W T. 2013. A feasible method for optimization with orthogonality constraints. Mathematical Programming, 142(1-2): 397-434[DOI:10.1007/s10107-012-0584-1]

Wu J L, Lin Z C and Zha H B. 2017. Joint latent subspace learning and regression for cross-modal retrieval//Proceedings of the 40th International ACM Sigir Conference. Shinjuku, Japan: ACM: 917-920[ DOI:10.1145/3077136.3080678 http://dx.doi.org/10.1145/3077136.3080678 ]

Xu X, Shen F M, Yang Y, Shen H T and Li X L. 2017. Learning discriminative binary codes for large-scale cross-modal retrieval. IEEE Transactions on Image Processing, 26(5): 2494-2507[DOI:10.1109/TIP.2017.2676345]

Yan S Y, Liu C H, Jiang A W, Ye J H and Wang M W. 2019. Discriminative cross-modal hashing with coupled semantic correlation. Chinese Journal of Computers, 42(1): 164-175

严双咏, 刘长红, 江爱文, 叶继华, 王明文. 2019. 语义耦合相关的判别式跨模态哈希学习算法. 计算机学报, 42(1): 164-175 [DOI:10.11897/SP.J.1016.2019.00164]

Zhang D Q and Li W J. 2014. Large-scale supervised multimodal hashing with semantic correlation maximization//Proceedings of the 28th AAAI Conference on Artificial Intelligence. Quebec City, Canada: AAAI Press: 2177-2183[ DOI:10.5555/2892753.2892854 http://dx.doi.org/10.5555/2892753.2892854 ]

Zhang D, Wu X J and Yu J. 2020. Learning latent hash codes with discriminative structure preserving for cross-modal retrieval. Pattern Analysis and Applications, 24(4): 283-297[DOI:10.1007/s10044-020-00893-6]

Zhou J, Ding G G and Guo Y C. 2014. Latent semantic sparse hashing for cross-modal similarity search//Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval. Gold Coast, Australia: ACM: 415-424[ DOI:10.1145/2600428.2609610 http://dx.doi.org/10.1145/2600428.2609610 ]