耦合保持投影哈希跨模态检索
Structure-preserving hashing with coupled projections for cross-modal retrieval
- 2021年26卷第7期 页码:1558-1567
收稿:2020-08-28,
修回:2021-2-2,
录用:2021-2-9,
纸质出版:2021-07-16
DOI: 10.11834/jig.200526
移动端阅览

浏览全部资源
扫码关注微信
收稿:2020-08-28,
修回:2021-2-2,
录用:2021-2-9,
纸质出版:2021-07-16
移动端阅览
目的
2
基于哈希的跨模态检索方法因其检索速度快、消耗存储空间小等优势受到了广泛关注。但是由于这类算法大都将不同模态数据直接映射至共同的汉明空间,因此难以克服不同模态数据的特征表示及特征维度的较大差异性,也很难在汉明空间中同时保持原有数据的结构信息。针对上述问题,本文提出了耦合保持投影哈希跨模态检索算法。
方法
2
为了解决跨模态数据间的异构性,先将不同模态的数据投影至各自子空间来减少模态“鸿沟”,并在子空间学习中引入图模型来保持数据间的结构一致性;为了构建不同模态之间的语义关联,再将子空间特征映射至汉明空间以得到一致的哈希码;最后引入类标约束来提升哈希码的判别性。
结果
2
实验在3个数据集上与主流的方法进行了比较,在Wikipedia数据集中,相比于性能第2的算法,在任务图像检索文本(I to T)和任务文本检索图像(T to I)上的平均检索精度(mean average precision,mAP)值分别提升了6%和3%左右;在MIRFlickr数据集中,相比于性能第2的算法,优势分别为2%和5%左右;在Pascal Sentence数据集中,优势分别为10%和7%左右。
结论
2
本文方法可适用于两个模态数据之间的相互检索任务,由于引入了耦合投影和图模型模块,有效提升了跨模态检索的精度。
Objective
2
With the rapid development of multimedia technology
the scale of multimedia data has been growing rapidly. For example
people are used to describing the things they want to show with multimedia data such as texts
images
and videos. Obtaining the relevant results of one modality using another modality is a good objective. In this sense
how to effectively perform semantic correlation analysis and measure the similarity between the data has gradually become a hot research topic. As the representation of different modal data is heterogeneous
it poses a great challenge to the cross-modal retrieval task. Hashing-based methods have received great attention in cross-modal retrieval because of its fast retrieval speed and low storage consumption. To solve the problem of heterogeneity between different modalities of the data
most of the current supervised hashing algorithms directly map different modal data into the Hamming space. However
these methods have the following limitations: 1) The data from each modality have different feature representations
and the dimensions of their feature spaces vary greatly. Therefore
it is difficult for these methods to obtain a consistent hash code by directly mapping the data from different modalities into the same Hamming space. 2) Although label information has been considered for these hashing methods
the structural information of the original data is ignored
which could result in a less-representative hash code to encode the original structural information in each modality. To solve these issues
a novel hashing algorithm called structure-preserving hashing with coupled projections (SPHCP) is proposed in this paper for cross-modal retrieval.
Method
2
Considering the heterogeneity between the cross-modal data
this algorithm first projects the data from different modalities into their respective subspaces to reduce the modal difference. A local graph model is also designed in the subspace learning to maintain the structural consistency between the samples. Then
to build a semantic relationship between different modalities
the algorithm maps the subspace features to the Hamming space to obtain a consistent hash code. At the same time
the label constraint is exploited to improve the discriminant power of the obtained hash codes. Finally
the algorithm measures the similarity of different modal data in terms of the Hamming distance.
Result
2
We compared our model with several state-of-the-art methods on three public datasets
namely
Wikipedia
MIRFlickr
and Pascal Sentence. The mean average precision (mAP) is used as the quantitative evaluation metric. We first test our method on two benchmark datasets
Wikipedia and MIRFlickr. To evaluate the impact of hash-code length on the performance of the algorithm
this experiment set the hash code length to 16
32
64
and 128 bits. The experimental results show that for both the text-retrieving image task and image-retrieving text task
our proposed method outperforms the existing methods in each length setting. To further measure the performance of our proposed method on the dataset with deep features
we test the algorithm on the Pascal Sentence dataset. The experimental results show that our SPHCP algorithm can also achieve higher mAP on such dataset with deep features. In general
cross-modal retrieval methods based on deep networks can handle nonlinear features well
so their retrieval accuracy is supposed to be higher than that of traditional methods
but they need much more computational power. As a "shallow" method
the proposed SPHCP algorithm is competitive with deep methods in terms of mAP. Therefore
as an interesting direction
our framework can be used in conjunction with the deep learning method in the future
i.e.
using deep learning to extract the features of images and text offline
and using the SPHCP algorithm for fast retrieval. Furthermore
we analyze the parameter sensitivity of the proposed algorithm. As this algorithm has 7 parameters
a controlled variable method is used for evaluation. The experimental results show that the proposed algorithm is not sensitive to parameters
which means that the training process does not require much optimization time
making it suitable for practical application.
Conclusion
2
In this study
a novel method called SPHCP is proposed to solve the problems mentioned. First
aiming at the "modal gap" between cross-modal data
the scheme of coupled projections is applied to gradually reduce the modal difference of multimedia data. In this way
a more consistent hash code can be obtained. Second
considering the structural information and semantic discrimination of the original data
the algorithm introduces the graph model in subspace learning
which can maintain the intra-class and inter-class relationship of the samples. Finally
a label constraint is introduced to improve the discriminability of the hash code. The experiments on the benchmark datasets verify the effectiveness of the proposed algorithm. Specifically
compared with the second-best method
SPHCP achieves an improvement by 6% and 3% on Wikipedia for two retrieval tasks. On MIRFlickr
SPHCP achieves an improvement by 2% and 5%. On Pascal Sentence
the improvement is approximately 10% and 7%. However
the proposed method requires a large amount of computing power when dealing with large-scale data
because SPHCP introduces a graph model to maintain the structural information between the data. The calculation of the structural information between each sample leads to a larger computing complexity.In future research
we will introduce nonlinear feature mapping into our SPHCP framework to improve its scalability when dealing with nonlinear feature data. Furthermore
we can extend the SPHCP from a cross-modal retrieval algorithm to a multi-modal version.
Ding G G, Guo Y C and Zhou J L. 2014. Collective matrix factorization hashing for multimodal data//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 2083-2090[ DOI:10.1109/CVPR.2014.267 http://dx.doi.org/10.1109/CVPR.2014.267 ]
Gong Y C, Ke Q F, Isard M and Lazebnik S. 2012. A multi-view embedding space for modeling internet images, tags, and their semantics. International Journal of Computer Vision, 106(2): 210-233[DOI:10.1007/s11263-013-0658-4]
Hu M Q, Yang Y, Shen F M, Xie N, Hong R C and Shen H T. 2019. Collective reconstructive embeddings for cross-modal hashing. IEEE Transactions on Image Processing, 28(6): 2770-2784[DOI:10.1109/TIP.2018.2890144]
Jiang Q Y and Li W J. 2017. Deep cross-modal hashing//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 3270-3278[ DOI:10.1109/CVPR.2017.348 http://dx.doi.org/10.1109/CVPR.2017.348 ]
Jin L, Li K, Li Z C, Xiao F, Qi G J and Tang J H. 2019. Deep semantic-preserving ordinal hashing for cross-modal similarity search. IEEE Transactions on Neural Networks and Learning Systems, 30(5): 1429-1440[DOI:10.1109/TNNLS.2018.2869601]
Kumar S and Udupa R. 2011. Learning hash functions for cross-view similarity search//Proceedings of the 22nd International Joint Conference on Artificial Intelligence. Barcelona, Catalonia, Spain: IJCAI/AAAI: 1360-1365[ DOI:10.5591/978-1-57735-516-8/IJCAI11-230 http://dx.doi.org/10.5591/978-1-57735-516-8/IJCAI11-230 ]
Li C, Deng C, Li N, Liu W, Gao X B and Tao D C. 2018. Self-supervised adversarial hashing networks for cross-modal retrieval//Proceedings of 2018 IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4242-4251[ DOI10.1109/CVPR.2018.00446 http://dx.doi.org/10.1109/CVPR.2018.00446 ]
Lin Z J, Ding G G, Hu M Q and Wang J M. 2015. Semantics-preserving hashing for cross-view retrieval//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3864-3872[ DOI:10.1109/CVPR.2015.7299011 http://dx.doi.org/10.1109/CVPR.2015.7299011 ]
Sharma A, Kumar A, Daume H and Jacobs D W. 2012. Generalized multiview analysis: a discriminative latent space//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 2160-2167[ DOI:10.1109/CVPR.2012.6247923 http://dx.doi.org/10.1109/CVPR.2012.6247923 ]
Wei Y C, Zhao Y, Lu C Y, Wei S K, Liu L Q, Zhu Z F and Yan S C. 2017. Cross-modal retrieval with CNN visual features: a new baseline. IEEE Transactions on Cybernetics, 47(2): 449-460[DOI:10.1109/TCYB.2016.2519449]
Wen Z W and Yin W T. 2013. A feasible method for optimization with orthogonality constraints. Mathematical Programming, 142(1-2): 397-434[DOI:10.1007/s10107-012-0584-1]
Wu J L, Lin Z C and Zha H B. 2017. Joint latent subspace learning and regression for cross-modal retrieval//Proceedings of the 40th International ACM Sigir Conference. Shinjuku, Japan: ACM: 917-920[ DOI:10.1145/3077136.3080678 http://dx.doi.org/10.1145/3077136.3080678 ]
Xu X, Shen F M, Yang Y, Shen H T and Li X L. 2017. Learning discriminative binary codes for large-scale cross-modal retrieval. IEEE Transactions on Image Processing, 26(5): 2494-2507[DOI:10.1109/TIP.2017.2676345]
Yan S Y, Liu C H, Jiang A W, Ye J H and Wang M W. 2019. Discriminative cross-modal hashing with coupled semantic correlation. Chinese Journal of Computers, 42(1): 164-175
严双咏, 刘长红, 江爱文, 叶继华, 王明文. 2019. 语义耦合相关的判别式跨模态哈希学习算法. 计算机学报, 42(1): 164-175 [DOI:10.11897/SP.J.1016.2019.00164]
Zhang D Q and Li W J. 2014. Large-scale supervised multimodal hashing with semantic correlation maximization//Proceedings of the 28th AAAI Conference on Artificial Intelligence. Quebec City, Canada: AAAI Press: 2177-2183[ DOI:10.5555/2892753.2892854 http://dx.doi.org/10.5555/2892753.2892854 ]
Zhang D, Wu X J and Yu J. 2020. Learning latent hash codes with discriminative structure preserving for cross-modal retrieval. Pattern Analysis and Applications, 24(4): 283-297[DOI:10.1007/s10044-020-00893-6]
Zhou J, Ding G G and Guo Y C. 2014. Latent semantic sparse hashing for cross-modal similarity search//Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval. Gold Coast, Australia: ACM: 415-424[ DOI:10.1145/2600428.2609610 http://dx.doi.org/10.1145/2600428.2609610 ]
相关作者
相关机构
京公网安备11010802024621