-
导出
分享
-
收藏
-
专辑
面向跨媒体检索的层级循环注意力网络模型
Cross-media retrieval with hierarchical recurrent attention network
- 2018年23卷第11期 页码:1751-1758
收稿:2018-04-16,
修回:2018-7-1,
纸质出版:2018-11-16
DOI: 10.11834/jig.180259
移动端阅览

浏览全部资源
扫码关注微信
导出
分享
收藏
专辑
收稿:2018-04-16,
修回:2018-7-1,
纸质出版:2018-11-16
移动端阅览
目的
2
跨媒体检索旨在以任意媒体数据检索其他媒体的相关数据,实现图像、文本等不同媒体的语义互通和交叉检索。然而,"异构鸿沟"导致不同媒体数据的特征表示不一致,难以实现语义关联,使得跨媒体检索面临巨大挑战。而描述同一语义的不同媒体数据存在语义一致性,且数据内部蕴含着丰富的细粒度信息,为跨媒体关联学习提供了重要依据。现有方法仅仅考虑了不同媒体数据之间的成对关联,而忽略了数据内细粒度局部之间的上下文信息,无法充分挖掘跨媒体关联。针对上述问题,提出基于层级循环注意力网络的跨媒体检索方法。
方法
2
首先提出媒体内-媒体间两级循环神经网络,其中底层网络分别建模不同媒体内部的细粒度上下文信息,顶层网络通过共享参数的方式挖掘不同媒体之间的上下文关联关系。然后提出基于注意力的跨媒体联合损失函数,通过学习媒体间联合注意力来挖掘更加精确的细粒度跨媒体关联,同时利用语义类别信息增强关联学习过程中的语义辨识能力,从而提升跨媒体检索的准确率。
结果
2
在2个广泛使用的跨媒体数据集上,与10种现有方法进行实验对比,并采用平均准确率均值MAP作为评价指标。实验结果表明,本文方法在2个数据集上的MAP分别达到了0.469和0.575,超过了所有对比方法。
结论
2
本文提出的层级循环注意力网络模型通过挖掘图像和文本的细粒度信息,能够充分学习图像和文本之间精确跨媒体关联关系,有效地提高了跨媒体检索的准确率。
Objective
2
Cross-media retrieval aims to retrieve the data of different media types by a query
which can provide flexible and useful retrieval experience with numerous user demands at present. However
a "heterogeneity gap" leads to inconsistent representations of different media types
thus resulting in a challenging construction of correlation and realizing cross-media retrieval between them. However
data from different media types naturally have a semantic consistency
and their patches contain abundant fine-grained information
which provides key clues for cross-media correlation learning. Existing methods mostly consider a pairwise correlation of various media types with the same semantics
but they ignore the context information among the fine-grained patches
which cannot fully capture the cross-media correlation. To address this problem
a cross-media hierarchical recurrent attention network (CHRAN) is proposed to fully consider the intra-and inter-media fine-grained context information.
Method
2
First
we propose to construct a hierarchical recurrent network to fully exploit the cross-media fine-grained context information. Specifically
the hierarchical recurrent network consists of two levels
which are implemented by a long short-term memory network. We extract features from the fine-grained patches of different media types and organize them into sequences
which are considered the inputs of the hierarchical network. The bottom level aims to model the intra-media fine-grained context information
whereas the top level adopts a weight-sharing constraint to fully exploit inter-media context correlation
which aims to share the knowledge learned from different media types. Thus
the hierarchical recurrent network can provide intra-and inter-media fine-grained hints for boosting cross-media correlation learning. Second
we propose an attention-based cross-media joint embedding loss to learn a cross-media correlation. We utilize an attention mechanism to allow the models to focus on the necessary fine-grained patches within various media types
thereby allowing the inter-media co-attention to be explored. Furthermore
we jointly consider the matched and mismatched cross-media pairs to preserve the relative similarity ranking information. We also adopt a semantic constraint to preserve the semantically discriminative capability during the correlation learning process. Therefore
a precise fine-grained cross-media correlation can be captured to improve retrieval accuracy.
Result
2
We conduct experiments on two widely-used cross-media datasets
including Wikipedia and Pascal Sentence datasets
which consider 10 state-of-the-art methods for comprehensive comparisons to verify the effectiveness of our proposed CHRAN approach. We perform a cross-media retrieval with two types of retrieval tasks
that is
retrieving text by image and retrieving the image by text
and then we adopt mean average precision (MAP) score as the evaluation metric. We also conduct baseline experiments to verify the contribution of a weight-sharing constraint and cross-media attention modeling. The experimental results show that our proposed approach achieves the optimal MAP scores of 0.469 and 0.575 on two datasets and outperforms the state-of-the-art methods.
Conclusion
2
The proposed approach can effectively learn a fine-grained cross-media correlation precisely. Compared with the existing methods that mainly model the pairwise correlation and ignore the fine-grained context information
our proposed hierarchical recurrent network can fully capture the intra-and inter-media fine-grained context information with a cross-media co-attention mechanism that can further promote the accuracy of cross-media retrieval.
Peng Y X, Huang X, Zhao Y Z. An overview of cross-media retrieval:concepts, methodologies, benchmarks and challenges[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2017.[DOI:10.1109/TCSVT.2017.2705068]
Hotelling H. Relations between two sets of variates[J]. Biometrika, 1936, 28(3-4):321-377.[DOI:10.2307/2333955]
Rasiwasia N, Costa Pereira J, Coviello E, et al. A new approach to cross-modal multimedia retrieval[C]//Proceedings of the 18th ACM International Conference on Multimedia. Firenze, Italy: ACM, 2010: 251-260.[ DOI:10.1145/1873951.1873987 http://dx.doi.org/10.1145/1873951.1873987 ]
Zhai X H, Peng Y X, Xiao J G. Learning cross-media joint representation with sparse and semisupervised regularization[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2014, 24(6):965-978.[DOI:10.1109/TCSVT.2013.2276704]
Feng F X, Wang X J, Li R F. Cross-modal retrieval with correspondence autoencoder[C]//Proceedings of the 22nd ACM International Conference on Multimedia. Orlando, Florida, USA: ACM, 2014: 7-16.[ DOI:10.1145/2647868.2654902 http://dx.doi.org/10.1145/2647868.2654902 ]
Peng Y X, Qi J W, Huang X, et al. CCL:cross-modal correlation learning with multigrained fusion by hierarchical network[J]. IEEE Transactions on Multimedia, 2018, 20(2):405-420.[DOI:10.1109/TMM.2017.2742704]
Hardoon D R, Szedmak S, Shawe-Taylor J. Canonical correlation analysis:An overview with application to learning methods[J]. Neural Computation, 2004, 16(12):2639-2664.[DOI:10.1162/0899766042321814]
Gong Y C, Ke Q F, Isard M, et al. A multi-view embedding space for modeling internet images, tags, and their semantics[J]. International Journal of Computer Vision, 2014, 106(2):210-233.[DOI:10.1007/s11263-013-0658-4]
Li D G, Dimitrova N, Li M K, et al. Multimedia content processing through cross-modal association[C]//Proceedings of the 11th ACM International Conference on Multimedia. Berkeley, CA, USA: ACM, 2003: 604-611.[ DOI:10.1145/957013.957143 http://dx.doi.org/10.1145/957013.957143 ]
Peng Y X, Zhai X H, Zhao Y Z, et al. Semi-supervised cross-media feature learning with unified patch graph regularization[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2016, 26(3):583-596.[DOI:10.1109/TCSVT.2015.2400779]
Ngiam J, Khosla A, Kim M, et al. Multimodal deep learning[C]//Proceedings of the 28th International Conference on Machine Learning. Bellevue, WA: ICML, 2011: 689-696. http://campar.in.tum.de/Students/MaDeepEMR .
Andrew G, Arora R, Bilmes J, et al. Deep canonical correlation analysis[C]//Proceedings of the 30th International Conference on Machine Learning. Atlanta, GA, USA: ACM, 2013: 1247-1255. https://www.mendeley.com/research-papers/deep-canonical-correlation-analysis/ .
Peng Y X, Huang X, Qi J W. Cross-media shared representation by hierarchical learning with multiple deep networks[C]//Proceeding of 25th International Joint Conference on Artificial Intelligence. New York, NY, USA: Morgan Kaufmann, 2016: 3846-3853.
Wei Y C, Zhao Y, Lu C Y, et al. Cross-modal retrieval with CNN visual features:a new baseline[J]. IEEE Transactions on Cybernetics, 2017, 47(2):449-460.[DOI:10.1109/TCYB.2016.2519449]
Wang B K, Yang Y, Xu X, et al. Adversarial cross-modal retrieval[C]//Proceedings of ACM on Multimedia Conference. Mountain View, California, USA: ACM, 2017: 154-162.[ DOI:10.1145/3123266.3123326 http://dx.doi.org/10.1145/3123266.3123326 ]
Huang X, Peng Y X, Yuan M K. Cross-modal common representation learning by hybrid transfer network[C]//Proceeding of the 26th International Joint Conference on Artificial Intelligence. Melbourne, Australia: Morgan Kaufmann, 2017: 1893-1900.[ DOI:10.24963/ijcai.2017/263 http://dx.doi.org/10.24963/ijcai.2017/263 ]
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv: 1409.1556, 2014. https://arxiv.org/abs/1409.1556 .
Lin Y T, Pang Z Y, Wang D H, et al. Task-driven visual saliency and attention-based visual question answering[J]. arXiv preprint arXiv: 1702.06700, 2017. https://arxiv.org/abs/1702.06700 .
Kim Y. Convolutional neural networks for sentence classification[C]//Proceeding of Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: Association for Computational Linguistics, 2014: 1746-1751.[ DOI:10.3115/v1/D14-1181 http://dx.doi.org/10.3115/v1/D14-1181 ]
Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8):1735-1780.[DOI:10.1162/neco.1997.9.8.1735]
Lu J S, Yang J W, Batra D, et al. Hierarchical question-image co-attention for visual question answering[C]//Proceeding of Advances in Neural Information Processing Systems. Barcelona, Spain: MIT Press 2016: 289-297. http://export.arxiv.org/abs/1606.00061 .
Rashtchian C, Young P, Hodosh M, et al. Collecting image annotations using Amazon's Mechanical Turk[C]//Proceeding of NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk. Los Angeles, California: ACM, 2010: 139-147. https://www.mendeley.com/research-papers/collecting-image-annotations-using-amazons-mechanical-turk/ .
Kang C C, Xiang S M, Liao S C, et al. Learning consistent feature representation for cross-modal multimedia retrieval[J]. IEEE Transactions on Multimedia, 2015, 17(3):370-381.[DOI:10.1109/TMM.2015.2390499]
相关作者
相关机构
京公网安备11010802024621