基于深度学习的跨模态检索综述
Survey on deep learning based cross-modal retrieval
- 2021年26卷第6期 页码:1368-1388
纸质出版日期: 2021-06-16 ,
录用日期: 2021-02-10
DOI: 10.11834/jig.200862
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2021-06-16 ,
录用日期: 2021-02-10
移动端阅览
尹奇跃, 黄岩, 张俊格, 吴书, 王亮. 基于深度学习的跨模态检索综述[J]. 中国图象图形学报, 2021,26(6):1368-1388.
Qiyue Yin, Yan Huang, Junge Zhang, Shu Wu, Liang Wang. Survey on deep learning based cross-modal retrieval[J]. Journal of Image and Graphics, 2021,26(6):1368-1388.
由于多模态数据的快速增长,跨模态检索受到了研究者的广泛关注,其将一种模态的数据作为查询条件检索其他模态的数据,如用户可以用文本检索图像或/和视频。由于查询及其检索结果模态表征的差异,如何度量不同模态之间的相似性是跨模态检索的主要挑战。随着深度学习技术的推广及其在计算机视觉、自然语言处理等领域的显著成果,研究者提出了一系列以深度学习为基础的跨模态检索方法,极大缓解了不同模态间相似性度量的挑战,本文称之为深度跨模态检索。本文从以下角度综述有代表性的深度跨模态检索论文,基于所提供的跨模态信息将这些方法分为3类:基于跨模态数据间一一对应的、基于跨模态数据间相似度的以及基于跨模态数据语义标注的深度跨模态检索。一般来说,上述3类方法提供的跨模态信息呈现递增趋势,且提供学习的信息越多,跨模态检索性能越优。在上述不同类别下,涵盖了7类主流技术,即典型相关分析、一一对应关系保持、度量学习、似然分析、学习排序、语义预测以及对抗学习。不同类别下包含部分关键技术,本文将具体阐述其中有代表性的方法。同时对比提供不同跨模态数据信息下不同技术的区别,以阐述在提供了不同层次的跨模态数据信息下相关技术的关注点与使用异同。为评估不同的跨模态检索方法,总结了部分代表性的跨模态检索数据库。最后讨论了当前深度跨模态检索待解决的问题以及未来的研究方向。
Over the last decade
different types of media data such as texts
images
and videos grow rapidly on the internet. Different types of data are used for describing the same events or topics. For example
a web page usually contains not only textual description but also images or videos for illustrating the common content. Such different types of data are referred as multi-modal data
which inspire many applications
e.g.
multi-modal retrieval
hot topic detection
and perso-nalize recommendation. Nowadays
mobile devices and emerging social websites (e.g.
Flickr
YouTube
and Twitter) are diffused across all persons
and a demanding requirement for cross-modal data retrieval is emergent. Accordingly
cross-modal retrieval has attracted considerable attention. One type of data is required as the query to retrieve relevant data of another type. For example
a user can use a text to retrieve relevant pictures or/and videos. The query and its retrieved results can have different modalities; thus
measuring the content similarity between different modalities of data
i.e.
reducing heterogeneity gap
remains a challenge. With the rapid development of deep learning techniques
various deep cross-modal retrieval approaches have been proposed to alleviate this problem
and promising performance has been obtained. We aim to review and comb representative methods for deep learning based cross-modal retrieval. We first classify these approaches into three main groups based on the cross-modal information provided
i.e.: 1) co-occurrence information
2) pairwise information
and 3) semantic information. Co-occurrence information based methods indicate that only co-occurrence information is utilized to learn common representations across multi-modal data
where co-occurrence information indicates that if different modalities of data co-exist in a multi-modal document
then they have the same semantic. Pairwise information based methods indicate that similar pairs and dissimilar pairs are utilized to learn the common representations. A similarity matrix for all modalities is usually provided indicating whether or not two points from the modalities are in the same categories. Semantic information based methods indicate that class label information is provided to learn common representations
where a multi-modal example can have one or more labels with massive manual annotation. Usually
co-occurrence information exists in pairwise information and semantic information based approaches
and pairwise information can be derived when semantic information is provided. However
these relationships do not necessarily hold. In each category
various techniques can be utilized and combined to fully use the provided cross-modal information. We roughly categorize these techniques into seven main classes
as follows: 1) canonical correlation analysis
2) correspondence preserving
3) metric learning
4) likelihood analysis
5) learning to rank
6) semantic prediction
and 7) adversarial learning. Canonical correlation analysis methods focus on finding linear combinations of two vectors of random variables with the objective of maximizing the correlation. When combined with deep learning
linear projections are replaced with deep neural networks with extra considerations. Correspondence preserving methods aim at preserving the co-existing relationship of different modalities with the objective of minimizing their distances in the learned embedding space. Usually
the multi-modal correspondence relationship is formed as regularizers or loss functions to enforce a pairwise constraint for learning multi-modal common representations. Metric learning approaches seek to establish a distance function for measuring multi-modal similarities with the objective to pull similar pairs of modalities closer and dissimilar pairs apart. Compared with correspondence preserving and canonical correlation analysis methods
similar pairs and dissimilar pairs are provided as restricted conditions when learning common representations. Likelihood analysis methods
based on Bayesian analysis
are generative approaches with the objective of maximizing the likelihood of the observed multi-modal relationship
e.g.
similarity. Conventionally
the maximum likelihood estimation objective is derived to maximize the posterior probability of multi-modal observation. Learning to rank approaches aim to construct a ranking model constrained on the common representations with the objective of maintaining the order of multi-modal similarities. Compared with metric learning methods
explicit ranking loss based objectives are usually developed for ranking similarity optimization. Semantic prediction methods are similar to traditional classification model with the objective of predicting accuracy semantic labels of multi-modal data or their relationships. With such high-level semantics utilized
intramodal structure can effectively reflect learning multi-modal common representations. Adversarial learning approaches refer to methods using generative adversarial networks with the objective of being unable to infer the modality sources for learning common representations. Usually
the generative and discriminative models are carefully designed to form a min-max game for learning statistical inseparable common representations. We introduce several multi-modal datasets in the community
i.e.
the Wiki image-text dataset
the INRIA-Websearch dataset
the Flickr30K dataset
the Microsoft common objects in context(MS COCO) dataset
the Real-world Web Image Dataset from National University of Singapore(NUS-WIDE) dataset
the pattern analysis
statistical modelling and computational learning visual object classes(PPSCAL Voc) dataset
and the XMedia dataset. Finally
we discuss open problems and future directions. 1) Some researchers have put forward transferred/extendable/zero-shot cross-modal retrieval
which claims that multi-modal data in the source domain and the target domain can have different semantic annotation categories. 2) Effective cross-modal benchmark data-set containing multiple modal data and with a certain volume for the complex algorithm verification to promote cross-modal retrieval performance with huge data is limited. 3) Labeling all cross-modal data and each sample with accurate annotations is impractical; thus
using these limited and noisy multi-modal data for cross-modal retrieval will be an important research direction. 4) Researchers have designed relatively complex algorithms to improve performance
but the requirements of retrieval efficiency are difficult to satisfy. Therefore
designing efficient and high-performance cross-modal retrieval algorithm is a crucial direction. 5) Embedding different modalities into a common representation space is difficult
and extracting fragment level representation for different modal types and developing more complex fragment-level relationship modeling will be some of the future research directions.
跨模态检索跨模态哈希深度学习共同表示学习对抗学习似然分析学习排序
cross-modal retrievalcross-modal hashingdeep learningcommon representation learningadversarial learninglikelihood analysislearning to rank
Andrew G, Arora R, Bilmes J and Livescu K. 2013. Deep canonical correlation analysis//Proceedings of the 30th International Conference on Machine Learning. Atlanta, USA: JMLR: 1247-1255
Baltrušaitis T, Ahuja C and Morency LP. 2019. Multimodal machine learning: a survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2): 423-443[DOI:10.1109/TPAMI.2018.2798607]
Blei D M, Ng A Y and Jordan M I. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research, 3: 993-1022
Cao W M, Lin Q B, He Z H and He Z Q. 2019. Hybrid representation learning for cross-modal retrieval. Neurocomputing, 345: 45-57[DOI:10.1016/j.neucom.2018.10.082]
Cao Y, Liu B, Long M S and Wang J M. 2018. Cross-modal hamming hashing//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 207-223[DOI:10.1007/978-3-030-01246-5_13http://dx.doi.org/10.1007/978-3-030-01246-5_13]
Cao Y, Long M S, Wang J M and Liu S C. 2017. Collective deep quantization for efficient cross-modal retrieval//Proceedings of the 31st AAAI Conference on Artificial Intelligence. San Francisco, USA: AAAI: 3974-3980
Cao Y, Long M S, Wang J M, Yang Q and Yu P S. 2016b. Deep visual-semantic hashing for cross-modal retrieval//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, USA: ACM: 1445-1454[DOI:10.1145/2939672.2939812http://dx.doi.org/10.1145/2939672.2939812]
Cao Y, Long M S, Wang J M and Yu P S. 2016a. Correlation hashing network for efficient cross-modal retrieval[EB/OL]. [2020-12-31].https://arxiv.org/pdf/1602.06697.pdfhttps://arxiv.org/pdf/1602.06697.pdf
Cao Y, Long M S, Wang J M and Zhu H. 2016c. Correlation autoencoder hashing for supervised cross-modal search//Proceedings of 2016 ACM on International Conference on Multimedia Retrieval. New York, USA: ACM: 197-204[DOI: 10.1145/2911996.2912000http://dx.doi.org/10.1145/2911996.2912000]
Carvalho M, Cadène R, Picard D, Soulier L,Thome N and Cord M. 2018. Cross-modal retrieval in the cooking context: learning semantic text-image embeddings//Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. Ann Arbor, USA: ACM: 35-44[DOI:10.1145/3209978.3210036http://dx.doi.org/10.1145/3209978.3210036]
Chen J J, Ngo C W and Chua T S. 2017a. Cross-modal recipe retrieval with rich food attributes//Proceedings of the 25th ACM International Conference on Multimedia. Mountain, View, USA: ACM: 1771-1779[DOI:10.1145/3123266.3123428http://dx.doi.org/10.1145/3123266.3123428]
Chen J J, Ngo C W, Feng F L and Chua T S. 2018. Deep understanding of cooking procedure for cross-modal recipe retrieval//Proceedings of the 26th ACM International Conference on Multimedia. Seoul, Korea(South): ACM: 1020-1028[DOI:10.1145/3240508.3240627http://dx.doi.org/10.1145/3240508.3240627]
Chen J J, Pang L and Ngo C W. 2017b. Cross-modal recipe retrieval: how to cook this dish?//Proceedings of the 23rd International Conference on MultiMedia Modeling. Reykjavik, Iceland: Springer: 588-600[DOI:10.1007/978-3-319-51811-4_48http://dx.doi.org/10.1007/978-3-319-51811-4_48]
Chua T S, Tang J H, Hong R C, Li H J, Luo Z P and Zheng Y T. 2009. NUS-WIDE: a real-world web image database from national university of Singapore//Proceedings of 2009 ACM International Conference on Image and Video Retrieval. Santorini Island, Greece: ACM: 48[DOI:10.1145/1646396.1646452http://dx.doi.org/10.1145/1646396.1646452]
Deng C, Chen Z J, Liu X L, Gao X B and Tao D C. 2018. Triplet-based deep hashing network for cross-modal retrieval. IEEE Transactions on Image Processing, 27(8): 3893-3903[DOI:10.1109/TIP.2018.2821921]
Dorfer M, Hajič Jr J, Arzt A, Frostel H and Widmer G. 2018. Learning audio-sheet music correspondences for cross-modal retrieval and piece identification. Transactions of the International Society for Music Information Retrieval, 1(1): 22-33[DOI:10.5334/tismir.12]
Feng F X, Li R F and Wang X J. 2015a. Deep correspondence restricted Boltzmann machine for cross-modal retrieval. Neurocomputing, 154: 50-60[DOI:10.1016/j.neucom.2014.12.020]
Feng F X, Wang X J and Li R F. 2014. Cross-modal retrieval with correspondence autoencoder//Proceedings of the 22nd ACM International Conference on Multimedia. Orlando, USA: ACM: 7-16[DOI:10.1145/2647868.2654902http://dx.doi.org/10.1145/2647868.2654902]
Feng F X, Wang X J, Li R F and Ahmad I. 2015b. Correspondence autoencoders for cross-modal retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications, 12(1S): #26[DOI:10.1145/2808205]
Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A and Bengio Y. 2014. Generative adversarial nets//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: NIPS: 2672-2680
Gou M, Yuan Y and Lu X Q. 2018. Deep cross-modal retrieval for remote sensing image and audio//Proceedings of the 10th IAPR Workshop on Pattern Recognition in Remote Sensing. Beijing, China: IEEE: 1-7[DOI:10.1109/PRRS.2018.8486338http://dx.doi.org/10.1109/PRRS.2018.8486338]
Gu J X, Cai J F, Joty S, Niu L and Wang G. 2018. Look, imagine and match: improving textual-visual cross-modal retrieval with generative models//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7181-7189[DOI:10.1109/CVPR.2018.00750http://dx.doi.org/10.1109/CVPR.2018.00750]
He L, Xu X, Lu H M, Yang Y, Shen F M and Shen H T. 2017. Unsupervised cross-modal retrieval through adversarial learning//Proceedings of 2017 IEEE International Conference on Multimedia and Expo. Hong Kong, China: IEEE: 1153-1158[DOI:10.1109/ICME.2017.8019549http://dx.doi.org/10.1109/ICME.2017.8019549]
He R, Zhang M, Wang L, Ji Y and Yin Q Y. 2015. Cross-modal subspace learning via pairwise constraints. IEEE Transactions on Image Processing, 24(12): 5543-5556[DOI:10.1109/TIP.2015.2466106]
He Y H, Xiang S M, Kang C C, Wang J and Pan C H. 2016. Cross-modal retrieval via deep and bidirectional representation learning. IEEE Transactions on Multimedia, 18(7): 1363-1377[DOI:10.1109/TMM.2016.2558463]
Hodosh M, Young P and Hockenmaier J. 2013. Framing image description as a ranking task: data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47: 853-899[DOI:10.1613/jair.3994]
Hu P, Peng D Z, Wang X and Xiang Y. 2019. Multimodal adversarial network for cross-modal retrieval. Knowledge-Based Systems, 180: 38-50[DOI:10.1016/j.knosys.2019.05.017]
Hua Y, Tian H, Cai A N and Shi P. 2015. Cross-modal correlation learning with deep convolutional architecture//Proceedings of 2015 Visual Communications and Image Processing. Singapore, Singapore: IEEE: 1-4[DOI:10.1109/VCIP.2015.7457841http://dx.doi.org/10.1109/VCIP.2015.7457841]
Huang P Y, Kang G L, Liu W H, Chang X J and Hauptmann A G. 2019a. Annotation efficient cross-modal retrieval with adversarial attentive alignment//Proceedings of the 27th ACM International Conference on Multimedia. Nice, France: ACM: 1758-1767[DOI:10.1145/3343031.3350894http://dx.doi.org/10.1145/3343031.3350894]
Huang X, Peng Y X and Yuan M K. 2020a. MHTN: modal-adversarial hybrid transfer network for cross-modal retrieval. IEEE Transactions on Cybernetics, 50(3): 1047-1059[DOI:10.1109/TCYB.2018.2879846]
Huang Y, Long Y and Wang L. 2019b. Few-shot image and sentence matching via gated visual-semantic embedding//Proceedings of the 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, the 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019, the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019. Honolulu, USA: AAAI: 8489-8496
Huang Y and Wang L. 2019c. ACMM: aligned cross-modal memory for few-shot image and sentence matching//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea(South): IEEE: 5773-5782[DOI:10.1109/ICCV.2019.00587http://dx.doi.org/10.1109/ICCV.2019.00587]
Huang Y, Wang W and Wang L. 2015. Unconstrained multimodal multi-label learning. IEEE Transactions on Multimedia, 17(11): 1923-1935[DOI:10.1109/TMM.2015.2476658]
Huang Y, Wang W and Wang L. 2017. Instance-aware image and sentence matching with selective multimodal LSTM//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 2310-2318[DOI:10.1109/CVPR.2017.767http://dx.doi.org/10.1109/CVPR.2017.767]
Huang Y, Wu Q, Song C F and Wang L. 2018. Learning semantic concepts and order for image and sentence matching//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 6163-6171[DOI:10.1109/CVPR.2018.00645http://dx.doi.org/10.1109/CVPR.2018.00645]
Huang Y, Wu Q, Wang W and Wang L. 2020b. Image and sentence matching via semantic concepts and order learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(3): 636-650[DOI:10.1109/TPAMI.2018.2883466]
Hwang S J and Grauman K 2012. Reading between the lines: object localization using implicit cues from image tags. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(6): 1145-1158[DOI:10.1109/TPAMI.2011.190]
Jiang Q Y and Li W J. 2017. Deep cross-modal hashing//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 3270-3278[DOI:10.1109/CVPR.2017.348http://dx.doi.org/10.1109/CVPR.2017.348]
Jiang X Y, Wu F, Li X, Zhao Z, Lu W M, Tang S L and Zhuang Y T. 2015. Deep compositional cross-modal learning to rank via local-global alignment//Proceedings of the 23rd ACM International Conference on Multimedia. Brisbane, Australia: ACM: 69-78[DOI:10.1145/2733373.2806240http://dx.doi.org/10.1145/2733373.2806240]
Kang P P, Lin Z H, Yang Z G, Fang X Z, Li Q and Liu W Y. 2019. Deep semantic space with intra-class low-rank constraint for cross-modal retrieval//Proceedings of 2019 on International Conference on Multimedia Retrieval. Ottawa, Canada: ACM: 226-234[DOI:10.1145/3323873.3325029http://dx.doi.org/10.1145/3323873.3325029]
Karpathy A, Joulin A and Li F F. 2014. Deep fragment embeddings for bidirectional image sentence mapping//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: NIPS: 1889-1897
Krapac J, Allan M, Verbeek J and Juried F. 2010. Improving web image search results using query-relative classifiers//Proceedings of the 23rd IEEE Computer Society Computer Vision and Pattern Recognition. San Francisco, USA: IEEE: 1094-1101[DOI:10.1109/CVPR.2010.5540092http://dx.doi.org/10.1109/CVPR.2010.5540092]
LeCun Y, Bengio Y and Hinton G. 2015. Deep learning. Nature, 521(7553): 436-444[DOI:10.1038/nature14539]
Li C, Deng C, Li N, Liu W, Gao X B and Tao D C. 2018. Self-supervised adversarial hashing networks for cross-modal retrieval//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4242-4251[DOI:10.1109/CVPR.2018.00446http://dx.doi.org/10.1109/CVPR.2018.00446]
Li C, Deng C, Wang L, Xie D and Liu X L. 2019b. Coupled cyclegan: unsupervised hashing network for cross-modal retrieval//Proceedings of the 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, the 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019, the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019. Honolulu, USA: AAAI: 176-183
Li S, Tao Z Q, Li K and Fu Y. 2019a. Visual to text: survey of image and video captioning. IEEE Transactions on Emerging Topics in Computational Intelligence, 3(4): 297-312[DOI:10.1109/TETCI.2019.2892755]
Lin L, Wang G R, Zuo G M, Feng X C and Zhang L. 2017. Cross-domain visual matching via generalized similarity measure and feature learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1089-1102[DOI:10.1109/TPAMI.2016.2567386]
Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár C P and Zitnick L. 2014. Microsoft coco: common objects in context//Proceedings of the 13th European Conference on Computer Vision. Zurich Switzerland: Springer: 740-755[DOI:10.1007/978-3-319-10602-1_48http://dx.doi.org/10.1007/978-3-319-10602-1_48]
Liong V E, Lu J W, Tan Y P and Zhou J. 2017a. Deep coupled metric learning for cross-modal matching. IEEE Transactions on Multimedia, 19(6): 1234-1244[DOI:10.1109/TMM.2016.2646180]
Liong V E, Lu J W, Tan Y P and Zhou J. 2017b. Cross-modal deep variational hashing//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 4077-4085[DOI: 10.1109/ICCV.2017.439http://dx.doi.org/10.1109/ICCV.2017.439]
Liu J, Xu C S and Lu H Q. 2010. Cross-media retrieval: state-of-the-art and open issues. International Journal of Multimedia Intelligence and Security, 1(1): 33-52[DOI:10.1504/IJMIS.2010.035970]
Liu X W, Li Z, Wang J, Yu G X, Domenicon C and Zhang X L. 2019. Cross-modal zero-shot hashing//Proceedings of 2019 IEEE International Conference on Data Mining. Beijing, China: IEEE: 449-458[DOI:10.1109/ICDM.2019.00055http://dx.doi.org/10.1109/ICDM.2019.00055]
Lowe D G. 2004. Distinctive image features from scale-invariant Keypoints. International Journal of Computer Vision, 60(2): 91-110[DOI:10.1023/B:VISI.0000029664.99615.94]
Ma L, Li H L, Meng F M, Wu Q B and Ngan K N. 2018. Global and local semantics-preserving based deep hashing for cross-modal retrieval. Neurocomputing, 312: 49-62[DOI:10.1016/j.neucom.2018.05.052]
Ma X H, Zhang T Z and Xu C S. 2020. Multi-level correlation adversarial hashing for cross-modal retrieval. IEEE Transactions on Multimedia, 22(12): 3101-3114[DOI:10.1109/TMM.2020.2969792]
Masci J, Bronstein M M, Bronstein A M and Schmidhuber J. 2014. Multimodal similarity-preserving hashing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(4): 824-830[DOI:10.1109/TPAMI.2013.225]
Mithun N C, Li J C, Metze F and Roy-Chowdhury A K. 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval//Proceedings of 2018 ACM on International Conference on Multimedia Retrieval. Yokohama, Japan: ACM: 19-27[DOI: 10.1145/3206025.3206064http://dx.doi.org/10.1145/3206025.3206064]
Nawaz S, Janjua M K, Calefati A and Gallo I. 2018. Revisiting cross modal retrieval[EB/OL]. [2020-12-30].https://arxiv.org/pdf/1807.07364v1.pdfhttps://arxiv.org/pdf/1807.07364v1.pdf
Nie X S, Wang B W, Li J J, Hao F C, Jian M W and Yin Y L. 2021. Deep multiscale fusion hashing for cross-modal retrieval. IEEE Transactions on Circuits and Systems for Video Technology, 31(1): 401-410[DOI:10.1109/TCSVT.2020.2974877]
Peng H Y, He J J, Chen S F, Wang Y L and Qiao Y. 2019. Dual-supervised attention network for deep cross-modal hashing. Pattern Recognition Letters, 128: 333-339[DOI:10.1016/j.patrec.2019.08.032]
Peng Y X, Huang X and Zhao Y Z. 2018b. An overview of cross-media retrieval: concepts, methodologies, benchmarks, and challenges. IEEE Transactions on Circuits and Systems for Video Technology, 28(9): 2372-2385[DOI:10.1109/TCSVT.2017.2705068]
Peng Y X, Qi J W, Huang X and Yuan Y X. 2018c. CCL: cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Transactions on Multimedia, 20(2): 405-420[DOI:10.1109/TMM.2017.2742704]
Peng Y X, Qi J W and Yuan Y X. 2018a. Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Transactions on Image Processing, 27(11): 5585-5595[DOI:10.1109/TIP.2018.2852503]
Peng Y X, Zhai X H, Zhao Y Z and Huang X. 2016. Semi-supervised cross-media feature learning with unified patch graph regularization. IEEE Transactions on Circuits and Systems for Video Technology, 26(3): 583-596[DOI:10.1109/TCSVT.2015.2400779]
Qi J W, Huang X and Peng Y X. 2017. Cross-media similarity metric learning with unified deep networks. Multimedia Tools and Applications, 76(23): 25109-25127[DOI:10.1007/s11042-017-4726-6]
Qiang H P, Wan Y, Xiang L and Meng X J. 2020. Deep semantic similarity adversarial hashing for cross-modal retrieval. Neurocomputing, 400: 24-33[DOI:10.1016/j.neucom.2020.03.032]
Rasiwasia N, Pereira J C, Coviello E, Doyle G, Lanckriet G R G, Levy R and Vasconcelos N. 2010. A new approach to cross-modal multimedia retrieval//Proceedings of the 18th ACM International Conference on Multimedia. Firenze, Italy: ACM: 251-260[DOI: 10.1145/1873951.1873987http://dx.doi.org/10.1145/1873951.1873987]
Salvador A, Hynes N, Aytar Y, Marin J, Ofli F, Weber I and Torralba A. 2017. Learning cross-modal embeddings for cooking recipes and food images//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 3020-3028[DOI: 10.1109/CVPR.2017.327http://dx.doi.org/10.1109/CVPR.2017.327]
Shang F, Zhang H X, Zhu L and Sun J D. 2019. Adversarial cross-modal retrieval based on dictionary learning. Neurocomputing, 355: 93-104[DOI:10.1016/j.neucom.2019.04.041]
Shao J, Wang L Q, Zhao Z C, Su F and Cai A N. 2016. Deep canonical correlation analysis with progressive and hypergraph learning for cross-modal retrieval. Neurocomputing, 214: 618-628[DOI:10.1016/j.neucom.2016.06.047]
Shao J, Zhao Z C and Su F. 2019. Two-stage deep learning for supervised cross-modal retrieval. Multimedia Tools and Applications, 78(12): 16615-16631[DOI:10.1007/s11042-018-7068-0]
Shao J, Zhao Z C, Su F and Yue T. 2015. 3view deep canonical correlation analysis for cross-modal retrieval//Proceedings of 2015 Visual Communications and Image Processing. Singapore, Singapore: IEEE: 1-4[DOI: 10.1109/VCIP.2015.7457870http://dx.doi.org/10.1109/VCIP.2015.7457870]
Sharma A, Kumar A, Daume H and Jacobs D W. 2012. Generalized multiview analysis: a discriminative latent space//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 2160-2167[DOI: 10.1109/CVPR.2012.6247923http://dx.doi.org/10.1109/CVPR.2012.6247923]
Shen Y M, Liu L, Shao L and Song J K. 2017. Deep binaries: encoding semantic-rich cues for efficient textual-visual cross retrieval//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 4097-4103[DOI: 10.1109/ICCV.2017.441http://dx.doi.org/10.1109/ICCV.2017.441]
Srivastava Y, Murali V, Dubey S R and Mukherjee S. 2019. Visual question answering using deep learning: a survey and performance analysis[EB/OL]. [2020-12-30].https://arxiv.org/pdf/1909.01860.pdfhttps://arxiv.org/pdf/1909.01860.pdf
Su S P, Zhong Z S and Zhang C. 2019. Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea(South): IEEE: 3027-3035[DOI: 10.1109/ICCV.2019.00312http://dx.doi.org/10.1109/ICCV.2019.00312]
VukotićV, Raymond C and Gravier G. 2016. Bidirectional joint representation learning withsymmetrical deep neural networks for multimodal and crossmodal applications//Proceedings of 2016 ACM on International Conference on Multimedia Retrieval. New York, USA: ACM: 343-346[DOI: 10.1145/2911996.2912064http://dx.doi.org/10.1145/2911996.2912064]
Wang B K, Yang Y, Xu X, Hanjalic A and Shen H T. 2017. Adversarial cross-modal retrieval//Proceedings of the 25th ACM International Conference on Multimedia. Mountain View, USA, ACM: 154-162[DOI: 10.1145/3123266.3123326http://dx.doi.org/10.1145/3123266.3123326]
Wang H, Sahoo D, Liu C H, Shu K, Achananuparp P, Lim E P and Hoi S C H. 2020. Cross-modal food retrieval: Learning a joint embedding of food images and recipes with semantic consistency and attention mechanism[EB/OL]. [2020-12-30].https://arxiv.org/pdf/2003.03955v1.pdfhttps://arxiv.org/pdf/2003.03955v1.pdf
Wang J, He Y H, Kang C C, Xiang S M and Pan C H. 2015. Image-text cross-modal retrieval via modality-specific feature learning//Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. Shanghai, China: ACM: 347-354[DOI: 10.1145/2671188.2749341http://dx.doi.org/10.1145/2671188.2749341]
Wang K Y, He R, Wang W, Wang L and Tan T N. 2013. Learning coupled feature spaces for cross-modal matching//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE: 2088-2095[DOI: 10.1109/ICCV.2013.261http://dx.doi.org/10.1109/ICCV.2013.261]
Wang K Y, He R, Wang L, Wang W and Tan T N. 2016a. Joint feature selection and subspace learning for cross-modal retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10): 2010-2023[DOI:10.1109/TPAMI.2015.2505311]
Wang L W, Li Y and Lazebnik S. 2016b. Learning deep structure-preserving image-text embeddings//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 5375-5384[DOI: 10.1109/CVPR.2016.541http://dx.doi.org/10.1109/CVPR.2016.541]
Wang W, Ooi B C, Yang X Y, Zhang D X and Zhuang Y T. 2014. Effective multi-modal retrieval based on stacked auto-encoders. Proceedings of the VLDB Endowment, 7(8): 649-660[DOI:10.14778/2732296.2732301]
Wang W, Yang X Y, Ooi B C, Zhang D X and Zhuang Y T. 2016c. Effective deep learning-based multi-modal retrieval. The VLDB Journal, 25(1): 79-101[DOI:10.1007/s00778-015-0391-4]
Wang X Z, Zou X T, Bakker E M and Wu S. 2020. Self-constraining and attention-based hashing network for bit-scalable cross-modal retrieval. Neurocomputing, 400: 255-271[DOI:10.1016/j.neucom.2020.03.019]
Wei Y C, Zhao Y, Lu C Y, Wei S K, Liu L Q, Zhu Z F and Yan S C. 2017. Cross-modal retrieval with CNN visual features: a new baseline. IEEE Transactions on Cybernetics, 47(2): 449-460[DOI:10.1109/TCYB.2016.2519449]
Wu F, Jing X Y, Wu Z Y, Ji Y M, Dong X W, Luo X K, Huang Q H and Wang R C. 2020. Modality-specific and shared generative adversarial network for cross-modal retrieval. Pattern Recognition, 104: 107335[DOI:10.1016/j.patcog.2020.107335]
Wu G S, Lin Z J, Han J G, Liu L, Ding G G, Zhang B C and Shen J L. 2018a. Unsupervised deep hashing via binary latent factor models for large-scale cross-modal retrieval//Proceedings of the 27th International Joint Conference on Artificial Intelligence. Stockholm, Sweden: ACM: 2854-2860
Wu Y L, Wang S H and Huang Q M. 2018b. Learning semantic structure-preserved embeddings for cross-modal retrieval//Proceedings of the 26th ACM International Conference on Multimedia. Seoul, Korea(South): ACM: 825-833[DOI: 10.1145/3240508.3240521http://dx.doi.org/10.1145/3240508.3240521]
Xia D L, Miao L and Fan A W. 2020. A cross-modal multimedia retrieval method using depth correlation mining in big data environment. Multimedia Tools and Applications, 79(1): 1339-1354
Xie D, Deng C, Li C, Liu X L and Tao D C. 2020. Multi-task consistency-preserving adversarial hashing for cross-modal retrieval. IEEE Transactions on Image Processing, 29: 3626-3637[DOI:10.1109/TIP.2020.2963957]
Xu C, Tao D C and Xu C. 2013. A survey on multi-view learning[EB/OL]. [2020-12-30].https://arxiv.org/pdf/1304.5634.pdfhttps://arxiv.org/pdf/1304.5634.pdf
Xu P, Yin Q Y, Qi Y G, Song Y Z, Ma Z Y, Wang L and Guo J. 2016. Instance-level coupled subspace learning for fine-grained sketch-based image retrieval//Proceedings of 2016 Computer Vision-ECCV 2016 Workshop. Amsterdam, the Netherlands: Springer: 19-34[DOI: 10.1007/978-3-319-46604-0_2http://dx.doi.org/10.1007/978-3-319-46604-0_2]
Xu X, He L, Lu H M, Gao L L and Ji Y L. 2019. Deep adversarial metric learning for cross-modal retrieval. World Wide Web, 22(2): 657-672[DOI:10.1007/s11280-018-0541-x]
Xu X, Lu H M, Song J K, Yang Y, Shen H T and Li X L. 2020. Ternary adversarial networks with self-supervision for zero-shot cross-modal retrieval. IEEE Transactions on Cybernetics, 50(6): 2400-2413[DOI:10.1109/TCYB.2019.2928180]
Xu X, Song J K, Lu H M, Yang Y, Shen F M and Huang Z. 2018. Modal-adversarial semantic learning network for extendable cross-modal retrieval//Proceedings of 2018 ACM on International Conference on Multimedia Retrieval. Yokohama, Japam: ACM: 46-54[DOI: 10.1145/3206025.3206033http://dx.doi.org/10.1145/3206025.3206033]
Yan F and Mikolajczyk K. 2015. Deep correlation for matching images and text//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3441-3450[DOI: 10.1109/CVPR.2015.7298966http://dx.doi.org/10.1109/CVPR.2015.7298966]
Yang E K, Deng C, Liu W, Liu X L, Tao D C and Gao X B. 2017. Pairwise relationship guided deep hashing for cross-modal retrieval//Proceedings of the 31st AAAI Conference on Artificial Intelligence. San Francisco, USA: AAAI: 1618-1625
Yin Q Y, Wu S and Wang L. 2017. Unified subspace learning for incomplete and unlabeled multi-view data. Pattern Recognition, 67: 313-327[DOI:10.1016/j.patcog.2017.01.035]
Yin Q Y, Wu S and Wang L. 2018. Multiview clustering via unified and view-specific embeddings learning. IEEE Transactions on Neural Networks and Learning Systems, 29(11): 5541-5553[DOI:10.1109/TNNLS.2017.2786743]
Young P, Lai A, Hodosh M and Hockenmaier J. 2014. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2: 67-78[DOI:10.1162/tacl_a_00166]
Zhai X H, Peng Y X and Xiao J G. 2014. Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Transactions on Circuits and Systems for Video Technology, 24(6): 965-978[DOI:10.1109/TCSVT.2013.2276704]
Zhang J and Peng Y X. 2020. Multi-pathway generative adversarial hashing for unsupervised cross-modal retrieval. IEEE Transactions on Multimedia, 22(1): 174-187[DOI:10.1109/TMM.2019.2922128]
Zhang J, Peng Y X and Yuan M K. 2018a. Unsupervised generative adversarial cross-modal hashing//Proceedings of the 32nd AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th Innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18). New Orleans, USA: AAAI: 539-546
Zhang L, Ma B P, Li G R, Huang Q and Tian Q. 2017. Multi-networks joint learning for large-scale cross-modal retrieval//Proceedings of the 25th ACM International Conference on Multimedia. Mountain View, USA: ACM: 907-915[DOI: 10.1145/3123266.3123317http://dx.doi.org/10.1145/3123266.3123317]
Zhang M J, Li J Z, Zhang H X and Liu L. 2020. Deep semantic cross modal hashing with correlation alignment. Neurocomputing, 381: 240-251[DOI:10.1016/j.neucom.2019.11.061]
Zhang X, Lai H J and Feng J S. 2018b. Attention-aware deep adversarial hashing for cross-modal retrieval//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 614-629[DOI: 10.1007/978-3-030-01267-0_36http://dx.doi.org/10.1007/978-3-030-01267-0_36]
Zhen L L, Hu P, Peng X, Goh R S M and Zhou J T. 2020. Deep multimodal transfer learning for cross-modal retrieval. IEEE Transactions on Neural Networks and Learning Systems: #3029181[DOI: 10.1109/TNNLS.2020.3029181http://dx.doi.org/10.1109/TNNLS.2020.3029181]
Zhen L L, Hu P, Wang X and Peng D Z. 2019. Deep supervised cross-modal retrieval//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 10394-10403[DOI: 10.1109/CVPR.2019.01064http://dx.doi.org/10.1109/CVPR.2019.01064]
Zhong F M, Chen Z K and Min G Y. 2018. Deep discrete cross-modal hashing for cross-media retrieval. Pattern Recognition, 83: 64-77[DOI:10.1016/j.patcog.2018.05.018]
相关作者
相关机构