Current Issue Cover

尹奇跃,黄岩,张俊格,吴书,王亮(中国科学院自动化研究所, 北京 100190)

摘 要
Survey on deep learning based cross-modal retrieval

Yin Qiyue,Huang Yan,Zhang Junge,Wu Shu,Wang Liang(Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China)

Over the last decade, different types of media data such as texts, images, and videos grow rapidly on the internet. Different types of data are used for describing the same events or topics. For example, a web page usually contains not only textual description but also images or videos for illustrating the common content. Such different types of data are referred as multi-modal data, which inspire many applications, e.g., multi-modal retrieval, hot topic detection, and perso-nalize recommendation. Nowadays, mobile devices and emerging social websites (e.g., Facebook, Flickr, YouTube, and Twitter) are diffused across all persons, and a demanding requirement for cross-modal data retrieval is emergent. Accordingly, cross-modal retrieval has attracted considerable attention. One type of data is required as the query to retrieve relevant data of another type. For example, a user can use a text to retrieve relevant pictures or/and videos. The query and its retrieved results can have different modalities; thus, measuring the content similarity between different modalities of data, i.e., reducing heterogeneity gap, remains a challenge. With the rapid development of deep learning techniques, various deep cross-modal retrieval approaches have been proposed to alleviate this problem, and promising performance has been obtained. We aim to review and comb representative methods for deep learning based cross-modal retrieval. We first classify these approaches into three main groups based on the cross-modal information provided, i.e.: 1) co-occurrence information, 2) pairwise information, and 3) semantic information. Co-occurrence information based methods indicate that only co-occurrence information is utilized to learn common representations across multi-modal data, where co-occurrence information indicates that if different modalities of data co-exist in a multi-modal document, then they have the same semantic. Pairwise information based methods indicate that similar pairs and dissimilar pairs are utilized to learn the common representations. A similarity matrix for all modalities is usually provided indicating whether or not two points from the modalities are in the same categories. Semantic information based methods indicate that class label information is provided to learn common representations, where a multi-modal example can have one or more labels with massive manual annotation. Usually, co-occurrence information exists in pairwise information and semantic information based approaches, and pairwise information can be derived when semantic information is provided. However, these relationships do not necessarily hold. In each category, various techniques can be utilized and combined to fully use the provided cross-modal information. We roughly categorize these techniques into seven main classes, as follows: 1) canonical correlation analysis, 2) correspondence preserving, 3) metric learning, 4) likelihood analysis, 5) learning to rank, 6) semantic prediction, and 7) adversarial learning. Canonical correlation analysis methods focus on finding linear combinations of two vectors of random variables with the objective of maximizing the correlation. When combined with deep learning, linear projections are replaced with deep neural networks with extra considerations. Correspondence preserving methods aim at preserving the co-existing relationship of different modalities with the objective of minimizing their distances in the learned embedding space. Usually, the multi-modal correspondence relationship is formed as regularizers or loss functions to enforce a pairwise constraint for learning multi-modal common representations. Metric learning approaches seek to establish a distance function for measuring multi-modal similarities with the objective to pull similar pairs of modalities closer and dissimilar pairs apart. Compared with correspondence preserving and canonical correlation analysis methods, similar pairs and dissimilar pairs are provided as restricted conditions when learning common representations. Likelihood analysis methods, based on Bayesian analysis, are generative approaches with the objective of maximizing the likelihood of the observed multi-modal relationship, e.g., similarity. Conventionally, the maximum likelihood estimation objective is derived to maximize the posterior probability of multi-modal observation. Learning to rank approaches aim to construct a ranking model constrained on the common representations with the objective of maintaining the order of multi-modal similarities. Compared with metric learning methods, explicit ranking loss based objectives are usually developed for ranking similarity optimization. Semantic prediction methods are similar to traditional classification model with the objective of predicting accuracy semantic labels of multi-modal data or their relationships. With such high-level semantics utilized, intramodal structure can effectively reflect learning multi-modal common representations. Adversarial learning approaches refer to methods using generative adversarial networks with the objective of being unable to infer the modality sources for learning common representations. Usually, the generative and discriminative models are carefully designed to form a min-max game for learning statistical inseparable common representations. We introduce several multi-modal datasets in the community, i.e., the Wiki image-text dataset, the INRIA-Websearch dataset, the Flickr30K dataset, the Microsoft common objects in context(MS COCO) dataset, the Real-world Web Image Dataset from National University of Singapore(NUS-WIDE) dataset, the pattern analysis, statistical modelling and computational learning visual object classes(PPSCAL Voc) dataset, and the XMedia dataset. Finally, we discuss open problems and future directions. 1) Some researchers have put forward transferred/extendable/zero-shot cross-modal retrieval, which claims that multi-modal data in the source domain and the target domain can have different semantic annotation categories. 2) Effective cross-modal benchmark data-set containing multiple modal data and with a certain volume for the complex algorithm verification to promote cross-modal retrieval performance with huge data is limited. 3) Labeling all cross-modal data and each sample with accurate annotations is impractical; thus, using these limited and noisy multi-modal data for cross-modal retrieval will be an important research direction. 4) Researchers have designed relatively complex algorithms to improve performance, but the requirements of retrieval efficiency are difficult to satisfy. Therefore, designing efficient and high-performance cross-modal retrieval algorithm is a crucial direction. 5) Embedding different modalities into a common representation space is difficult, and extracting fragment level representation for different modal types and developing more complex fragment-level relationship modeling will be some of the future research directions.