Current Issue Cover


摘 要
Cross-Modal Representation Learning and Generation

Liu Huafeng,Chen Jinjin,Li Liang,Bao Bingkun,Li Zechao,Liu Jiaying,Nie Liqiang(Fudan University;Institute of Computing Technology, Chinese Academy of Sciences;Nanjing University of Posts and Telecommunications;Nanjing University of Science and Technology;Peking University;Harbin Institute of Technology, Shenzhen)

Nowadays, with the booming of multimedia data, the character of multi-source and multi-modality of data has become a challenging problem in multimedia research. Thus, cross-modal learning has attracted attention from both academia and industry. Cross-modal representation and generation are two core topics in cross-modal learning research. Cross-modal representation studies the feature learning and information integration within different modalities. It aims to take advantage of the complementarity between multiple modalities and eliminate redundancy between modalities, so as to obtain a more effective feature representation. Cross-modal generation studies the knowledge transfer mechanism across modalities, it studies the semantic consistency between modals, which realizes the interconversion of data forms of different modals. Cross-modal generation is helpful to improve the migration ability between different modalities. The recent advances in cross-modal representation and generation worldwide and domestically are systematically reviewed in this article, including traditional cross-modal representation learning, foundation model for cross-modal representation learning, image-to-text cross-modal conversion, joint representation across modes, and cross-modal image generation based on the generation model. This survey covers the four main directions mentioned above. The traditional cross-modal representation research discussed in this article has two categories: joint representation and coordinated representation. Joint representation maps multiple single-modal information to the joint representation space, while coordinate representations process single-modal information separately, and cross-modal representations can learn collaboratively through similarity constraints. The study of traditional cross-modal representation laid the foundation for subsequent research. As the pre-training technique successfully activates the self-supervised learning ability of deep neural networks on large-scale unlabeled data, the multi-modal pre-trained foundation models have gradually attracted widespread attention from academia and industry, especially the Transformer based methods discussed in this paper. Different from the previously commonly used supervised learning paradigm, pre-trained large models can make full use of large-scale unlabeled data to learn training and use a small amount of labeled data from downstream tasks for model fine-tuning. Compared with the model directly trained for specific tasks, the pre-trained model has better versatility and migration ability, and the fine-tuned model on its basis has achieved significant performance improvement in various downstream tasks. The development of cross-modal synthesis (a.k.a image caption or video caption) methods has been summarized, including end-to-end, semantic-based, and stylize-based methods. In addition, the development of cross-modal conversion between image and text has been discussed, including image caption, video caption, and visual question answering. At last, the cross-modal generation methods are overviewed in this article, including the joint representation of cross-modal information, image generation, text-image cross-modal generation, and cross-modal generation based on pre-trained models. In recent years, generative adversarial networks (GANs) and denoising diffusion probabilistic models (DDPMs) have shown great potential in cross-modal generation tasks. Thanks to the strong adaptability and generation ability of DDPM models, the images they produce have delicate textures while not being bound to some specific fields, which boosts the cross-modal generation research. This paper summarizes the progress both on GAN-based and DDPM-based methods. The challenges, evolution, and state-of-the-art methods in cross-modal representation and generation areas are also comprehensively reviewed. In addition, this report provides a prospect for future works in cross-modal learning in the last section.