跨模态表征与生成技术
刘华峰,陈静静,李亮,鲍秉坤,李泽超,刘家瑛,聂礼强(南京理工大学;复旦大学;中国科学院计算技术研究所;南京邮电大学;北京大学;哈尔滨工业大学(深圳)) 摘 要
多媒体数据呈现爆发式增长并显现出异源异构的特性,因此跨模态学习领域研究逐渐引起学术和工业界的关注。跨模态表征与生成是跨模态学习的两大核心基础问题,跨模态表征旨在利用多种模态之间的互补性,剔除模态之间的冗余,从而获得更为有效的特征表示;跨模态生成则是基于模态之间的语义一致性,实现不同模态数据形式上的相互转换,有助于提高不同模态间的迁移能力。本文系统地分析了国际与国内近年来跨模态表征与生成领域的重要研究进展,包括传统跨模态表征学习、多模态大模型表示学习、图像到文本的跨模态转换和跨模态图像生成等话题。其中,传统跨模态表征学习探讨了跨模态统一表征和跨模态协同表征,多模态大模型表示学习探讨了基于Transformer的模型研究,图像到文本的跨模态转换探讨了图像视频的语义描述、视频字幕语义分析和视觉问答等领域的发展,而跨模态图像生成则从不同模态信息的跨模态联合表示方法、图像的跨模态生成技术和基于预训练的特定域图像生成阐述了跨模态生成方面的进展。本文详细地综述了上述各个子领域研究的挑战性,对比了国内外研究方面的进展情况,梳理了发展脉络和学术研究的前沿动态。最后,根据上述分析展望了跨模态表征与生成的发展趋势和突破口。
关键词
Cross-Modal Representation Learning and Generation
Liu Huafeng,Chen Jinjin,Li Liang,Bao Bingkun,Li Zechao,Liu Jiaying,Nie Liqiang(Fudan University;Institute of Computing Technology, Chinese Academy of Sciences;Nanjing University of Posts and Telecommunications;Nanjing University of Science and Technology;Peking University;Harbin Institute of Technology, Shenzhen) Abstract
Nowadays, with the booming of multimedia data, the character of multi-source and multi-modality of data has become a challenging problem in multimedia research. Thus, cross-modal learning has attracted attention from both academia and industry. Cross-modal representation and generation are two core topics in cross-modal learning research. Cross-modal representation studies the feature learning and information integration within different modalities. It aims to take advantage of the complementarity between multiple modalities and eliminate redundancy between modalities, so as to obtain a more effective feature representation. Cross-modal generation studies the knowledge transfer mechanism across modalities, it studies the semantic consistency between modals, which realizes the interconversion of data forms of different modals. Cross-modal generation is helpful to improve the migration ability between different modalities. The recent advances in cross-modal representation and generation worldwide and domestically are systematically reviewed in this article, including traditional cross-modal representation learning, foundation model for cross-modal representation learning, image-to-text cross-modal conversion, joint representation across modes, and cross-modal image generation based on the generation model. This survey covers the four main directions mentioned above. The traditional cross-modal representation research discussed in this article has two categories: joint representation and coordinated representation. Joint representation maps multiple single-modal information to the joint representation space, while coordinate representations process single-modal information separately, and cross-modal representations can learn collaboratively through similarity constraints. The study of traditional cross-modal representation laid the foundation for subsequent research. As the pre-training technique successfully activates the self-supervised learning ability of deep neural networks on large-scale unlabeled data, the multi-modal pre-trained foundation models have gradually attracted widespread attention from academia and industry, especially the Transformer based methods discussed in this paper. Different from the previously commonly used supervised learning paradigm, pre-trained large models can make full use of large-scale unlabeled data to learn training and use a small amount of labeled data from downstream tasks for model fine-tuning. Compared with the model directly trained for specific tasks, the pre-trained model has better versatility and migration ability, and the fine-tuned model on its basis has achieved significant performance improvement in various downstream tasks. The development of cross-modal synthesis (a.k.a image caption or video caption) methods has been summarized, including end-to-end, semantic-based, and stylize-based methods. In addition, the development of cross-modal conversion between image and text has been discussed, including image caption, video caption, and visual question answering. At last, the cross-modal generation methods are overviewed in this article, including the joint representation of cross-modal information, image generation, text-image cross-modal generation, and cross-modal generation based on pre-trained models. In recent years, generative adversarial networks (GANs) and denoising diffusion probabilistic models (DDPMs) have shown great potential in cross-modal generation tasks. Thanks to the strong adaptability and generation ability of DDPM models, the images they produce have delicate textures while not being bound to some specific fields, which boosts the cross-modal generation research. This paper summarizes the progress both on GAN-based and DDPM-based methods.
The challenges, evolution, and state-of-the-art methods in cross-modal representation and generation areas are also comprehensively reviewed. In addition, this report provides a prospect for future works in cross-modal learning in the last section.
Keywords
multimedia technology, cross-modal learning, foundation model, cross-modal representation, cross-modal generation deep learning
|