数据受限条件下的多模态处理技术综述

王佩瑾; 闫志远; 容雪娥; 李俊希; 路晓男; 胡会扬; 严启炜; 孙显

发布时间： 2022-10-20
摘要点击次数： 5465
全文下载次数： 1687
DOI: 10.11834/jig.220049
2022 | Volume 27 | Number 10

数据受限条件下的多模态处理技术综述

王佩瑾^1,2,3, 闫志远^1,2,3, 容雪娥^1,2,3, 李俊希^1,2,3, 路晓男^1,2,3, 胡会扬^1,2,3, 严启炜^1,2,3, 孙显^1,2,3(1. 中国科学院空天信息创新研究院, 北京 100190;2.
2. 中国科学院大学电子电气与通信工程学院, 北京 100049;3.
3. 中国科学院空天信息创新研究院网络信息体系技术科技创新重点实验室, 北京 100190)

摘要

随着多媒体技术的发展，可获取的媒体数据在种类和量级上大幅提升。受人类感知方式的启发，多种媒体数据互相融合处理，促进了人工智能在计算机视觉领域的研究发展，在遥感图像解译、生物医学和深度估计等方面有广泛的应用。尽管多模态数据在描述事物特征时具有明显优势，但仍面临着较大的挑战。1）受到不同成像设备和传感器的限制，难以收集到大规模、高质量的多模态数据集；2）多模态数据需要匹配成对用于研究，任一模态的缺失都会造成可用数据的减少；3）图像、视频数据在处理和标注上需要耗费较多的时间和人力成本，这些问题使得目前本领域的技术尚待攻关。本文立足于数据受限条件下的多模态学习方法，根据样本数量、标注信息和样本质量等不同的维度，将计算机视觉领域中的多模态数据受限方法分为小样本学习、缺乏强监督标注信息、主动学习、数据去噪和数据增强5个方向，详细阐述了各类方法的样本特点和模型方法的最新进展。并介绍了数据受限前提下的多模态学习方法使用的数据集及其应用方向（包括人体姿态估计、行人重识别等），对比分析了现有算法的优缺点以及未来的发展方向，对该领域的发展具有积极的意义。

关键词

多模态数据数据受限深度学习融合算法计算机视觉

Review of multimodal data processing techniques with limited data

Wang Peijin^1,2,3, Yan Zhiyuan^1,2,3, Rong Xuee^1,2,3, Li Junxi^1,2,3, Lu Xiaonan^1,2,3, Hu Huiyang^1,2,3, Yan Qiwei^1,2,3, Sun Xian^1,2,3(1. Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China;2.
2. School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China;3.
3. Key Laboratory of Network Information System Technology(NIST), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China)

Abstract

The growth of multimedia technology has leveraged more available multifaceted media data. Human-perceptive multiple media data fusion has promoted the research and development (R&D) of artificial intelligence (AI) for computer vision. It has a wide range of applications like remote sensing image interpretation, biomedicine, and depth estimation. Multimodality can be as a form of representation of things (RoT). It refers to the description of things from multiple perspectives. Early AI-oriented technology is focused on a single modality of data. Current human-perceptive researches have clarified that each modality has a relatively independent description of things (IDoT), and the use of complementary representations of multimodal data tend to three-dimensional further. Recent processing and applications of multimodal data has been intensively developed like sentiment analysis, machine translation, natural language processing, and biomedicine. Our critical review is focused on the development of multimodality. Computer-vision-oriented multimodal learning is mainly used to analyze the related multimodal data on the aspects of images and videos, modalities-ranged learning and complemented information, and image detection and recognition, semantic segmentation, and video action prediction, etc. Multimodal data has its priority for objects description. First, it is challenged to collect large-scale, high-quality multimodal datasets due to the equipment-limited like multiple imaging devices and sensors. Next, Image and video data processing and labeling are time-consuming and labor-intensive. Based on the limited-data-derived multimodal learning methods, the multimodal data limited methods in the context of computer vision can be segmented into five aspects, including few-shot learning, lack of strong supervised information, active learning, data denoising and data augmentation. The multi-features of samples and the models evolution are critically reviewed as mentioned below:1) in the case of insufficient multi-modal data, the few-shot learning method has the cognitive ability to make correct judgments via learning a small number of samples only, and it can effectively learn the target features in the case of lack of data. 2) Due to the high cost of the data labeling process, it is challenged to obtain all the ground truth labels of all modalities for strongly supervised learning of the model. The incomplete supervised methods are composed of weakly supervised, unsupervised, semi-supervised, and self-supervised learning methods in common. These methods can optimize modal labeling information and cost-effective manual labeling. 3) The active learning method is based on the integration of prior knowledge and learning regulatory via designing a model using autonomous learning ability, and it is committed to the maximum optimization of few samples. Labeling costs can be effectively reduced in consistency based on the optimized options of samples. 4) Multimodal data denoising refers to reducing data noise, restoring the original data, and then extracting the information of interest. 5) In order to make full use of limited multi-modal data, few-samples-conditioned data enhancement method extends realistic data by performing a series of transformation operations on the original data set. In addition, the data sets are used for the multimodal learning method limited data. Its potential applications are introduced like human pose estimation and person re-identification, and the performance of the existing algorithms is compared and analyzed. The pros and cons, as well as the future development direction, are projected as following:1) a lightweight multimodal data processing method:we argue that limited-data-conditioned multimodal learning still has the challenge of mobile-devices-oriented models applications. When the existing methods fuse the information of multiple modalities, it is generally necessary to use two or above networks for feature extraction, and then fuse the features. Therefore, the large number of parameters and the complex structure of the model limit its application to mobile devices. Future lightweight model has its potentials. 2) A commonly-used multimodal intelligent processing model:most of existing multimodal data processing methods are derived from the developed multi-algorithms for multitasks, which need to be trained on specific tasks. This tailored training method greatly increases the cost of developing models, making it difficult to meet the needs of more application scenarios. Therefore, for the data of different modalities, it is necessary to promote a consensus perception model to learn the general representation of multimodal data and the parameters and features of the general model can be shared for multiple scenarios. 3) A multi-sources knowledge and data driven model:it is possible to introduce featured data and knowledge of multi-modal data beyond, establish an integrated knowledge-data-driven model, and enhance the model's performance and interpretability.

Keywords

multimodal data limited data deep learning fusion algorithms computer vision