王佩瑾1,2,3, 闫志远1,2,3, 容雪娥1,2,3, 李俊希1,2,3, 路晓男1,2,3, 胡会扬1,2,3, 严启炜1,2,3, 孙显1,2,3(1. 中国科学院空天信息创新研究院, 北京 100190;2.
2. 中国科学院大学电子电气与通信工程学院, 北京 100049;3.
3. 中国科学院空天信息创新研究院网络信息体系技术科技创新重点实验室, 北京 100190)
Review of multimodal data processing techniques with limited data
Wang Peijin1,2,3, Yan Zhiyuan1,2,3, Rong Xuee1,2,3, Li Junxi1,2,3, Lu Xiaonan1,2,3, Hu Huiyang1,2,3, Yan Qiwei1,2,3, Sun Xian1,2,3(1. Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China;2.
2. School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China;3.
3. Key Laboratory of Network Information System Technology(NIST), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China)
The growth of multimedia technology has leveraged more available multifaceted media data. Human-perceptive multiple media data fusion has promoted the research and development (R&D) of artificial intelligence (AI) for computer vision. It has a wide range of applications like remote sensing image interpretation, biomedicine, and depth estimation. Multimodality can be as a form of representation of things (RoT). It refers to the description of things from multiple perspectives. Early AI-oriented technology is focused on a single modality of data. Current human-perceptive researches have clarified that each modality has a relatively independent description of things (IDoT), and the use of complementary representations of multimodal data tend to three-dimensional further. Recent processing and applications of multimodal data has been intensively developed like sentiment analysis, machine translation, natural language processing, and biomedicine. Our critical review is focused on the development of multimodality. Computer-vision-oriented multimodal learning is mainly used to analyze the related multimodal data on the aspects of images and videos, modalities-ranged learning and complemented information, and image detection and recognition, semantic segmentation, and video action prediction, etc. Multimodal data has its priority for objects description. First, it is challenged to collect large-scale, high-quality multimodal datasets due to the equipment-limited like multiple imaging devices and sensors. Next, Image and video data processing and labeling are time-consuming and labor-intensive. Based on the limited-data-derived multimodal learning methods, the multimodal data limited methods in the context of computer vision can be segmented into five aspects, including few-shot learning, lack of strong supervised information, active learning, data denoising and data augmentation. The multi-features of samples and the models evolution are critically reviewed as mentioned below:1) in the case of insufficient multi-modal data, the few-shot learning method has the cognitive ability to make correct judgments via learning a small number of samples only, and it can effectively learn the target features in the case of lack of data. 2) Due to the high cost of the data labeling process, it is challenged to obtain all the ground truth labels of all modalities for strongly supervised learning of the model. The incomplete supervised methods are composed of weakly supervised, unsupervised, semi-supervised, and self-supervised learning methods in common. These methods can optimize modal labeling information and cost-effective manual labeling. 3) The active learning method is based on the integration of prior knowledge and learning regulatory via designing a model using autonomous learning ability, and it is committed to the maximum optimization of few samples. Labeling costs can be effectively reduced in consistency based on the optimized options of samples. 4) Multimodal data denoising refers to reducing data noise, restoring the original data, and then extracting the information of interest. 5) In order to make full use of limited multi-modal data, few-samples-conditioned data enhancement method extends realistic data by performing a series of transformation operations on the original data set. In addition, the data sets are used for the multimodal learning method limited data. Its potential applications are introduced like human pose estimation and person re-identification, and the performance of the existing algorithms is compared and analyzed. The pros and cons, as well as the future development direction, are projected as following:1) a lightweight multimodal data processing method:we argue that limited-data-conditioned multimodal learning still has the challenge of mobile-devices-oriented models applications. When the existing methods fuse the information of multiple modalities, it is generally necessary to use two or above networks for feature extraction, and then fuse the features. Therefore, the large number of parameters and the complex structure of the model limit its application to mobile devices. Future lightweight model has its potentials. 2) A commonly-used multimodal intelligent processing model:most of existing multimodal data processing methods are derived from the developed multi-algorithms for multitasks, which need to be trained on specific tasks. This tailored training method greatly increases the cost of developing models, making it difficult to meet the needs of more application scenarios. Therefore, for the data of different modalities, it is necessary to promote a consensus perception model to learn the general representation of multimodal data and the parameters and features of the general model can be shared for multiple scenarios. 3) A multi-sources knowledge and data driven model:it is possible to introduce featured data and knowledge of multi-modal data beyond, establish an integrated knowledge-data-driven model, and enhance the model's performance and interpretability.