视觉知识：跨媒体智能进化的新支点

杨易; 庄越挺; 潘云鹤

发布时间： 2022-09-17
摘要点击次数： 2369
全文下载次数： 921
DOI: 10.11834/jig.211264
2022 | Volume 27 | Number 9

视觉知识：跨媒体智能进化的新支点

杨易¹, 庄越挺¹, 潘云鹤^1,2(1.浙江大学计算机科学与技术学院, 杭州 310027;2.之江实验室, 杭州 310027)

摘要

回顾跨媒体智能的发展历程，分析跨媒体智能的新趋势与现实瓶颈，展望跨媒体智能的未来前景。跨媒体智能旨在融合多来源、多模态数据，并试图利用不同媒体数据间的关系进行高层次语义理解与逻辑推理。现有跨媒体算法主要遵循了单媒体表达到多媒体融合的范式，其中特征学习与逻辑推理两个过程相对割裂，无法综合多源多层次的语义信息以获得统一特征，阻碍了推理和学习过程的相互促进和修正。这类范式缺乏显式知识积累与多级结构理解的过程，同时限制了模型可信度与鲁棒性。在这样的背景下，本文转向一种新的智能表达方式——视觉知识。以视觉知识驱动的跨媒体智能具有多层次建模和知识推理的特点，并易于进行视觉操作与重建。本文介绍了视觉知识的3个基本要素，即视觉概念、视觉关系和视觉推理，并对每个要素展开详细讨论与分析。视觉知识有助于实现数据与知识驱动的统一框架，学习可归因可溯源的结构化表达，推动跨媒体知识关联与智能推理。视觉知识具有强大的知识抽象表达能力和多重知识互补能力，为跨媒体智能进化提供了新的有力支点。

关键词

跨媒体智能视觉知识视觉概念视觉关系视觉推理

The review of visual knowledge:a new pivot for cross-media intelligence evolution

Yang Yi¹, Zhuang Yueting¹, Pan Yunhe^1,2(1.College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China;2.Zhejiang Laboratory, Hangzhou 310027, China)

Abstract

We review the recent development of cross-media intelligence,analyze its new trends and challenges,and discuss future prospects of cross-media intelligence.Cross-media intelligence is focused on the integration of multi-source and multi-modal data.It attempts to use the relationship between different media data for high-level semantic understanding and logical reasoning.Existing cross-media algorithms mainly follow the paradigm of "single media representation" to "multimedia integration",in which the two processes of feature learning and logical reasoning are relatively disconnected.It is unlikely to synthesize multi-source and multi-level semantic information to obtain unified features,which hinders the mutual benefits of the reasoning and learning process.This paradigm is lack of the process of explicit knowledge accumulation and multi-level structure understanding.At the same time,it restricts the interpretability and robustness of the model.We interpret new representation method,i.e.,visual knowledge.Visual knowledge driven cross-media intelligence has the features of multi-level modeling and knowledge reasoning.Its built-in mechanisms can implement operations and reconstruction visually,which learns knowledge alignment and association.To establish a unified way of knowledge representation learning,the theory of visual knowledge has been illustrated as mentioned below:1) we introduce three key factors of visual contexts,i.e.,concept,visual relationship,and visual reasoning.Visual knowledge has capable of knowledge representations abstraction and multiple knowledge complementing.Visual relations represent the relationship between visual concepts and provide an effective basis for more complex cross-media visual reasoning.We demonstrate visual-based spatio-temporal and causal relationships,but the visual relationship is not limited to these categories.We recommend that the pairwise visual relationships should be extended to multi-objects cascade relationships and the integrated spatio-temporal and causal representations effectively.Visual knowledge is derived of visual concepts and visual relationships,enabling more interpretive and generalized high-level cross-media visual reasoning.Visual knowledge develops a structured knowledge representation,a multi-level basis for visual reasoning,and realizes an effective demonstration for neural network decisions.Broadly,the referred visual reasoning includes a variety of visual operations,such as prediction,reconstruction,association and decomposition.2) We discuss the applications of visual knowledge,and introduce detailed analysis on their future challenges.We select three applications of those are structured representation of visual knowledge,operation and reasoning of visual knowledge,and cross-media reconstruction and generation.Visual knowledge is predicted to resolve the ambiguity problems in relational descriptions and suppress data bias effectively.It is worth noting that these three specific applications are involved some cross-media intelligence examples of visual knowledge only.Although hand-crafted features are less capable of abstracting multimedia data than deep learning features,these descriptors tend to be more interpretable.The effective integration of hand-crafted features and deep learning features for cross-media representation modeling is a typical application of visual knowledge representation in the context of cross-media intelligence.The structured representation of visual knowledge contributes to the improvement of model interpretability.3) We analyze the advantages of visual knowledge.It aids to achieve a unified framework driven by both data and knowledge,learn explainable structured representations,and promote cross-media knowledge association and intelligent reasoning.Thanks to the development of visual knowledge based cross-media intelligence,more emerging cross-media intelligence applications will be developed.The decision-making assistance process is more credible through the structural and multi-granularity representation of visual knowledge and the integrated optimization of multi-source and cross-domain data.The reasoning process can be reviewed and clarified,and the model generalization ability can be improved systematically.These factors provide a new powerful pivot for the evolution of cross-media intelligence.Visual knowledge can improve the generative models greatly and enhance the application of simulation technology.Future visual knowledge can be used as a prior to improve the rendering of scenes,realize interactive visual editing tools and controllable semantic understanding of scene objects.A data-driven and visual knowledge derived graphics system will be focused on the integration of the strengths of data and rules,semantic features extraction of visual data,model complexity optimization,simulation improvement,and realistic and sustainable content in new perspectives and new scenarios.

Keywords

cross-media intelligence visual knowledge visual concepts visual relationships visual reasoning

在线采编平台

在线出版

年度会议

下载中心

年度信息