Current Issue Cover
三维视觉-语言推理技术的前沿研究与最新趋势

雷印杰1, 徐凯2, 郭裕兰2, 杨鑫3, 武玉伟4, 胡玮5, 杨佳琪6, 汪汉云7(1.四川大学;2.国防科技大学;3.大连理工大学;4.北京理工大学;5.北京大学;6.西北工业大学;7.信息工程大学)

摘 要
三维视觉推理的核心思想对点云场景中的视觉主体间的关系进行理解。然而,非专业用户难以向计算机传达自己的意图,从而限制了该技术的普及与推广。为此,研究人员以自然语言作为语义背景和查询条件反映用户意图,进而与点云的信息进行交互以完成相应的任务。此种范式称作三维视觉-语言推理,近年来在自动驾驶、机器人导航以及人机交互等众多领域广泛应用,已经成为计算机视觉领域中备受瞩目的研究方向。过去三年间,三维视觉-语言推理技术迅猛发展,呈现出百花齐放的趋势,但是目前依旧缺乏对最新研究进展的全面总结。本文聚焦于两类最具代表性的研究工作,锚框预测和内容生成类的三维视觉-语言推理技术,系统性概括领域内研究的最新进展。首先,本文总结了三维视觉-语言推理的问题定义和现存挑战,同时概述了一些常见的骨干网络。其次,本文按照方法所关注的下游场景,对两类三维视觉-语言推理技术做了进一步细分,并深入探讨了各方法的优缺点。接下来,本文对比分析了各类方法在不同基准数据集上的性能。最后,本文展望了三维视觉-语言推理技术的未来发展前景,以期促进该领域的深入研究与广泛应用。
关键词
A comprehensive survey on 3D visual-language understanding techniques

(National University of Defense Technology)

Abstract
The core of 3D visual reasoning is to understand the relationships between different visual entities in point cloud scenes. Traditional 3D visual reasoning typically requires users to possess professional expertise. However, non-professional users face difficulty conveying their intentions to computers, which hinders the popularization and advancement of this technology. Users now anticipate a more convenient way to convey their intentions to the computer, achieving information exchange and gaining personalized results. To address this issue, researchers utilize natural language as a semantic background or query criteria to reflect user intentions, and further accomplish various missions by interacting such natural language with 3D point clouds. By multi-modal interaction, often employing techniques such as the Transformer or graph neural network, current approaches can not only locate the entities mentioned by users (e.g., visual grounding and open-vocabulary recognition) but also generate user-required content (e.g., dense captioning, visual question answering, scene generation). Specifically, 3D visual grounding is intended to locate desired objects or regions in the 3D point cloud scene based on the object-related linguistic query. Open-vocabulary 3D recognition aims to identify and localize 3D objects of novel classes defined by an unbounded (open) vocabulary at inference, which can generalize beyond the limited number of base classes labeled during the training phase. 3D dense captioning aims to identify all possible instances within the 3D point cloud scene and generate the corresponding natural language description for each instance. The goal of 3D visual question answering is to comprehend an entire 3D scene and provide an appropriate answer. Text-guided scene generation is to synthesize a realistic 3D scene composed of complex background and multiple objects from natural language descriptions. The aforementioned paradigm, known as 3D visual-language understanding, has gained significant traction in various fields such as autonomous driving, robot navigation, and human-computer interaction in recent years. Consequently, it has become a highly anticipated research direction within the computer vision domain. Over the past three years, 3D visual-language understanding technology has experienced rapid development, showcasing a blossoming trend. Nonetheless, there remains a lack of comprehensive summaries regarding the latest research progress. Therefore, it is necessary to systematically summarize recent studies, comprehensively evaluate the performance of different approaches, and prospectively point out future research directions. This motivates this survey that will fill this gap. For this purpose, this article aims to focus on two of the most representative research works of 3D visual-language understanding technologies and systematically summarizes their latest research advancements: anchor box prediction and content generation. Initially, the article provides an overview of the problem definition and existing challenges in 3D visual-language understanding, and also outlines some common backbones used in this area. The challenges in 3D visual-language understanding includes 3D-language alignment and complex scene understanding; while some common backbones involve priori rules, multilayer perceptrons, graph neural networks and Transformer architectures. Subsequently, the article delves into downstream scenarios, emphasizing two types of 3D visual-language understanding techniques, including bounding box predation and content generation. This article thoroughly explores the advantages and disadvantages of each method. Furthermore, the article compares and analyzes the performance of various methods on different benchmark datasets. Ultimately, the article concludes by looking ahead to the future prospects of 3D visual language reasoning technology, which can promote profound research and widespread application in this field. The major contributions of this paper can be summarized as: (1) Systematic survey of 3D visual-language understanding. To the best of our knowledge, this is the first survey to thoroughly discuss the recent advances in 3D visual-language understanding. To provide readers with a clear comprehension of our article, we categorize algorithms into different taxonomies from the perspective of downstream scenarios. (2) Comprehensive performance evaluation and analysis. We compare the existing 3D visual-language understanding approaches on several publicly available datasets. Our in-depth analysis can help researchers in selecting the baseline suitable for their specific applications while also offering valuable insights about the modification of existing methods. (3) Insightful discussion of future prospects. Based on the systematic survey and comprehensive performance comparison, some promising future research directions are discussed, including large-scale 3D foundation model, computational efficiency of 3D modeling, and incorporation of additional modalities. The structure of this paper is organized as follows. Section 1 summarizes the problem definition and primary challenge in 3D visual-language understanding. Sections 3 and 4 provide in-depth explorations of typical approaches used for different downstream scenarios in bounding box predation and content generation, respectively. Section 5 introduces the benchmark datasets and evaluation metrics, as well as the comparative analysis of different techniques. Finally, section 6 concludes this paper and discusses the promising avenues for future research.
Keywords

订阅号|日报