Current Issue Cover

雷印杰1, 徐凯2, 郭裕兰2, 杨鑫3, 武玉伟4, 胡玮5, 杨佳琪6, 汪汉云7(1.四川大学;2.国防科技大学;3.大连理工大学;4.北京理工大学;5.北京大学;6.西北工业大学;7.信息工程大学)

摘 要
A comprehensive survey on 3D visual-language understanding techniques

(National University of Defense Technology)

The core of 3D visual reasoning is to understand the relationships between different visual entities in point cloud scenes. Traditional 3D visual reasoning typically requires users to possess professional expertise. However, non-professional users face difficulty conveying their intentions to computers, which hinders the popularization and advancement of this technology. Users now anticipate a more convenient way to convey their intentions to the computer, achieving information exchange and gaining personalized results. To address this issue, researchers utilize natural language as a semantic background or query criteria to reflect user intentions, and further accomplish various missions by interacting such natural language with 3D point clouds. By multi-modal interaction, often employing techniques such as the Transformer or graph neural network, current approaches can not only locate the entities mentioned by users (e.g., visual grounding and open-vocabulary recognition) but also generate user-required content (e.g., dense captioning, visual question answering, scene generation). Specifically, 3D visual grounding is intended to locate desired objects or regions in the 3D point cloud scene based on the object-related linguistic query. Open-vocabulary 3D recognition aims to identify and localize 3D objects of novel classes defined by an unbounded (open) vocabulary at inference, which can generalize beyond the limited number of base classes labeled during the training phase. 3D dense captioning aims to identify all possible instances within the 3D point cloud scene and generate the corresponding natural language description for each instance. The goal of 3D visual question answering is to comprehend an entire 3D scene and provide an appropriate answer. Text-guided scene generation is to synthesize a realistic 3D scene composed of complex background and multiple objects from natural language descriptions. The aforementioned paradigm, known as 3D visual-language understanding, has gained significant traction in various fields such as autonomous driving, robot navigation, and human-computer interaction in recent years. Consequently, it has become a highly anticipated research direction within the computer vision domain. Over the past three years, 3D visual-language understanding technology has experienced rapid development, showcasing a blossoming trend. Nonetheless, there remains a lack of comprehensive summaries regarding the latest research progress. Therefore, it is necessary to systematically summarize recent studies, comprehensively evaluate the performance of different approaches, and prospectively point out future research directions. This motivates this survey that will fill this gap. For this purpose, this article aims to focus on two of the most representative research works of 3D visual-language understanding technologies and systematically summarizes their latest research advancements: anchor box prediction and content generation. Initially, the article provides an overview of the problem definition and existing challenges in 3D visual-language understanding, and also outlines some common backbones used in this area. The challenges in 3D visual-language understanding includes 3D-language alignment and complex scene understanding; while some common backbones involve priori rules, multilayer perceptrons, graph neural networks and Transformer architectures. Subsequently, the article delves into downstream scenarios, emphasizing two types of 3D visual-language understanding techniques, including bounding box predation and content generation. This article thoroughly explores the advantages and disadvantages of each method. Furthermore, the article compares and analyzes the performance of various methods on different benchmark datasets. Ultimately, the article concludes by looking ahead to the future prospects of 3D visual language reasoning technology, which can promote profound research and widespread application in this field. The major contributions of this paper can be summarized as: (1) Systematic survey of 3D visual-language understanding. To the best of our knowledge, this is the first survey to thoroughly discuss the recent advances in 3D visual-language understanding. To provide readers with a clear comprehension of our article, we categorize algorithms into different taxonomies from the perspective of downstream scenarios. (2) Comprehensive performance evaluation and analysis. We compare the existing 3D visual-language understanding approaches on several publicly available datasets. Our in-depth analysis can help researchers in selecting the baseline suitable for their specific applications while also offering valuable insights about the modification of existing methods. (3) Insightful discussion of future prospects. Based on the systematic survey and comprehensive performance comparison, some promising future research directions are discussed, including large-scale 3D foundation model, computational efficiency of 3D modeling, and incorporation of additional modalities. The structure of this paper is organized as follows. Section 1 summarizes the problem definition and primary challenge in 3D visual-language understanding. Sections 3 and 4 provide in-depth explorations of typical approaches used for different downstream scenarios in bounding box predation and content generation, respectively. Section 5 introduces the benchmark datasets and evaluation metrics, as well as the comparative analysis of different techniques. Finally, section 6 concludes this paper and discusses the promising avenues for future research.