Current Issue Cover
自动驾驶中的三维目标检测算法研究综述+无人系统的平行决策智能

李昌财1, 陈刚1, 侯作勋2, 黄凯1, 张伟3(1.中山大学计算机学院;2.北京空间机电研究所;3.鹏城实验室网络智能部)

摘 要
近几年新兴的三维目标检测技术在自动驾驶领域中扮演着关键的角色,它通过提供环境感知和障碍物检测等信息,为自动驾驶系统的决策和控制提供了基础。过去的许多学者对该领域优秀的方法论和成果进行了全面的检验和研究。然而,由于技术上的不断更新和快速进步,对该领域的最新进展保持持续跟踪并坚持跟随知识前沿,不仅是学术界的一项至关重要任务,同时也是应对新兴挑战的一项基础。出于以上考虑,本文回顾了近一年内的新兴成果并针对该方向中的前沿理论进行系统性的阐述。首先,简单介绍三维目标检测的背景知识并回顾相关的综述研究。然后,从数据规模、多样性等方面对KITTI等17个流行的数据集进行了归纳总结,并进一步介绍相关基准的评测原理。接下来,按照传感器类型和数量将最近的几十种检测方法划分为基于单目的、基于立体的、基于多视图的、基于激光雷达的、基于多模态五个类别,并根据模型架构或数据预处理方式的不同对每一种类别进行更深层次的细分。在每一种类别的方法中,首先对其代表性算法进行简单回顾,然后着重对该类别中最前沿的方法进行综述介绍,并进一步深入分析了该类别潜在的发展前景和当前面临的严峻挑战。最后展望了三维目标检测领域接下来的研究方向。
关键词
A Survey of 3D Object Detection Algorithms for Autonomous Driving

Li Changcai, Chen Gang1, Hou Zuoxun2, Huang Kai1, Zhang Wei3(1.School of Computer Science and Engineering, Sun Yat-sen University;2.Beijing Institute of Space Mechanics & Electricity;3.Department of Networked Intelligence, Pengcheng Laboratory)

Abstract
Background Conventional two-dimensional (2D) object detection technology primarily emphasizes classifying the target to be detected and defining its bounding box in image space coordinates, but it lacks the capability to provide accurate information about the target"s real three-dimensional (3D) spatial position. This limitation restricts its applicability in autonomous driving systems (ADs), particularly for tasks such as obstacle avoidance and path planning in real 3D environments. The emerging field of 3D object detection represents a substantial technological advancement. It primarily relies on neural networks to extract features from input data, commonly obtained from camera images or LiDAR-captured point clouds. Following feature extraction, it predicts the target"s category and furnishes crucial data, including its spatial coordinates, dimensions, and yaw angle in a real-world coordinate system. This facilitates the provision of essential preliminary information for subsequent operations such as object tracking, trajectory forecasting, and path planning. Consequently, this technology has assumed a paramount role within the field of ADs, serving as a cornerstone within the domain of perception tasks. Objective Currently, the field of 3D object detection has witnessed the emergence of numerous exceptional methodologies, showcasing significant accomplishments. Several scholars have conducted comprehensive reviews and in-depth assessments of these pertinent works and their associated outcomes. However, due to the swift evolution of technology within the domain of computer vision, prior reviews may have omitted the latest developments. Therefore, it is not only an imperative task for the academic community but also a fundamental endeavor to continuously monitor the most recent advancements and persist at the vanguard of this realm, to effectively respond to the emerging challenges posed by the incessant and expeditious technological advancements and progression. Based on the preceding considerations, this paper conducts a systematic review of the latest developments and cutting-edge theories in the realm of existing 3D object detection. In contrast to earlier review studies, our work offers distinct advantages, as it encompasses the inclusion of more cutting-edge methodologies and encompasses a broader spectrum of fields. For example, while most prior review works predominantly concentrated on individual sensors, our work uniquely incorporates a multitude of diverse sensor types. Moreover, it encompasses a wide array of distinct training strategies, ranging from semi-supervised and weak-supervised methods to active learning and knowledge distillation techniques, thereby significantly enhancing the breadth and depth of research within this field. Content Specifically, we commence with a concise contextualization of the field"s progress and conduct a brief examination of pertinent review research. Following this, we delve into the fundamental definition of 3D object detection and proceed to comprehensively summarize 17 widely used datasets based on data scale, diversity, etc., and our discourse extends to introduce the evaluation criteria integral to the relevant benchmark assessments. Among these datasets, we particularly spotlight three widely recognized datasets: KITTI, nuScenes, and Waymo Open. Next, we categorized the multitude of detection methods proposed in the previous year into five distinct groups, primarily dictated by the type and quantity of sensors involved: monocular-based, stereo-based, multi-view-based, LiDAR-based, and multi-modal-based. Additionally, we conducted further subcategorization within each group according to the specific data preprocessing methods or model architectures utilized. Within each method category grounded in distinct sensor types, we initiate our examination with a comprehensive review of the pioneering representative algorithms. Following this, we offer an intricate exposition of the latest and most advanced methodologies within that specific domain. Furthermore, we conduct an in-depth analysis of the prospective pathways for development and the formidable challenges currently encountered by this category. Among these five categories, the monocular method relies solely on a single camera for the classification and localization of environmental objects. This approach is cost-effective and straightforward to implement. However, it grapples with the challenge of ill-posed depth information regression from monocular images, which frequently results in reduced accuracy for this method. The stereo-based method leverages stereo images to enforce geometric constraints, leading to more precise depth estimation and comparatively higher detection accuracy. Nonetheless, the requirement for stereo camera calibration drives up the deployment costs of this method, and it remains susceptible to environmental factors. The multi-view-based method seeks to establish a unified feature space through the utilization of multiple surrounding cameras. Unlike the first two approaches, it provides improved safety and practicality due to its panoramic perspective. However, the absence of direct constraints between cameras results in its inherent ill-posed nature. LiDAR-based methods excel in providing accurate depth information directly, which eliminates the need for additional depth estimation. This leads to enhanced detection efficiency and accuracy compared to image-centric methods. Despite these advantages, the substantial hardware costs associated with LiDAR pose a significant financial burden on real-world deployments. And the multi-modal-based approaches leverage the advantages of both image and point cloud data, albeit at the cost of increased computational time required for the concurrent processing of these data modalities. In a broader context, each of the five method categories exhibits unique strengths and limitations, necessitating a judicious selection based on financial considerations and specific application prerequisites during real-world engineering deployment. Upon concluding the exhaustive review of all methodologies, we proceeded to conduct comprehensive statistical analyses of these techniques on datasets such as KITTI, nuScenes, and Waymo Open. The statistical evaluations encompassed aspects pertaining to detection performance and inference time. Conclusion In this research, we have meticulously reviewed 3D object detection algorithms in the context of AD. This comprehensive study encompasses detection algorithms based on various mainstream sensors and includes an exploration of the latest advancements in this field. Subsequently, we perform a comprehensive statistical analysis and comparison of the performance and latency demonstrated by all the methods on widely recognized datasets. To conclude, we present a summary of the current research status and offer prospects for future research directions.
Keywords

订阅号|日报