基于图像的自动驾驶3D目标检测综述——基准、制约因素和误差分析

李熙莹; 叶芝桧; 韦世奎; 陈泽; 陈小彤; 田永鸿; 党建武; 付树军; 赵耀

发布时间： 2023-06-20
摘要点击次数： 1370
全文下载次数： 1157
DOI: 10.11834/jig.230036
2023 | Volume 28 | Number 6

基于图像的自动驾驶3D目标检测综述——基准、制约因素和误差分析

李熙莹^1,2,3, 叶芝桧^1,2,3, 韦世奎⁴, 陈泽^1,2,3, 陈小彤⁴, 田永鸿⁵, 党建武⁶, 付树军⁷, 赵耀⁴(1.中山大学智能工程学院, 深圳 518107;2.中山大学·深圳, 深圳 518107;3.广东省智能交通系统 (ITS)重点实验室, 深圳 518107;4.北京交通大学信息科学研究所, 北京 100044;5.北京大学信息科学技术学院, 北京 100871;6.兰州交通大学电子与信息工程学院, 兰州 730070;7.山东大学数学学院, 济南 250100)

摘要

从高分辨率图像中获取周边目标的精准3D位置和尺寸信息是实现自动驾驶控制和行为决策的基础，因此基于图像的3D目标检测是自动驾驶领域中的研究热点。已有学者对该领域方法论及成果进行了比较详细的综述，但对于导致现有方法检测精度不尽如意的制约因素未能进行深入系统的分析。考虑自动驾驶领域在工程应用方面的要求高，且现有方法以数据驱动类型为主，本文从常用数据集和评价基准、数据影响、方法论的制约因素和误差等角度，对学术界和产业界在3D目标检测方面的研究成果及行业应用进行较为系统的阐述。首先，从学术界探索成果以及自动驾驶行业的应用角度进行概要介绍。然后，从数据采集设备、数据精度和标注信息3方面详细分析总结了KITTI等4个通用数据集，并对这些数据集提出的主要评价指标进行对比分析。接着，从数据和方法论方面分析制约算法性能的主要因素及由此造成的误差影响。在数据方面，制约因素主要是数据精度、样本差异、标注数据量和标注规范；在方法论方面，制约因素主要包括先验几何关系、深度预测误差和数据模态等。最后，对国内外研究现状进行总结，并在数据集、评价指标和目标深度预测等方面提出了未来需要重点关注的研究方向。

关键词

3D目标检测基准制约因素误差分析自动驾驶图像处理计算机视觉

3D object detection for autonomous driving from image: a survey——benchmarks, constraints and error analysis

Li Xiying^1,2,3, Ye Zhihui^1,2,3, Wei Shikui⁴, Chen Ze^1,2,3, Chen Xiaotong⁴, Tian Yonghong⁵, Dang Jianwu⁶, Fu Shujun⁷, Zhao Yao⁴(1.School of Intelligent Systems Engineering, Sun Yat-sen University, Shenzhen 518107, China;2.Shenzhen Campus of Sun Yat-sen University, Shenzhen 518107, China;3.Guangdong Province Key Laboratory of Intelligent Transportation System(ITS), Shenzhen 518107, China;4.Institute of Information Science, Beijing Jiaotong University, Beijing 100044, China;5.School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China;6.School of Electronic and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China;7.School of Mathematics, Shandong University, Jinan 250100, China)

Abstract

Autonomous driving-oriented accurate perception and measurement of the three-dimensional(3D) spatial position and scale can be as the basis for realizing the control ability and decision-making level. Sensing technology-driven autonomous vehicles are equipped with high-resolution camera, light detection and ranging (LiDAR), radar, global positioning system (GPS)/inertial measurement unit (IMU) and other related sensors. Current LiDAR or multi-modal data-based 3D object detection algorithms are challenged for its deployment and application because of the shortcomings of LiDAR sensors like high price, limited sensing range, and sparse point clouds data. In contrast, such high-resolution cameras are commonly-used and featured by its lower price, and it can obtain high-resolution spatial information, richer shape, and appearance details as well. The emerging image-based 3D object detection is focused on further. At present, constraints of detection accuracy of the existing methods are still to be analyzed thoroughly and systematically. We summary the research results and industrial applications in relevance to such 1) perspectives of commonly used datasets and evaluation criteria, 2) data impact, 3) methodological constraints and prediction errors. First, a brief introduction is linked to perspective of academic domain and application of autonomous driving industry. We briefly review latest growths of Baidu Apollo, Google Waymo, Tesla and other related autonomous driving companies, and the thread of 3D object detection methods for autonomous driving. Then, we analyzed and summarized four popular datasets like KITTI, nuScenes, Waymo open dataset, and DAIR-V2X dataset from three aspects of:1) data acquisition/sensors, data accuracy and data label information;2) key evaluation standards proposed by these data sets, and 3) pros/cons and applicability of these evaluation standards. Third, main constraints of the image-based 3D object detection algorithm and the errors are derived from two sides of:data and methodology. Such main data constraints are originated from their data accuracy, sample difference, data volume, and data annotation. The data accuracy is mainly limited by equipment performance. The sample difference is mainly restricted by such image processing problems in related to object distance difference, angle difference, occlusion, and truncation. Data volume is affected by variety of 3D data types and high difficulty of labeling. The volume of 3D object detection data set is much smaller in comparison with the 2D object detection data set. Data annotation is mainly focused on 3D bounding box labeling, the labeling details, and quality of the dataset, especially for image annotation used in image-based 3D object detection. For non-rigid objects like pedestrians, the annotation error is larger, and there are some optimal for improving the labeling method. The general framework of image-based 3D object detection can be classified as one-stage methods and two-stage methods, and the limitations consists of 1) the prior geometric relationship, 2) depth prediction accuracy, and 3) data modality. The prior geometric relationship is focused on 2D-3D geometric constraints for 2D images-projected 3D objects and objects-between position relationships. The image-based 3D object detection methods face such problems as:prior 2D-3D geometric constraints and occluded and truncated objects. The prediction of depth information from 2D images is an ill conditioned problem, and dimension collapse will cause depth prediction error-relevant loss of depth information in the image. On the one hand, the depth prediction is often not accurate due to the influence of projection relationship. On the other hand, the performance of continuous depth prediction is often poor at the depth mutation of the image(such as edge of objects). When the prediction depth is discretized, there is a problem that the classification of depth is relatively rough, and the accuracy classification cannot be arbitrarily divided. The limitation of single image-based data modality is mainly reflected via large error of depth prediction. The detection performance of the algorithm can be optimized by 1) simulating the stereo signal and LiDAR point clouds, or 2) using stereo image as the aided input, or 3) leveraging point clouds data with accurate 3D information as supervision signal. In addition, video data can be adopted to improve the detection accuracy to a certain extent. Forth, current research situation is summarized and compared from academic and industrial domain. Finally, some future research directions are predicted in terms of such factors of datasets, evaluation indicators, and depth prediction.

Keywords

3D object detection benchmark constraint error analysis autonomous driving image processing computer vision

在线采编平台

在线出版

年度会议

下载中心

年度信息