3D Object Detection for Autonomous Driving from Image：A survey ——Benchmarks，Constraints and Error Analysis
Wei Shikui,Li Xiying,Ye Zhihui,Chen Ze,Chen Xiaotong,Tian Yonghong,Dang Jianwu,Fu Shujun,Zhao Yao()
The development of autonomous driving will greatly change people"s travel and lifestyle. Accurate perception and measurement of the three-dimensional(3D) spatial position and scale of surrounding objects is the basis for realizing autonomous driving control and behavior decision-making, so 3D object detection is an indispensable measure. With the development of sensing technology, the autonomous vehicles are generally equipped with high-precision camera, lidar, radar, GPS/IMU positioning and other sensors. At present, 3D object detection algorithms based on lidar or multi-modal data have achieved excellent performance, but the disadvantages of high-precision lidar such as high price, limited sensing range and sparse point clouds data limit its widely deployment and application. In contrast, high-precision cameras have much lower price, can obtain high-precision spatial information, abundant shape and appearance details, and are more easily to be common used. Thus, image-based 3D object detection is a research hot spot in the field of autonomous driving. At present, some scholars have made a relatively detailed review of the methodology and achievements in this field, but the constraining factors that lead to the unsatisfactory detection accuracy of the existing methods have not been thoroughly and systematically analyzed. Considering the high-performance requirements for engineering applications in the field of automatic driving, and the existing 3D object detection methods are mainly data-driven, this paper systematically summaries the research results and industrial applications from the perspectives of commonly used data sets and evaluation criteria, data impact, methodological constraints and prediction errors. First of all, a brief introduction is made from the perspective of academic research achievements and the application of autonomous driving industry. The latest achievements of Baidu Apollo, Google Waymo, Tesla and other autonomous driving companies in the leading position in the industry are summarized, and the 3D object detection methods for autonomous driving are classified. Then, four general data sets including KITTI dataset, nuScenes dataset, Waymo open dataset and DAIR-V2X dataset are analyzed and summarized in detail from three aspects: data acquisition sensors, data accuracy and data label information; the main evaluation standards proposed by these data sets are compared and analyzed, and the advantages, disadvantages and applicability of these evaluation standards are summarized as well. Next, the main factors that restrict the performance of the image-based 3D object detection algorithm and the errors resulted by them are analyzed from two sides: data and methodology. In terms of data, the main constraints are data accuracy, sample difference, data volume, and data annotation. The data accuracy is mainly limited by equipment performance and the sample difference is mainly reflected the problems during imaging in the object distance difference, imaging angle difference, occlusion and truncation. Data volume is affected by the variety of 3D data types and the high difficulty of labeling. Compared with the 2D object detection data set, the volume of 3D object detection data set is much smaller. Data annotation, especially image annotation used in image-based 3D object detection is mainly 3D bounding box, the labeling details and quality of the data set directly affect the performance of the algorithm. For non-rigid objects such as pedestrians, the annotation error is larger, and there are some suggestions for improving the labeling method. In terms of methodology, the general framework of image-based 3D object detection can be classified as one-stage methods and two-stage methods, and the limitations mainly include the prior geometric relationship, depth prediction accuracy, and data modality used by the algorithm. The prior geometric relationship includes the 2D-3D geometric constraints of 3D objects projected onto 2D images and the position relationships between objects. The image-based 3D object detection methods face such problems as how to use prior 2D-3D geometric constraints and how to deal with occluded and truncated objects. The prediction of depth information from 2D images is a morbid problem, and dimension collapse will cause the loss of depth information in the image while result in the depth prediction error. On the one hand, the depth prediction is often not accurate due to the influence of projection relationship. On the other hand, the performance of continuous depth prediction is often poor at the depth mutation of the image (such as edge of objects). When the prediction depth is discretized, there is a problem that the classification of depth is relatively rough, and the accuracy classification can not be arbitrarily divided. Some scholars put forward other ideas to solve the problem of depth prediction. The limitation of data modality is mainly reflected in the large error of depth prediction based on a single image. The detection performance of the algorithm can be improved by simulating the stereo signal and lidar point clouds, or directly using stereo image as the aided input or using point clouds data with accurate 3D information as supervision signal. In addition, using video data can also improve the detection accuracy to a certain extent. In section 3, the research status between domestic and foreign are summarized and compared from the academic and industrial perspectives. At the end of this paper, some research directions that need to be focused on in the future are proposed in terms of data sets, evaluation indicators, depth prediction and so on.