周昊1, 齐洪钢1, 邓永强2, 李娟娟2, 梁浩2, 苗军3(1.中国科学院大学;2.北京万集科技股份有限公司;3.北京信息科技大学)
目的 基于点云的3D目标检测是自动驾驶领域的重要技术之一。由于点云的非结构化特性，通常将点云进行体素化处理，然后基于体素特征完成3D目标检测任务。在基于体素的3D目标检测算法中，对点云进行体素化时会导致部分点云的数据信息和结构信息的损失，降低检测效果。针对该问题，本文提出一种融合点云深度信息的方法，有效提高了3D目标检测的精度。方法 首先将点云通过球面投影的方法转换为深度图像，然后将深度图像与3D目标检测算法提取的特征图进行融合，从而对损失信息进行补全。由于此时的融合特征以2D伪图像的形式表示，因此使用YOLOv7中的主干网络提取融合特征。最后设计回归与分类网络，将提取到的融合特征送入到网络中预测目标的位置、大小以及类别。结果 本文方法在KITTI数据集和DAIR-V2X数据集上进行测试。以AP值为评价指标，在KITTI数据集上，改进算法PP-Depth相较于PointPillars在汽车、行人、自行车类别上分别有0.84%、2.3%、1.77%的提升。以自行车简单难度为例，改进算法PP-YOLO-Depth相较于PointPillars、PP-YOLO、PP-Depth分别有5.15%、1.1%、2.75%的提升。在DAIR-V2X数据集上，PP-Depth相较于PointPillars在汽车、行人、自行车类别上分别有17.46%、20.72%、12.7%的提升。以汽车简单难度为例，PP-YOLO-Depth相较于PointPillars、PP-YOLO、PP-Depth分别有13.53%、5.59%、1.08%的提升。结论 实验结果表明，本文方法在KITTI数据集和DAIR-V2X数据集上都取得了较好表现，减少了点云在体素化过程中的信息损失以及提高了网络对融合特征的提取能力和多尺度目标的检测性能，使目标检测结果更加准确。
3D object detection and classification combined with point cloud depth information
(1.University of Chinese Academy of Sciences;2.Beijing Information Science & Technology University)
Objective Perception systems are integral components in modern autonomous driving systems, designed to accurately estimate the state of the surrounding environment and provide reliable observations for prediction and planning. 3D object detection can intelligently predict the location, size and category of key 3D objects near the autonomous vehicle, and is an important part of the perception system. In 3D object detection, common data types include images and point clouds. Compared with images, a point cloud is a data set composed of many points in a three-dimensional space, and the position information of each point is represented by coordinates in a three-dimensional coordinate system. At the same time, in addition to position information, information such as reflection intensity is usually included. In the field of computer vision, point clouds are often used to represent the shape and structure information of 3D objects. Therefore, the 3D object detection method based on point cloud has more real spatial information, and often has more advantages in detection accuracy and speed. However, due to the unstructured nature of the point cloud, the point cloud is often converted into a 3D voxel grid, and each voxel in the voxel grid is regarded as a 3D feature vector, and then the 3D convolutional network is used to extract the feature of the voxel, thereby completing the 3D object detection task based on the voxel feature. In the voxel-based 3D object detection algorithm, the voxelization of the point cloud will lead to the loss of data information and structural information of part of the point cloud, thus affecting the detection effect. To solve this problem, we proposes a method that combines point cloud depth information. Our method uses point cloud depth information as fusion information to complement the information lost in the voxelization process. At the same time, it uses the efficient YOLOv7-Net network to extract fusion features, improve the detection performance and feature extraction capabilities of multi-scale objects, and effectively improve the accuracy of 3D object detection. Method In order to reduce the information loss of the point cloud during the voxelization process, the point cloud is first converted into a depth image through spherical projection. The depth image refers to a grayscale image generated through the point cloud, which reflects the distance from each point to the origin of the coordinate system in three-dimensional space, and then uses the pixel gray value to represent the depth information of the point cloud. Therefore, the depth image of the point cloud can provide a rich feature representation for the point cloud, and it is completely feasible to use the depth information of the point cloud as fusion information to complement the information lost in the voxelization process. Afterwards, the depth image is fused with the feature map extracted by the 3D object detection algorithm to complement the information lost in the voxelization process. Since the fusion features at this time are more in the form of pseudo-images, a more efficient backbone feature extraction network is selected to extract fusion features. The backbone feature extraction network in YOLOv7 uses an adaptive convolution module, which can adaptively adjust the size of the convolution kernel and the size of the receptive field according to the scale, which improves the detection performance of the network for multi-scale objects. At the same time, the feature fusion module and feature pyramid pooling module of YOLOv7-Net further enhance the feature extraction ability and detection performance of the network. Therefore, we choose to use YOLOv7-Net to extract fusion features. Finally, the classification and regression network is designed, and the extracted fusion features are sent to the classification and regression network to predict the category, position and size of the object. Result Our method is tested on the KITTI 3D Object Detection Dataset and the DAIR-V2X Object Detection Dataset. Taking AP as the evaluation index, on the KITTI dataset, PP-Depth has an improvement of 0.84%, 2.3%, and 1.77% in the categories of cars, pedestrians, and bicycles compared with PointPillars. Taking the simple difficulty of bicycles as an example, PP-YOLO-Depth has an improvement of 5.15%, 1.1%, and 2.75% compared with PointPillars, PP-YOLO, and PP-Depth, respectively. On the DAIR-V2X dataset, PP-Depth has 17.46%, 20.72%, and 12.7% improvements in the car, pedestrian, and bicycle categories compared to PointPillars. Taking the simple difficulty of cars as an example, PP-YOLO-Depth has an improvement of 13.53%, 5.59%, and 1.08% compared with PointPillars, PP-YOLO, and PP-Depth, respectively. Conclusion The experimental results show that our method has achieved good performance on both the KITTI 3D Object Detection Dataset and the DAIR-V2X Object Detection Dataset, which reduces the information loss of the point cloud during the voxelization process and improves the network""s ability to extract fusion features and multi-scale object detection performance, making the object detection results more accurate.