目的 针对激光雷达点云稀疏性导致小目标检测精度下降的问题，提出一种伪激光点云增强技术，利用图像与点云融合，对稀疏的小目标几何信息进行补充，提升道路场景下三维目标检测性能。方法 首先，使用深度估计网络获取双目图像的深度图，利用激光点云对深度图进行深度校正，减少深度估计误差；其次，采用语义分割的方法获取图像的前景区域，仅将前景区域对应的深度图映射到三维空间中生成伪激光点云，提升伪激光点云中前景点的数量占比；最后，根据不同的观测距离对伪激光点云进行不同线数的下采样，并与原始激光点云进行融合作为最终的输入点云数据。结果 在KITTI （Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago）数据集上的实验结果表明，该方法能够提升多个最新网络框架的小目标检测精度，以典型网络SECOND（sparselyembedded convolutional detection）、MVX-Net （multimodal voxelnet for 3D object detection）、Voxel-RCNN为例，在困难等级下，三维目标检测精度分别获得8.65%、7.32%和6.29%的大幅提升。结论 该方法适用于所有以点云为输入的目标检测网络，并显著提升了多个目标检测网络在道路场景下的小目标检测性能。该方法具备有效性与通用性。
3D object detection in road scenes by pseudo-LiDAR point cloud augmentation
Objective Light detection and ranging（LiDAR）is one of the most commonly used sensors in autonomous driving that has strong structure sensing ability and whose point cloud provides accurate object distance information. However， LiDAR has a limited number of laser lines. As the distance from the object to the LiDAR increases，the object feedback point cloud area becomes sparse，and the effective information is greatly reduced，thereby reducing the detection accuracy of distant small objects. At the same time，due to the complex and changeable road environment，vehicles cannot rely on a single sensor，hence necessitating the use of multi-source data fusion to improve their perception capabilities. This paper proposes a pseudo-LiDAR point cloud augmentation technology that fuses an image and point cloud to improve its 3D object detection performance in road scenes. Method First，a stereo image is used as the input of the depth estimation network to predict the depth image. The LiDAR point cloud is mapped to the plane to obtain the point cloud depth map，which is sent to the depth correction module together with the depth map of the image. Afterward，the depth correction module builds a directed k-nearest neighbor graph among the pseudo-LiDAR point clouds，finds the part of the pseudo-LiDAR point cloud that is closest to the LiDAR point cloud，uses the precise depth of the LiDAR point cloud to correct the depth of this part of the pseudo-LiDAR point cloud，and retains the shape and structure of the original pseudo-LiDAR point cloud to generate a corrected depth image. Second，the semantic segmentation network is applied to the image to obtain the foreground area of the vehicle. The semantic segmentation map and corrected depth map are simultaneously processed by the foreground segmentation module. The depth map corresponding to the foreground area is then mapped into the 3D space to generate the pseudo-LiDAR point cloud. Only the foreground points of the vehicle are retained. The point cloud at this time is called the foreground pseudo-LiDAR point cloud. Finally，the foreground pseudo-LiDAR point cloud performs 16-，32-，and 64-line down-sampling in intervals of 0～20 m，20～40 m，and 40～80 m，respectively，and fuses with the original point cloud to form a fusion point cloud. The fused point cloud has more foreground points than the original point cloud. For distant objects，the number of point clouds is greatly increased，thus improving the sparsity of small object point clouds. Result In this paper，the depth estimation network adopts a pre-trained model of pyramid stereo matching network based on the SceneFlow dataset（a large dataset for training convolutional networks for disparity，optical flow，and scene flow estimation），and the semantic segmentation network adopts a pre-trained model of high-resolution representations for labeling pixels and regions（HRNet）based on the Cityscapes dataset. Five latest object detection networks，including sparsely embedded convolutional detection（SECOND），are used as benchmark models for training and testing on the Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago dataset. Experimental results show that the proposed method can improve the small object detection accuracy of several network frameworks，with the largest improvement of each metric for the SECOND algorithm，multi-modal voxelnet for 3D object detection（MVX-Net）algorithm，and Voxel-RCNN algorithm. The average precision under the intersection over union of 0. 7 is used to evaluate and compare the experimental results. Under the hard difficulty condition，the 3D detection accuracies for SECOND，MVX-Net，and Voxel-RCNN improve by 8. 65%，7. 32%，and 6. 29%，respectively，and the maximum improvement in the bird’s-eye view detection accuracy is 7. 05%. Meanwhile，most of the other object detection networks obtain better 3D detection accuracy than the original method under the easy and moderate difficulty conditions，and all networks obtain a better bird’s-eye view detection accuracy than the original method under the easy，moderate，and hard difficulty conditions. Ablation experiments are also conducted using the network SECOND as the benchmark model，and experimental results show that the depth correction module，foreground segmentation module，and sampling module designed in this paper all contribute to the improvement of the results，among which the sampling module improves the results the most. Specifically，the introduction of this module improves the 3D detection accuracy in the easy，moderate，and hard difficulty conditions by about 2. 70%， 3. 69%，and 10. 63%，respectively. Conclusion This paper proposes a pseudo-LiDAR point cloud augmentation technique that uses the accurate depth information of the point cloud to correct the image depth map and uses the dense pixel information to compensate for the sparsity of the point cloud. This method effectively addresses the poor detection accuracy of small objects caused by the sparsity of the point cloud. Using the semantic segmentation module greatly increases the proportion of the number of foreground points. The sampling module is also adopted to compensate for those pseudo-point clouds with different line numbers according to the observation distance，thus greatly reducing the number of pseudo-point clouds. This method is applicable to all object detection networks with point cloud as input and significantly improves the 3D object detection and bird's-eye view detection performance of multiple object detection networks in road scenes. This method is proven effective and general，hence presenting a new idea for multi-modal fusion 3D object detection.