伪激光点云增强的道路场景三维目标检测

晋帅; 李煊鹏; 杨凤; 张为公

doi:10.11834/jig.220986

图像理解和计算机视觉 | 浏览量 : 0 下载量: 0 CSCD: 2

PDF
导出
分享
收藏
专辑

伪激光点云增强的道路场景三维目标检测
3D object detection in road scenes by pseudo-LiDAR point cloud augmentation
2023年28卷第11期页码：3520-3535
纸质出版日期： 2023-11-16 ，
DOI： 10.11834/jig.220986
稿件说明：

移动端阅览

晋帅，李煊鹏，杨凤，张为公. 2023. 伪激光点云增强的道路场景三维目标检测. 中国图象图形学报， 28(11):3520-3535

Jin Shuai， Li Xuanpeng， Yang Feng， Zhang Weigong. 2023. 3D object detection in road scenes by pseudo-LiDAR point cloud augmentation. Journal of Image and Graphics， 28(11):3520-3535
晋帅，李煊鹏，杨凤，张为公. 2023. 伪激光点云增强的道路场景三维目标检测. 中国图象图形学报， 28(11):3520-3535 DOI： 10.11834/jig.220986.

Jin Shuai， Li Xuanpeng， Yang Feng， Zhang Weigong. 2023. 3D object detection in road scenes by pseudo-LiDAR point cloud augmentation. Journal of Image and Graphics， 28(11):3520-3535 DOI： 10.11834/jig.220986.

摘要

目的

针对激光雷达点云稀疏性导致小目标检测精度下降的问题，提出一种伪激光点云增强技术，利用图像与点云融合，对稀疏的小目标几何信息进行补充，提升道路场景下三维目标检测性能。

方法

首先，使用深度估计网络获取双目图像的深度图，利用激光点云对深度图进行深度校正，减少深度估计误差；其次，采用语义分割的方法获取图像的前景区域，仅将前景区域对应的深度图映射到三维空间中生成伪激光点云，提升伪激光点云中前景点的数量占比；最后，根据不同的观测距离对伪激光点云进行不同线数的下采样，并与原始激光点云进行融合作为最终的输入点云数据。

结果

在KITTI（Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago）数据集上的实验结果表明，该方法能够提升多个最新网络框架的小目标检测精度，以典型网络SECOND（sparsely embedded convolutional detection）、MVX-Net（multimodal voxelnet for 3D object detection）、Voxel-RCNN为例，在困难等级下，三维目标检测精度分别获得8.65%、7.32%和6.29%的大幅提升。

结论

该方法适用于所有以点云为输入的目标检测网络，并显著提升了多个目标检测网络在道路场景下的小目标检测性能。该方法具备有效性与通用性。

Abstract

Objective

Light detection and ranging（LiDAR） is one of the most commonly used sensors in autonomous driving that has strong structure sensing ability and whose point cloud provides accurate object distance information. However， LiDAR has a limited number of laser lines. As the distance from the object to the LiDAR increases， the object feedback point cloud area becomes sparse， and the effective information is greatly reduced， thereby reducing the detection accuracy of distant small objects. At the same time， due to the complex and changeable road environment， vehicles cannot rely on a single sensor， hence necessitating the use of multi-source data fusion to improve their perception capabilities. This paper proposes a pseudo-LiDAR point cloud augmentation technology that fuses an image and point cloud to improve its 3D object detection performance in road scenes.

Method

First， a stereo image is used as the input of the depth estimation network to predict the depth image. The LiDAR point cloud is mapped to the plane to obtain the point cloud depth map， which is sent to the depth correction module together with the depth map of the image. Afterward， the depth correction module builds a directed k-nearest neighbor graph among the pseudo-LiDAR point clouds， finds the part of the pseudo-LiDAR point cloud that is closest to the LiDAR point cloud， uses the precise depth of the LiDAR point cloud to correct the depth of this part of the pseudo-LiDAR point cloud， and retains the shape and structure of the original pseudo-LiDAR point cloud to generate a corrected depth image. Second， the semantic segmentation network is applied to the image to obtain the foreground area of the vehicle. The semantic segmentation map and corrected depth map are simultaneously processed by the foreground segmentation module. The depth map corresponding to the foreground area is then mapped into the 3D space to generate the pseudo-LiDAR point cloud. Only the foreground points of the vehicle are retained. The point cloud at this time is called the foreground pseudo-LiDAR point cloud. Finally， the foreground pseudo-LiDAR point cloud performs 16-， 32-， and 64-line down-sampling in intervals of 0～20 m， 20～40 m， and 40～80 m， respectively， and fuses with the original point cloud to form a fusion point cloud. The fused point cloud has more foreground points than the original point cloud. For distant objects， the number of point clouds is greatly increased， thus improving the sparsity of small object point clouds.

Result

In this paper， the depth estimation network adopts a pre-trained model of pyramid stereo matching network based on the SceneFlow dataset （a large dataset for training convolutional networks for disparity， optical flow， and scene flow estimation）， and the semantic segmentation network adopts a pre-trained model of high-resolution representations for labeling pixels and regions （HRNet） based on the Cityscapes dataset. Five latest object detection networks， including sparsely embedded convolutional detection （SECOND）， are used as benchmark models for training and testing on the Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago dataset. Experimental results show that the proposed method can improve the small object detection accuracy of several network frameworks， with the largest improvement of each metric for the SECOND algorithm， multi-modal voxelnet for 3D object detection （MVX-Net） algorithm， and Voxel-RCNN algorithm. The average precision under the intersection over union of 0.7 is used to evaluate and compare the experimental results. Under the hard difficulty condition， the 3D detection accuracies for SECOND， MVX-Net， and Voxel-RCNN improve by 8.65%， 7.32%， and 6.29%， respectively， and the maximum improvement in the bird’s-eye view detection accuracy is 7.05%. Meanwhile， most of the other object detection networks obtain better 3D detection accuracy than the original method under the easy and moderate difficulty conditions， and all networks obtain a better bird’s-eye view detection accuracy than the original method under the easy， moderate， and hard difficulty conditions. Ablation experiments are also conducted using the network SECOND as the benchmark model， and experimental results show that the depth correction module， foreground segmentation module， and sampling module designed in this paper all contribute to the improvement of the results， among which the sampling module improves the results the most. Specifically， the introduction of this module improves the 3D detection accuracy in the easy， moderate， and hard difficulty conditions by about 2.70%， 3.69%， and 10.63%， respectively.

Conclusion

This paper proposes a pseudo-LiDAR point cloud augmentation technique that uses the accurate depth information of the point cloud to correct the image depth map and uses the dense pixel information to compensate for the sparsity of the point cloud. This method effectively addresses the poor detection accuracy of small objects caused by the sparsity of the point cloud. Using the semantic segmentation module greatly increases the proportion of the number of foreground points. The sampling module is also adopted to compensate for those pseudo-point clouds with different line numbers according to the observation distance， thus greatly reducing the number of pseudo-point clouds. This method is applicable to all object detection networks with point cloud as input and significantly improves the 3D object detection and bird's-eye view detection performance of multiple object detection networks in road scenes. This method is proven effective and general， hence presenting a new idea for multi-modal fusion 3D object detection.

关键词

伪激光点云深度估计语义分割融合算法三维目标检测

Keywords

pseudo-LiDAR（point cloud）depth estimationsemantic segmentationfusion algorithm3D object detection

references

Cao J L， Li Y L， Sun H Q， Xie J， Huang K Q and Pang Y W. 2022. A survey on deep learning based visual object detection. Journal of Image and Graphics， 27（6）： 1697-1722

曹家乐，李亚利，孙汉卿，谢今，黄凯奇，庞彦伟. 2022. 基于深度学习的视觉目标检测技术综述. 中国图象图形学报， 27（6）： 1697-1722 ［DOI： 10.11834/jig.220069http://dx.doi.org/10.11834/jig.220069］

Chang J R and Chen Y S. 2018. Pyramid stereo matching network//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 5410-5418 ［DOI： 10.1109/CVPR.2018.00567http://dx.doi.org/10.1109/CVPR.2018.00567］

Charles R Q， Su H， Mo K C and Guibas L J. 2017. PointNet： deep learning on point sets for 3D classification and segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 77-85 ［DOI： 10.1109/CVPR.2017.16http://dx.doi.org/10.1109/CVPR.2017.16］

Chen C， Chen Z， Zhang J and Tao D C. 2022. SASA： semantics-augmented set abstraction for point-based 3D object detection//Proceedings of the 36th AAAI Conference on Artificial Intelligence. Washington， USA： AAAI： 221-229 ［DOI： 10.1609/aaai.v36i1.19897http://dx.doi.org/10.1609/aaai.v36i1.19897］

Chen X Z， Kundu K， Zhang Z Y， Ma H M， Fidler S and Urtasun R. 2016. Monocular 3D object detection for autonomous driving//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 2147-2156 ［DOI： 10.1109/CVPR.2016.236http://dx.doi.org/10.1109/CVPR.2016.236］

Chen X Z， Ma H M， Wan J， Li B and Xia T. 2017. Multi-view 3D object detection network for autonomous driving//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 6526-6534 ［DOI： 10.1109/CVPR.2017.691http://dx.doi.org/10.1109/CVPR.2017.691］

Cordts M， Omran M， Ramos S， Rehfeld T， Enzweiler M， Benenson R， Franke U， Roth S and Schiele B. 2016. The cityscapes dataset for semantic urban scene understanding//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 3213-3223 ［DOI： 10.1109/CVPR.2016.350http://dx.doi.org/10.1109/CVPR.2016.350］

Dai K， Xu L B， Huang S Y and Li Y L. 2022. Single stage object detection algorithm based on fusing strategy optimization selection and dual attention mechanism. Journal of Image and Graphics， 27（8）： 2430-2443

戴坤，许立波，黄世旸，李鋆铃. 2022. 融合策略优选和双注意力的单阶段目标检测. 中国图象图形学报， 27（8）： 2430-2443 ［DOI： 10.11834/jig.210204http://dx.doi.org/10.11834/jig.210204］

Deng J J， Shi S S， Li P W， Zhou W G， Zhang Y Y and Li H Q. 2021. Voxel R-CNN： towards high performance voxel-based 3D object detection//Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto， USA： AAAI： 1201-1209 ［DOI： 10.1609/aaai.v35i2.16207http://dx.doi.org/10.1609/aaai.v35i2.16207］

Fu H， Gong M M， Wang C H， Batmanghelich K and Tao D C. 2018. Deep ordinal regression network for monocular depth estimation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 2002-2011 ［DOI： 10.1109/CVPR.2018.00214http://dx.doi.org/10.1109/CVPR.2018.00214］

Geiger A， Lenz P and Urtasun R. 2012. Are we ready for autonomous driving？ The KITTI vision benchmark suite//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence， USA： IEEE： 3354-3361 ［DOI： 10.1109/CVPR.2012.6248074http://dx.doi.org/10.1109/CVPR.2012.6248074］

Guo Y L， Wang H Y， Hu Q Y， Liu H， Liu L and Bennamoun M. 2021. Deep learning for 3D point clouds： a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence， 43（12）： 4338-4364 ［DOI： 10.1109/TPAMI.2020.3005434http://dx.doi.org/10.1109/TPAMI.2020.3005434］

He C H， Zeng H， Huang J Q， Hua X S and Zhang L. 2020. Structure aware single-stage 3D object detection from point cloud//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 11870-11879 ［DOI： 10.1109/CVPR42600.2020.01189http://dx.doi.org/10.1109/CVPR42600.2020.01189］

Huang T T， Liu Z， Chen X W and Bai X. 2020. EPNet： enhancing point features with image semantics for 3D object detection//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 35-52 ［DOI： 10.1007/978-3-030-58555-6_3http://dx.doi.org/10.1007/978-3-030-58555-6_3］

Ku J， Mozifian M， Lee J， Harakeh A and Waslander S L. 2018. Joint 3D proposal generation and object detection from view aggregation//Proceedings of 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems. Madrid， Spain： IEEE： #8594049 ［DOI： 10.1109/IROS.2018.8594049http://dx.doi.org/10.1109/IROS.2018.8594049］

Lang A H， Vora S， Caesar H， Zhou L B， Yang J and Beijbom O. 2019. PointPillars： fast encoders for object detection from point clouds//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 12689-12697 ［DOI： 10.1109/CVPR.2019.01298http://dx.doi.org/10.1109/CVPR.2019.01298］

Li P L， Chen X Z and Shen S J. 2019. Stereo R-CNN based 3D object detection for autonomous driving//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 7636-7644 ［DOI： 10.1109/CVPR.2019.00783http://dx.doi.org/10.1109/CVPR.2019.00783］

Liu Z C， Wu Z Z and Tóth R. 2020. Smoke： single-stage monocular 3D object detection via keypoint estimation//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Seattle， USA： IEEE： 4289-4298 ［DOI： 10.1109/CVPRW50498.2020.00506http://dx.doi.org/10.1109/CVPRW50498.2020.00506］

Mayer N， Ilg E， Häusser P， Fischer P， Cremers D， Dosovitskiy A and Brox T. 2016. A large dataset to train convolutional networks for disparity， optical flow， and scene flow estimation//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 4040-4048 ［DOI： 10.1109/CVPR.2016.438http://dx.doi.org/10.1109/CVPR.2016.438］

Mousavian A， Anguelov D， Flynn J and Košeck􀅡 J. 2017. 3D bounding box estimation using deep learning and geometry//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 5632-5640 ［DOI： 10.1109/CVPR.2017.597http://dx.doi.org/10.1109/CVPR.2017.597］

Qi C R， Liu W， Wu C X， Su H and Guibas L J. 2018. Frustum PointNets for 3D object detection from RGB-D data//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 918-927 ［DOI： 10.1109/CVPR.2018.00102http://dx.doi.org/10.1109/CVPR.2018.00102］

Qi C R， Yi L， Su H and Guibas L J. 2017. PointNet++： deep hierarchical feature learning on point sets in a metric space//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 5105-5114

Qian R， Lai X and Li X R. 2022. 3D object detection for autonomous driving： a survey. Pattern Recognition， 130： #108796 ［DOI： 10.1016/j.patcog.2022.108796http://dx.doi.org/10.1016/j.patcog.2022.108796］

Redmon J， Divvala S， Girshick R and Farhadi A. 2016. You only look once： unified， real-time object detection//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 779-788 ［DOI： 10.1109/CVPR.2016.91http://dx.doi.org/10.1109/CVPR.2016.91］

Ren S Q， He K M， Girshick R and Sun J. 2017. Faster R-CNN： towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence， 39（6）： 1137-1149 ［DOI： 10.1109/TPAMI.2016.2577031http://dx.doi.org/10.1109/TPAMI.2016.2577031］

Shi S S， Guo C X， Jiang L， Wang Z， Shi J P， Wang X G and Li H S. 2020. PV-RCNN： point-voxel feature set abstraction for 3D object detection//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 10526-10535 ［DOI： 10.1109/CVPR42600.2020.01054http://dx.doi.org/10.1109/CVPR42600.2020.01054］

Shi S S， Wang X G and Li H S. 2019. PointRCNN： 3D object proposal generation and detection from point cloud//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 770-779 ［DOI： 10.1109/CVPR.2019.00086http://dx.doi.org/10.1109/CVPR.2019.00086］

Shi S S， Wang Z， Shi J P， Wang X G and Li H S. 2021. From points to parts： 3D object detection from point cloud with part-aware and part-aggregation network. IEEE Transactions on Pattern Analysis and Machine Intelligence， 43（8）： 2647-2664 ［DOI： 10.1109/TPAMI.2020.2977026http://dx.doi.org/10.1109/TPAMI.2020.2977026］

Sindagi V A， Zhou Y and Tuzel O. 2019. MVX-Net： multimodal voxelnet for 3D object detection//Proceedings of 2019 International Conference on Robotics and Automation. Montreal， Canada： IEEE： 7276-7282 ［DOI： 10.1109/ICRA.2019.8794195http://dx.doi.org/10.1109/ICRA.2019.8794195］

Vora S， Lang A H， Helou B and Beijbom O. 2020. PointPainting： sequential fusion for 3D object detection//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 4603-4611 ［DOI： 10.1109/CVPR42600.2020.00466http://dx.doi.org/10.1109/CVPR42600.2020.00466］

Wang J D， Sun K， Cheng T H， Jiang B R， Deng C R， Zhao Y， Liu D， Mu Y D， Tan M K， Wang X G， Liu W Y and Xiao B. 2021. Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence， 43（10）： 3349-3364 ［DOI： 10.1109/TPAMI.2020.2983686http://dx.doi.org/10.1109/TPAMI.2020.2983686］

Wang Y， Chao W L， Garg D， Hariharan B， Campbell M and Weinberger K Q. 2019. Pseudo-LiDAR from visual depth estimation： bridging the gap in 3D object detection for autonomous driving//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 8437-8445 ［DOI： 10.1109/CVPR.2019.00864http://dx.doi.org/10.1109/CVPR.2019.00864］

Wang Z X and Jia K. 2019. Frustum ConvNet： sliding frustums to aggregate local point-wise features for amodal 3D object detection//Proceedings of 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems. Macau， China： IEEE： 1742-1749 ［DOI： 10.1109/IROS40897.2019.8968513http://dx.doi.org/10.1109/IROS40897.2019.8968513］

Wu Z R， Song S R， Khosla A， Yu F， Zhang L G， Tang X O and Xiao J X. 2015. 3D ShapeNets： a deep representation for volumetric shapes//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston， USA： IEEE： 1912-1920 ［DOI： 10.1109/CVPR.2015.7298801http://dx.doi.org/10.1109/CVPR.2015.7298801］

Yan J， Fang Z J and Gao Y B. 2020. 3D object detection based on domain attention and dilated convolution. Journal of Image and Graphics， 25（6）： 1221-1234

严娟，方志军，高永彬. 2020. 结合混合域注意力与空洞卷积的3维目标检测. 中国图象图形学报， 25（6）： 1221-1234 ［DOI： 10.11834/jig.190378http://dx.doi.org/10.11834/jig.190378］

Yan Y， Mao Y X and Li B. 2018. Second： sparsely embedded convolutional detection. Sensors， 18（10）： #3337 ［DOI： 10.3390/s18103337http://dx.doi.org/10.3390/s18103337］

Yang Z T， Sun Y N， Liu S and Jia J Y. 2020. 3DSSD： Point-based 3D single stage object detector//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 11040-11048 ［DOI： 10.1109/CVPR42600.2020.01105http://dx.doi.org/10.1109/CVPR42600.2020.01105］

Yoo J H， Kim Y， Kim J and Choi J W. 2020. 3D-CVF： generating joint camera and LiDAR features using cross-view spatial feature fusion for 3D object detection//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 720-736 ［DOI： 10.1007/978-3-030-58583-9_43http://dx.doi.org/10.1007/978-3-030-58583-9_43］

You Y R， Wang Y， Chao W L， Garg D， Pleiss G， Hariharan B， Campbell M E and Weinberger K Q. 2020. Pseudo-LiDAR++： accurate depth for 3D object detection in autonomous driving//Proceedings of the 8th International Conference on Learning Representations. Addis Ababa， Etiopia： ICLR： 1-22

Yu X M， Rao Y M， Wang Z Y， Liu Z Y， Lu J W and Zhou J. 2021. PoinTr： diverse point cloud completion with geometry-aware Transformers//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 12478-12487 ［DOI： 10.1109/ICCV48922.2021.01227http://dx.doi.org/10.1109/ICCV48922.2021.01227］

Zhang Y F， Hu Q R， Xu G Q， Ma Y X， Wan J W and Guo Y L. 2022. Not all points are equal： learning highly efficient point-based detectors for 3D LiDAR point clouds//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 18931-18940 ［DOI： 10.1109/CVPR52688.2022.01838http://dx.doi.org/10.1109/CVPR52688.2022.01838］

Zheng W， Tang W L， Chen S J， Jiang L and Fu C W. 2021. CIA-SSD： confident IoU-aware single-stage object detector from point cloud//Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto， USA： AAAI： 3555-3562 ［DOI： 10.1609/aaai.v35i4.16470http://dx.doi.org/10.1609/aaai.v35i4.16470］

Zhou Y and Tuzel O. 2018. VoxelNet： end-to-end learning for point cloud based 3D object detection//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 4490-4499 ［DOI： 10.1109/CVPR.2018.00472http://dx.doi.org/10.1109/CVPR.2018.00472］

文章被引用时，请邮件提醒。

提交

结合双边交叉增强与自注意力补偿的点云语义分割

面向无人机海岸带生态系统监测的语义分割基准数据集

带深度信息监督的神经辐射场虚拟视点画面合成

基于深度学习的弱监督语义分割方法综述