Multi-stage guidance network for constructing dense depth map based on LiDAR and RGB data

Di Jia; Zitao Wang; Yuyang Li; Zhiyang Jin; Zeyang Liu; Si Wu

doi:10.11834/jig.210465

Depth Estimation & 3D Reconstruction | Views : 0 下载量: 0 CSCD: 3

PDF
Export
Share
Collection
Album

Multi-stage guidance network for constructing dense depth map based on LiDAR and RGB data
Vol. 27, Issue 2, Pages: 435-446(2022)
Published： 16 February 2022 ，

Accepted： 09 September 2021
DOI： 10.11834/jig.210465
稿件说明：

移动端阅览

Di Jia, Zitao Wang, Yuyang Li, Zhiyang Jin, Zeyang Liu, Si Wu. Multi-stage guidance network for constructing dense depth map based on LiDAR and RGB data. [J]. Journal of Image and Graphics 27(2):435-446(2022)
DOI：

Di Jia, Zitao Wang, Yuyang Li, Zhiyang Jin, Zeyang Liu, Si Wu. Multi-stage guidance network for constructing dense depth map based on LiDAR and RGB data. [J]. Journal of Image and Graphics 27(2):435-446(2022) DOI： 10.11834/jig.210465.

摘要

目的

使用单幅RGB图像引导稀疏激光雷达（light detection and ranging，LiDAR）点云构建稠密深度图已逐渐成为研究热点，然而现有方法在构建场景深度信息时，目标边缘处的深度依然存在模糊的问题，影响3维重建与摄影测量的准确性。为此，本文提出一种基于多阶段指导网络的稠密深度图构建方法。

方法

多阶段指导网络由指导信息引导路径和RGB信息引导路径构成。在指导信息引导路径上，通过ERF（efficient residual factorized）网络融合稀疏激光雷达点云和RGB数据提取前期指导信息，采用指导信息处理模块融合稀疏深度和前期指导信息，并将融合后的信息通过双线性插值的方式构建出表面法线，将多模态信息融合指导模块提取的中期指导信息和表面法线信息输入到ERF网络中，提取可用于引导稀疏深度稠密化的后期指导信息，以此构建该路径上的稠密深度图；在RGB信息引导路径上，通过前期指导信息引导融合稀疏深度与RGB信息，通过多模态信息融合指导模块获得该路径上的稠密深度图，采用精细化模块减少该稠密深度图中的误差信息。融合上述两条路径得到的结果，获得最终稠密深度图。

结果

通过KITTI（Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago）深度估计数据集训练多阶段指导网络，将测试数据结果提交到KITTI官方评估服务器，评估指标中，均方根误差值和反演深度的均方根误差分别为768.35和2.40，均低于对比方法，且本文方法在物体边缘和细节处的构建精度更高。

结论

本文给出的多阶段指导网络可以更好地提高稠密深度图构建准确率，弥补激光雷达点云稀疏的缺陷，实验结果验证了本文方法的有效性。

Abstract

Objective

Recently

depth information plays an important role in the field of autonomous driving and robot navigation

but the sparse depth collected by light detection and ranging (LiDAR) has sparse and noisy deficiencies. To solve such problems

several recently proposed methods that use a single image to guide sparse depth to construct the dense depth map have shown good performance. However

many methods cannot perfectly learn the depth information about edges and details of the object. This paper proposes a multistage guidance network model to cope with this challenge. The deformable convolution and efficient residual factorized(ERF) network are introduced into the network model

and the quality of the dense depth map is improved from the angle of the geometric constraint by surface normal information. The depth and guidance information extracted in the network is dominated

and the information extracted in the RGB picture is used as the guidance information to guide the sparse depth densification and correct the error in depth information.

Method

The multistage guidance network is composed of guidance information guidance path and RGB information guidance path. On the path of guidance information guidance

first

the sparse depth information and RGB images are merged through the ERF network to obtain the initial guidance information

and the sparse depth information and the initial guidance information are input into the guidance information processing module to construct the surface normal. Second

the surface normal and the midterm guidance information obtained by the multimodal information fusion guidance module are input into the ERF network

and the later guidance information containing rich depth information is extracted under the action of the surface normal. The later guidance information is used to guide the sparse depth densification. At the same time

the sparse depth is introduced again to make up for the depth information ignored in the early stage

and then the dense depth map constructed on this path is obtained. On the RGB information guidance path

the initial guidance information can be used to guide the fusion of the sparse depth and the information extracted from the RGB picture

and reduce the influence of sparse depth noise and sparsity. The midterm guidance information and initial dense depth map with rich depth information can be extracted from the multimodal information fusion guidance module. However

the initial dense depth map still contains error information. Through the refined module to correct the dense depth map

the accurate dense depth map can be obtained. The network adds sparse depth and guidance information by adding an operation

which can effectively guide sparse depth densification. Using cascading operation can effectively retain their respective features in different information

which causes the network or module to extract more features. Overall

the initial guidance information is extracted by entering information

which promotes the construction of surface normal and guides the fusion of sparse depth and RGB information. The midterm guidance information is obtained by the multimodal information fusion guidance module

which is the key information to connect two paths. The later guidance information is obtained by fusing the midterm guidance information and the surface normal

which is used to guide the sparse depth densification. From the two paths

on the guidance information guidance path

a dense depth map is constructed by the initial

midterm

and later guidance information to guide the sparse depth; on the RGB information guidance path

the multimodal information fusion guidance module guides the sparse depth through the RGB information.

Result

The proposed network is implemented using PyTorch and Adam optimizer. The parameters of the Adam optimizer are set to

1=0.9 and

2=0.999. The image input to the network is cropped to 256×512 pixels

the graphics card is NVIDIA 3090

the batch size is set to 6

and 30 rounds of training are performed. The initial learning rate is 0.000 125

and the learning rate is reduced by half every 5 rounds. The Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago (KITTI) depth estimation data contains more than 93 000 pairs of ground truth data

aligned LiDAR sparse depth data

and RGB pictures. A total of 85 898 pairs of data can be used to train

and the officially distributed 1 000 pairs of validation set data with ground truth data and 1 000 pairs of test set data without ground truth data can be used to test. The experimental results can be evaluated directly due to the validation set with ground truth data. The test set without ground truth data and the experimental results are required to be submitted to the KITTI official evaluation server to obtain public evaluation results

and the result is an important basis for the performance of a fair assessment model. The validation set and test set do not participate in the training of the network model. The mean square error of the root and the mean square error of inversion root in the evaluation indicators are lower than those of the other methods

and the accuracy of the depth information at the edges and details of the object is more evident.

Conclusion

A multistage guidance network model for dense depth map construction from LiDAR and RGB information is presented in this paper. The guidance information processing module is used to promote the fusion of guidance information and sparse depth. The multimodal information fusion guidance module can learn a large amount of depth information from sparse depth and RGB pictures. The refined module is used to modify the output results of the multimodal information fusion guidance module. In summary

the dense depth map constructed by the multistage guidance network is composed of the guidance information guidance path and the RGB information guidance path. Two strategies build the dense depth map to form a complementary advantage effectively

using more information to obtain more accurate dense depth maps. Experiments on the KITTI depth estimation data set show that using a multistage guidance network can effectively deal with the depth of the edges and details of the object

and improve the construction quality of dense depth maps.

关键词

深度估计深度学习LiDAR多模态数据融合图像处理

Keywords

depth estimationdeep learningLiDARmulti-modal data fusionimage processing

references

Bai L, Zhao Y M, Elhousni M and Huang X M. 2020. DepthNet: real-time LiDAR point cloud depth completion for autonomous vehicles. IEEE Access, 8: 227825-227833[DOI: 10.1109/ACCESS.2020.3045681]

Dai J F, Qi H Z, Xiong Y W, Li Y, Zhang G D, Hu H and Wei Y C. 2017. Deformable convolutional networks//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 764-773[DOI: 10.1109/ICCV.2017.89http://dx.doi.org/10.1109/ICCV.2017.89]

Eldesokey A, Felsberg M and Khan F S. 2020. Confidence propagation through CNNs for guided sparse depth regression. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10): 2423-2436[DOI: 10.1109/TPAMI.2019.2929170]

Huang J, Wang C, Liu Y and Bi T T. 2019. The progress of monocular depth estimation technology. Journal of Image and Graphics, 24(12): 2081-2097

黄军, 王聪, 刘越, 毕天腾. 2019. 单目深度估计技术进展综述. 中国图象图形学报, 24(12): 2081-2097[DOI: 10.11834/jig.190455]

Imran S, Long Y F, Liu X M and Morris D. 2019. Depth coefficients for depth completion//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 12438-12447[DOI: 10.1109/CVPR.2019.01273http://dx.doi.org/10.1109/CVPR.2019.01273]

Ku J, Harakeh A and Waslander S L. 2018. In defense of classical image processing: fast depth completion on the CPU//Proceedings of the 15th Conference on Computer and Robot Vision (CRV). Toronto, Canada: IEEE: 16-22[DOI: 10.1109/CRV.2018.00013http://dx.doi.org/10.1109/CRV.2018.00013]

Lee S, Lee J, Kim D and Kim J. 2020. Deep architecture with cross guidance between single image and sparseLiDAR data for depth completion. IEEE Access, 8: 79801-79810[DOI: 10.1109/ACCESS.2020.2990212]

Lu K Y, Barnes N, Anwar S and Zheng L. 2020. From depth what can you see? Depth completion via auxiliary image reconstruction//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 11303-11312[DOI: 10.1109/CVPR42600.2020.01132http://dx.doi.org/10.1109/CVPR42600.2020.01132]

Ma F C, Cavalheiro G V and Karaman S. 2019. Self-supervised sparse-to-dense: self-supervised depth completion from LiDAR and monocular camera//Proceedings of 2019 International Conference on Robotics and Automation (ICRA). Montreal, Canada: IEEE: 3288-3295[DOI: 10.1109/ICRA.2019.8793637http://dx.doi.org/10.1109/ICRA.2019.8793637]

Park J, Joo K, Hu Z, Liu C K and Kweon I S. 2020. Non-local spatial propagation network for depth completion//Proceedings of the 16th European Conference on Computer Vision (ECCV). Glasgow, UK: Springer: 120-136[DOI: 10.1007/978-3-030-58601-0_8http://dx.doi.org/10.1007/978-3-030-58601-0_8]

Romera E, Álvarez J M, Bergasa L M and Arroyo R. 2018. ERFNet: efficient residual factorized ConvNet for real-time semantic segmentation. IEEE Transactions on Intelligent Transportation Systems, 19(1): 263-272[DOI: 10.1109/TITS.2017.2750080]

Schuster R, Wasenmüller O, Unger C and Stricker D. 2021. SSGP: sparse spatial guided propagation for robust and generic interpolation//Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision (WACV). Waikoloa, USA: 197-206[DOI: 10.1109/WACV48630.2021.00024http://dx.doi.org/10.1109/WACV48630.2021.00024]

Shivakumar S S, Nguyen T, Miller I D, Chen S W, Kumar V and Taylor C J. 2019. DFuseNet: deep fusion of RGB and sparse depth information for image guided dense depth completion//Proceedings of 2019 IEEE Intelligent Transportation Systems Conference (ITSC). Auckland, New Zealand: IEEE: 13-20[DOI: 10.1109/ITSC.2019.8917294http://dx.doi.org/10.1109/ITSC.2019.8917294]

Silberman N, Hoiem D, Kohli P and Fergus R. 2012. Indoor segmentation and support inference from RGBD images//Proceedings of the 12th European Conference on Computer Vision (ECCV). Florence, Italy: Springer: 746-760[DOI: 10.1007/978-3-642-33715-4_54http://dx.doi.org/10.1007/978-3-642-33715-4_54]

Tang J, Tian F P, Feng W, Li J and Tan P. 2020. Learning guided convolutional network for depth completion. IEEE Transactions on Image Processing, 30: 1116-1129[DOI: 10.1109/TIP.2020.3040528]

Uhrig J, Schneider N, Schneider L, Franke U, Brox T and Geiger A. 2017. Sparsity invariant CNNs//Proceedings of 2017 International Conference on 3D Vision (3DV). Qingdao, China: IEEE: 11-20[DOI: 10.1109/3DV.2017.00012http://dx.doi.org/10.1109/3DV.2017.00012]

Wang B Z, Feng Y L and Liu H Z. 2018. Multi-scale features fusion from sparse LiDAR data and single image for depth completion. Electronics Letters, 54(24): 1375-1377[DOI: 10.1049/el.2018.6149]

Xu Y, Zhu X G, Shi J P, Zhang G F, Bao H J and Li H S. 2019. Depth completion from sparse LiDAR data with depth-normal constraints//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 2811-2820[DOI: 10.1109/ICCV.2019.00290http://dx.doi.org/10.1109/ICCV.2019.00290]

Yan L, Liu K and Belyaev E. 2020. Revisiting sparsity invariant convolution: a network for image guided depth completion. IEEE Access, 8: 126323-126332[DOI: 10.1109/ACCESS.2020.3008404]

Zhang Y L, Nguyen T, Miller I D, Shivakumar S S, Chen S, Taylor C and Kumar V. 2019. DFineNet: ego-motion estimation and depth refinement from sparse, noisy depth input with RGB guidance[EB/OL]. [2021-05-23].https://arxiv.org/pdf/1903.06397.pdfhttps://arxiv.org/pdf/1903.06397.pdf

Zhao S S, Gong M M, Fu H and Tao D C. 2021. Adaptive context-aware multi-modal network for depth completion. IEEE Transactions on Image Processing, 30: 5264-5276[DOI: 10.1109/TIP.2021.3079821]

Zhou D K, Tian J and Yang X. 2021. Unsurpervised monocular image depth estimation based on the prediction of local plane parameters. Journal of Image and Graphics, 26(1): 165-175

周大可, 田径, 杨欣. 2021. 结合局部平面参数预测的无监督单目图像深度估计. 中国图象图形学报, 26(1): 165-175[DOI:10.11834/jig.200364]

Alert me when the article has been cited

提交

Review of various vessels and airway segmentation in medical imaging

Low-light image enhancement guided by semantic segmentation and HSV color space

The growth of UAV aerial images-related power lines detection： a literature review of 2023

Vision Transformer-based recognition tasks： a critical review

Complex gesture pose estimation network fusing multiscale features