发布时间: 2022-02-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.210465
2022 | Volume 27 | Number 2

深度估计与三维重建

结合LiDAR与RGB数据构建稠密深度图的多阶段指导网络

贾迪^1,2, 王子滔¹, 李宇扬¹, 金志楊¹, 刘泽洋¹, 吴思¹

1. 辽宁工程技术大学电子与信息工程学院, 葫芦岛 125105;

2. 辽宁工程技术大学电器与控制工程学院, 葫芦岛 125105

收稿日期: 2021-06-23; 修回日期: 2021-09-02; 预印本日期: 2021-09-09

基金项目: 国家自然科学基金项目（61601213）；辽宁省教育厅项目（LJ2020FWL004，2019-ZD-0038）

作者简介: 贾迪, 1982年生, 男, 教授, 主要研究方向为立体匹配与3维重建、摄影测量、视觉空间定位和视觉机械臂作业。E-mail: lntu_jiadi@163.com
王子滔, 通信作者, 男, 硕士研究生, 主要研究方向为深度估计和3维重建。E-mail: lntu_wzt@163.com
李宇扬, 男, 硕士研究生, 主要研究方向为人体姿态估计和手势识别。E-mail: lntu_lyy@163.com
金志楊, 男, 硕士研究生, 主要研究方向为姿态估计和视觉机械臂作业。E-mail: jzy980125@163.com
刘泽洋, 男, 硕士研究生, 主要研究方向为姿态估计。E-mail: lntu_lzy@163.com
吴思, 女, 硕士研究生, 主要研究方向为图像匹配。E-mail: lntu_ws@163.com
*通信作者: 王子滔 lntu_wzt@163.com

中图法分类号: TP391

文献标识码: A

文章编号: 1006-8961(2022)02-0435-12

摘要

目的使用单幅RGB图像引导稀疏激光雷达（light detection and ranging，LiDAR）点云构建稠密深度图已逐渐成为研究热点，然而现有方法在构建场景深度信息时，目标边缘处的深度依然存在模糊的问题，影响3维重建与摄影测量的准确性。为此，本文提出一种基于多阶段指导网络的稠密深度图构建方法。方法多阶段指导网络由指导信息引导路径和RGB信息引导路径构成。在指导信息引导路径上，通过ERF（efficient residual factorized）网络融合稀疏激光雷达点云和RGB数据提取前期指导信息，采用指导信息处理模块融合稀疏深度和前期指导信息，并将融合后的信息通过双线性插值的方式构建出表面法线，将多模态信息融合指导模块提取的中期指导信息和表面法线信息输入到ERF网络中，提取可用于引导稀疏深度稠密化的后期指导信息，以此构建该路径上的稠密深度图；在RGB信息引导路径上，通过前期指导信息引导融合稀疏深度与RGB信息，通过多模态信息融合指导模块获得该路径上的稠密深度图，采用精细化模块减少该稠密深度图中的误差信息。融合上述两条路径得到的结果，获得最终稠密深度图。结果通过KITTI（Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago）深度估计数据集训练多阶段指导网络，将测试数据结果提交到KITTI官方评估服务器，评估指标中，均方根误差值和反演深度的均方根误差分别为768.35和2.40，均低于对比方法，且本文方法在物体边缘和细节处的构建精度更高。结论本文给出的多阶段指导网络可以更好地提高稠密深度图构建准确率，弥补激光雷达点云稀疏的缺陷，实验结果验证了本文方法的有效性。

关键词

深度估计; 深度学习; LiDAR; 多模态数据融合; 图像处理

Multi-stage guidance network for constructing dense depth map based on LiDAR and RGB data

Jia Di^1,2, Wang Zitao¹, Li Yuyang¹, Jin Zhiyang¹, Liu Zeyang¹, Wu Si¹

1. School of Electronic and Information Engineering, Liaoning Technical University, Huludao 125105, China;

2. Faculty of Electrical and Control Engineering, Liaoning Technical University, Huludao 125105, China

Supported by: National Natural Science Foundation of China (61601213)

Abstract

Objective Recently, depth information plays an important role in the field of autonomous driving and robot navigation, but the sparse depth collected by light detection and ranging (LiDAR) has sparse and noisy deficiencies. To solve such problems, several recently proposed methods that use a single image to guide sparse depth to construct the dense depth map have shown good performance. However, many methods cannot perfectly learn the depth information about edges and details of the object. This paper proposes a multistage guidance network model to cope with this challenge. The deformable convolution and efficient residual factorized(ERF) network are introduced into the network model, and the quality of the dense depth map is improved from the angle of the geometric constraint by surface normal information. The depth and guidance information extracted in the network is dominated, and the information extracted in the RGB picture is used as the guidance information to guide the sparse depth densification and correct the error in depth information. Method The multistage guidance network is composed of guidance information guidance path and RGB information guidance path. On the path of guidance information guidance, first, the sparse depth information and RGB images are merged through the ERF network to obtain the initial guidance information, and the sparse depth information and the initial guidance information are input into the guidance information processing module to construct the surface normal. Second, the surface normal and the midterm guidance information obtained by the multimodal information fusion guidance module are input into the ERF network, and the later guidance information containing rich depth information is extracted under the action of the surface normal. The later guidance information is used to guide the sparse depth densification. At the same time, the sparse depth is introduced again to make up for the depth information ignored in the early stage, and then the dense depth map constructed on this path is obtained. On the RGB information guidance path, the initial guidance information can be used to guide the fusion of the sparse depth and the information extracted from the RGB picture, and reduce the influence of sparse depth noise and sparsity. The midterm guidance information and initial dense depth map with rich depth information can be extracted from the multimodal information fusion guidance module. However, the initial dense depth map still contains error information. Through the refined module to correct the dense depth map, the accurate dense depth map can be obtained. The network adds sparse depth and guidance information by adding an operation, which can effectively guide sparse depth densification. Using cascading operation can effectively retain their respective features in different information, which causes the network or module to extract more features. Overall, the initial guidance information is extracted by entering information, which promotes the construction of surface normal and guides the fusion of sparse depth and RGB information. The midterm guidance information is obtained by the multimodal information fusion guidance module, which is the key information to connect two paths. The later guidance information is obtained by fusing the midterm guidance information and the surface normal, which is used to guide the sparse depth densification. From the two paths, on the guidance information guidance path, a dense depth map is constructed by the initial, midterm, and later guidance information to guide the sparse depth; on the RGB information guidance path, the multimodal information fusion guidance module guides the sparse depth through the RGB information. Result The proposed network is implemented using PyTorch and Adam optimizer. The parameters of the Adam optimizer are set to β1=0.9 and β2=0.999. The image input to the network is cropped to 256×512 pixels, the graphics card is NVIDIA 3090, the batch size is set to 6, and 30 rounds of training are performed. The initial learning rate is 0.000 125, and the learning rate is reduced by half every 5 rounds. The Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago (KITTI) depth estimation data contains more than 93 000 pairs of ground truth data, aligned LiDAR sparse depth data, and RGB pictures. A total of 85 898 pairs of data can be used to train, and the officially distributed 1 000 pairs of validation set data with ground truth data and 1 000 pairs of test set data without ground truth data can be used to test. The experimental results can be evaluated directly due to the validation set with ground truth data. The test set without ground truth data and the experimental results are required to be submitted to the KITTI official evaluation server to obtain public evaluation results, and the result is an important basis for the performance of a fair assessment model. The validation set and test set do not participate in the training of the network model. The mean square error of the root and the mean square error of inversion root in the evaluation indicators are lower than those of the other methods, and the accuracy of the depth information at the edges and details of the object is more evident. Conclusion A multistage guidance network model for dense depth map construction from LiDAR and RGB information is presented in this paper. The guidance information processing module is used to promote the fusion of guidance information and sparse depth. The multimodal information fusion guidance module can learn a large amount of depth information from sparse depth and RGB pictures. The refined module is used to modify the output results of the multimodal information fusion guidance module. In summary, the dense depth map constructed by the multistage guidance network is composed of the guidance information guidance path and the RGB information guidance path. Two strategies build the dense depth map to form a complementary advantage effectively, using more information to obtain more accurate dense depth maps. Experiments on the KITTI depth estimation data set show that using a multistage guidance network can effectively deal with the depth of the edges and details of the object, and improve the construction quality of dense depth maps.

Key words

depth estimation; deep learning; LiDAR; multi-modal data fusion; image processing

0 引言

在自动驾驶、增强现实和机器人导航等领域，获取准确的深度信息尤为重要。获取深度信息的方式分为被动传感测距和主动传感测距两类。被动传感测距通过立体匹配算法获得像对的稠密视差图，并根据三角测量原理计算深度信息，然而受相机分辨率及摄影基线的影响较大，视差精度不高；主动传感测距通过传感器本身发射与收集能量的方式获得深度信息，主要有TOF(time of flight)、结构光和激光雷达(light detection and ranging，LiDAR)扫描等方法，由于激光雷达具有测距范围广和测量精度高的优势，已广泛应用于3维空间感知的人工智能系统。通常，激光雷达获取的场景中的深度信息是稀疏的，且受运动状态和场景中运动物体的影响较大，导致收集的深度信息带有噪声。

为了解决上述问题，Ku等人(2018)提出将稀疏深度信息作为输入推理缺失深度值，进而得到稠密深度图。然而该方法在远处物体和物体边缘处激光雷达获得的深度信息存在歧义，很难在这些位置上推理出缺失的深度信息。研究表明，利用RGB信息可以有效地构建稠密深度图(黄军等，2019；周大可等，2021)。一些学者提出采用RGB图像引导稀疏深度稠密化，通过RGB图像中蕴含的丰富信息提高稠密深度图构建质量。Wang等人(2018)通过构建多尺度融合模块分别融合不同尺度下的RGB图像和稀疏深度信息，学习它们之间的相关性，从而提取深度信息。Ma等人(2019)也采用多尺度学习的方式提取深度信息，与Wang等人(2018)方法不同之处在于，该方法首先将RGB图像和稀疏深度信息级联为4D张量进行前期融合，之后再提取深度信息。与之对应的方法为后期融合，Shivakumar等人(2019)分别从RGB图像和稀疏深度中提取特征后再将二者融合，进而提取深度信息。与前期融合相比，后期融合可以在RGB图像和稀疏深度信息中提取到更多的上下文信息，进而保留更多细节。Zhao等人(2021)采用图传播的方式捕获空间信息，以此获得场景中更多上下文信息。从RGB图像中提取的信息也可用于引导稀疏深度信息的稠密化，Imran等人(2019)通过提取RGB图像中丰富的语义线索引导构建稠密深度图。此外，还有很多其他方法也可融合与提取多模态信息中的深度信息。Tang等人(2020)通过学习自适应卷积核大小和传播迭代次数，动态地为每个像素分配所需的上下文和计算资源。Yan等人(2020)通过掩膜感知操作来处理和融合稀疏特征，从而学习到更多的深度信息。对于多模态信息中模态表示能力不足问题，Lee等人(2020)通过多模态特征融合交叉指导的方式解决。Park等人(2020)学习多模态信息中的亲和度组合也可更好地构建稠密深度图。Xu等人(2019)的研究表明，在构建稠密深度图的过程中引入表面法线信息可以有效减小稀疏激光雷达点云受噪声的影响。

受上述方法启发，本文采用单幅RGB图像引导稀疏深度的方式构建稠密深度图，引入Dai等人(2017)提出的可变形卷积和Romera等人(2018)提出的ERF(efficient residual factorized)网络，并通过表面法线信息从几何约束的角度提高稠密深度图的构建质量。在多阶段指导网络(multi-stage guidance network，MsG)构建稠密深度图的策略上，以网络中提取的深度和指导信息为主导，将RGB图像中提取的信息作为次引导信息，引导稀疏深度稠密化并修正深度信息中的误差。在整体上，将稠密深度图构建工作分为指导信息引导路径和RGB信息引导路径，并将两条路径中的信息互补、整合获得最终多阶段指导网络的稠密深度图。本文的主要贡献如下：1)构造一种多阶段指导网络，能够有效处理物体边缘和细节处的深度信息，提高稠密深度图构建准确率；2)构建了多模态信息融合指导模块，可以在融合多模态信息的同时提取深度信息；3)构建了精细化模块，用于修正多模态信息融合指导模块输出结果。

1 方法

图 1给出了多阶段指导网络结构，主要由指导信息引导路径和RGB信息引导路径构成。在指导信息引导路径上，首先通过ERF网络融合稀疏深度信息及RGB图像获取前期指导信息，并与稀疏深度信息共同输入指导信息处理模块构建表面法线。其次将多模态信息融合指导模块获得的中期指导信息与表面法线共同输入到ERF网络中，在表面法线的作用下，提取包含丰富深度信息的后期指导信息。然后利用后期指导信息引导稀疏深度稠密化，同时再次引入稀疏深度弥补前期忽略的深度信息，进而得到此路径上构建的稠密深度图。在RGB信息引导路径上，前期指导信息用于引导融合稀疏深度信息与RGB图像中提取的信息，并减小稀疏深度噪声和稀疏性的影响。同时，在多模态信息融合指导模块中提取具有丰富深度信息的中期指导信息和初期稠密深度图。但初期稠密深度图中仍包含误差信息，因此该图在通过精细化模块修正后才能够得到此路径上准确的稠密深度图。

图 1 多阶段指导网络结构概览

Fig. 1 Multi-stage guidance network structure overview

网络中采用加法操作融合稀疏深度和指导信息，可有效引导稀疏深度稠密化。采用级联操作融合信息将有效保留不同信息中各自的特征，促使网络或模块提取到更多特征。

从整体上看，通过输入信息初步提取前期指导信息，促进表面法线构建并引导稀疏深度与RGB信息融合；采用多模态信息融合指导模块提取中期指导信息，以此作为连接两条路径的关键信息；融合中期指导信息与表面法线构建后期指导信息，用于引导稀疏深度构建稠密深度图。从两条路径上看，在指导信息引导路径上，通过包含丰富信息的前期、中期和后期指导信息引导稀疏深度构建稠密深度图；在RGB信息引导路径上，多模态信息融合指导模块通过RGB信息引导稀疏深度的稠密化，整合两条路径的结果从而对于物体细节和边缘处收获更好的效果。

1.1 指导信息处理模块

指导信息处理模块不但能构建指导信息引导路径上的深度特征，而且可用于构建表面法线信息，融合指导信息和稀疏深度以获取深度信息。在构建指导信息引导路径上的深度特征时，直接使用获取的深度信息，而在构建表面法线信息时，采用深度信息到真实表面法线映射的形式，利用深度信息构建表面法线，采用这种方式可以提高深度信息与表面法线信息之间更多的相关性。为了加强指导信息的指引性，促进稀疏深度与指导信息的融合，本文采用图 2所示的网络完成信息融合，图中标记“1”表示特征信息与输入信息尺寸相同，“1/2”和“1/4”分别表示在输入信息1/2和1/4尺寸下的特征信息。

图 2 指导信息处理模块主要结构

Fig. 2 The main structure of the guidance information processing module

1.2 多模态信息融合指导模块

为了更好地完成深度信息的提取，在多模态信息融合指导模块中，通过前期指导信息引导稀疏深度进行下采样，并将融合后的稀疏深度与指导信息作为融合信息共同进行特征提取操作。通过提取RGB图像中的信息引导融合信息稠密化并剔除融合信息中的深度误差。如图 3所示，其中，标记“1”表示特征信息与输入信息尺寸相同，“1/2”、“1/4”、“1/8”和“1/16”分别表示在输入信息1/2、1/4、1/8和1/16尺寸下的特征信息。

图 3 多模态信息融合指导模块

Fig. 3 Multi-modal information fusion guidance module

为了提取更加丰富的特征信息，采用残差结构块(如图 4所示)进行多尺度下采样操作，其中BN为批归一化操作，具体为

$ \boldsymbol{D}_{1}=R_{3 \times 3}^{1}\left(R_{3 \times 3}^{1}\left(\boldsymbol{D}_{0}\right)\right) $

(1)

$ \boldsymbol{F}_{1}=R_{3 \times 3}^{1}\left(R_{3 \times 3}^{1}\left(\boldsymbol{F}_{0}\right)\right) $

(2)

图 4 残差结构块

Fig. 4 Residual block

式中，${\boldsymbol{D}}_{1}$为提取的浅层RGB特征，${\boldsymbol{F}}_{1}$为提取的浅层融合特征，$R^{1}_{3×3}$为卷积核大小为3×3、步长为1的残差结构块，${\boldsymbol{D}}_{0}$为输入的RGB图像，${\boldsymbol{F}}_{0}$为输入的融合信息。

进行多尺度特征提取的操作为

$ \boldsymbol{D}_{1 / 2 n}=R_{3 \times 3}^{2}\left(\boldsymbol{D}_{1 / n}\right) $

(3)

$ \boldsymbol{F}_{1 / 2 n}=R_{3 \times 3}^{2}\left(\boldsymbol{F}_{1 / n}\right) $

(4)

式中，$n=1、2、4、8，{\boldsymbol{D}}_{1/2}$、${\boldsymbol{D}}_{1/4}$、${\boldsymbol{D}}_{1/8}$和${\boldsymbol{D}}_{1/16}$分别表示在输入图像1/2、1/4、1/8和1/16尺寸下的RGB特征信息，${\boldsymbol{F}}_{1/2}$、${\boldsymbol{F}}_{1/4}$、${\boldsymbol{F}}_{1/8}$和${\boldsymbol{F}}_{1/16}$分别表示在输入图像1/2、1/4、1/8和1/16尺寸下的融合特征信息，$R^{2}_{3×3}$为卷积核大小为3×3、步长为2的残差结构块。

特征融合过程中，对不同尺度的特征设置不同的融合比例，在第一次和最后一次上采样前的级联操作中, 可赋予融合特征信息更大的比例系数。上采样及融合操作是将对应尺度下的RGB特征与融合特征相加，得到对应尺度下每个特征量的稠密信息增量。具体为

$ \boldsymbol{A}_{1 / m}=\boldsymbol{D}_{1 / m}+\boldsymbol{F}_{1 / m} $

(5)

式中，$m=16、8、4、2，{\boldsymbol{A}}_{1/16}$，${\boldsymbol{A}}_{1/8}$，${\boldsymbol{A}}_{1/4}$和${\boldsymbol{A}}_{1/2}$表示在输入图像1/16、1/8、1/4和1/2尺寸下的稠密信息。

第1次上采样操作表示为

$ \boldsymbol{U}_{1 / 8}=T\left(C\left(\boldsymbol{A}_{1 / 16}, \boldsymbol{D}_{1 / 16}, \boldsymbol{F}_{1 / 16}\right)\right) $

(6)

式中，$C$表示级联操作，$T$表示逆卷积即上采样操作，${\boldsymbol{U}}_{1/8}$为输入图像1/8尺度下的上采样结果。

第2~4次上采样中，令$K=4、2、1$，则上采样结果为

$ \boldsymbol{U}_{1 / k}=T\left(C\left(\boldsymbol{A}_{1 / 2 k}, \boldsymbol{D}_{1 / 2 k}, \boldsymbol{U}_{1 / 2 k}\right)\right) $

(7)

式中，${\boldsymbol{U}}_{1/4}$、${\boldsymbol{U}}_{1/2}$和${\boldsymbol{U}}_{1}$分别表示在输入图像1/4、1/2和原尺寸下的上采样结果。

在提取多阶段融合图${\boldsymbol{U}}_{0}$时，采用浅层融合特征与具有丰富特征信息的${\boldsymbol{U}}_{1}$相结合，具体为

$ \boldsymbol{U}_{0}=R_{3 \times 3}^{1}\left(C\left(\boldsymbol{F}_{1}, \boldsymbol{U}_{1}\right)\right) $

(8)

然后，根据得到的多阶段融合图提取中期指导信息${\boldsymbol{M}}_{g}$以及初期稠密深度图${\boldsymbol{M}}_{d}$，具体为

$ \boldsymbol{M}_{d}=C_{3 \times 3}^{1}\left(\boldsymbol{U}_{0}\right) $

(9)

$ \boldsymbol{M}_{g}=S\left(C_{3 \times 3}^{1}\left(\boldsymbol{U}_{0}\right)\right) $

(10)

式中，$C^{1}_{3×3}$表示卷积核大小为3×3、步长为1的2维卷积，$S$表示sigmoid激活函数。

1.3 精细化模块

常规卷积的主要操作过程为在输入的特征图上使用规则网格${\boldsymbol{R}}$进行采样，使用卷积核$ω$对采样点进行加权运算，${\boldsymbol{R}}$定义了感受野大小和扩张，具体为

$ \boldsymbol{R}=\{(-1,-1), \cdots,(0,1),(1,1)\} $

(11)

定义卷积核大小为3×3，扩张率为1，对特征图上的每个位置$p_{0}$，则输出值$y(p_{0})$为

$ y\left(p_{0}\right)=\sum\limits_{p_{n} \in \boldsymbol{R}} \omega\left(p_{n}\right) \cdot x\left(p_{0}+p_{n}\right) $

(12)

式中，$p_{n}$为${\boldsymbol{R}}$中所列出的位置。

在可变形卷积中，通过对规则网格${\boldsymbol{R}}$增加一个偏移量$\{Δp_{n}|n=1, 2, …, N－1, N\}$，$N=|{\boldsymbol{R}}|$，进行扩张。此外，对每个采样点预测一个权重$Δm_{n}$，则输出值$y(p_{0})$为

$ y\left(p_{0}\right)=\sum\limits_{p_{n} \in \boldsymbol{R}} \omega\left(p_{n}\right) \cdot x\left(p_{0}+p_{n}+\Delta p_{n}\right) \cdot \Delta m_{n} $

(13)

图 5为精细化模块结构。为了减少初期稠密深度图中的误差，在多模态信息融合指导模块中构建具有丰富特征的多阶段融合图，提取输入到可变形卷积的Δ$p_{n}$偏置项($x$和$y$的偏置项)，并将初期稠密深度图和偏置项输入到可变形卷积中，细化初期稠密深度图并减小误差信息，进而得到RGB信息引导路径上的深度特征。

图 5 精细化模块结构

Fig. 5 Structure of refined module

1.4 结果输出模块

根据输入的深度特征计算两条路径上的稠密深度图，再采用深度特征计算对应路径上的组合权重，如图 6所示，进而计算出最终的稠密深度图，如图 7所示，其中$ \otimes $表示乘法操作。相关计算方法为

$ \hat{\boldsymbol{d}}_{0}=\omega_{m} \cdot \hat{\boldsymbol{d}}_{m}+\omega_{d} \cdot \hat{\boldsymbol{d}}_{d} $

(14)

图 6 组合权重计算过程

Fig. 6 Combination weight calculation process

图 7 结果输出模块的主要结构

Fig. 7 The main structure of the output result module

式中，${\boldsymbol{ \hat d }}_{m}$和${\boldsymbol{ \hat d }}_{d}$分别表示RGB信息引导路径和指导信息引导路径上获得的稠密深度图，$ω_{m}$和$ω_{d}$分别表示对应的组合权重，${\boldsymbol{ \hat d }}_{0}$为最终稠密深度图。

2 实验

2.1 实验细节

2.1.1 数据集

Uhrig等人(2017)构建的KITTI(Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago)深度估计数据集包含93 000多幅真实深度数据图像、对齐的稀疏激光雷达深度图和RGB图像，大小为1 242×375像素。其中，采用85 898幅数据图像进行训练，采用KITTI官方提供的数据集(1 000幅带有真值的验证集和1 000幅未带有真值的测试集)进行测试，由于验证集带有真值，因此可直接对实验结果进行评估。测试集不带有真值，需要将实验结果提交到KITTI官方评估服务器才可获得公开评估结果，该结果是公正评估模型性能的重要依据，验证集与测试集均不参与网络模型训练。此外，真实表面法线数据通过KITTI深度估计数据集中的真实深度数据计算获得(Silberman等，2012)。

2.1.2 评价指标

采用与KITTI官方评估服务器相同的指标评估稠密深度图的构建结果，分别为均方根误差(root mean square error，RMSE)、平均绝对误差(mean absolute error，MAE)、反演深度的均方根误差(root mean square error of the inverse depth，iRMSE)和反演深度的平均绝对误差(mean absolute error of the inverse depth，iMAE)。MAE用于评估深度图构建的平均误差，RMSE用于评估较远距离场景、目标细节和边缘处稠密深度图的构建误差，该指标对检测异常值更为敏感，是KITTI官方评估服务器上对稠密深度图构建性能排名影响最为重要的指标(Lu等，2020)。iMAE和iRMSE与深度倒数相关(反演深度)，用于评估深度图中近距离场景的构建误差(Bai等，2020)。KITTI官方评估服务器网址为http://www.cvlibs.net/datasets/kitti/。各评估指标的相关公式为

$ R M S E =\sqrt{\frac{1}{n} \sum\limits_{i=1}^{n}\left(\hat{\boldsymbol{y}}_{i}-\boldsymbol{y}_{i}\right)^{2}} $

(15)

$ M A E =\frac{1}{n} \sum\limits_{i=1}^{n}\left|\hat{\boldsymbol{y}}_{i}-\boldsymbol{y}_{i}\right| $

(16)

$ i R M S E =\sqrt{\frac{1}{n} \sum\limits_{i=1}^{n}\left(\frac{1}{\hat{\boldsymbol{y}}_{i}}-\frac{1}{\boldsymbol{y}_{i}}\right)^{2}} $

(17)

$ i M A E =\frac{1}{n} \sum\limits_{i=1}^{n}\left|\frac{1}{\hat{\boldsymbol{y}}_{i}}-\frac{1}{\boldsymbol{y}_{i}}\right| $

(18)

式中，${\boldsymbol{ \hat y }}_{i}$表示构建的稠密深度图，${\boldsymbol{y}}_{i}$表示真实深度数据。

2.1.3 训练

训练通过PyTorch和Adam优化器实现。Adam优化器的参数设置为$β_{1}= 0.9$，$β_{2}= 0.999$。将输入网络的图像裁剪为256×512像素，显卡选用NVIDIA 3090，批量大小设置为6，进行30轮训练。初始学习率为0.000 125，每5轮学习率减少一半。网络的损失函数(loss)为

$ \begin{gathered} { Loss }=\omega_{1} {loss}_{d}\left(\hat{\boldsymbol{y}}_{F}, \boldsymbol{y}\right)+\omega_{2} {loss}_{d}\left(\hat{\boldsymbol{y}}_{M}, \boldsymbol{y}\right)+ \\ \omega_{3} {loss}_{d}\left(\hat{\boldsymbol{y}}_{D}, \boldsymbol{y}\right)+\omega_{4} {loss}_{n}\left(\hat{\boldsymbol{y}}_{S}, \boldsymbol{y}_{n}\right) \end{gathered} $

(19)

式中，${\boldsymbol{y}}$表示真实深度数据，${\boldsymbol{y}}_{n}$表示真实表面法线数据，${\boldsymbol{ \hat y }}_{F}$、${\boldsymbol{\hat y }}_{M}$和${\boldsymbol{\hat y }}_{D}$分别表示最终稠密深度图和两条路径上的稠密深度图构建结果，${\boldsymbol{\hat y }}_{S}$表示构建的表面法线。各项权重依据相关文献的参数取值经验设置为$ω_{1}=0.6$，$ω_{2}=0.3$，$ω_{3}=0.3$，$ω_{4}=0.2$。$loss_{d}$表示L2损失函数，$loss_{n}$表示余弦损失函数。

$ D_{\mathrm{L} 2}=\sum\limits_{i=1}^{n}\left(y_{i}-f\left(x_{i}\right)\right)^{2} $

(20)

式中，$y_{i}$表示真实值，$f(x_{i})$表示估计值，用于计算构建的稠密深度图误差。

$ \cos (\theta)=\frac{\sum\limits_{i=1}^{n} A_{i} \times B_{i}}{\sqrt{\sum\limits_{i=1}^{n}\left(A_{i}\right)^{2}} \times \sqrt{\sum\limits_{i=1}^{n}\left(B_{i}\right)^{2}}} $

(21)

式中，$A_{i}$和$B_{i}$分别表示估计值和真实值，用于计算构建的表面法线误差。

根据如上条件训练本文给出的网络模型，在KITTI验证数据集上进行测试，结果如图 8和表 1所示。同时，将该网络模型在测试数据集上实验并将结果提交到KITTI官方评估服务器，结果如图 9和表 2所示。

图 8 KITTI验证集上的稠密深度图构建结果

Fig. 8 The dense depth map construction result on the KITTI validation set

((a) LiDAR; (b) RGB information guidance path dense depth map construction result; (c) guidance information guidance path dense depth map construction result; (d) RGB; (e) final dense depth map construction result)

表 1 不同路径在KITTI验证集上的稠密深度图构建性能
Table 1 The dense depth map construction performance of different paths on KITTI validation dataset

下载CSV

路径	RMSE/mm	MAE/mm	iRMSE/(1/km)	iMAE/(1/km)
RGB信息引导路径构建的稠密深度图	828.36	256.86	2.92	1.20
指导信息引导路径构建的稠密深度图	807.71	238.67	2.50	1.18
多阶段指导网络构建的稠密深度图	803.46	235.33	2.45	1.16
注：加粗字体表示各列最优结果。

图 9 不同方法在KITTI测试集上的稠密深度图构建结果

Fig. 9 The dense depth map construction results of different methods on KITTI test set

((a) RGB; (b) NConv-CNN-L2 (Eldesokey et al., 2020); (c) Sparse-to-Dense (Ma et al., 2019); (d) Cross Guidance (Lee et al., 2020); (e) Revisiting (Yan et al., 2020); (f) depth-normal constraints (Xu et al., 2019); (g) MsG (ours))

表 2 不同方法在KITTI测试集上的稠密深度图构建性能
Table 2 The dense depth map construction performance of different methods on KITTI test set

下载CSV

方法	RMSE/mm	MAE/mm	iRMSE/(1/km)	iMAE/(1/km)	时间/s
DfineNet(Zhang等，2019)	943.89	304.17	3.21	1.39	0.02
IR_L2(Lu等，2020)	901.43	292.36	4.92	1.35	0.05
SSGP(Schuster等，2021)	838.22	247.70	2.51	1.09	0.14
NConv-CNN-L2(Eldesokey等，2020)	829.98	233.26	2.60	1.03	0.02
Sparse-to-Dense(Ma等，2019)	814.73	249.95	2.80	1.21	0.08
Cross Guidance(Lee等，2020)	807.42	253.98	2.73	1.33	0.20
Revisiting(Yan等，2020)	792.80	225.81	2.42	0.99	0.05
depth-normal constraints(Xu等，2019)	777.05	235.17	2.42	1.13	0.10
MsG(本文)	768.35	235.55	2.40	1.17	0.24
注：加粗字体表示各列最优结果。IR_L2为image reconstruction_L2;SSGP为sparse spatial guided propagation。

2.2 实验结果

KITTI深度估计数据集中的稀疏深度信息会存在一些交错信息，如图 8(a)所示，路杆和后方景物的信息在边缘处混合在一起，与图 8(d)中RGB图像的描述明显不同。从实验结果可以看出，RGB信息引导路径上构建的稠密深度图(图 8(b))和指导信息引导路径上构建的稠密深度图(图 8(c))均能够较好地修正该误差，最终的稠密深度图构建结果(图 8(d))同样能够较为细致地分辨前景与背景。此外，如图 8(a)中的红框所示，两根路杆中间几乎没有深度信息，而最终的稠密深度图(图 8(d))良好地补全了相关深度，有效弥补了激光雷达点云的稀疏性缺陷，验证了本文方法在KITTI验证集上的有效性。

本文提出的多阶段指导网络通过整合指导信息引导路径和RGB信息引导路径的结果构建稠密深度图。此外，通过RGB图像提取深度信息的同时也会产生误差信息，因此本文通过精细化模块修正多模态信息融合指导模块的输出结果，并额外引入表面法线，修正中期指导信息，进而确保网络中信息的准确性。在图 9展示的本文方法与其他几种方法的实验对比结果中，左侧一列的对比图中，其他方法构建的稠密深度图在近处的汽车(红色方框)附近，只能得到较为模糊的汽车边缘深度图，而本文方法可有效构建出清晰的边缘深度，在远处树木(蓝色方框)构建出的深度信息也十分清晰明显。在右侧一列的对比图中，相较于其他方法，通过多阶段指导网络获得的稠密深度图在一些远处(红色方框)及近处(蓝色方框)较小的路标上，获取的细节处深度信息更加细致精确。由此可见，本文方法可以较好地利用RGB和LiDAR信息，更好地处理物体边缘和细节处的深度信息，从而提升稠密深度图的构建性能。

在多阶段指导网络训练结束后，分别计算每条路径及整个网络输出的稠密深度图，评估结果如表 1所示。可以看出，指导信息引导路径上的结果优于RGB信息引导路径，表明在RGB信息引导路径上利用RGB图像可以获得更多的指导信息。此外，在指导信息引导路径上的指导信息可以起到有效的指导作用，构建良好的稠密深度图。多阶段指导网络在重要的均方根误差(RMSE)及反演深度的均方根误差(iRMSE)指标上获得最优值(见表 2)，与同样获得两项最优指标的Yan等人(2020)方法相比，本文方法在物体边缘和细节处的深度占有明显优势(见图 9)。总体而言，多阶段指导网络的结果优于单独使用两条路径的深度提取结果，验证了本文采用两条路径进行RGB信息和指导信息引导稀疏深度稠密化的策略是有效的，通过两种不同策略构建稠密深度图形成优势互补，利用更多信息获取更为准确的稠密深度图。

2.3 消融实验

在不同条件下进行实验，验证每个模块和路径的有效性，包括指导信息处理模块、精细化模块、指导信息引导路径和RGB信息引导路径。在消融实验中，为减少训练时间，本文对不同路径和模块设置下的多阶段指导网络分别进行10轮训练，调整初始学习率为0.001，其他与2.1.3节网络训练的参数相同。根据如上训练条件获得的实验结果如表 3所示。可以看出，多阶段指导网络的整体模型达到了最好的性能，验证了多阶段指导网络中的所有模块和路径都是有效的。

表 3 不同路径和模块在KITTI验证集上的稠密深度图构建性能的消融实验结果
Table 3 Results of ablation experiment for the dense depth map construction performance of different paths and modules on KITTI validation set

下载CSV

实验设置	RMSE/mm	MAE/mm	iRMSE/(1/km)	iMAE/(1/km)
RGB信息引导路径	852.30	285.36	3.34	1.51
指导信息引导路径	1 032.71	308.96	3.88	1.62
多阶段指导网络(全部模块和路径)	831.00	256.59	2.81	1.31
未使用指导信息处理模块	903.06	333.16	3.64	1.78
未使用精细化模块	841.37	262.54	2.94	1.35
注：加粗字体表示各列最优结果。

表 3中，在仅使用RGB信息引导路径或指导信息引导路径进行训练的情况下，后者构建的稠密深度图误差更高。对网络整体训练结果表明，指导信息引导路径构建的稠密深度图比RGB信息引导路径构建的结果更加准确(见表 2)。多模态信息融合指导模块提供的中期指导信息是两条路径结合的关键环节。该模块由于具有较为复杂的网络结构，导致网络整体训练时间有所增加(见表 1)，然而该结构却能更好地利用LiDAR与RGB信息提高稠密深度图的构建质量。表面法线信息在网络中起到重要作用，中期指导信息在表面法线的作用下更好地构建了后期指导信息。此外，若在多阶段指导网络中去除指导信息处理模块，会降低指导信息与稀疏深度信息的融合效能。多模态信息融合指导模块中构建的稠密深度图在经过精细化模块后，可以有效减少相关误差。综上，在本文给出的模块和路径共同作用下，多阶段指导网络可以更好地完成稠密深度图的构建。

3 结论

本文给出一种结合LiDAR与RGB数据构建稠密深度图的多阶段指导网络模型。采用指导信息处理模块促进指导信息与稀疏深度融合，通过多模态信息融合指导模块能够从稀疏深度和RGB图像中学习到大量深度信息，精细化模块用于修正多模态信息融合指导模块输出结果。多阶段指导网络通过RGB信息引导和指导信息引导两条路径的共同作用下实现。在KITTI深度估计数据集上的实验表明，与其他方法相比，多阶段指导网络能够更好地处理物体边缘和细节处的深度信息，提高稠密深度图的构建质量，减少稀疏深度中的误差信息。消融实验验证了每个模块和路径的有效性。

本文给出的多阶段指导网络可以更好地提高稠密深度图构建准确率，但构建的稠密深度图在个别指标上存在不足之处。如在景物较多且存在交错遮挡情况下，构建出的稠密深度图存在一定误差，在此方面仍有较大提升空间，将在未来的工作中进一步完善。

参考文献

Bai L, Zhao Y M, Elhousni M, Huang X M. 2020. DepthNet: real-time LiDAR point cloud depth completion for autonomous vehicles. IEEE Access, 8: 227825-227833 [DOI:10.1109/ACCESS.2020.3045681]

Dai J F, Qi H Z, Xiong Y W, Li Y, Zhang G D, Hu H and Wei Y C. 2017. Deformable convolutional networks//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 764-773[DOI: 10.1109/ICCV.2017.89]

Eldesokey A, Felsberg M, Khan F S. 2020. Confidence propagation through CNNs for guided sparse depth regression. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10): 2423-2436 [DOI:10.1109/TPAMI.2019.2929170]

Huang J, Wang C, Liu Y, Bi T T. 2019. The progress of monocular depth estimation technology. Journal of Image and Graphics, 24(12): 2081-2097 (黄军, 王聪, 刘越, 毕天腾. 2019. 单目深度估计技术进展综述. 中国图象图形学报, 24(12): 2081-2097) [DOI:10.11834/jig.190455]

Imran S, Long Y F, Liu X M and Morris D. 2019. Depth coefficients for depth completion//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 12438-12447[DOI: 10.1109/CVPR.2019.01273]

Ku J, Harakeh A and Waslander S L. 2018. In defense of classical image processing: fast depth completion on the CPU//Proceedings of the 15th Conference on Computer and Robot Vision (CRV). Toronto, Canada: IEEE: 16-22[DOI: 10.1109/CRV.2018.00013]

Lee S, Lee J, Kim D, Kim J. 2020. Deep architecture with cross guidance between single image and sparse LiDAR data for depth completion. IEEE Access, 8: 79801-79810 [DOI:10.1109/ACCESS.2020.2990212]

Lu K Y, Barnes N, Anwar S and Zheng L. 2020. From depth what can you see? Depth completion via auxiliary image reconstruction//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 11303-11312[DOI: 10.1109/CVPR42600.2020.01132]

Ma F C, Cavalheiro G V and Karaman S. 2019. Self-supervised sparse-to-dense: self-supervised depth completion from LiDAR and monocular camera//Proceedings of 2019 International Conference on Robotics and Automation (ICRA). Montreal, Canada: IEEE: 3288-3295[DOI: 10.1109/ICRA.2019.8793637]

Park J, Joo K, Hu Z, Liu C K and Kweon I S. 2020. Non-local spatial propagation network for depth completion//Proceedings of the 16th European Conference on Computer Vision (ECCV). Glasgow, UK: Springer: 120-136[DOI: 10.1007/978-3-030-58601-0_8]

Romera E, Álvarez J M, Bergasa L M, Arroyo R. 2018. ERFNet: efficient residual factorized ConvNet for real-time semantic segmentation. IEEE Transactions on Intelligent Transportation Systems, 19(1): 263-272 [DOI:10.1109/TITS.2017.2750080]

Schuster R, Wasenmüller O, Unger C and Stricker D. 2021. SSGP: sparse spatial guided propagation for robust and generic interpolation//Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision (WACV). Waikoloa, USA: 197-206[DOI: 10.1109/WACV48630.2021.00024]

Shivakumar S S, Nguyen T, Miller I D, Chen S W, Kumar V and Taylor C J. 2019. DFuseNet: deep fusion of RGB and sparse depth information for image guided dense depth completion//Proceedings of 2019 IEEE Intelligent Transportation Systems Conference (ITSC). Auckland, New Zealand: IEEE: 13-20[DOI: 10.1109/ITSC.2019.8917294]

Silberman N, Hoiem D, Kohli P and Fergus R. 2012. Indoor segmentation and support inference from RGBD images//Proceedings of the 12th European Conference on Computer Vision (ECCV). Florence, Italy: Springer: 746-760[DOI: 10.1007/978-3-642-33715-4_54]

Tang J, Tian F P, Feng W, Li J, Tan P. 2020. Learning guided convolutional network for depth completion. IEEE Transactions on Image Processing, 30: 1116-1129 [DOI:10.1109/TIP.2020.3040528]

Uhrig J, Schneider N, Schneider L, Franke U, Brox T and Geiger A. 2017. Sparsity invariant CNNs//Proceedings of 2017 International Conference on 3D Vision (3DV). Qingdao, China: IEEE: 11-20[DOI: 10.1109/3DV.2017.00012]

Wang B Z, Feng Y L, Liu H Z. 2018. Multi-scale features fusion from sparse LiDAR data and single image for depth completion. Electronics Letters, 54(24): 1375-1377 [DOI:10.1049/el.2018.6149]

Xu Y, Zhu X G, Shi J P, Zhang G F, Bao H J and Li H S. 2019. Depth completion from sparse LiDAR data with depth-normal constraints//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 2811-2820[DOI: 10.1109/ICCV.2019.00290]

Yan L, Liu K, Belyaev E. 2020. Revisiting sparsity invariant convolution: a network for image guided depth completion. IEEE Access, 8: 126323-126332 [DOI:10.1109/ACCESS.2020.3008404]

Zhang Y L, Nguyen T, Miller I D, Shivakumar S S, Chen S, Taylor C and Kumar V. 2019. DFineNet: ego-motion estimation and depth refinement from sparse, noisy depth input with RGB guidance[EB/OL]. [2021-05-23]. https://arxiv.org/pdf/1903.06397.pdf

Zhao S S, Gong M M, Fu H, Tao D C. 2021. Adaptive context-aware multi-modal network for depth completion. IEEE Transactions on Image Processing, 30: 5264-5276 [DOI:10.1109/TIP.2021.3079821]

Zhou D K, Tian J, Yang X. 2021. Unsurpervised monocular image depth estimation based on the prediction of local plane parameters. Journal of Image and Graphics, 26(1): 165-175 (周大可, 田径, 杨欣. 2021. 结合局部平面参数预测的无监督单目图像深度估计. 中国图象图形学报, 26(1): 165-175) [DOI:10.11834/jig.200364]