单目相机轨迹的真实尺度恢复

刘思博; 房立金

发布时间： 2022-02-22
摘要点击次数： 3078
全文下载次数： 807
DOI: 10.11834/jig.200622
2022 | Volume 27 | Number 2

单目相机轨迹的真实尺度恢复

刘思博¹, 房立金²(1.东北大学信息科学与工程学院, 沈阳 110819;2.东北大学机器人科学与工程学院, 沈阳 110169)

摘要

目的单目相机运动轨迹恢复由于输入只有单目视频序列而缺乏尺度信息，生成的轨迹存在严重漂移而无法进行高精度应用。为了能够运用单目相机普及度高、成本低的优势，提出一种基于场景几何的方法在自动驾驶领域进行真实尺度恢复。方法首先使用深度估计网络对连续图像进行相对深度估计，利用估计的深度值将像素点从2维平面投影到3维空间。然后对光流网络估计出的光流进行前后光流一致性计算得到有效匹配点，使用传统方法求解位姿，使相对深度与位姿尺度统一。再利用相对深度值计算表面法向量图求解地面点群，通过几何关系计算相同尺度的相机高度后引入相机先验高度得到初始尺度。最后为了减小图像噪声对尺度造成的偏差，由额外的车辆检测模块计算出的补偿尺度与初始尺度加权得到最终尺度。结果实验在KITTI （Karlsruhe Institute of Technology and Toyota Technological at Chicago）自动驾驶数据集上进行，相机运动轨迹和图像深度均在精度上得到提高。使用深度真实值尺度还原后的相对深度的绝对误差为0.114，使用本文方法进行尺度恢复后的绝对深度的绝对误差为0.116。对得到的相机运动轨迹在不同复杂路径中进行对比测试，使用尺度恢复的距离与真实距离误差为2.67%，恢复出的轨迹相比传统方法的ORB-SLAM2（oriented FAST and rotated BRIEF-simultaneous localization and mapping）更接近真实轨迹。结论本文仅以单目相机图像作为输入，在自动驾驶数据集中利用自监督学习方法，不需要真实深度标签进行训练，利用场景中的几何约束对真实尺度进行恢复，恢复出的绝对深度和真实轨迹均在精度上有所提高。相比于传统方法在加入真实尺度后偏移量误差更低，且计算速度快、鲁棒性高。

关键词

自监督学习自动驾驶单目深度估计位姿估计尺度恢复

Monocular camera trajectory recovery with real scale

Liu Sibo¹, Fang Lijin²(1.College of Information Science and Engineering, Northeastern University, Shenyang 110819, China;2.Faculty of Robot Science and Engineering, Northeastern University, Shenyang 110169, China)

Abstract

Objective Trajectory recovery based on the camera uses one or more cameras to collect the image data, but it always causes serious drift in the computed path due to the lack of scale information, as the input of monocular depth estimation is only one monocular sequence, the depth of the objects in the image can have innumerable possibilities, and only the distance relationship between two connected objects can be obtained from the image by distinguishing the border and identifying the brightness of the color. Thus, the monocular camera is rarely used for high-precision applications. To take advantage of the high popularity and low cost of the monocular camera, many researchers have presented learning-based methods to estimate the pose and depth of the camera simultaneously, which is also the target solved by the simultaneous localization and mapping (SLAM) system. Although this method is fast and effective, it is does not work well in several specific areas, such as images with excessively long spans, fewer features, or complex textures. Moreover, the accuracy of the depth is essential for the details of the estimated path. Most researchers use light detection and ranging (LiDAR) to acquire the depth values. It is clearly more accurate than any other sensor, and even almost all of the large datasets use LiDAR to make ground-truth labels. However, it is not popular due to its expensive prices. Others use stereo RGB images to compute the depth, but the algorithm is very complex, slower than other methods, and needs to be calibrated again before use, as the baseline is changed if the images obtained are not collected by your own stereo camera. With the rise of artificial intelligence, the convolutional neural network can be used to train to realize a function needed. Therefore, the monocular camera can be used to implement the task that cannot be finished previously. Geometric space is modeled into the mathematic expression, and the network is promoted to meet the requirement. This work is proven effective, and most of the scholars use a large amount of the labels provided by the dataset to train the network, but it is not really effective because ground-truth labels cannot be obtained in most complex fields, and then these methods will not work again. Therefore, the traditional method is used to resolve the relative poses, and a real-scale recovery method by leveraging the scene geometry in the field of autonomous driving is presented. Method First, the depth network is used to estimate the relative depth map of continuous sequences. Then, the pixels are projected from the pixel plane into 3D geometry space by leveraging the estimated depth values, the optical flow consistency with the forward-backward optical flow estimated from the optical flow network is calculated to catch the effective matching points, the relative poses are solved from the effective points by traditional method, and the scale between the relative depth and pose is consistent. Next, relative depth is used to calculate a surface normal map, which obtains a ground point group due to the geometric relationship, the camera height is calculated by the ground points with the consistent scale, and the initial scale is obtained by adding the information of the camera prior height. The vehicle detection module is introduced to perform compensation on the scale to obtain the final scale and eliminate the deviation of image noise to the computed scale. Finally, an absolute depth map and an integrated motion trajectory are recovered by the computed scale. Result The experiment is carried out on the Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago (KITTI) autonomous driving dataset, and the recovered absolute depth and estimated camera motion trajectory are improved in accuracy. The absolute error of the relative depth recovered by the ground truth is 0.114, the absolute error of our method using the computed scale is as low as 0.116, and the camera trajectory is tested in different complex paths. The error between the distance recovered using scale and the ground truth trajectory is only 2.67%, and the restored trajectory is closer to the ground truth trajectory than the oriented FAST and rotated BRIEF-simultaneous localization and mapping(ORB-SLAM2) that uses the traditional method. Conclusion Only monocular camera data are used input that is applied in the autonomous driving filed in this paper. Self-supervised learning is adopted to calculate the true scale by leveraging the geometric constraints in the scene that do not need any ground-truth labels. Moreover, the depth values in most methods are relative, that is, they are useless in practical application. Without the scale information, no matter how accurate relative depth is, it cannot be used to approximate reality. However, most researchers work like this. They use the trained network to estimate the depth, use the scale computed between the average value of relative depth and ground-truth depth from the labels to unify, and then obtain an extremely lower loss than other methods but with no practical effect. Compared with other traditional methods, in this paper, after adding the real scale, the offset error is lower, the calculation speed is fast, the robustness is high, and ground-truth labels are not needed.

Keywords

self-supervised learning autonomous driving monocular depth estimation relative pose estimation scale recovery

在线采编平台

在线出版

年度会议

下载中心

年度信息