单目相机轨迹的真实尺度恢复
Monocular camera trajectory recovery with real scale
- 2022年27卷第2期 页码:486-499
纸质出版日期: 2022-02-16 ,
录用日期: 2021-03-28
DOI: 10.11834/jig.200622
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2022-02-16 ,
录用日期: 2021-03-28
移动端阅览
刘思博, 房立金. 单目相机轨迹的真实尺度恢复[J]. 中国图象图形学报, 2022,27(2):486-499.
Sibo Liu, Lijin Fang. Monocular camera trajectory recovery with real scale[J]. Journal of Image and Graphics, 2022,27(2):486-499.
目的
2
单目相机运动轨迹恢复由于输入只有单目视频序列而缺乏尺度信息,生成的轨迹存在严重漂移而无法进行高精度应用。为了能够运用单目相机普及度高、成本低的优势,提出一种基于场景几何的方法在自动驾驶领域进行真实尺度恢复。
方法
2
首先使用深度估计网络对连续图像进行相对深度估计,利用估计的深度值将像素点从2维平面投影到3维空间。然后对光流网络估计出的光流进行前后光流一致性计算得到有效匹配点,使用传统方法求解位姿,使相对深度与位姿尺度统一。再利用相对深度值计算表面法向量图求解地面点群,通过几何关系计算相同尺度的相机高度后引入相机先验高度得到初始尺度。最后为了减小图像噪声对尺度造成的偏差,由额外的车辆检测模块计算出的补偿尺度与初始尺度加权得到最终尺度。
结果
2
实验在KITTI(Karlsruhe Institute of Technology and Toyota Technological at Chicago)自动驾驶数据集上进行,相机运动轨迹和图像深度均在精度上得到提高。使用深度真实值尺度还原后的相对深度的绝对误差为0.114,使用本文方法进行尺度恢复后的绝对深度的绝对误差为0.116。对得到的相机运动轨迹在不同复杂路径中进行对比测试,使用尺度恢复的距离与真实距离误差为2.67%,恢复出的轨迹相比传统方法的ORB-SLAM2(oriented FAST and rotated BRIEF-simultaneous localization and mapping)更接近真实轨迹。
结论
2
本文仅以单目相机图像作为输入,在自动驾驶数据集中利用自监督学习方法,不需要真实深度标签进行训练,利用场景中的几何约束对真实尺度进行恢复,恢复出的绝对深度和真实轨迹均在精度上有所提高。相比于传统方法在加入真实尺度后偏移量误差更低,且计算速度快、鲁棒性高。
Objective
2
Trajectory recovery based on the camera uses one or more cameras to collect the image data
but it always causes serious drift in the computed path due to the lack of scale information
as the input of monocular depth estimation is only one monocular sequence
the depth of the objects in the image can have innumerable possibilities
and only the distance relationship between two connected objects can be obtained from the image by distinguishing the border and identifying the brightness of the color. Thus
the monocular camera is rarely used for high-precision applications. To take advantage of the high popularity and low cost of the monocular camera
many researchers have presented learning-based methods to estimate the pose and depth of the camera simultaneously
which is also the target solved by the simultaneous localization and mapping (SLAM) system. Although this method is fast and effective
it is does not work well in several specific areas
such as images with excessively long spans
fewer features
or complex textures. Moreover
the accuracy of the depth is essential for the details of the estimated path. Most researchers use light detection and ranging (LiDAR) to acquire the depth values. It is clearly more accurate than any other sensor
and even almost all of the large datasets use LiDAR to make ground-truth labels. However
it is not popular due to its expensive prices. Others use stereo RGB images to compute the depth
but the algorithm is very complex
slower than other methods
and needs to be calibrated again before use
as the baseline is changed if the images obtained are not collected by your own stereo camera. With the rise of artificial intelligence
the convolutional neural network can be used to train to realize a function needed. Therefore
the monocular camera can be used to implement the task that cannot be finished previously. Geometric space is modeled into the mathematic expression
and the network is promoted to meet the requirement. This work is proven effective
and most of the scholars use a large amount of the labels provided by the dataset to train the network
but it is not really effective because ground-truth labels cannot be obtained in most complex fields
and then these methods will not work again. Therefore
the traditional method is used to resolve the relative poses
and a real-scale recovery method by leveraging the scene geometry in the field of autonomous driving is presented.
Method
2
First
the depth network is used to estimate the relative depth map of continuous sequences. Then
the pixels are projected from the pixel plane into 3D geometry space by leveraging the estimated depth values
the optical flow consistency with the forward-backward optical flow estimated from the optical flow network is calculated to catch the effective matching points
the relative poses are solved from the effective points by traditional method
and the scale between the relative depth and pose is consistent. Next
relative depth is used to calculate a surface normal map
which obtains a ground point group due to the geometric relationship
the camera height is calculated by the ground points with the consistent scale
and the initial scale is obtained by adding the information of the camera prior height. The vehicle detection module is introduced to perform compensation on the scale to obtain the final scale and eliminate the deviation of image noise to the computed scale. Finally
an absolute depth map and an integrated motion trajectory are recovered by the computed scale.
Result
2
The experiment is carried out on the Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago (KITTI) autonomous driving dataset
and the recovered absolute depth and estimated camera motion trajectory are improved in accuracy. The absolute error of the relative depth recovered by the ground truth is 0.114
the absolute error of our method using the computed scale is as low as 0.116
and the camera trajectory is tested in different complex paths. The error between the distance recovered using scale and the ground truth trajectory is only 2.67%
and the restored trajectory is closer to the ground truth trajectory than the oriented FAST and rotated BRIEF-simultaneous localization and mapping(ORB-SLAM2) that uses the traditional method.
Conclusion
2
Only monocular camera data are used input that is applied in the autonomous driving filed in this paper. Self-supervised learning is adopted to calculate the true scale by leveraging the geometric constraints in the scene that do not need any ground-truth labels. Moreover
the depth values in most methods are relative
that is
they are useless in practical application. Without the scale information
no matter how accurate relative depth is
it cannot be used to approximate reality. However
most researchers work like this. They use the trained network to estimate the depth
use the scale computed between the average value of relative depth and ground-truth depth from the labels to unify
and then obtain an extremely lower loss than other methods but with no practical effect. Compared with other traditional methods
in this paper
after adding the real scale
the offset error is lower
the calculation speed is fast
the robustness is high
and ground-truth labels are not needed.
自监督学习自动驾驶单目深度估计位姿估计尺度恢复
self-supervised learningautonomous drivingmonocular depth estimationrelative pose estimationscale recovery
Bian J W, Li Z C, Wang N Y, Zhan H Y, Shen C H, Cheng M M and Reid I. 2019. Unsupervised scale-consistent depth and ego-motion learning frommonocular video//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates, Inc.: 35-45
Casser V, Pirk S, Mahjourian R and Angelova A. 2019. Depth prediction without the sensors: leveraging structure for unsupervised learning from monocular videos//Proceedings of 2019 AAAI Conference on Artificial Intelligence. Honolulu, USA: [s. n.]: 8001-8008[DOI: 10.1609/aaai.v33i01.33018001http://dx.doi.org/10.1609/aaai.v33i01.33018001]
Dharmasiri T, Spek A and Drummond T. 2018. ENG: end-to-end neural geometry for robust depth and pose estimation using CNNs[EB/OL]. [2020-10-05].https://arxiv.org/pdf/1807.05705.pdfhttps://arxiv.org/pdf/1807.05705.pdf
Eigen D, Puhrsch C and Fergus R. 2014. Depth map prediction from a single image using a multi-scale deep network//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press: 2366-2374
Fischler M A and Bolles R C. 1981. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6): 381-395[DOI: 10.1145/358669.358692]
Garg R, B.G. V K, Carneiro G and Reid I. 2016. Unsupervised CNN for single view depth estimation: geometry to the rescue//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 740-756[DOI: 10.1007/978-3-319-46484-8_45http://dx.doi.org/10.1007/978-3-319-46484-8_45]
Geiger A, Lenz P, Stiller C and Urtasun R. 2013. Vision meets robotics: the KITTI dataset. The International Journal of Robotics Research, 32(11): 1231-1237[DOI: 10.1177/0278364913491297]
Geiger A, Lenz P and Urtasun R. 2012. Are we ready for autonomous driving? The KITTI vision benchmark suite//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 3354-3361[DOI: 10.1109/CVPR.2012.6248074http://dx.doi.org/10.1109/CVPR.2012.6248074]
Godard C, Aodha O M, Firman M and Brostow G. 2019. Digging into self-supervised monocular depth estimation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE: 3827-3837[DOI: 10.1109/ICCV.2019.00393http://dx.doi.org/10.1109/ICCV.2019.00393]
Hartley R I. 1997. In defense of the eight-point algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(6): 580-593[DOI: 10.1109/34.601246]
Hui T W, Tang X O and Loy C C. 2018. LiteFlowNet: a lightweight convolutional neural network for optical flow estimation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, USA: IEEE: 8981-8989[DOI: 10.1109/CVPR.2018.00936http://dx.doi.org/10.1109/CVPR.2018.00936]
Kalman D. 1996. A singularly valuable decomposition: the SVD of a matrix. The College Mathematics Journal, 27(1): 2-23[DOI: 10.2307/2687269]
Klingner M, Termöhlen J A, Mikolajczyk J and Fingscheidt T. 2020. Self-supervised monocular depth estimation: solving the dynamic object problem by semantic guidance//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 582-600[DOI: 10.1007/978-3-030-58565-5_35http://dx.doi.org/10.1007/978-3-030-58565-5_35]
Luo C X, Yang Z H, Wang P, Wang Y, Xu W, Nevatia R and Yuille A. 2020. Every pixel counts++: joint learning of geometry and motion with 3D holistic understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10): 2624-2641[DOI: 10.1109/TPAMI.2019.2930258]
Mur-Artal R, Montiel J M M and Tardós J D. 2015. ORB-SLAM: a versatile and accurate monocular slam system. IEEE Transactions on Robotics, 31(5): 1147-1163[DOI: 10.1109/TRO.2015.2463671]
Ranjan A, Jampani V, Balles L, Kim K, Sun D Q, Wulff J and Black M J. 2019. Competitive collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 12232-12241[DOI: 10.1109/CVPR.2019.01252http://dx.doi.org/10.1109/CVPR.2019.01252]
Wang C Y, Buenaposada J M, Zhu R and Lucey S. 2018. Learning depth from monocular videos using direct methods//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 2022-2030[DOI: 10.1109/CVPR.2018.00216http://dx.doi.org/10.1109/CVPR.2018.00216]
Wang Z, Bovik A C, Sheikh H R and Simoncelli E P. 2004. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4): 600-612[DOI: 10.1109/TIP.2003.819861]
Xue F, Zhuo G R, Huang Z Y, Fu W F, WuZ Y and Ang M H. 2020. Toward hierarchical self-supervised monocular absolute depth estimation for autonomous driving applications//Proceedings of 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Las Vegas, USA: IEEE: 2330-2337[DOI: 10.1109/IROS45743.2020.9340802http://dx.doi.org/10.1109/IROS45743.2020.9340802]
Yang Z H, Wang P, Wang Y, Xu W and Nevatia R. 2018. LEGO: learning edge with geometry all at once by watching videos//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 225-234[DOI: 10.1109/CVPR.2018.00031http://dx.doi.org/10.1109/CVPR.2018.00031]
Yang Z H, Wang P, Xu W, Zhao L and Nevatia R. 2017. Unsupervised learning of geometry with edge-aware depth-normal consistency[EB/OL]. [2020-10-05].https://arxiv.org/pdf/1711.03665.pdfhttps://arxiv.org/pdf/1711.03665.pdf
Yin Z C and Shi J P. 2018. GeoNet: unsupervised learning of dense depth, optical flow and camera pose//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 1983-1992[DOI: 10.1109/CVPR.2018.00212http://dx.doi.org/10.1109/CVPR.2018.00212]
Zhan H Y, Weerasekera C S, Bian J W and Reid I. 2020. Visual odometry revisited: What should be learnt?//Proceedings of 2020 IEEE International Conference on Robotics and Automation(ICRA). Paris, France: IEEE: 4203-4210[DOI: 10.1109/ICRA40945.2020.9197374http://dx.doi.org/10.1109/ICRA40945.2020.9197374]
Zhao H,Gallo O, Frosio I and Kautz J. 2015. Loss functions for neural networks for image processing[EB/OL]. [2020-10-05].https://arxiv.org/pdf/1511.08861.pdfhttps://arxiv.org/pdf/1511.08861.pdf
Zhao W, Liu S H, Shu Y Z and Liu Y J. 2020. Towards better generalization: joint depth-pose learning without PoseNet//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE: 9148-9158[DOI: 10.1109/CVPR42600.2020.00917http://dx.doi.org/10.1109/CVPR42600.2020.00917]
Zhou H Z, Ummenhofer B and Brox T. 2018. DeepTAM: deep tracking and mapping//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 851-868[DOI: 10.1007/978-3-030-01270-0_50http://dx.doi.org/10.1007/978-3-030-01270-0_50]
Zhou T H, Brown M, Snavely N and Lowe D G. 2017. Unsupervised learning of depth and ego-motion from video//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 6612-6619[DOI: 10.1109/CVPR.2017.700http://dx.doi.org/10.1109/CVPR.2017.700]
相关作者
相关机构