结合局部平面参数预测的无监督单目图像深度估计

周大可; 田径; 杨欣

doi:10.11834/jig.200364

自动驾驶场景感知与仿真 | 浏览量 : 0 下载量: 154 CSCD: 3

PDF
导出
分享
收藏
专辑

结合局部平面参数预测的无监督单目图像深度估计
Unsurpervised monocular image depth estimation based on the prediction of local plane parameters
2021年26卷第1期页码：165-175
收稿：2020-07-10，

修回：2020-10-15，

录用：2020-10-22，

纸质出版：2021-01-16
DOI： 10.11834/jig.200364
稿件说明：

移动端阅览

周大可, 田径, 杨欣. 结合局部平面参数预测的无监督单目图像深度估计[J]. 中国图象图形学报, 2021,26(1):165-175. DOI： 10.11834/jig.200364.

Dake Zhou, Jing Tian, Xin Yang. Unsurpervised monocular image depth estimation based on the prediction of local plane parameters[J]. Journal of Image and Graphics, 2021, 26(1): 165-175. DOI： 10.11834/jig.200364.

摘要

目的

无监督单目图像深度估计是3维重建领域的一个重要方向，在视觉导航和障碍物检测等领域具有广泛的应用价值。针对目前主流方法存在的局部可微性问题，提出了一种基于局部平面参数预测的方法。方法将深度估计问题转化为局部平面参数估计问题，使用局部平面参数预测模块代替多尺度估计中上采样及生成深度图的过程。在每个尺度的深度图预测中根据局部平面参数恢复至标准尺度，然后依据针孔相机模型得到标准尺度深度图，以避免使用双线性插值带来的局部可微性，从而有效规避陷入局部极小值，配合在网络跳层连接中引入的串联注意力机制，提升网络的特征提取能力。

结果

在KITTI（Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago）自动驾驶数据集上进行了对比实验以及消融实验，与现存无监督方法和部分有监督方法进行对比，相比于最优数据，误差性指标降低了10% 20%，准确性指标提升了2%左右，同时，得到的稠密深度估计图具有清晰的边缘轮廓以及对反射区域更优的鲁棒性。

结论

本文提出的基于局部平面参数预测的深度估计方法，充分利用卷积特征信息，避免了训练过程中陷入局部极小值，同时对网络添加几何约束，使测试指标及视觉效果更加优秀。

Abstract

Objective

Scene depth information plays a vital role in many current research topics

such as 3D reconstruction

obstacle detection

and visual navigation. Obtaining dense and accurate depth image information often requires expensive equipment

resulting in high costs. The method of using color images for depth estimation does not require expensive equipment and has a wider range of applications. Stereo matching is a traditional method used for estimating the depth with RGB images. A large estimation error is found for weak texture regions because stereo matching relies heavily on feature matching. With the wide application of convolutional neural networks in image processing

the depth estimation of monocular images has been widely investigated. However

the monocular image is essentially a pathological problem because it lacks depth clues related to motion and stereo. Many methods are currently used to estimate the depth of monocular image. Without the use of real depth data

the method of using binocular images for unsupervised learning uses image reconstruction as a supervised signal to train a depth estimation model. This task currently has achieved a large breakthrough although depth estimation depends on the geometric features. How to effectively use the information in the shallow features of the image and how to add geometric constraints to the prediction output while ensuring high convergence performance have been widely investigated to improve the accuracy of depth estimation. In the commonly used multi-scale estimation

the sampling method of bilinear interpolation has local differentiability

easily making the network fall into a local minimum and affecting the training effect. A method based on local plane parameter prediction is proposed to address these problems. This method is applied to multi-scale prediction by using a completely differentiable method with geometric constraints

thereby effectively limiting the convergence of multi-scale depth map prediction in the same direction.

Method

This study presents an unsupervised monocular depth estimation network based on local plane parameter prediction. The main structure is a coding-decoding network and is mainly composed of three parts: a ResNet50-based coding network

a decoding network that introduces a serial double attention mechanism in the skip layer connection

and multi-scale prediction using local plane parameter estimation module. During the training

the network estimates the depth of an image in stereo images

reconstructs another view

and uses the real image of the other view as a supervision for training. Our training set includes 22 600 images in the KITTI(Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago) dataset. The model is built on PyTorch framework

and the input image is 640×192 pixels for training. NVIDIA GTX 2080 equipment is used for training

and the training involves 20 epochs. In the multi-scale prediction module

we convert the depth estimation problem into a local plane parameter estimation problem. The local plane parameter prediction module is used to replace upsampling and depth map generation in multi-scale estimation. The depth map prediction of each scale is restored to the standard scale in accordance with the local plane parameters. The standard scale depth map is obtained in accordance with a pinhole camera model to avoid the local differentiability caused by bilinear interpolation

thereby effectively avoiding falling into the local minimum value. A serial attention mechanism is introduced in the network layer hopping connection to obtain clear edge contour information.

Result

We compared our model with multiple unsupervised and supervised methods on the KITTI test dataset. Quantitative evaluation indicators include absolute relative error (Abs Rel)

squared relative error (Sq Rel)

linear root mean square error (RMSE)

logarithmic root mean square error (RMSElog)

and threshold accuracy index

$$\delta $$

. The dense depth map results for each method are compared. The experimental results show that the proposed method performs well in the depth estimation of various errors and accuracy indicators. In the comparative test

the error indicators are relatively reduced by 10% to 20%

and the accuracy indicators are increased by 1% to 2%. The generated depth map has a relatively clear outline and can separate the important depth values of pedestrians and vehicles from the complex background. It also has a certain robustness to th

e reflection area

thereby improving the quality of depth estimation. We conducted a series of ablation experiments in the test set to clearly show the effectiveness of the proposed algorithm.

Conclusion

In this study

we proposed a depth estimation method based on local plane parameter prediction. The proposed method utilizes convolution feature information

avoids the local minimum during training

and adds geometric constraints to the network to obtain excellent test indicators and visual effects.

关键词

Keywords

references

Aleotti F, Tosi F, Poggi M and Mattoccia S. 2018. Generative adversarial networks for unsupervised monocular depth prediction//Proceedings of 2018 European Conference on Computer Vision. Munich, Germany: Springer: 337-354[ DOI:10.1007/978-3-030-11009-3_20 http://dx.doi.org/10.1007/978-3-030-11009-3_20 ]

Eigen D and Fergus R. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 2650-2658[ DOI:10.1109/ICCV.2015.304 http://dx.doi.org/10.1109/ICCV.2015.304 ]

Eigen D, Puhrsch C and Fergus R. 2014. Depth map prediction from a single image using a multi-scale deep network//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: NIPS: 2366-2374[ DOI:10.5555/2969033.2969091 http://dx.doi.org/10.5555/2969033.2969091 ]

Fu J, Liu J, Tian H J, Li Y, Bao Y J, Fang Z W and Lu H Q. 2019. Dual attention network for scene segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 3141-3149[DOI: 10.1109/cvpr.2019.00326 http://dx.doi.org/10.1109/cvpr.2019.00326 ]

Garg R, Kumar B G V, Carneiro G and Reid I. 2016. Unsupervised CNN for single view depth estimation: geometry to the rescue//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 740-756[ DOI:10.1007/978-3-319-46484-8_45 http://dx.doi.org/10.1007/978-3-319-46484-8_45 ]

Geiger A, Lenz P and Urtasun R. 2012. Are we ready for autonomous driving? The KITTI vision benchmark suite//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 3354-3361[ DOI:10.1109/CVPR.2012.6248074 http://dx.doi.org/10.1109/CVPR.2012.6248074 ]

Godard C, Aodha O M and Brostow G J. 2017. Unsupervised monocular depth estimation with left-right consistency//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 6602-6611[ DOI:10.1109/CVPR.2017.699 http://dx.doi.org/10.1109/CVPR.2017.699 ]

Godard C, Aodha O M, Firman M and Brostow G. 2019. Digging into self-supervised monocular depth estimation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, South Korea: IEEE: 3827-3837[ DOI:10.1109/iccv.2019.00393 http://dx.doi.org/10.1109/iccv.2019.00393 ]

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778[ DOI:10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]

Huang J, Wang C, Liu Y and Bi T T. 2019. The progress of monocular depth estimation technology. Journal of Image and Graphics, 24(12):2081-2097

黄军, 王聪, 刘越, 毕天腾. 2019.单目深度估计技术进展综述.中国图象图形学报, 24(12):2081-2097[DOI:10.11834/jig.190455]

Jaderberg M, Simonyan K, Zisserman A and Kavukcuoglu K. 2015. Spatial transformer networks//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Canada: IEEE: 2017-2025

Kuznietsov Y, Stückler J and Leibe B. 2017. Semi-superv ised deep learning for monocular depth map prediction//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 2215-2223[ DOI:10.1109/CVPR.2017.238 http://dx.doi.org/10.1109/CVPR.2017.238 ]

Li Y, Chen X W, Wang Y and Liu M L. 2019. Progress in deep learning based monocular image depth estimation. Laser and Optoelectronics Progress, 56(19):#190001

李阳, 陈秀万, 王媛, 刘茂林. 2019.基于深度学习的单目图像深度估计的研究进展.激光与光电子学进展, 56(19):#190001[DOI:10.3788/LOP56.190001]

Liu F Y, Shen C H and Lin G S. 2015. Deep convolutional neural fields for depth estimation from a single image//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 5162-5170[DOI: 10.1109/CVPR.2015.7299152 http://dx.doi.org/10.1109/CVPR.2015.7299152 ]

Luo Y, Ren J, Lin M D, Pang J H, Sun W X, Li H S and Lin L. 2018. Single view stereo matching//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 155-163[ DOI:10.1109/cvpr.2018.00024 http://dx.doi.org/10.1109/cvpr.2018.00024 ]

Peng C, Zhang X Y, Yu G, Luo G M and Sun J. 2017. Large kernel matters-improve semantic segmentation by global convolutional network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 1743-1751[ DOI:10.1109/cvpr.2017.189 http://dx.doi.org/10.1109/cvpr.2017.189 ]

Pilzer A, Xu D, Puscas M, Ricci E and Sebe N. 2018. Unsupervised adversarial depth estimation using cycled generative networks//Proceedings of 2018 International Conference on 3D Vision. Verona, Italy: IEEE: 587-595[ DOI:10.1109/3 dv.2018.00073 http://dx.doi.org/10.1109/3dv.2018.00073 ]

Poggi M, Aleotti F, Tosi F and Mattoccia S. 2018a. Towards real-time unsupervised monocular depth estimation on CPU//Proceedings of 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems. Madrid, Spain: IEEE: 5848-5854[ DOI:10.1109/IROS.2018.8593814 http://dx.doi.org/10.1109/IROS.2018.8593814 ]

Poggi M, Tosi F and Mattoccia S. 2018b. Learning monocular depth estimation with unsupervised trinocular assumptions//Proceedings of 2018 International Conference on 3D Vision. Verona, Italy: IEEE: 324-333[ DOI:10.1109/3 dv.2018.00045 http://dx.doi.org/10.1109/3dv.2018.00045 ]

Ranjan A, Jampani V, Balles L, Kim K, Sun D Q, Wulff J and Black M J. 2019. Competitive collaboration: joint unsupervised learning of depth, c amera motion, optical flow and motion segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 12232-12241[ DOI:10.1109/cvpr.2019.01252 http://dx.doi.org/10.1109/cvpr.2019.01252 ]

Ronneberger O, Fischer P and Brox T. 2015. U-net: convolutional networks for biomedical image segmentation//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich, Germany: Springer: 234-241[ DOI:10.1007/978-3-319-24574-4_28 http://dx.doi.org/10.1007/978-3-319-24574-4_28 ]

Wang C Y, Buenaposada J M, Zhu R and Lucey S. 2018. Learning depth from monocular videos using direct methods//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 2022-2030[ DOI:10.1109/CVPR.2018.00216 http://dx.doi.org/10.1109/CVPR.2018.00216 ]

Wang Z, Bovik A C, Sheikh H R and Simoncelli E P. 2004. Image quality assessment:from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600-612[DOI:10.1109/TIP.2003.819861]

Xie J Y, Girshick R and Farhadi A. 2016. Deep3D: Fully automatic 2D-to-3D video conversion with deep convolutional neural networks//Proceedings of the 14th Eu ropean Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 842-857[ DOI:10.1007/978-3-319-46493-0_51 http://dx.doi.org/10.1007/978-3-319-46493-0_51 ]

Xu W P, Wang Y T, Liu Y and Weng D D. 2013. Survey on occlusion handling in augmented reality. Journal of Computer-Aided Design and Computer Graphics, 25(11):1635-1642

徐维鹏, 王涌天, 刘越, 翁冬冬. 2013.增强现实中的虚实遮挡处理综述.计算机辅助设计与图形学学报, 25(11):1635-1642

Ye X C, Zhang M L, Xu R, Zhong W, Fan X, Liu Z and Zhang J A. 2019. Unsupervised monocular depth estimation based on dual attention mechanism and depth-aware loss//Proceedings of 2019 IEEE International Conference on Multimedia and Expo. Shanghai, China: IEEE: 169-174[ DOI:10.1109/ICME.2019.00037 http://dx.doi.org/10.1109/ICME.2019.00037 ]

Yin Z C and Shi J P. 2018. Geonet: unsupervised learning of dense depth, optical flow and camera pose//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 1983-1992[ DOI:10.1109/CVPR.2018.00212 http://dx.doi.org/10.1109/CVPR.2018.00212 ]

Yusiong J P T and Naval P C. 2019. AsiANet: Autoencoders in autoencoder for unsupervised monocular depth estimation//Proceedings of 2019 IEEE Winter Con ference on Applications of Computer Vision. Waikoloa Village, USA: IEEE: 443-451[ DOI:10.1109/wacv.2019.00053 http://dx.doi.org/10.1109/wacv.2019.00053 ]

Zhan H Y, Garg R, Weerasekera C S, Li K J, Agarwal H and Reid I M. 2018. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 340-349[ DOI:10.1109/CVPR.2018.00043 http://dx.doi.org/10.1109/CVPR.2018.00043 ]

Zhao H S, Shi J P, Qi X J, Wang X G and Jia J Y. 2017. Pyramid scene parsing network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 6230-6239[ DOI:10.1109/CVPR.2017.660 http://dx.doi.org/10.1109/CVPR.2017.660 ]

Zhao S Y, Zhang L, Shen Y, Zhao S J and Zhang H J. 2019. Super-resolution for monocular depth estimation with multi-scale sub-pixel convolutions anda smoothness constraint. IEEE Access, 7:16323-16335[DOI:10.1109/ACCESS.2019.2894651]

Zhou T H, Brown M, Snavely N and Lowe D G. 2017. Unsupervised learning of depth and ego-motion from video//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 6612-6619[ DOI:10.110 s9/CVPR.2017.700 http://dx.doi.org/10.110s9/CVPR.2017.700 ]