单目深度估计技术进展综述
The progress of monocular depth estimation technology
- 2019年24卷第12期 页码:2081-2097
收稿:2019-08-28,
修回:2019-8-29,
录用:2019-9-6,
纸质出版:2019-12-16
DOI: 10.11834/jig.190455
移动端阅览

浏览全部资源
扫码关注微信
收稿:2019-08-28,
修回:2019-8-29,
录用:2019-9-6,
纸质出版:2019-12-16
移动端阅览
单幅图像深度估计是计算机视觉中的经典问题,对场景的3维重建、增强现实中的遮挡及光照处理具有重要意义。本文回顾了单幅图像深度估计技术的相关工作,介绍了单幅图像深度估计常用的数据集及模型方法。根据场景类型的不同,数据集可分为室内数据集、室外数据集与虚拟场景数据集。按照数学模型的不同,单目深度估计方法可分为基于传统机器学习的方法与基于深度学习的方法。基于传统机器学习的单目深度估计方法一般使用马尔可夫随机场(MRF)或条件随机场(CRF)对深度关系进行建模,在最大后验概率框架下,通过能量函数最小化求解深度。依据模型是否包含参数,该方法又可进一步分为参数学习方法与非参数学习方法,前者假定模型包含未知参数,训练过程即是对未知参数进行求解;后者使用现有的数据集进行相似性检索推测深度,不需要通过学习来获得参数。对于基于深度学习的单目深度估计方法本文详细阐述了国内外研究现状及优缺点,同时依据不同的分类标准,自底向上逐层级将其归类。第1层级为仅预测深度的单任务方法与同时预测深度及语义等信息的多任务方法。图片的深度和语义等信息关联密切,因此有部分工作研究多任务的联合预测方法。第2层级为绝对深度预测方法与相对深度关系预测方法。绝对深度是指场景中的物体到摄像机的实际距离,而相对深度关注图片中物体的相对远近关系。给定任意图片,人的视觉更擅于判断场景中物体的相对远近关系。第3层级包含有监督回归方法、有监督分类方法及无监督方法。对于单张图片深度估计任务,大部分工作都关注绝对深度的预测,而早期的大多数方法采用有监督回归模型,即模型训练数据带有标签,且对连续的深度值进行回归拟合。考虑到场景由远及近的特性,也有用分类的思想解决深度估计问题的方法。有监督学习方法要求每幅RGB图像都有其对应的深度标签,而深度标签的采集通常需要深度相机或激光雷达,前者范围受限,后者成本昂贵。而且采集的原始深度标签通常是一些稀疏的点,不能与原图很好地匹配。因此不用深度标签的无监督估计方法是研究趋势,其基本思路是利用左右视图,结合对极几何与自动编码机的思想求解深度。
Depth estimation from a single image
a classical problem in computer vision
is important for scene reconstruction
occlusion
and illumination processing in augmented reality. In this paper
the recent related literature of single-image depth estimation are reviewed
and the commonly used datasets and methods are introduced. According to different types of scenes
the datasets can be divided into indoor
outdoor
and virtual scenes. In consideration of the different mathematical models
monocular depth estimation methods can be divided into traditional machine learning-based methods and deep learning-based methods. Traditional machine learning-based methods use a Markov random field or conditional random field to model the depth relationships of pixels in an image. In the framework of maximum a posteriori probability
the depth can be obtained by minimizing the energy function. According to whether the model contains parameters
traditional machine learning-based methods can be further divided into parameter and non-parameter learning methods. The former assumes that the model contains unknown parameters
and the training process obtains these unknown parameters. The latter uses existing datasets for similarity retrieval to infer depth
and no parameters need to be solved. In recent years
deep learning has promoted the development of computer vision in many fields. The current research situations of deep learning-based monocular depth estimation methods in China abroad are analyzed with their advantages and disadvantages. These methods are classified hierarchically in a bottom-up paradigm with reference to different classification criteria. The depth and semantics of images are closely related
and several works focus on multi-task joint learning. In the first level
single-depth estimation methods are segregated into single-task methods that predict only depth and multi-task methods that simultaneously predict depth and semantics. The second level contains absolute depth prediction methods and relative depth prediction methods. Absolute depth refers to the actual distance between the object in the scene and the camera
while relative depth focuses on the relative distance of the object in the picture. Given arbitrary images
people are often better at judging the relative distances of objects in the scene. The third level consists of supervised regression method
supervised classification method
and unsupervised method. For single-image depth estimation task
most works focus on the prediction of absolute depth
and most of the early methods use a supervised regression model. In this manner
the model regression on continuous depth values and the training data should contain depth labels. On the basis of the characteristics of the scene from far to near
several studies were conducted to solve the problem of depth estimation with classification methods. Supervised learning methods require each RGB image to have a corresponding depth label
whose acquisition usually requires a depth camera or radar. However
the depth camera is limited in scope
and the radar is expensive. Furthermore
the original depth collected by the depth camera is usually sparse and cannot precisely match the original image. Therefore
the unsupervised depth estimation methods that do not need a depth label have been the research trends in recent years. The basic idea is to combine the polar geometry based on left-right consistency with an automatic coding machine to obtain depth.
Asada N, Fujiwara H and Matsuyama T. 1998. Edge and depth from focus. International Journal of Computer Vision, 26(2):153-163[DOI:10.1023/A:1007996810301]
Barnard S T and Fischler M A. 1982. Computational stereo. ACM Computing Surveys, 14(4):553-572[DOI:10.1145/356893.356896]
Bishop C M. 2006. Pattern Recognition and Machine Learning. New York:Springer.
Cao Y Z, Wu Z F and Shen C H. 2018. Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Transactions on Circuits and Systems for Video Technology, 28(11):3174-3182[DOI:10.1109/TCSVT.2017.2740321]
Chen W F, Zhao F, Yang D W and Jia D. 2016. Single-image depth perception in the wild//Proceedings of the 30th Conference on Neural Information Processing Systems. Barcelona. Curran Associates Inc: 730-738.
Criminisi A, Reid I and Zisserman A. 2000. Single view metrology. International Journal of Computer Vision, 40(2):123-148[DOI:10.1023/A:1026598000963]
Dellaert F, Seitz S M, Thorpe C E and Sebastian T. 2000. Structure from motion without correspondence//Proceedings of 2000 IEEE Conference on Computer Vision and Pattern Recognition. Hilton Head Island: IEEE, 557-564[ DOI: 10.1109/CVPR.2000.854916 http://dx.doi.org/10.1109/CVPR.2000.854916 ]
Dhond U R and Aggarwal J K. 1989. Structure from stereo-a review. IEEE Transactions on Systems, Man, and Cybernetics, 19(6):1489-1510[DOI:10.1109/21.44067]
Eigen D and Fergus R. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago: IEEE, 2650-2658[ DOI: 10.1109/ICCV.2015.304 http://dx.doi.org/10.1109/ICCV.2015.304 ]
Eigen D, Puhrsch C and Fergus R. 2014. Depth map prediction from a single image using a multi-scale deep network//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal: MIT Press, 2366-2374
Favaro P and Soatto S. 2005. A geometric approach to shape from defocus. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(3):406-417[DOI:10.1109/TPAMI.2005.43]
Garg R, Vijay Kumar B G, Carneiro G and Ian R. 2016. Unsupervised CNN for single view depth estimation: geometry to the rescue//Proceedings of the 14th European Conference on Computer Vision. Amsterdam: Springer, 740-756[ DOI: 10.1007/978-3-319-46484-8_45 http://dx.doi.org/10.1007/978-3-319-46484-8_45 ]
Geiger A, Lenz P and Urtasun R. 2012. Are we ready for autonomous driving? The KITTI vision benchmark suite//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence: IEEE, 3354-3361[ DOI: 10.1109/CVPR.2012.6248074 http://dx.doi.org/10.1109/CVPR.2012.6248074 ]
Godard C, Aodha O M and Brostow G J. 2017. Unsupervised monocular depth estimation with left-right consistency//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 270-279[ DOI: 10.1109/CVPR.2017.699 http://dx.doi.org/10.1109/CVPR.2017.699 ]
Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A and Bengio Y. 2014. Generative adversarial nets//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal: MIT Press, 2672-2680
Herbort S and Wöhler C. 2011. An introduction to image-based 3D surface reconstruction and a survey of photometric stereo methods. 3D Research, 2(3):#4[DOI:10.1007/3DRes.03(2011)4]
Huang G, Liu Z, van der Maaten L and Kilian Q W. 2017. Densely connected convolutional networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 4700-4708[ DOI: 10.1109/CVPR.2017.243 http://dx.doi.org/10.1109/CVPR.2017.243 ]
Hubel D H and Wiesel T N. 1970. The period of susceptibility to the physiological effects of unilateral eye closure in kittens. The Journal of Physiology, 206(2):419-436[DOI:10.1113/jphysiol.1970.sp009022]
Karsch K, Liu C and Kang S B. 2012. Depth transfer:depth extraction from video using non-parametric sampling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(11):2144-2158[DOI:10.1109/TPAMI.2014.2316835]
Konrad J, Wang M and Ishwar P. 2012.2D-to-3D image conversion by learning depth from examples//Proceedings of 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. Providence: IEEE, 16-22[ DOI: 10.1109/CVPRW.2012.6238903 http://dx.doi.org/10.1109/CVPRW.2012.6238903 ]
Kundu J N, Uppala P K, Pahuja A and Venkatesh B R. 2018. AdaDepth: unsupervised content congruent adaptation for depth estimation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2656-2665[ DOI: 10.1109/CVPR.2018.00281 http://dx.doi.org/10.1109/CVPR.2018.00281 ]
Kuznietsov Y, Stuckler J and Leibe B. 2017. Semi-supervised deep learning for monocular depth map prediction//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2215-2223[ DOI: 10.1109/CVPR.2017.238 http://dx.doi.org/10.1109/CVPR.2017.238 ]
Laina I, Rupprecht C, Belagiannis V, Tombari F and Navab N. 2016. Deeper depth prediction with fully convolutional residual networks//Proceedings of the 4th International Conference on 3D Vision. Stanford: IEEE, 239-248[ DOI: 10.1109/3DV.2016.32 http://dx.doi.org/10.1109/3DV.2016.32 ]
Lecun Y, Bottou L, Bengio Y and Haffner P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324[DOI:10.1109/5.726791]
Levin A, Lischinski D and Weiss Y. 2004. Colorization using optimization. ACM Transactions on Graphics, 23(3):689-694[DOI:10.1145/1015706.1015780]
Li B, Shen C H, Dai Y C, Van den H A and He M Y. 2015. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 1119-1127[ DOI: 10.1109/CVPR.2015.7298715 http://dx.doi.org/10.1109/CVPR.2015.7298715 ]
Li Z Q and Snavely N. 2018. MegaDepth: learning single-view depth prediction from internet photos//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2041-2050[ DOI: 10.1109/CVPR.2018.00218 http://dx.doi.org/10.1109/CVPR.2018.00218 ]
Liu B Y, Gould S and Koller D. 2010. Single image depth estimation from predicted semantic labels//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco: IEEE, 1253-1260[ DOI: 10.1109/CVPR.2010.5539823 http://dx.doi.org/10.1109/CVPR.2010.5539823 ]
Liu F Y, Shen C H, Lin G S, Lin G S and Reid I. 2016. Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10):2024-2039[DOI:10.1109/TPAMI.2015.2505283]
Liu F Y, Shen C H and Lin G S. 2015. Deep convolutional neural fields for depth estimation from a single image//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 5162-5170[ DOI: 10.1109/CVPR.2015.7299152 http://dx.doi.org/10.1109/CVPR.2015.7299152 ]
Liu M M, Salzmann M and He X M. 2014. Discrete-continuous depth estimation from a single image//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE, 716-723[ DOI: 10.1109/CVPR.2014.97 http://dx.doi.org/10.1109/CVPR.2014.97 ]
Liu W K and Liu Y. 2016. Review on illumination estimation in augmented reality. Journal of Computer-Aided Design&Computer Graphics, 28(2):197-207
刘万奎, 刘越. 2016.用于增强现实的光照估计研究综述.计算机辅助设计与图形学学报, 28(2):197-207[DOI:10.3969/j.issn.1003-9775.2016.02.001]
McCormac J, Handa A, Leutenegger S and Andrew J. 2017. SceneNet RGB-D: can 5M synthetic images beat generic ImageNet pre-training on indoor segmentation?//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice: IEEE, 2697-2706[ DOI: 10.1109/ICCV.2017.292 http://dx.doi.org/10.1109/ICCV.2017.292 ]
Mo Y M. 2014. Depth Estimation of Monocular Video Using Non-parametric Fusion of Multiple Cues. Nanjing:Nanjing University of Posts and Telecommunications
莫一鸣. 2014.基于非参数多线索融合的单目视频深度图估计研究.南京:南京邮电大学
Mousavian A, Pirsiavash H and KošeckáJ. 2016. Joint semantic segmentation and depth estimation with deep convolutional networks//Proceedings of the 4th International Conference on 3D Vision. Stanford: IEEE, 611-619[ DOI: 10.1109/3DV.2016.69 http://dx.doi.org/10.1109/3DV.2016.69 ]
Nayar S K and Nakagawa Y. 1994. Shape from focus. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(8):824-831[DOI:10.1109/34.308479]
Pentland A P. 1987. A new sense for depth of field. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-9(4): 523-531[ DOI: 10.1109/TPAMI.1987.4767940 http://dx.doi.org/10.1109/TPAMI.1987.4767940 ]
Pope A R and Lowe D G. 1993. Learning object recognition models from images//Proceedings of the 4th International Conference on Computer Vision. Berlin: IEEE, 296-301[ DOI: 10.1109/ICCV.1993.378202 http://dx.doi.org/10.1109/ICCV.1993.378202 ]
Ros G, Sellart L, Materzynska J, Vazquez D and Antonio M. 2016. The SYNTHIA dataset:a large collection of synthetic images for semantic segmentation of urban scenes//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas:IEEE, 3234-3243[DOI:10.1109/CVPR.2016.352]
Saxena A, Chung S H and Ng A Y. 2005. Learning depth from single monocular images//Proceedings of the 18th International Conference on Neural Information Processing Systems. Vancouver: MIT Press, 1161-1168
Saxena A, Schulte J and Ng A Y. 2007. Depth estimation using monocular and stereo cues//Proceedings of the 20th International Joint Conference on Artifical Intelligence. Hyderabad: Morgan Kaufmann Publishers Inc., 2197-2203.
Silberman N, Hoiem D, Kohli P and Fergus R. 2012. Indoor segmentation and support inference from RGBD images//Proceedings of the 12th European Conference on Computer Vision. Florence: Springer, 746-760[ DOI: 10.1007/978-3-642-33715-4_54 http://dx.doi.org/10.1007/978-3-642-33715-4_54 ]
Tomasi C and Kanade T. 1992. Shape and motion from image streams under orthography:a factorization method. International Journal of Computer Vision, 9(2):137-154[DOI:10.1007/BF00129684]
Trigueiros P, Ribeiro F and Reis L P. 2012. A comparison of machine learning algorithms applied to hand gesture recognition//Proceedings of the 7th Iberian Conference on Information Systems and Technologies. Madrid: IEEE, 1-6
Wang H Y, Gould S and Koller D. 2010. Discriminative learning with latent variables for cluttered indoor scene understanding//Proceedings of the 11th European Conference on Computer Vision. Heraklion: Springer, 497-510[ DOI: 10.1007/978-3-642-15561-1_36 http://dx.doi.org/10.1007/978-3-642-15561-1_36 ]
Wang P, Shen X H, Lin Z, Scott C, Brian P and Alan L Y. 2015. Towards unified depth and semantic prediction from a single image//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2800-2809[ DOI: 10.1109/CVPR.2015.7298897 http://dx.doi.org/10.1109/CVPR.2015.7298897 ]
Wang P, Shen X H, Russell B, Scott C, Brian P and Alan L Y. 2016. Surge: surface regularized geometry estimation from a single image//Proceedings of the 30th Conference on Neural Information Processing Systems. Barcelona. Curran Associates Inc: 172-180
Wang Y G, Wang R P and Dai Q H. 2014. A parametric model for describing the correlation between single color images and depth maps. IEEE Signal Processing Letters, 21(7):800-803[DOI:10.1109/LSP.2013.2283851]
Xu W P, Wang Y T, Liu Y and Weng D. 2013. Survey on occlusion handling in augmented reality. Journal of Computer-Aided Design&Computer Graphics, 25(11):1635-1642
徐维鹏, 王涌天, 刘越, 翁冬冬. 2013.增强现实中的虚实遮挡处理综述.计算机辅助设计与图形学学报, 25(11):1635-1642
Zhang R, Tsai P S, Cryer J E and Mubarak S. 1999. Shape-from-shading:a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(8):690-706[DOI:10.1109/34.784284]
Zhuo W, Salzmann M, He X M and Liu M M. 2015. Indoor scene structure analysis for single image depth estimation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 614-622[ DOI: 10.1109/CVPR.2015.7298660 http://dx.doi.org/10.1109/CVPR.2015.7298660 ]
Zoran D, Isola P, Krishnan D and Freeman W T. 2015. Learning ordinal relationships for mid-level vision//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago: IEEE, 388-396[ DOI: 10.1109/ICCV.2015.52 http://dx.doi.org/10.1109/ICCV.2015.52 ]
相关作者
相关机构
京公网安备11010802024621