Current Issue Cover

宋巍, 朱孟飞, 张明华, 赵丹枫, 贺琪(上海海洋大学信息学院, 上海 201306)

摘 要
场景的深度估计问题是计算机视觉领域中的经典问题之一,也是3维重建和图像合成等应用中的一个重要环节。基于深度学习的单目深度估计技术高速发展,各种网络结构相继提出。本文对基于深度学习的单目深度估计技术最新进展进行了综述,回顾了基于监督学习和基于无监督学习方法的发展历程。重点关注单目深度估计的优化思路及其在深度学习网络结构中的表现,将监督学习方法分为多尺度特征融合的方法、结合条件随机场(conditional random field,CRF)的方法、基于序数关系的方法、结合多元图像信息的方法和其他方法等5类;将无监督学习方法分为基于立体视觉的方法、基于运动恢复结构(structure from motion,SfM)的方法、结合对抗性网络的方法、基于序数关系的方法和结合不确定性的方法等5类。此外,还介绍了单目深度估计任务中常用的数据集和评价指标,并对目前基于深度学习的单目深度估计技术在精确度、泛化性、应用场景和无监督网络中不确定性研究等方面的现状和面临的挑战进行了讨论,为相关领域的研究人员提供一个比较全面的参考。
A review of monocular depth estimation techniques based on deep learning

Song Wei, Zhu Mengfei, Zhang Minghua, Zhao Danfeng, He Qi(School of Information, Shanghai Ocean University, Shanghai 201306, China)

Scene depth estimation is one of the key issues in the field of computer vision and an important aspect in the applications such as 3D reconstruction and image synthesis. Monocular depth estimation techniques based on deep learning have developed fast recently. Differentiated network structures have been proposed gradually. The current development of monocular depth estimation techniques based on deep learning and a categorical review of supervised and unsupervised learning-based methods have been illustrated in terms of the characteristics of the network structures. The supervised learning methods have been segmented as following:1) Multi-scale feature fusion strategies:Different scales images contain different kinds of information via fusing multi-scale features extracted from the images. The demonstrated results of depth estimation can be effectively improved. 2)Conditional random fields (CRFs):CRFs, as one of probabilistic graphical models, have good performance in the field of semantic segmentation. Since depth information has similar data distribution attributes as semantic information, the use of consistent CRFs can be effective for predicting continuous depth values. CRFs can be operated as the loss function in the final part of the network as well as a feature fusion module in the medium layer of the network due to its effectiveness for fuse features. 3)Ordinal relations:One category is the relative depth estimation method which uses ordinal relation straight forward to estimate the relative position of two pixels in the image. The other category defines the depth estimation as an ordinal regression issue, which needs to discretize the continuous depth values into discrete depth labels and perform multi-class classification for the global depth. 4) Multiple image information:It is beneficial to combining various image information in depth estimation to improve the accuracy of depth estimation results whereas the image information of different dimensions (time, space, semantics, etc.) can be implicitly related to the depth of the image scene. Four types of information are often adopted:semantic information, neighborhood information, temporal information and object boundary information. 5)Miscellaneous strategies:Some other supervised learning methods still cannot be easily classified into the above-mentioned methods. 6) Various optimization strategies:Acquiring efficiency optimization, using synthetic data obtained via image style transfer for domain adaptation, and the hardware-oriented optimization for underwater scene depth estimation. The unsupervised learning methods of scene depth estimation are classified as below:1) Stereo vision:Stereo vision aims to deduce the depth information of each pixel in the image from two or more images. Conventional binocular stereo vision algorithm is based on the stereo disparity, and can reconstruct the three-dimensional geometric information of surrounding scenery from the images captured by two camera sensors in terms of the principle of trigonometry. Researchers transform the depth estimation into an image reconstruction, and unsupervised depth estimation method is realized based on binocular (or multi-ocular) images and predicted disparity maps. 2) Structure from motion (SfM):SfM is a technique that automatically recovers camera parameters and the 3D structure of a scene from multiple images or video sequences. The unsupervised method based on SfM has its similarity to the unsupervised method based on stereo vision. It also transforms the depth estimation into the image reconstruction, but there are many differences in details. First, the SfM-based image reconstruction unsupervised processing method is generally using successive frames, that is, the image of the current frame is used to reconstruct the image of the previous or the next frame. Therefore, this kind of method uses image sequence generally-video as the training data. Second, the unsupervised method based on SfM needs to introduce a module for camera pose estimation in the training process. 3) Adversarial strategies:Generative adversarial networks (GANs) facilitate many imaging tasks with their powerful performance, where a discriminator can judge the results generated by the generator to force the generator to produce the same results as the labels. Adding discriminators to unsupervised learning networks can be effective in improving depth estimation results by optimizing image reconstruction results. 4) Ordinal relationship:Similar to the ordinal regression approach that utilizes ordinal relationships in the supervised learning methods, discrete disparity estimation is also desirable in unsupervised networks. In view of the fact that discrete depth values achieve more robust and sharper depth estimates than conventional regression predictions, discrete operations are equally effective in unsupervised networks. 5) Uncertainty:Since unsupervised learning does not use ground truth depth values, the depth results predicted is in doubt. From this viewpoint, it has been proposed to use the uncertainty of the prediction results of unsupervised methods as a benchmark for judging whether the prediction results are credible, and the results can be optimized in monocular depth estimation tasks. Meanwhile, this review refers to the NYU dataset, Karlsruhe Institute Technology and Toyota Technological Institute at Chicago (KITTI) dataset, Make3D dataset and Cityscapes dataset, which are mainly used in monocular deep estimation tasks, as well as six commonly-used evaluation metrics. Based on these datasets and evaluation metrics, a comparison among the reviewed methods is illustrated. Finally, the review discusses the current status of deep learning-based monocular depth estimation techniques in terms of accuracy, generalizability, application scenarios and uncertainty studies in unsupervised networks.