Current Issue Cover
融合注意力机制和多层U-Net的多视图立体重建

刘会杰, 柏正尧, 程威, 李俊杰, 许祝(云南大学信息学院, 昆明 650500)

摘 要
目的 针对多视图立体(multi-view stereo,MVS)重建效果整体性不理想的问题,本文对MVS 3D重建中的特征提取模块和代价体正则化模块进行研究,提出一种基于注意力机制的端到端深度学习架构。方法 首先从输入的源图像和参考图像中提取深度特征,在每一级特征提取模块中均加入注意力层,以捕获深度推理任务的远程依赖关系;然后通过可微分单应性变换构建参考视锥的特征量,并构建代价体;最后利用多层U-Net体系结构正则化代价体,并通过回归结合参考图像边缘信息生成最终的细化深度图。结果 在DTU (Technical University of Denmark)数据集上进行测试,与现有的几种方法相比,本文方法相较于Colmap、Gipuma和Tola方法,整体性指标分别提高8.5%、13.1%和31.9%,完整性指标分别提高20.7%、41.6%和73.3%;相较于Camp、Furu和SurfaceNet方法,整体性指标分别提高24.8%、33%和29.8%,准确性指标分别提高39.8%、17.6%和1.3%,完整性指标分别提高9.7%、48.4%和58.3%;相较于PruMvsnet方法,整体性指标提高1.7%,准确性指标提高5.8%;相较于Mvsnet方法,整体性指标提高1.5%,完整性标提高7%。结论 在DTU数据集上的测试结果表明,本文提出的网络架构在整体性指标上得到了目前最优的结果,完整性和准确性指标得到较大提升,3D重建质量更好。
关键词
Fusion attention mechanism and multilayer U-Net for multiview stereo

Liu Huijie, Bai Zhengyao, Cheng Wei, Li Junjie, Xu Zhu(School of Information Science and Engineering, Yunnan University, Kunming 650500, China)

Abstract
Objective With the rapid development of deep learning, multi-view stereo (MVS) research based on learning has also made great progress. The goal of MVS is to reconstruct a highly detailed scene or object under the premise that a series of images and corresponding camera poses and inherent parameters (internal and external parameters of the camera) are known as the 3D geometric model. As a branch of computer vision, it has achieved tremendous development in recent decades and is widely used in many aspects, such as autonomous driving, robot navigation, and remote sensing. Learning-based methods can incorporate global semantic information such as specular reflection and reflection priors to achieve more reliable matching. If the receiving field of convolutional neural network (CNN) is large enough, it can better reconstruct poor texture areas. The existing learning-based MVS reconstruction methods mainly include three categories:voxel-based, point cloud-based, and depth map-based. The voxel-based method divides the 3D space into a regular grid and estimates whether each voxel is attached to the surface. The point cloud-based method runs directly on the point cloud, usually relying on the propagation strategy to make the reconstruction more dense gradually. The depth map method uses the estimated depth map as an intermediate layer to decompose the complex MVS problem into relatively small depth estimation problems per view, only focuses on one reference image and several source images at a time, and then performs regression (fusion) on each depth map to form the final 3D point cloud model. Despite room for improvement in the series of reconstruction methods proposed before, the latest MVS benchmark tests (such as Technical University of Denmark(DTU)) have proven that using depth maps as an intermediate layer can achieve more accurate 3D model reconstruction. Several end-to-end neural networks are proposed to predict the depth of the scene directly from a series of input images (for example, MVSNet and R-MVSNet). Even though the accuracy of these methods has been verified on the DTU datasets, most methods still only use 3D CNN to predict the occupancy of depth maps or voxels, which not only leads to excessive memory consumption but also limits the resolution, and the reconstruction results are not ideal. In response to the above problems, an end-to-end deep learning architecture is proposed in this paper based on the attention mechanism for 3D reconstruction. It is a deep learning framework that takes a reference image and multiple source images as input, and finally obtains the corresponding reference image depth map. The depth map estimation steps are as follows:depth feature extraction, matching cost construction, cost regularization, depth map estimation, and depth map optimization. Method First, the depth features are extracted from the input multiple source images and a reference image. At each layer of feature extraction, an attention layer is added to the feature extraction module to focus on learning important information for deep reasoning to capture remote dependencies in deep reasoning tasks. Second, the differentiable homography deformation is used to construct the feature quantity of the reference cone, and the matching cost volume is constructed. The central idea of the construction cost volume is to calculate the reference under the assumption of each sampling depth and the matching cost between each pixel in the camera and its neighboring camera pixels. Finally, the multilayer U-Net architecture is used to normalize the cost, that is, to down sample the cost volume, extract the context information and adjacent pixel information of different scales, and filter the cost amount. Then, the final refined estimated depth map is generated through regression. In addition, the difference-based cost measurement used in this article not only solves the problem of the input quantity of any view but also can finally aggregate multiple element quantities into one cost quantity. In summary, the following are the two contributions in this work:an attention mechanism applied to the feature extraction module is proposed to focus on learning important information for deep reasoning to capture the remote dependencies of deep reasoning tasks. A multilayer U-Net network is proposed for cost regularization, that is, to down sample the cost volume and extract context information and neighboring pixel information of different scales to filter the cost volume. Then, the final refined estimated depth map is generated through regression. Result Our method is tested on the DTU datasets and compared with several existing methods. Compared with Colmap, the overall index is increased by 8.5% and the completeness index is increased by 20.7%. Compared with the Gipuma method, the overall index is increased by 13.1%, and the completeness index is increased by 41.6%. Compared with the Tola method, the overall index is increased by 31.9%, and the completeness index is increased by 73.3%. Compared with the Camp method, the overall index is increased by 24.8%, the accuracy index is increased by 39.8%, and the completeness index is increased by 9.7%. Compared with the Furu method, the overall index is increased by 33%, the accuracy index is increased by 17.6%, and the completeness index is increased by 48.4%. Compared with the SurfaceNet method, the overall index is increased by 29.8%, the accuracy index is increased by 1.3%, and the completeness index is increased by 58.3%. Compared with the PruMvsnet method, the overall index is increased by 1.7%, and the accuracy index is increased by 5.8%. Compared with Mvsnet, the overall index is increased by 1.5%, and the completeness is increased by 7%. Conclusion The test results on the DTU data set show that the network architecture proposed in this paper obtains the current best results in terms of overall indicators, the completeness and accuracy indicators are greatly improved, and the quality of 3D reconstruction is better, which proves the effectiveness of the proposed method.
Keywords

订阅号|日报