Fusion attention mechanism and multilayer U-Net for multiview stereo
- Vol. 27, Issue 2, Pages: 475-485(2022)
Published: 16 February 2022 ,
Accepted: 21 September 2021
DOI: 10.11834/jig.210516
移动端阅览
浏览全部资源
扫码关注微信
Published: 16 February 2022 ,
Accepted: 21 September 2021
移动端阅览
Huijie Liu, Zhengyao Bai, Wei Cheng, Junjie Li, Zhu Xu. Fusion attention mechanism and multilayer U-Net for multiview stereo. [J]. Journal of Image and Graphics 27(2):475-485(2022)
目的
2
针对多视图立体(multi-view stereo,MVS)重建效果整体性不理想的问题,本文对MVS 3D重建中的特征提取模块和代价体正则化模块进行研究,提出一种基于注意力机制的端到端深度学习架构。
方法
2
首先从输入的源图像和参考图像中提取深度特征,在每一级特征提取模块中均加入注意力层,以捕获深度推理任务的远程依赖关系;然后通过可微分单应性变换构建参考视锥的特征量,并构建代价体;最后利用多层U-Net体系结构正则化代价体,并通过回归结合参考图像边缘信息生成最终的细化深度图。
结果
2
在DTU(Technical University of Denmark)数据集上进行测试,与现有的几种方法相比,本文方法相较于Colmap、Gipuma和Tola方法,整体性指标分别提高8.5%、13.1%和31.9%,完整性指标分别提高20.7%、41.6%和73.3%;相较于Camp、Furu和SurfaceNet方法,整体性指标分别提高24.8%、33%和29.8%,准确性指标分别提高39.8%、17.6%和1.3%,完整性指标分别提高9.7%、48.4%和58.3%;相较于PruMvsnet方法,整体性指标提高1.7%,准确性指标提高5.8%;相较于Mvsnet方法,整体性指标提高1.5%,完整性标提高7%。
结论
2
在DTU数据集上的测试结果表明,本文提出的网络架构在整体性指标上得到了目前最优的结果,完整性和准确性指标得到较大提升,3D重建质量更好。
Objective
2
With the rapid development of deep learning
multi-view stereo (MVS) research based on learning has also made great progress. The goal of MVS is to reconstruct a highly detailed scene or object under the premise that a series of images and corresponding camera poses and inherent parameters (internal and external parameters of the camera) are known as the 3D geometric model. As a branch of computer vision
it has achieved tremendous development in recent decades and is widely used in many aspects
such as autonomous driving
robot navigation
and remote sensing. Learning-based methods can incorporate global semantic information such as specular reflection and reflection priors to achieve more reliable matching. If the receiving field of convolutional neural network (CNN) is large enough
it can better reconstruct poor texture areas. The existing learning-based MVS reconstruction methods mainly include three categories: voxel-based
point cloud-based
and depth map-based. The voxel-based method divides the 3D space into a regular grid and estimates whether each voxel is attached to the surface. The point cloud-based method runs directly on the point cloud
usually relying on the propagation strategy to make the reconstruction more dense gradually. The depth map method uses the estimated depth map as an intermediate layer to decompose the complex MVS problem into relatively small depth estimation problems per view
only focuses on one reference image and several source images at a time
and then performs regression (fusion) on each depth map to form the final 3D point cloud model. Despite room for improvement in the series of reconstruction methods proposed before
the latest MVS benchmark tests (such as Technical University of Denmark(DTU)) have proven that using depth maps as an intermediate layer can achieve more accurate 3D model reconstruction. Several end-to-end neural networks are proposed to predict the depth of the scene directly from a series of input images (for example
MVSNet and R-MVSNet). Even though the accuracy of these methods has been verified on the DTU datasets
most methods still only use 3D CNN to predict the occupancy of depth maps or voxels
which not only leads to excessive memory consumption but also limits the resolution
and the reconstruction results are not ideal. In response to the above problems
an end-to-end deep learning architecture is proposed in this paper based on the attention mechanism for 3D reconstruction. It is a deep learning framework that takes a reference image and multiple source images as input
and finally obtains the corresponding reference image depth map. The depth map estimation steps are as follows: depth feature extraction
matching cost construction
cost regularization
depth map estimation
and depth map optimization.
Method
2
First
the depth features are extracted from the input multiple source images and a reference image. At each layer of feature extraction
an attention layer is added to the feature extraction module to focus on learning important information for deep reasoning to capture remote dependencies in deep reasoning tasks. Second
the differentiable homography deformation is used to construct the feature quantity of the reference cone
and the matching cost volume is constructed. The central idea of the construction cost volume is to calculate the reference under the assumption of each sampling depth and the matching cost between each pixel in the camera and its neighboring camera pixels. Finally
the multilayer U-Net architecture is used to normalize the cost
that is
to down sample the cost volume
extract the context information and adjacent pixel information of different scales
and filter the cost amount. Then
the final refined estimated depth map is generated through regression. In addition
the difference-based cost measurement used in this article not only solves the problem of the input quantity of any view but also can finally aggregate multiple element quantities into one cost quantity. In summary
the following are the two contributions in this work: an attention mechanism applied to the feature extraction module is proposed to focus on learning important information for deep reasoning to capture the remote dependencies of deep reasoning tasks. A multilayer U-Net network is proposed for cost regularization
that is
to down sample the cost volume and extract context information and neighboring pixel information of different scales to filter the cost volume. Then
the final refined estimated depth map is generated through regression.
Result
2
Our method is tested on the DTU datasets and compared with several existing methods. Compared with Colmap
the overall index is increased by 8.5% and the completeness index is increased by 20.7%. Compared with the Gipuma method
the overall index is increased by 13.1%
and the completeness index is increased by 41.6%. Compared with the Tola method
the overall index is increased by 31.9%
and the completeness index is increased by 73.3%. Compared with the Camp method
the overall index is increased by 24.8%
the accuracy index is increased by 39.8%
and the completeness index is increased by 9.7%. Compared with the Furu method
the overall index is increased by 33%
the accuracy index is increased by 17.6%
and the completeness index is increased by 48.4%. Compared with the SurfaceNet method
the overall index is increased by 29.8%
the accuracy index is increased by 1.3%
and the completeness index is increased by 58.3%. Compared with the PruMvsnet method
the overall index is increased by 1.7%
and the accuracy index is increased by 5.8%. Compared with Mvsnet
the overall index is increased by 1.5%
and the completeness is increased by 7%.
Conclusion
2
The test results on the DTU data set show that the network architecture proposed in this paper obtains the current best results in terms of overall indicators
the completeness and accuracy indicators are greatly improved
and the quality of 3D reconstruction is better
which proves the effectiveness of the proposed method.
注意力机制多层U-Net可微分单应性变换代价体正则化多视图立体(MVS)
attention mechanismmulti-layer U-Netdifferentiable homography transformationcost volume regularizationmulti-view stereo(MVS)
Campbell N D F, Vogiatzis G, Hernández C and Cipolla R. 2008. Using multiple hypotheses to improve depth-maps for multi-view stereo//Proceedings of the 10th European Conference on Computer Vision. Marseille, France: Springer: 766-779[DOI: 10.1007/978-3-540-88682-2_58http://dx.doi.org/10.1007/978-3-540-88682-2_58]
Cernea D. 2015. OpenMVS: open multiple view stereo vision[CP/OL]. [2021-06-10].https://github.com/cdcseacave/openMVS/https://github.com/cdcseacave/openMVS/
Chen R, Han S F, Xu J and Su H. 2019. Point-based multi-view stereo network//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 1538-1547[DOI: 10.1109/ICCV.2019.00162http://dx.doi.org/10.1109/ICCV.2019.00162]
Choi S, Kim S, Park K and Sohn K. 2018. Learning descriptor, confidence, and depth estimation in multi-view stereo//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Salt Lake City, USA: IEEE: 389-396[DOI: 10.1109/CVPRW.2018.00065http://dx.doi.org/10.1109/CVPRW.2018.00065]
Choy C B, Xu D F, Gwak J, Chen K and Savarese S. 2016. 3D-R2 N2: a unified approach for single and multi-view 3D object reconstruction//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 628-644[DOI: 10.1007/978-3-319-46484-8_38http://dx.doi.org/10.1007/978-3-319-46484-8_38]
Eigen D, Puhrsch C and Fergus R. 2014. Depth map prediction from a single image using a multi-scale deep network[EB/OL]. [2021-06-10].https://arxiv.org/pdf/1406.2283.pdfhttps://arxiv.org/pdf/1406.2283.pdf
Furukawa Y and Ponce J. 2006. Carved visual hulls for image-based modeling//Proceedings of the 9th European Conference on Computer Vision. Graz, Austria: Springer: 564-577[DOI: 10.1007/11744023_44http://dx.doi.org/10.1007/11744023_44]
Furukawa Y and Ponce J. 2010. Accurate, dense, and robust multiview stereopsis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(8): 1362-1376[DOI: 10.1109/TPAMI.2009.161]
Galliani S, Lasinger K and Schindler K. 2015. Massively parallel multiview stereopsis by surface normal diffusion//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 873-881[DOI: 10.1109/iccv.2015.106http://dx.doi.org/10.1109/iccv.2015.106]
Galliani S, Lasinger K and Schindler K. 2016. Gipuma: massively parallel multi-view stereo reconstruction[EB/OL]. [2021-06-10]. Publikationen der Deutschen Gesellschaft für Photogrammetrie, Fernerkundung und Geoinformation e. V 25(361-369): 2
Gu X D, Fan Z W, Zhu S Y, Dai Z Z, Tan F T and Tan P. 2020. Cascade cost volume for high-resolution multi-view stereo and stereo matching//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 2492-2501[DOI: 10.1109/CVPR42600.2020.00257http://dx.doi.org/10.1109/CVPR42600.2020.00257]
Hochreiter S and Schmidhuber J. 1997. Long short-term memory. Neural Computation, 9(8): 1735-1780[DOI: 10.1162/neco.1997.9.8.1735]
Huang P H, Matzen K, Kopf J, Ahuja N and Huang J B. 2018. DeepMVS: learning multi-view stereopsis//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 2821-2830[DOI: 10.1109/CVPR.2018.00298http://dx.doi.org/10.1109/CVPR.2018.00298]
Im S, Jeon H G, Lin S and Kweon I S. 2019. DPSNet: end-to-end deep plane sweep stereo[EB/OL]. [2021-06-10].https://arxiv.org/pdf/1905.00538.pdfhttps://arxiv.org/pdf/1905.00538.pdf
Jensen R, Dahl A, Vogiatzis G, Tola E and Aanaes H. 2014. Large scale multi-view stereopsis evaluation//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 406-413[DOI: 10.1109/CVPR.2014.59http://dx.doi.org/10.1109/CVPR.2014.59]
Ji M Q, Gall J, Zheng H T, Liu Y B and Fang L. 2017. SurfaceNet: an end-to-end 3D neural network for multiview stereopsis//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2326-2334[DOI: 10.1109/ICCV.2017.253http://dx.doi.org/10.1109/ICCV.2017.253]
Kar A, Häne C and Malik J. 2017. Learning a multi-view stereo machine[EB/OL]. [2021-06-10].https://arxiv.org/pdf/1708.05375.pdfhttps://arxiv.org/pdf/1708.05375.pdf
Kingma D P and Ba J L. 2017. Adam: a method for stochastic optimization[EB/OL]. [2021-06-10].https://arxiv.org/pdf/1412.6980.pdfhttps://arxiv.org/pdf/1412.6980.pdf
Li Z X, Wang K Q, Zuo W M, Meng D Y and Zhang L. 2016. Detail-preserving and content-aware variational multi-view stereo reconstruction. IEEE Transactions on Image Processing, 25(2): 864-877[DOI: 10.1109/TIP.2015.2507400]
Liu J G. 2016. The multi-view 3D reconstruction of movable cultural relics. Archaeology, (12): 97-103
刘建国. 2016. 可移动文物的多视角影像三维重建. 考古, (12): 97-103
Luo K Y, Guan T, Ju L L, Huang H P and Luo Y W. 2019. P-MVSNet: learning patch-wise matching confidence aggregation for multi-view stereo//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 10451-10460[DOI: 10.1109/ICCV.2019.01055http://dx.doi.org/10.1109/ICCV.2019.01055]
Moulon P, Monasse P, Perrot R and Marlet R. 2017. OpenMVG: open multiple view geometry//Proceedings of the 1st International Workshop on Reproducible Research in Pattern Recognition. Cancún, Mexico: Springer: 60-74[DOI: 10.1007/978-3-319-56414-2_5http://dx.doi.org/10.1007/978-3-319-56414-2_5]
Ronneberger O, Fischer P and Brox T. 2015. U-Net: convolutional networks for biomedical image segmentation[EB/OL]. [2021-06-10].https://arxiv.org/pdf/1505.04597.pdfhttps://arxiv.org/pdf/1505.04597.pdf
Schönberger J L and Frahm J M. 2016. Structure-from-motion revisited//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 4104-4113[DOI: 10.1109/CVPR.2016.445http://dx.doi.org/10.1109/CVPR.2016.445]
Schönberger J L, Zheng E L, Frahm J M and Pollefeys M. 2016. Pixelwise view selection for unstructured multi-view stereo//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 501-518[DOI: 10.1007/978-3-319-46487-9_31http://dx.doi.org/10.1007/978-3-319-46487-9_31]
Tola E, Strecha C and Fua P. 2012. Efficient large-scale multi-view stereo for ultra high-resolution image sets. Machine Vision and Applications, 23(5): 903-920[DOI: 10.1007/s00138-011-0346-8]
Weilharter R and Fraundorfer F. 2021. HighRes-MVSNet: a fast multi-view stereo network for dense 3D reconstruction from high-resolution images. IEEE Access, 9: 11306-11315[DOI: 10.1109/ACCESS.2021.3050556]
Xiang X, Wang Z Y, Lao S S and Zhang B C. 2020. Pruning multi-view stereo net for efficient 3D reconstruction. ISPRS Journal of Photogrammetry and Remote Sensing, 168: 17-27[DOI: 10.1016/j.isprsjprs.2020.06.018]
Xue Y Z, Chen J S, Wan W T, Huang Y Q, Yu C, Li T P and Bao J Y. 2019. MVSCRF: learning multi-view stereo with conditional random fields//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 4311-4320[DOI: 10.1109/ICCV.2019.00441http://dx.doi.org/10.1109/ICCV.2019.00441]
Yang J Y, Mao W, Alvarez J M and Liu M M. 2019. Cost volume pyramid Based depth inference for multi-view stereo[EB/OL]. [2021-06-10].https://arxiv.org/pdf/1912.08329.pdfhttps://arxiv.org/pdf/1912.08329.pdf
Yang J Y, Mao W, Alvarez J M and Liu M M. 2020. Cost volume pyramid Based depth inference for multi-view stereo//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 4876-4885[DOI: 10.1109/CVPR42600.2020.00493http://dx.doi.org/10.1109/CVPR42600.2020.00493]
Yao Y, Luo Z X, Li S W, Fang T and Quan L. 2018. MVSNet: depth inference for unstructured multi-view stereo//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 785-801[DOI: 10.1007/978-3-030-01237-3_47http://dx.doi.org/10.1007/978-3-030-01237-3_47]
Yao Y, Luo ZX, Li S W, Shen T W, Fang T and Quan L. 2019. Recurrent MVSNet for high-resolution multi-view stereo depth inference//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 5520-5529[DOI: 10.1109/CVPR.2019.00567http://dx.doi.org/10.1109/CVPR.2019.00567]
Yu Z H and Gao S H. 2020. Fast-MVSNet: sparse-to-dense multi-view stereo with learned propagation and gauss-Newton refinement//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 1946-1955[DOI: 10.1109/CVPR42600.2020.00202http://dx.doi.org/10.1109/CVPR42600.2020.00202]
Zaharescu A, Boyer E and Horaud R. 2007. TransforMesh: a topology-adaptive mesh-based approach to surface evolution//Proceedings of the 8th Asian Conference on Computer Vision. Tokyo, Japan: Springer: 166-175[DOI: 10.1007/978-3-540-76390-1_17http://dx.doi.org/10.1007/978-3-540-76390-1_17]
Zhang L. 2018. The Exploration on the photography of the multi-view three-dimensional reconstruction of movable cultural heritages. Huaxia Archaeology, (1): 123-128
张蕾. 2018. 可移动文物多视角三维重建的拍摄方法探索. 华夏考古, (1): 123-128
相关文章
相关作者
相关机构