Fusion attention mechanism and multilayer U-Net for multiview stereo

Huijie Liu; Zhengyao Bai; Wei Cheng; Junjie Li; Zhu Xu

doi:10.11834/jig.210516

Depth Estimation & 3D Reconstruction | Views : 0 下载量: 1 CSCD: 1

PDF
Export
Share
Collection
Album

Fusion attention mechanism and multilayer U-Net for multiview stereo
Vol. 27, Issue 2, Pages: 475-485(2022)
Published： 16 February 2022 ，

Accepted： 21 September 2021
DOI： 10.11834/jig.210516
稿件说明：

移动端阅览

Huijie Liu, Zhengyao Bai, Wei Cheng, Junjie Li, Zhu Xu. Fusion attention mechanism and multilayer U-Net for multiview stereo. [J]. Journal of Image and Graphics 27(2):475-485(2022)
DOI：

Huijie Liu, Zhengyao Bai, Wei Cheng, Junjie Li, Zhu Xu. Fusion attention mechanism and multilayer U-Net for multiview stereo. [J]. Journal of Image and Graphics 27(2):475-485(2022) DOI： 10.11834/jig.210516.

摘要

目的

针对多视图立体（multi-view stereo，MVS）重建效果整体性不理想的问题，本文对MVS 3D重建中的特征提取模块和代价体正则化模块进行研究，提出一种基于注意力机制的端到端深度学习架构。

方法

首先从输入的源图像和参考图像中提取深度特征，在每一级特征提取模块中均加入注意力层，以捕获深度推理任务的远程依赖关系；然后通过可微分单应性变换构建参考视锥的特征量，并构建代价体；最后利用多层U-Net体系结构正则化代价体，并通过回归结合参考图像边缘信息生成最终的细化深度图。

结果

在DTU（Technical University of Denmark）数据集上进行测试，与现有的几种方法相比，本文方法相较于Colmap、Gipuma和Tola方法，整体性指标分别提高8.5%、13.1%和31.9%，完整性指标分别提高20.7%、41.6%和73.3%；相较于Camp、Furu和SurfaceNet方法，整体性指标分别提高24.8%、33%和29.8%，准确性指标分别提高39.8%、17.6%和1.3%，完整性指标分别提高9.7%、48.4%和58.3%；相较于PruMvsnet方法，整体性指标提高1.7%，准确性指标提高5.8%；相较于Mvsnet方法，整体性指标提高1.5%，完整性标提高7%。

结论

在DTU数据集上的测试结果表明，本文提出的网络架构在整体性指标上得到了目前最优的结果，完整性和准确性指标得到较大提升，3D重建质量更好。

Abstract

Objective

With the rapid development of deep learning

multi-view stereo (MVS) research based on learning has also made great progress. The goal of MVS is to reconstruct a highly detailed scene or object under the premise that a series of images and corresponding camera poses and inherent parameters (internal and external parameters of the camera) are known as the 3D geometric model. As a branch of computer vision

it has achieved tremendous development in recent decades and is widely used in many aspects

such as autonomous driving

robot navigation

and remote sensing. Learning-based methods can incorporate global semantic information such as specular reflection and reflection priors to achieve more reliable matching. If the receiving field of convolutional neural network (CNN) is large enough

it can better reconstruct poor texture areas. The existing learning-based MVS reconstruction methods mainly include three categories: voxel-based

point cloud-based

and depth map-based. The voxel-based method divides the 3D space into a regular grid and estimates whether each voxel is attached to the surface. The point cloud-based method runs directly on the point cloud

usually relying on the propagation strategy to make the reconstruction more dense gradually. The depth map method uses the estimated depth map as an intermediate layer to decompose the complex MVS problem into relatively small depth estimation problems per view

only focuses on one reference image and several source images at a time

and then performs regression (fusion) on each depth map to form the final 3D point cloud model. Despite room for improvement in the series of reconstruction methods proposed before

the latest MVS benchmark tests (such as Technical University of Denmark(DTU)) have proven that using depth maps as an intermediate layer can achieve more accurate 3D model reconstruction. Several end-to-end neural networks are proposed to predict the depth of the scene directly from a series of input images (for example

MVSNet and R-MVSNet). Even though the accuracy of these methods has been verified on the DTU datasets

most methods still only use 3D CNN to predict the occupancy of depth maps or voxels

which not only leads to excessive memory consumption but also limits the resolution

and the reconstruction results are not ideal. In response to the above problems

an end-to-end deep learning architecture is proposed in this paper based on the attention mechanism for 3D reconstruction. It is a deep learning framework that takes a reference image and multiple source images as input

and finally obtains the corresponding reference image depth map. The depth map estimation steps are as follows: depth feature extraction

matching cost construction

cost regularization

depth map estimation

and depth map optimization.

Method

First

the depth features are extracted from the input multiple source images and a reference image. At each layer of feature extraction

an attention layer is added to the feature extraction module to focus on learning important information for deep reasoning to capture remote dependencies in deep reasoning tasks. Second

the differentiable homography deformation is used to construct the feature quantity of the reference cone

and the matching cost volume is constructed. The central idea of the construction cost volume is to calculate the reference under the assumption of each sampling depth and the matching cost between each pixel in the camera and its neighboring camera pixels. Finally

the multilayer U-Net architecture is used to normalize the cost

that is

to down sample the cost volume

extract the context information and adjacent pixel information of different scales

and filter the cost amount. Then

the final refined estimated depth map is generated through regression. In addition

the difference-based cost measurement used in this article not only solves the problem of the input quantity of any view but also can finally aggregate multiple element quantities into one cost quantity. In summary

the following are the two contributions in this work: an attention mechanism applied to the feature extraction module is proposed to focus on learning important information for deep reasoning to capture the remote dependencies of deep reasoning tasks. A multilayer U-Net network is proposed for cost regularization

that is

to down sample the cost volume and extract context information and neighboring pixel information of different scales to filter the cost volume. Then

the final refined estimated depth map is generated through regression.

Result

Our method is tested on the DTU datasets and compared with several existing methods. Compared with Colmap

the overall index is increased by 8.5% and the completeness index is increased by 20.7%. Compared with the Gipuma method

the overall index is increased by 13.1%

and the completeness index is increased by 41.6%. Compared with the Tola method

the overall index is increased by 31.9%

and the completeness index is increased by 73.3%. Compared with the Camp method

the overall index is increased by 24.8%

the accuracy index is increased by 39.8%

and the completeness index is increased by 9.7%. Compared with the Furu method

the overall index is increased by 33%

the accuracy index is increased by 17.6%

and the completeness index is increased by 48.4%. Compared with the SurfaceNet method

the overall index is increased by 29.8%

the accuracy index is increased by 1.3%

and the completeness index is increased by 58.3%. Compared with the PruMvsnet method

the overall index is increased by 1.7%

and the accuracy index is increased by 5.8%. Compared with Mvsnet

the overall index is increased by 1.5%

and the completeness is increased by 7%.

Conclusion

The test results on the DTU data set show that the network architecture proposed in this paper obtains the current best results in terms of overall indicators

the completeness and accuracy indicators are greatly improved

and the quality of 3D reconstruction is better

which proves the effectiveness of the proposed method.

关键词

注意力机制多层U-Net可微分单应性变换代价体正则化多视图立体(MVS)

Keywords

attention mechanismmulti-layer U-Netdifferentiable homography transformationcost volume regularizationmulti-view stereo(MVS)

references

Campbell N D F, Vogiatzis G, Hernández C and Cipolla R. 2008. Using multiple hypotheses to improve depth-maps for multi-view stereo//Proceedings of the 10th European Conference on Computer Vision. Marseille, France: Springer: 766-779[DOI: 10.1007/978-3-540-88682-2_58http://dx.doi.org/10.1007/978-3-540-88682-2_58]

Cernea D. 2015. OpenMVS: open multiple view stereo vision[CP/OL]. [2021-06-10].https://github.com/cdcseacave/openMVS/https://github.com/cdcseacave/openMVS/

Chen R, Han S F, Xu J and Su H. 2019. Point-based multi-view stereo network//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 1538-1547[DOI: 10.1109/ICCV.2019.00162http://dx.doi.org/10.1109/ICCV.2019.00162]

Choi S, Kim S, Park K and Sohn K. 2018. Learning descriptor, confidence, and depth estimation in multi-view stereo//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Salt Lake City, USA: IEEE: 389-396[DOI: 10.1109/CVPRW.2018.00065http://dx.doi.org/10.1109/CVPRW.2018.00065]

Choy C B, Xu D F, Gwak J, Chen K and Savarese S. 2016. 3D-R2 N2: a unified approach for single and multi-view 3D object reconstruction//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 628-644[DOI: 10.1007/978-3-319-46484-8_38http://dx.doi.org/10.1007/978-3-319-46484-8_38]

Eigen D, Puhrsch C and Fergus R. 2014. Depth map prediction from a single image using a multi-scale deep network[EB/OL]. [2021-06-10].https://arxiv.org/pdf/1406.2283.pdfhttps://arxiv.org/pdf/1406.2283.pdf

Furukawa Y and Ponce J. 2006. Carved visual hulls for image-based modeling//Proceedings of the 9th European Conference on Computer Vision. Graz, Austria: Springer: 564-577[DOI: 10.1007/11744023_44http://dx.doi.org/10.1007/11744023_44]

Furukawa Y and Ponce J. 2010. Accurate, dense, and robust multiview stereopsis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(8): 1362-1376[DOI: 10.1109/TPAMI.2009.161]

Galliani S, Lasinger K and Schindler K. 2015. Massively parallel multiview stereopsis by surface normal diffusion//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 873-881[DOI: 10.1109/iccv.2015.106http://dx.doi.org/10.1109/iccv.2015.106]

Galliani S, Lasinger K and Schindler K. 2016. Gipuma: massively parallel multi-view stereo reconstruction[EB/OL]. [2021-06-10]. Publikationen der Deutschen Gesellschaft für Photogrammetrie, Fernerkundung und Geoinformation e. V 25(361-369): 2

Gu X D, Fan Z W, Zhu S Y, Dai Z Z, Tan F T and Tan P. 2020. Cascade cost volume for high-resolution multi-view stereo and stereo matching//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 2492-2501[DOI: 10.1109/CVPR42600.2020.00257http://dx.doi.org/10.1109/CVPR42600.2020.00257]

Hochreiter S and Schmidhuber J. 1997. Long short-term memory. Neural Computation, 9(8): 1735-1780[DOI: 10.1162/neco.1997.9.8.1735]

Huang P H, Matzen K, Kopf J, Ahuja N and Huang J B. 2018. DeepMVS: learning multi-view stereopsis//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 2821-2830[DOI: 10.1109/CVPR.2018.00298http://dx.doi.org/10.1109/CVPR.2018.00298]

Im S, Jeon H G, Lin S and Kweon I S. 2019. DPSNet: end-to-end deep plane sweep stereo[EB/OL]. [2021-06-10].https://arxiv.org/pdf/1905.00538.pdfhttps://arxiv.org/pdf/1905.00538.pdf

Jensen R, Dahl A, Vogiatzis G, Tola E and Aanaes H. 2014. Large scale multi-view stereopsis evaluation//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 406-413[DOI: 10.1109/CVPR.2014.59http://dx.doi.org/10.1109/CVPR.2014.59]

Ji M Q, Gall J, Zheng H T, Liu Y B and Fang L. 2017. SurfaceNet: an end-to-end 3D neural network for multiview stereopsis//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2326-2334[DOI: 10.1109/ICCV.2017.253http://dx.doi.org/10.1109/ICCV.2017.253]

Kar A, Häne C and Malik J. 2017. Learning a multi-view stereo machine[EB/OL]. [2021-06-10].https://arxiv.org/pdf/1708.05375.pdfhttps://arxiv.org/pdf/1708.05375.pdf

Kingma D P and Ba J L. 2017. Adam: a method for stochastic optimization[EB/OL]. [2021-06-10].https://arxiv.org/pdf/1412.6980.pdfhttps://arxiv.org/pdf/1412.6980.pdf

Li Z X, Wang K Q, Zuo W M, Meng D Y and Zhang L. 2016. Detail-preserving and content-aware variational multi-view stereo reconstruction. IEEE Transactions on Image Processing, 25(2): 864-877[DOI: 10.1109/TIP.2015.2507400]

Liu J G. 2016. The multi-view 3D reconstruction of movable cultural relics. Archaeology, (12): 97-103

刘建国. 2016. 可移动文物的多视角影像三维重建. 考古, (12): 97-103

Luo K Y, Guan T, Ju L L, Huang H P and Luo Y W. 2019. P-MVSNet: learning patch-wise matching confidence aggregation for multi-view stereo//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 10451-10460[DOI: 10.1109/ICCV.2019.01055http://dx.doi.org/10.1109/ICCV.2019.01055]

Moulon P, Monasse P, Perrot R and Marlet R. 2017. OpenMVG: open multiple view geometry//Proceedings of the 1st International Workshop on Reproducible Research in Pattern Recognition. Cancún, Mexico: Springer: 60-74[DOI: 10.1007/978-3-319-56414-2_5http://dx.doi.org/10.1007/978-3-319-56414-2_5]

Ronneberger O, Fischer P and Brox T. 2015. U-Net: convolutional networks for biomedical image segmentation[EB/OL]. [2021-06-10].https://arxiv.org/pdf/1505.04597.pdfhttps://arxiv.org/pdf/1505.04597.pdf

Schönberger J L and Frahm J M. 2016. Structure-from-motion revisited//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 4104-4113[DOI: 10.1109/CVPR.2016.445http://dx.doi.org/10.1109/CVPR.2016.445]

Schönberger J L, Zheng E L, Frahm J M and Pollefeys M. 2016. Pixelwise view selection for unstructured multi-view stereo//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 501-518[DOI: 10.1007/978-3-319-46487-9_31http://dx.doi.org/10.1007/978-3-319-46487-9_31]

Tola E, Strecha C and Fua P. 2012. Efficient large-scale multi-view stereo for ultra high-resolution image sets. Machine Vision and Applications, 23(5): 903-920[DOI: 10.1007/s00138-011-0346-8]

Weilharter R and Fraundorfer F. 2021. HighRes-MVSNet: a fast multi-view stereo network for dense 3D reconstruction from high-resolution images. IEEE Access, 9: 11306-11315[DOI: 10.1109/ACCESS.2021.3050556]

Xiang X, Wang Z Y, Lao S S and Zhang B C. 2020. Pruning multi-view stereo net for efficient 3D reconstruction. ISPRS Journal of Photogrammetry and Remote Sensing, 168: 17-27[DOI: 10.1016/j.isprsjprs.2020.06.018]

Xue Y Z, Chen J S, Wan W T, Huang Y Q, Yu C, Li T P and Bao J Y. 2019. MVSCRF: learning multi-view stereo with conditional random fields//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 4311-4320[DOI: 10.1109/ICCV.2019.00441http://dx.doi.org/10.1109/ICCV.2019.00441]

Yang J Y, Mao W, Alvarez J M and Liu M M. 2019. Cost volume pyramid Based depth inference for multi-view stereo[EB/OL]. [2021-06-10].https://arxiv.org/pdf/1912.08329.pdfhttps://arxiv.org/pdf/1912.08329.pdf

Yang J Y, Mao W, Alvarez J M and Liu M M. 2020. Cost volume pyramid Based depth inference for multi-view stereo//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 4876-4885[DOI: 10.1109/CVPR42600.2020.00493http://dx.doi.org/10.1109/CVPR42600.2020.00493]

Yao Y, Luo Z X, Li S W, Fang T and Quan L. 2018. MVSNet: depth inference for unstructured multi-view stereo//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 785-801[DOI: 10.1007/978-3-030-01237-3_47http://dx.doi.org/10.1007/978-3-030-01237-3_47]

Yao Y, Luo ZX, Li S W, Shen T W, Fang T and Quan L. 2019. Recurrent MVSNet for high-resolution multi-view stereo depth inference//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 5520-5529[DOI: 10.1109/CVPR.2019.00567http://dx.doi.org/10.1109/CVPR.2019.00567]

Yu Z H and Gao S H. 2020. Fast-MVSNet: sparse-to-dense multi-view stereo with learned propagation and gauss-Newton refinement//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 1946-1955[DOI: 10.1109/CVPR42600.2020.00202http://dx.doi.org/10.1109/CVPR42600.2020.00202]

Zaharescu A, Boyer E and Horaud R. 2007. TransforMesh: a topology-adaptive mesh-based approach to surface evolution//Proceedings of the 8th Asian Conference on Computer Vision. Tokyo, Japan: Springer: 166-175[DOI: 10.1007/978-3-540-76390-1_17http://dx.doi.org/10.1007/978-3-540-76390-1_17]

Zhang L. 2018. The Exploration on the photography of the multi-view three-dimensional reconstruction of movable cultural heritages. Huaxia Archaeology, (1): 123-128

张蕾. 2018. 可移动文物多视角三维重建的拍摄方法探索. 华夏考古, (1): 123-128

Alert me when the article has been cited

提交

Infrared-visible image object detection algorithm using feature dynamic selection

Attention-guided local feature joint learning for facial expression recognition

Chemical structure recognition method based on attention mechanism and encoder-decoder architecture

Industrial box-packing action recognition based on multi-view adaptive 3D skeleton network

Counterfactual reasoning model for Alzheimer’s disease diagnosis and pathological region detection