融合语义先验和渐进式深度优化的宽基线3维场景重建
Wide-baseline 3D reconstruction with semantic prior fusion and progressive depth optimization
- 2019年24卷第4期 页码:603-614
收稿:2018-08-02,
修回:2018-8-27,
纸质出版:2019-04-24
DOI: 10.11834/jig.180477
移动端阅览

浏览全部资源
扫码关注微信
收稿:2018-08-02,
修回:2018-8-27,
纸质出版:2019-04-24
移动端阅览
目的
2
基于视觉的3维场景重建技术已在机器人导航、航拍地图构建和增强现实等领域得到广泛应用。不过,当相机出现较大运动时则会使得传统基于窄基线约束的3维重建方法无法正常工作。
方法
2
针对宽基线环境,提出了一种融合高层语义先验的3维场景重建算法。该方法在马尔可夫随机场(MRF)模型的基础上,结合超像素的外观、共线性、共面性和深度等多种特征对不同视角图像中各个超像素的3维位置和朝向进行推理,从而实现宽基线条件下的初始3维重建。与此同时,还以递归的方式利用高层语义先验对相似深度超像素实现合并,进而对场景深度和3维模型进行渐进式优化。
结果
2
实验结果表明,本文方法在多种不同的宽基线环境,尤其是相机运动较为剧烈的情况下,依然能够取得比传统方法更为稳定而精确的深度估计和3维场景重建效果。
结论
2
本文展示了在宽基线条件下如何将多元图像特征与基于三角化的几何特征相结合以构建出精确的3维场景模型。本文方法采用MRF模型对不同视角图像中超像素的3维位置和朝向进行同时推理,并结合高层语义先验对3维重建的过程提供指导。与此同时,还使用了一种递归式框架以实现场景深度的渐进式优化。实验结果表明,本文方法在不同的宽基线环境下均能够获得比传统方法更接近真实描述的3维场景模型。
Objective
2
As a research hotspot in computer vision
3D scene reconstruction technique has been widely used in many fields
such as unmanned driving
digital entertainment
aeronautics
and astronautics. Traditional scene reconstruction methods iteratively estimate the camera pose and 3D scene models sparsely or densely on the basis of image sequences from multiple views by structure from motion. However
the large motion between cameras usually leads to occlusion and geometric deformation
which often appears in actual applications and will significantly increase the difficulty of image matching. Most previous works
including sparse and dense reconstructions
are only effective in narrow baseline environments
and wide-baseline 3D reconstruction is a considerably more difficult problem. This problem often exists in many applications
such robot navigation
aerial map building
and augmented reality
and is valuable for research. In recent years
several semantic fusion-based solutions have been proposed and have become the developing trends because these methods are more consistent with human cognition of the scene.
Method
2
A novel wide-baseline dense 3D scene reconstruction algorithm
which integrates the attribute of an outdoor structural scene and high-level semantic prior
is proposed. Our algorithm has the following characteristics. 1) Superpixel
which is larger than the pixel in the area
is used as a geometric primitive for image representation with the following advantages. First
it increases the robustness of region correlation in weak-texture environments. Second
it describes the actual boundary of the objects in the scene and the discontinuity of the depth. Third
it reduces the number of graph nodes in Markov random field (MRF) model
thereby resulting in remarkable reduction of computational complexity when solving an energy minimization problem. 2) An MRF model is utilized to estimate the 3D position and orientation of each superpixel in different view images on the basis of multiple low-level features. In our MRF energy function
the unary potential models the planar parameter of each superpixel and uses the relational error of estimated and ground truth depths for penalty. The pairwise potential models three geometric relations
namely
co-linearity
connectivity
and co-planarity between adjacent superpixels. In addition
a new potential is added to model the relational error between the triangulated and estimated depths. 3) The depth and 3D model of the scene are progressively optimized through superpixel merging with similar depths according to high-level semantic priors in our iterative type framework. When the adjacent superpixels have similar depths
they are merged
and a larger superpixel is generated
thereby reducing the possibility of depth discontinuity further. The segmentation image after superpixel merging is used in the next iteration for MRF-based depth estimation. The MAP inference of our MRF model can be efficiently solved by the classic linear programming.
Result
2
We use several classic wide-baseline image sequences
such as "Stanford Ⅰ
Ⅱ
Ⅲ
and Ⅳ"
"Merton College Ⅲ"
"University Library"
and "Wadham College" to evaluate the performance of our wide-baseline 3D scene reconstruction algorithm. Experimental results demonstrate that our algorithm can estimate the large camera motion more accurately than the classic method and can recover more robust and accurate depth estimation and 3D scene models. Our algorithm can work effectively in the narrow- and wide-baseline environments and are especially suitable for large-scale scene reconstruction.
Conclusion
2
This study shows how to recover an accurate 3D scene model based on multiple image features and triangulated geometric features in wide-baseline environments. We use an MRF model to estimate the planar parameter of superpixel in different views
and high-level semantic prior is integrated to guide the superpixel merging with similar depths. Furthermore
an iterative framework is proposed to optimize the depth of the scene and the 3D scene model progressively. Experimental results show that our proposed algorithm can achieve more accurate 3D scene model than the classic algorithm in different wide-baseline image datasets.
Pritchett P, Zisserman A. Wide Baseline Stereo Matching[C]//Proceedings of the 6th International Conference on Computer Vision. Bombay, India: IEEE, 1998.[ DOI: 10.1109/ICCV.1998.710802 http://dx.doi.org/10.1109/ICCV.1998.710802 ]
Tuytelaars T, van Gool L. Wide baseline stereo matching based on Local, affinely invariant regions[C]//Proceedings of the 11th British Machine Vision Conference. Bristol, UK: University of Bristol, 2000: 412-425.
Xiao J J, Shah. Two-frame wide baselinematching[C]//Proceedings of the 9th IEEE International Conference on Computer Vision. Nice, France: IEEE, 2003: 603-609.[ DOI: 10.1109/ICCV.2003.1238403 http://dx.doi.org/10.1109/ICCV.2003.1238403 ]
Lowe D G. Distinctive image features from scale-invariant Keypoints[J]. International Journal of Computer Vision, 2004, 60(2):91-110.[DOI:10.1023/B:VISI.0000029664.99615.94]
Tola E, Lepetit V, Fua P. DAISY:an efficient dense descriptor applied to wide-baseline stereo[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(5):815-830.[DOI:10.1109/TPAMI.2009.77]
Hassner T, Mayzels V, Zelnik-Manor L. On SIFTs and their scales[C]//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, RI, USA: IEEE, 2012: 1522-1528.[ DOI: 10.1109/CVPR.2012.6247842 http://dx.doi.org/10.1109/CVPR.2012.6247842 ]
Bay H, Ferrari V, van Gool L. Wide-baseline stereo matching with line segments[C]//Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego, CA, USA: IEEE, 2005: 329-336.[ DOI: 10.1109/CVPR.2005.375 http://dx.doi.org/10.1109/CVPR.2005.375 ]
Micusik B, Wildenauer H, Kosecka J. Detection and matching of rectilinear structures[C]//Proceedings of 2008 IEEE Conference on Computer Vision and Pattern Recognition. Anchorage, AK, USA: IEEE, 2008: 1-7.[ DOI: 10.1109/CVPR.2008.4587488 http://dx.doi.org/10.1109/CVPR.2008.4587488 ]
Matas J, Chum O, Urban M, et al. Robust wide-baseline stereo from maximally stable extremal regions[J]. Image and Vision Computing, 2004, 22(10):761-767.[DOI:10.1016/j.imavis.2004.02.006]
Schaffalitzky T, Zisserman A. Viewpoint invariant texture matching and wide baseline stereo[C]//Proceedings of the 8th IEEE International Conference on Computer Vision. Vancouver, BC, Canada: IEEE, 2001: 636-643.[ DOI: 10.1109/ICCV.2001.937686 http://dx.doi.org/10.1109/ICCV.2001.937686 ]
Trulls E, Kokkinos I, Sanfeliu A, et al. Dense segmentation-aware descriptors[C] //Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE, 2013: 2890-2897.[ DOI: 10.1109/CVPR.2013.372 http://dx.doi.org/10.1109/CVPR.2013.372 ]
Liu C, Yuen J, Torralba A. SIFT flow:dense correspondence across scenes and its applications[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(5):978-994.[DOI:10.1109/TPAMI.2010.147]
Barnes C, Shechtman E, Finkelstein A, et al. PatchMatch:a randomized correspondence algorithm for structural image editing[J]. ACM Transactions on Graphics, 2009, 28(3):#24.[DOI:10.1145/1531326.1531330]
Kim J, Liu C, Sha F, et al. Deformable spatial pyramid matching for fast dense correspondences[C]//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE, 2013: 2307-2314.[ DOI: 10.1109/CVPR.2013.299 http://dx.doi.org/10.1109/CVPR.2013.299 ]
Duchenne O, Bach F, Kweon I S, et al. A tensor-based algorithm for high-order graph matching[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(12):2383-2395.[DOI:10.1109/TPAMI.2011.110]
Ranftl R, Vineet V, Chen Q F, et al. Dense monocular depth estimation in complex dynamic scenes[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016.[ DOI: 10.1109/CVPR.2016.440 http://dx.doi.org/10.1109/CVPR.2016.440 ]
Roy A, Todorovic S. Monocular depth estimation using neural regression forest[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 5506-5514.[ DOI: 10.1109/CVPR.2016.594 http://dx.doi.org/10.1109/CVPR.2016.594 ]
Dasgupta S,Fang K, Chen K, et al. DeLay: robust spatial layout estimation for cluttered indoor scenes[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 616-624.[ DOI: 10.1109/CVPR.2016.73 http://dx.doi.org/10.1109/CVPR.2016.73 ]
Zou C H, Colburn A, Shan Q, et al. LayoutNet: reconstructing the 3D room layout from a single RGB image[C]//Proceedings of 2018 IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 2051-2059.
Ren S Q, He K M, Girshick R, et al. Faster R-CNN:towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6):1137-1149.[DOI:10.1109/TPAMI.2016.2577031]
He K M, Gkioxari G, Dollár P, et al. Mask R-CNN[C]//Proceedings of 2017 IEEE International Conference onComputer Vision. Venice, Italy: IEEE, 2017: 2980-2988.[ DOI: 10.1109/ICCV.2017.322 http://dx.doi.org/10.1109/ICCV.2017.322 ]
Hadfield S, Bowden R. Exploiting high level scene cues in stereo reconstruction[C]//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 783-791.[ DOI: 10.1109/ICCV.2015.96 http://dx.doi.org/10.1109/ICCV.2015.96 ]
Tateno K, Tombari F, Laina I, et al. CNN-SLAM: real-time dense monocular SLAM with learned depth prediction[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 6565-6574.[ DOI: 10.1109/CVPR.2017.695 http://dx.doi.org/10.1109/CVPR.2017.695 ]
Savinov N, Ladicky' L, Häne C, et al. Discrete optimization of ray potentials for semantic 3D reconstruction[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 5511-5518.[ DOI: 10.1109/CVPR.2015.7299190 http://dx.doi.org/10.1109/CVPR.2015.7299190 ]
Savinov N, Häne C, LadickýL, et al. Semantic 3D reconstruction with continuous regularization and ray potentials using a visibility consistency constraint[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 5460-5469.[ DOI: 10.1109/CVPR.2016.589 http://dx.doi.org/10.1109/CVPR.2016.589 ]
Häne C, Zach C, Cohen A, et al. Joint 3D scene reconstruction and class segmentation[C]//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE, 2013: 97-104.[ DOI: 10.1109/CVPR.2013.20 http://dx.doi.org/10.1109/CVPR.2013.20 ]
Mustafa A, Hilton A. Semantically coherent co-segmentation and reconstruction of dynamic scenes[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 5583-5592.[ DOI: 10.1109/CVPR.2017.592 http://dx.doi.org/10.1109/CVPR.2017.592 ]
Felzenszwalb P F, Huttenlocher D P. Efficient graph-based image segmentation[J]. International Journal of Computer Vision, 2004, 59(2):167-181.
Saxena A, Sun M, Ng A Y. 3D reconstruction from sparse views using monocular vision[C]//Proceedings of the 11th IEEE International Conference on Computer Vision. Rio de Janeiro, Brazil: IEEE, 2007.[ DOI: 10.1109/ICCV.2007.4409219 http://dx.doi.org/10.1109/ICCV.2007.4409219 ]
Michels J, Saxena A, Ng A Y. High speed obstacle avoidance using monocular vision and reinforcement learning[C]//Proceedings of the 22nd International Conference on Machine Learning. Bonn, Germany: IEEE, 2005: 593-600.[ DOI: 10.1145/1102351.1102426 http://dx.doi.org/10.1145/1102351.1102426 ]
Lourakis M, Argyros A. A generic sparse bundle adjustment C/C++ package based on the Levenberg-Marquardt algorithm[R]. Foundation for Research and Technology-Hellas, Tech. Rep., 2006.
Bay H, Ess A, Tuytelaars T, et al. Speeded-up robust features (SURF)[J]. Computer Vision and Image Understanding, 2008, 110(3):346-359.[DOI:10.1016/j.cviu.2007.09.014]
Saxena A, Sun M, Ng A Y. Make3D:learning 3D scene structure from a single still image[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(5):824-840.[DOI:10.1109/TPAMI.2008.132]
Hoiem D, Efros A A, Hebert M. Geometric context from a single image[C]//Proceedings of the 10th IEEE International Conference on Computer Vision. Beijing, China: IEEE, 2005: 654-661.[ DOI: 10.1109/ICCV.2005.107 http://dx.doi.org/10.1109/ICCV.2005.107 ]
Pollefeys M, Nistér D, Frahm J M, et al. Detailed real-time urban 3D reconstruction from video[J]. International Journal of Computer Vision, 2008, 78(2-3):143-167.[DOI:10.1007/s11263-007-0086-4]
相关文章
相关作者
相关机构
京公网安备11010802024621