|
发布时间: 2019-04-24 |
计算机图形学 |
|
|
收稿日期: 2018-08-02; 修回日期: 2018-08-27
基金项目: 国家自然科学基金青年科学基金项目(61502256);浙江省重点研发计划项目(2018C01086);宁波市自然科学基金项目(2018A610160)
第一作者简介:
姚拓中, 1983年生, 男, 讲师, 主要研究方向为计算机视觉、机器学习。E-mail:thomasyao@zju.edu.cn;
安鹏, 男, 教授, 主要研究方向为机器人和嵌入式系统。E-mail:anp04@126.com; 宋加涛, 男, 教授, 主要研究方向为模式识别和图像处理。E-mail:sjt6612@163.com.
中图法分类号: TP391.4
文献标识码: A
文章编号: 1006-8961(2019)04-0603-12
|
摘要
目的 基于视觉的3维场景重建技术已在机器人导航、航拍地图构建和增强现实等领域得到广泛应用。不过,当相机出现较大运动时则会使得传统基于窄基线约束的3维重建方法无法正常工作。方法 针对宽基线环境,提出了一种融合高层语义先验的3维场景重建算法。该方法在马尔可夫随机场(MRF)模型的基础上,结合超像素的外观、共线性、共面性和深度等多种特征对不同视角图像中各个超像素的3维位置和朝向进行推理,从而实现宽基线条件下的初始3维重建。与此同时,还以递归的方式利用高层语义先验对相似深度超像素实现合并,进而对场景深度和3维模型进行渐进式优化。结果 实验结果表明,本文方法在多种不同的宽基线环境,尤其是相机运动较为剧烈的情况下,依然能够取得比传统方法更为稳定而精确的深度估计和3维场景重建效果。结论 本文展示了在宽基线条件下如何将多元图像特征与基于三角化的几何特征相结合以构建出精确的3维场景模型。本文方法采用MRF模型对不同视角图像中超像素的3维位置和朝向进行同时推理,并结合高层语义先验对3维重建的过程提供指导。与此同时,还使用了一种递归式框架以实现场景深度的渐进式优化。实验结果表明,本文方法在不同的宽基线环境下均能够获得比传统方法更接近真实描述的3维场景模型。
关键词
宽基线匹配; 致密3维场景重建; 高层语义先验; 超像素合并; 渐进式优化
Abstract
Objective As a research hotspot in computer vision, 3D scene reconstruction technique has been widely used in many fields, such as unmanned driving, digital entertainment, aeronautics, and astronautics. Traditional scene reconstruction methods iteratively estimate the camera pose and 3D scene models sparsely or densely on the basis of image sequences from multiple views by structure from motion. However, the large motion between cameras usually leads to occlusion and geometric deformation, which often appears in actual applications and will significantly increase the difficulty of image matching. Most previous works, including sparse and dense reconstructions, are only effective in narrow baseline environments, and wide-baseline 3D reconstruction is a considerably more difficult problem. This problem often exists in many applications, such robot navigation, aerial map building, and augmented reality, and is valuable for research. In recent years, several semantic fusion-based solutions have been proposed and have become the developing trends because these methods are more consistent with human cognition of the scene. Method A novel wide-baseline dense 3D scene reconstruction algorithm, which integrates the attribute of an outdoor structural scene and high-level semantic prior, is proposed. Our algorithm has the following characteristics. 1) Superpixel, which is larger than the pixel in the area, is used as a geometric primitive for image representation with the following advantages. First, it increases the robustness of region correlation in weak-texture environments. Second, it describes the actual boundary of the objects in the scene and the discontinuity of the depth. Third, it reduces the number of graph nodes in Markov random field (MRF) model, thereby resulting in remarkable reduction of computational complexity when solving an energy minimization problem. 2) An MRF model is utilized to estimate the 3D position and orientation of each superpixel in different view images on the basis of multiple low-level features. In our MRF energy function, the unary potential models the planar parameter of each superpixel and uses the relational error of estimated and ground truth depths for penalty. The pairwise potential models three geometric relations, namely, co-linearity, connectivity, and co-planarity between adjacent superpixels. In addition, a new potential is added to model the relational error between the triangulated and estimated depths. 3) The depth and 3D model of the scene are progressively optimized through superpixel merging with similar depths according to high-level semantic priors in our iterative type framework. When the adjacent superpixels have similar depths, they are merged, and a larger superpixel is generated, thereby reducing the possibility of depth discontinuity further. The segmentation image after superpixel merging is used in the next iteration for MRF-based depth estimation. The MAP inference of our MRF model can be efficiently solved by the classic linear programming. Result We use several classic wide-baseline image sequences, such as "Stanford Ⅰ, Ⅱ, Ⅲ, and Ⅳ", "Merton College Ⅲ", "University Library", and "Wadham College" to evaluate the performance of our wide-baseline 3D scene reconstruction algorithm. Experimental results demonstrate that our algorithm can estimate the large camera motion more accurately than the classic method and can recover more robust and accurate depth estimation and 3D scene models. Our algorithm can work effectively in the narrow- and wide-baseline environments and are especially suitable for large-scale scene reconstruction. Conclusion This study shows how to recover an accurate 3D scene model based on multiple image features and triangulated geometric features in wide-baseline environments. We use an MRF model to estimate the planar parameter of superpixel in different views, and high-level semantic prior is integrated to guide the superpixel merging with similar depths. Furthermore, an iterative framework is proposed to optimize the depth of the scene and the 3D scene model progressively. Experimental results show that our proposed algorithm can achieve more accurate 3D scene model than the classic algorithm in different wide-baseline image datasets.
Key words
wide-baseline matching; dense 3D scene reconstruction; high-level semantic prior; superpixel merging; progressive optimization
0 引言
作为计算机视觉领域的一大研究热点,3维场景重建技术已被广泛研究并应用于航空航天、无人驾驶和数字娱乐等诸多领域中。传统的3维场景重建技术在多个不同视角拍摄的图像序列基础上,采用基于运动的结构恢复法(SFM), 以递归的形式估计相机的姿态并将场景以稀疏点云或致密模型的形式加以3维呈现。实现该技术的关键问题之一是如何准确找到不同视角图像之间的对应关系。由于拍摄时相机的位置和姿态通常存在随意性,相机之间通常存在较大的运动变化(即相机光心之间存在较长的基线),造成不同视角间存在显著的遮挡和几何形变,从而大大增加图像匹配的难度,这就是经典的宽基线匹配问题[1]。该问题经常存在于机器人视觉导航、航拍地图构建、增强现实等诸多应用领域中,具有重要的研究意义。
1 相关研究
宽基线图像匹配问题最早于1998年由牛津大学机器人研究团队的Pritchett等人提出[1],此后诸多研究聚焦于设计更鲁棒的特征以估计本质矩阵。Tuytelaars等人[2]和Xiao等人[3]使用了仿射不变特征,而其他很多工作则使用了SIFT(scale-invariant feature transform)描述子[4]以及强调速度的Daisy描述子[5]或者基于尺度不变的描述子[6]。除此之外,Bay等人[7]和Micusik等人[8]分别使用了线段和由线段构成的矩形作为特征,而诸如MSER(maximally stable extremal regions)[9]或者纹理描述子[10]等区域特征也在宽基线环境下被使用,还有的描述子设计更多考虑了应对遮挡的情况[11]。在致密场景重建中,点和区域特征应用非常广泛,比如SIFT-flow[12]、Patch-match[13]、空间金字塔匹配[14]以及形变模型的使用[15]都有助于宽基线环境的场景重建。总体上,基于区域的匹配是目前宽基线条件下的主流趋势之一,其具有比点和线等特征能更为鲁棒而精确地反映彼此相似度或差异性的特点。
值得注意的是,在SFM过程中基于三角化的几何估计方法要求相邻视角之间的相机运动较小,而这在宽基线条件下通常无法满足。目前,已有不少研究成果通过人工智能技术在单幅图像上实现了场景的深度估计[16-17]、3维结构推理[18-19]和语义标注[20-21]等。一些研究开始利用单幅图像推理得到的语义信息,致力于改善传统基于多视角几何的深度估计[22-23]、SLAM(simultaneous localization and mapping)视觉导航系统的3维稀疏点云估计[24-25]以及致密3维模型重建的精度[26-27]。不过,迄今为止的绝大多数上述工作,无论是稀疏3维重建[24-25]还是致密3维重建[26-27],几乎都还是基于窄基线的环境应用。传统基于几何的3维重建方法和语义的融合,开始成为发展趋势之一,这也更符合人类对场景的认知方式,而这也将在基于宽基线的3维重建中发挥作用。
2 算法描述
为了更好地应对结构化场景中宽基线的情况,本文在原有几何特征的基础之上,通过结合室外结构化场景的一般属性以及不同视角间的语义先验信息来实现3维场景重建。本文方法具有以下几个特点:1)将超像素作为几何图元进行图像表示。这样的好处在于:首先,比像素具有更大面积的超像素有助于降低弱纹理环境中区域关联的模糊性;其次,能够较好地反映场景中物体的真实边界以及深度的不连续性;最后,在能量最小化求解时,基于超像素的图节点数目要比基于像素的图节点数目少很多,计算复杂度较低;2)在单幅图像基础上利用了丰富低层特征信息,而且还结合了高层语义先验来改善场景重建的效果;3)通过递归的形式实现场景深度的优化。通过模型估计得到的场景深度结合语义先验以指导无监督图像分割,并将更新后的分割图用于下一次的深度估计。
本文采用已标定相机对结构化场景进行拍摄,并通过基于图的无监督分割算法[28]将输入图像预先划分为具有局部同质性和非规则形状的超像素集合,如图 1所示。
本文假设3维图像分块必须位于穿越其2维投影的超像素边界的投影圆锥体和其所处的3维平面之间的重合区域。将投影到超像素的3维图像分块所对应的3维位置以及朝向进行参数化,用平面参数
在该超像素平面上的任意点
为了确定两幅待匹配图像中的两个超像素是否属于3维空间中的同一区域,本文使用MRF模型[29]来对不同视角图像中每个超像素平面的位置和朝向进行推理。第
$ \begin{array}{*{20}{c}} {D\left( {\mathit{\boldsymbol{\alpha }}\left| {\mathit{\boldsymbol{X}}, \mathit{\boldsymbol{Y}}, {{\mathit{\boldsymbol{d}}}_{\rm{T}}};\mathit{\boldsymbol{\theta }}} \right.} \right) \propto \prod\limits_n {{{\mathit{\boldsymbol{D}}}_1}\left( {{\mathit{\boldsymbol{\alpha }}^n}\left| {{\mathit{\boldsymbol{X}}^n}, {{\mathit{\boldsymbol{Y}}}^n}, {\mathit{\boldsymbol{Q}}^n};{\mathit{\boldsymbol{\theta }}^n}} \right.} \right)} \times }\\ {\prod\limits_n {{{\mathit{\boldsymbol{D}}}_2}\left( {{\mathit{\boldsymbol{\alpha }}^n}\left| {{\mathit{\boldsymbol{X}}^n}, {{\mathit{\boldsymbol{Y}}}^n}, {\mathit{\boldsymbol{Q}}^n}} \right.} \right)} \times }\\ {\prod\limits_{n, m \in \mathit{\boldsymbol{\psi }}} {{{\mathit{\boldsymbol{D}}}_3}\left( {{\mathit{\boldsymbol{\alpha }}^n}, {\mathit{\boldsymbol{\alpha }}^m}\left| {{\mathit{\boldsymbol{Q}}^n}, {\mathit{\boldsymbol{Q}}^m}, {{\mathit{\boldsymbol{Y}}}^{nm}}} \right.} \right)} \times }\\ {\prod\limits_n {{{\mathit{\boldsymbol{D}}}_4}\left( {{\mathit{\boldsymbol{\alpha }}^n}\left| {{\mathit{\boldsymbol{Q}}^n}, {\mathit{\boldsymbol{d}}}_{\rm{T}}^n, {\mathit{\boldsymbol{Y}}}_{\rm{T}}^n} \right.} \right)} } \end{array} $ | (1) |
式中,
本文的宽基线致密3维场景重建算法采用如图 3所示的递归式框架:通过MRF模型推理得到的深度将和高层语义先验一起被用于实现分割图中相似超像素的合并。合并后的分割图将被再次用于MRF模型进行深度估计,最终的3维场景模型通过后期的多视角深度融合优化得到。
3 能量函数的定义
3.1 一元项
能量函数的第1项将平面参数
$ {{\hat d}_{i, {\mathit{\boldsymbol{s}}_i}}}/{{\mathit{\boldsymbol{d}}}_{i, {\mathit{\boldsymbol{s}}_i}}} - 1 = \mathit{\boldsymbol{R}}_{i, {\mathit{\boldsymbol{s}}_i}}^{\rm{T}}{\mathit{\boldsymbol{\alpha }}_i}\left( {\mathit{\boldsymbol{x}}_{i, {\mathit{\boldsymbol{s}}_i}}^{\rm{T}}{\mathit{\boldsymbol{\theta }}_r}} \right) - 1 $ | (2) |
对于多元图像特征
$ \begin{array}{*{20}{c}} {{{\mathit{\boldsymbol{E}}}_i}\left( n \right) = \sum\limits_{\left( {x, y} \right) \in {\mathit{\boldsymbol{S}}_i}} {{{\left| {I\left( {x, y} \right) * {{\mathit{\boldsymbol{F}}}_n}\left( {x, y} \right)} \right|}^k}} }\\ {n = 1, 2, \cdots , 17} \end{array} $ | (3) |
式中,本文分别设置
基于光度学的摄影一致性度量计算步骤如下:
1) 对不同视角投影得到的超像素进行基于光度学的归一化。计算第
2) 估计超像素投影的代价。通过基于线性核函数的帕森窗(Parzen window),对RGB颜色空间中每个通道归一化后的像素分别计算具有
$ \chi _{\mathit{\boldsymbol{k}}}^2\left( {{\mathit{\boldsymbol{h}}_k}, {\mathit{\boldsymbol{h}}_{_{\rm{ref}}}}} \right) = \frac{1}{2}\sum\limits_{\mathit{\boldsymbol{j}}}^{{{\mathit{\boldsymbol{N}}}_{{\rm{bins}}}}} {\frac{{{{\left( {{\mathit{\boldsymbol{h}}_k}\left( j \right) + {\mathit{\boldsymbol{h}}_{_{\rm{ref}}}}\left( j \right)} \right)}^2}}}{{{\mathit{\boldsymbol{h}}_k}\left( j \right) + {\mathit{\boldsymbol{h}}_{_{\rm{ref}}}}\left( j \right)}}} $ | (4) |
3) 寻找满足整个超像素的投影位于图像中且满足
$ \begin{array}{*{20}{c}} {C\left( i \right) = \frac{1}{{\left| \mathit{\boldsymbol{K}} \right|}}\sum\limits_{k \in \mathit{\boldsymbol{K}}} {{{\mathit{\boldsymbol{C}}}_k}\left( i \right)} = }\\ {\frac{1}{{\left| \mathit{\boldsymbol{K}} \right|}}\sum\limits_{k \in \mathit{\boldsymbol{K}}} {\left( {\chi _{\mathit{\boldsymbol{k}}}^2 + \alpha {{\left\| {{\mathit{\boldsymbol{c}}_k} - {\mathit{\boldsymbol{c}}_{_{\rm{ref}}}}} \right\|}^2}} \right)} } \end{array} $ | (5) |
式中,
为了最小化超像素中所有图像点的累积相对误差,对图像特征和超像素平面参数之间的关系进行建模, 即
$ \begin{array}{*{20}{c}} {{{\mathit{\boldsymbol{D}}}_1}\left( {{\mathit{\boldsymbol{\alpha }}^n}\left| {{\mathit{\boldsymbol{X}}^n}, {{\mathit{\boldsymbol{Y}}}^n}, {\mathit{\boldsymbol{Q}}^n};{\mathit{\boldsymbol{\theta }}^n}} \right.} \right) = }\\ {\exp \left( { - \sum\limits_{i \in {\mathit{\boldsymbol{S}}_i}} {{{\mathit{\boldsymbol{w}}}_{i, {{\mathit{\boldsymbol{s}}}_i}}}\left| {\mathit{\boldsymbol{R}}_{i, {{\mathit{\boldsymbol{s}}}_i}}^{\rm{T}}{\mathit{\boldsymbol{\alpha }}_i}\left( {\mathit{\boldsymbol{x}}_{i, {{\mathit{\boldsymbol{s}}}_i}}^{\rm{T}}{\mathit{\boldsymbol{\theta }}_r}} \right) - 1} \right|} } \right)} \end{array} $ | (6) |
式中,
3.2 二元项
能量函数的第2项通过分析两个超像素
$ \begin{array}{*{20}{c}} {{{\mathit{\boldsymbol{D}}}_2}\left( {{\mathit{\boldsymbol{\alpha }}^n}\left| {{\mathit{\boldsymbol{X}}^n}, {{\mathit{\boldsymbol{Y}}}^n}, {\mathit{\boldsymbol{Q}}^n}} \right.} \right) = \prod\limits_{\left\{ {{\mathit{\boldsymbol{s}}_i}, {\mathit{\boldsymbol{s}}_j}} \right\} \in {\bf{N}}} {{{\mathit{\boldsymbol{H}}}_{{\mathit{\boldsymbol{s}}_i}, {\mathit{\boldsymbol{s}}_j}}}\left( {{\mathit{\boldsymbol{\alpha }}^n}\left| {{\mathit{\boldsymbol{X}}^n}, {{\mathit{\boldsymbol{Y}}}^n}, {\mathit{\boldsymbol{Q}}^n}} \right.} \right)} = }\\ {\prod\limits_{\left\{ {{\mathit{\boldsymbol{s}}_i}, {\mathit{\boldsymbol{s}}_j}} \right\} \in {\bf{N}}} {{{\mathit{\boldsymbol{H}}}_{{\mathit{\boldsymbol{s}}_i}}}\left( {{\mathit{\boldsymbol{\alpha }}^n}\left| {{\mathit{\boldsymbol{X}}^n}, {{\mathit{\boldsymbol{Y}}}^n}, {\mathit{\boldsymbol{Q}}^n}} \right.} \right) \cdot {{\mathit{\boldsymbol{H}}}_{{\mathit{\boldsymbol{s}}_j}}}\left( {{\mathit{\boldsymbol{\alpha }}^n}\left| {{\mathit{\boldsymbol{X}}^n}, {{\mathit{\boldsymbol{Y}}}^n}, {\mathit{\boldsymbol{Q}}^n}} \right.} \right)} } {\prod\limits_{\left\{ {{\mathit{\boldsymbol{s}}_i}, {\mathit{\boldsymbol{s}}_j}} \right\} \in {\bf{N}}} {{{\mathit{\boldsymbol{H}}}_{{\mathit{\boldsymbol{s}}_i}}}\left( {{\mathit{\boldsymbol{\alpha }}^n}\left| {{\mathit{\boldsymbol{X}}^n}, {{\mathit{\boldsymbol{Y}}}^n}, {\mathit{\boldsymbol{Q}}^n}} \right.} \right) \cdot {{\mathit{\boldsymbol{H}}}_{{\mathit{\boldsymbol{s}}_j}}}\left( {{\mathit{\boldsymbol{\alpha }}^n}\left| {{\mathit{\boldsymbol{X}}^n}, {{\mathit{\boldsymbol{Y}}}^n}, {\mathit{\boldsymbol{Q}}^n}} \right.} \right)} } \end{array} $ | (7) |
3.2.1 超像素之间的共线性约束
本文通过沿着长直线段进行图像点的选择来对超像素之间的共线性进行约束,这同样有助于获取彼此之间不相邻区域之间的关系,如图 4所示。选择两个位于某条直线段不同位置上的超像素
$ \begin{array}{*{20}{c}} {{{\mathit{\boldsymbol{H}}}_{{\mathit{\boldsymbol{s}}_j}}}\left( {{\mathit{\boldsymbol{\alpha }}_i}, {\mathit{\boldsymbol{\alpha }}_j}, {{\mathit{\boldsymbol{y}}}_{ij}}, {\mathit{\boldsymbol{R}}_{j, {{\mathit{\boldsymbol{s}}}_j}}}} \right) = }\\ {\exp \left( { - {{\mathit{\boldsymbol{v}}}_{ij}}\left| {\left( {\mathit{\boldsymbol{R}}_{j, {{\mathit{\boldsymbol{s}}}_j}}^{\rm{T}}{\mathit{\boldsymbol{\alpha }}_i} - \mathit{\boldsymbol{R}}_{j, {{\mathit{\boldsymbol{s}}}_j}}^{\rm{T}}{\mathit{\boldsymbol{\alpha }}_j}} \right)\hat d} \right|} \right)} \end{array} $ | (8) |
式中,
3.2.2 超像素之间的连接性约束
在超像素
$ \begin{array}{*{20}{c}} {{{{H}}_{{\mathit{\boldsymbol{s}}_i}, {\mathit{\boldsymbol{s}}_j}}}\left( {{\mathit{\boldsymbol{\alpha }}_i}, {\mathit{\boldsymbol{\alpha }}_j}, {{\mathit{\boldsymbol{y}}}_{ij}}, {\mathit{\boldsymbol{R}}_i}, {\mathit{\boldsymbol{R}}_j}} \right) = }\\ {\exp \left( { - {{\mathit{\boldsymbol{y}}}_{ij}}\left| {\left( {\mathit{\boldsymbol{R}}_{j, {{\mathit{\boldsymbol{s}}}_i}}^{\rm{T}}{\mathit{\boldsymbol{\alpha }}_i} - \mathit{\boldsymbol{R}}_{i, {{\mathit{\boldsymbol{s}}}_j}}^{\rm{T}}{\mathit{\boldsymbol{\alpha }}_j}} \right)\hat d} \right|} \right)} \end{array} $ | (9) |
式中,当两者没有相连接时,二值变量
3.2.3 超像素之间的共面性约束
与超像素之间的连接性定义类似,本文在每个超像素的中心选择第3对点
$ \begin{array}{*{20}{c}} {{{{H}}_{{{\mathit{\boldsymbol{s''}}}_j}}}\left( {{\mathit{\boldsymbol{\alpha }}_i}, {\mathit{\boldsymbol{\alpha }}_j}, {{\mathit{\boldsymbol{y}}}_{ij}}, {\mathit{\boldsymbol{R}}_i}, {\mathit{\boldsymbol{R}}_{j, {{\mathit{\boldsymbol{s''}}}_j}}}} \right) = }\\ {\exp \left( { - {{\mathit{\boldsymbol{y}}}_{ij}}\left| {\left( {\mathit{\boldsymbol{R}}_{j, {{\mathit{\boldsymbol{s''}}}_j}}^{\rm{T}}{\mathit{\boldsymbol{\alpha }}_i} - \mathit{\boldsymbol{R}}_{j, {{\mathit{\boldsymbol{s''}}}_j}}^{\rm{T}}{\mathit{\boldsymbol{\alpha }}_j}} \right){{\hat d}_{{{\mathit{\boldsymbol{s''}}}_j}}}} \right|} \right)} \end{array} $ | (10) |
式中,
3.2.4 图像之间的对应关系约束
场景中的3维点通常会在多个不同的视角图像中出现,如果两幅图像中的两个像素点
$ \begin{array}{*{20}{c}} {{{\mathit{\boldsymbol{p'}}}_n} - {\mathit{\boldsymbol{p}}_n} = {\mathit{\boldsymbol{Q}}^{mn}}\left[ {{\mathit{\boldsymbol{p}}_m};1} \right] - {\mathit{\boldsymbol{p}}_n} = }\\ {{\mathit{\boldsymbol{Q}}^{mn}}\left[ {{\mathit{\boldsymbol{R}}^m}/\left( {{{\left( {{\mathit{\boldsymbol{R}}^m}} \right)}^{\rm{T}}}{\mathit{\boldsymbol{\alpha }}^m}} \right);1} \right] - {\mathit{\boldsymbol{R}}^n}/\left( {{{\left( {{\mathit{\boldsymbol{R}}^n}} \right)}^{\rm{T}}}{\mathit{\boldsymbol{\alpha }}^n}} \right)} \end{array} $ | (11) |
可以得到如下的能量项定义
$ \begin{array}{*{20}{c}} {{{{D}}_3}\left( {{\mathit{\boldsymbol{\alpha }}^n}, {\mathit{\boldsymbol{\alpha }}^m}\left| {{\mathit{\boldsymbol{Q}}^n}, {\mathit{\boldsymbol{Q}}^m}, {{\mathit{\boldsymbol{Y}}}^{mn}}} \right.} \right) \propto }\\ {\prod\limits_{k = 1}^{{\mathit{\boldsymbol{J}}^{mn}}} {\exp \left( { - {\mathit{\boldsymbol{y}}}_{\mathit{\boldsymbol{k}}}^{mn}\left| {\left( {{\mathit{\boldsymbol{Q}}^{mn}}\left[ {\left( {\mathit{\boldsymbol{R}}{{_{i\left( k \right)}^n}^{\rm{T}}}\mathit{\boldsymbol{\alpha }}_{i\left( k \right)}^n} \right)\mathit{\boldsymbol{R}}_{j\left( k \right)}^m} \right.;} \right.} \right.} \right.} }\\ {\left. {\left. {\left. {\left. {\left( {\mathit{\boldsymbol{R}}{{_{i\left( k \right)}^n}^{\rm{T}}}\mathit{\boldsymbol{\alpha }}_{i\left( k \right)}^n} \right)\left( {\mathit{\boldsymbol{R}}{{_{j\left( k \right)}^m}^{\rm{T}}}\mathit{\boldsymbol{\alpha }}_{j\left( k \right)}^m} \right)} \right] - \left( {\mathit{\boldsymbol{R}}{{_{i\left( k \right)}^m}^{\rm{T}}}\mathit{\boldsymbol{\alpha }}_{j\left( k \right)}^m} \right)\mathit{\boldsymbol{R}}_{i\left( k \right)}^n} \right)\hat d} \right|} \right)} \end{array} $ | (12) |
式中,假设在图像
3.3 深度项
在图像
$ \begin{array}{*{20}{c}} {{{\mathit{\boldsymbol{D}}}_4}\left( {\mathit{\boldsymbol{\alpha }}\left| {\mathit{\boldsymbol{Q}}, {{\mathit{\boldsymbol{d}}}_{\rm{T}}}, {\mathit{\boldsymbol{Y}}_{\rm{T}}}} \right.} \right) \propto }\\ {\prod\limits_{i = 1}^{{{\mathit{\boldsymbol{K}}}^n}} {\exp \left( { - {{\mathit{\boldsymbol{y}}}_{{{\rm{T}}_i}}}\left| {{\mathit{\boldsymbol{d}}_{{{\rm{T}}_i}}}\mathit{\boldsymbol{R}}_{\mathit{\boldsymbol{i}}}^{\rm{T}}{\mathit{\boldsymbol{\alpha }}_i} - 1} \right|} \right)} } \end{array} $ | (13) |
在基于三角化的深度计算过程中利用单幅图像推理得到的深度以去除场景的尺度模糊性,然后使用光束法平差算法对得到的像素点关联进行优化。为此,本文采用如下方式:首先,计算SURF特征[32],并利用欧氏距离计算像素点之间的关联。接着,使用光束法平差来计算相机的姿态
然后,利用单幅图像特征来计算图像点的深度
3.4 模型推理
为了对超像素的平面参数
3.5 结合高层语义先验的区域合并
通过MRF推理,便可以获取图像中各个超像素的初始深度和不同视角图像之间的相对位置关系。初始深度尽管并不十分精确,尤其对于较远距离的区域,但是其却有助于对相邻超像素之间的朝向关系形成第一种相对可靠的约束C1,即:如果两个相邻的超像素具有相同的
除此之外,本文融合了高层语义先验作为超像素之间的新约束,即C2:属于同一语义类的相邻超像素应以较高的概率从属于同一平面。那么,定义如下的权重函数
$ W\left( {i, j} \right) = \left\{ {\begin{array}{*{20}{c}} \begin{array}{l} {{\mathit{\boldsymbol{a}}}_1} \cdot {\theta _{ij}} + {{\mathit{\boldsymbol{a}}}_2} \cdot OD{{\mathit{\boldsymbol{J}}}_{ij}}\\ \infty \end{array}&\begin{array}{l} i\;和\;j\;相邻\\ i\;和\;j\;不相邻 \end{array} \end{array}} \right. $ | (14) |
式中,
3.6 多视角深度融合
一般情况下,生成的原始深度图通常存在一定程度的误差,容易导致某个3维点在不同视角图像中存在不同的深度值。为此,需要丢弃错误和冗余的深度信息,从而得到更为精确的深度估计。本文选择所有视角图像中位居正中的一个视角为参考视角(如果仅有两个视角则选取任意一侧的视角为参考视角),而将其余视角的深度图分别投影到该参考深度图中以用于分析不同的深度值与3维点之间的位置关系。在深度融合时,本文采用基于稳定性的融合策略[35]。其中,每个深度值的稳定性度量可定义为遮挡参考视角中3维点的深度图个数与违反自由空间约束的深度图个数之间的差值。图 8给出了参考视角对应的3维点和其余视角对应的3维点之间所存在的3种不同类型的视觉关联:1)当视角
在深度融合过程中,我们分别判断各个深度值的稳定度大小,并对参考相机图像中的像素与其所对应3维地标的远近进行预测,最终融合得到的稳定深度值需要满足稳定度为非负且距离参考相机深度值最近的约束。对于得到的稳定深度图,将对其进行基于双边滤波的深度平滑以及空洞填充等后处理操作,进而实现更为精确的场景重建。
4 实验结果与分析
为了测试本文算法性能,不仅采用了斯坦福大学校园里的多组宽基线图像(Stanford Ⅰ,Ⅱ,Ⅲ,Ⅳ)作为实验图像集,而且也加入了Merton College Ⅲ,University Library,Wadham College等满足宽基线条件的多视角图像数据集。
由于很难得到场景的Ground Truth 3维模型,因此以定性的方式将本文算法与没有结合高层图像语义的经典多视角3维重建算法[29]在如下的8个不同的宽基线图像集中进行比较以测试各自的性能。采用基于RANSAC(random sample consensus)优化的SIFT匹配来评价不同视角图像之间相机的运动姿态变化程度,并通过基于种子扩张的区域增长方法去除与3维场景模型无关的天空区域。
4.1 Stanford Ⅰ数据集实验
第1组数据集“Stanford I”仅由2幅图像组成,可以看到相机的主要运动为围绕光心的小幅度旋转运动,通过SIFT匹配可得到38对特征对应点,可见严格意义上并不满足宽基线的条件。图 9(a)(b)分别给出了通过文献[29]方法和本文方法得到的基于不同角度观测的场景模型。从图 9(a)的右上图可以清晰地看到,远处大楼两个不同朝向的平面被推理为两块深度不同的区域。在图 9(b)中,融合高层语义先验的本文方法所得到的大楼深度变化呈现出连续性,较为准确地描述了实际场景。
4.2 Stanford Ⅱ数据集实验
第2组数据集Stanford Ⅱ由3幅图像组成,可以看到相机同样进行了围绕光心的旋转运动,不过其旋转幅度大于Stanford Ⅰ数据集,并且存在较为显著的平移运动。因此,通过SIFT匹配仅分别得到8对和0对特征对应点。图 10(a)(b)分别给出了通过文献[29]算法和本文算法得到的基于不同角度观测的场景模型。通过对比不难看到,本文方法更为准确地估计了不同视角图像之间的姿态关系,因而得到了更符合实际场景描述的3维场景模型,即建筑从俯视图上看基本处于同一水平线上,这从图 10(a)(b)的右上图对比可以看到。此外,本文方法得到的场景模型消除了原先不少深度杂乱表述的区域,但也会存在深度过度平滑的现象,比如建筑中央的圆门内部区域的深度被平滑为与两侧的墙壁相似,而这不符合实际情况。
4.3 Stanford Ⅲ数据集实验
第3组数据集Stanford III由2幅宽基线图像组成,可以看到相机同时进行了较大幅度的旋转和平移运动。通过SIFT匹配同样无法得到特征对应点。图 11(a)(b)分别给出了通过文献[29]算法和本文算法得到的基于不同角度观测的场景模型。通过对比不难看到,本文方法得到了更为精确的相机姿态估计结果,而这得益于基于高层语义先验的区域合并所带来的深度优化,并且3维场景模型更为准确地描述了建筑不同朝向面之间的几何关系。
4.4 Stanford IV数据集实验
第4组数据集Stanford IV由4幅宽基线图像组成,不难看出相机同样进行了一系列大范围的旋转和平移运动。通过SIFT匹配仅能从前两幅图像中得到13对特征对应点,而在其他图像对中则同样无法得到任何特征对应点。图 12(a)(b)分别给出了通过文献[29]算法和本文算法得到的3维场景模型。在图 12(a)中,左侧近处圆拱型大门的建筑前方的绿化带和地面处深度估计出现了不连续现象,而后侧远处的建筑和树木的深度估计则出现了大量错误,这可从第2行的第2幅图像清晰看到。相比之下,本文方法则取得了显著的改进效果。在图 12(b)中,本文方法得到的3维场景模型消除了文献[29]算法所存在的问题,不仅较为准确地估计了多个不同视角图像之间的相机运动姿态变化,而且获得的全局深度图较好地反映了真实的场景。
4.5 其他宽基线数据集实验
此外,还通过Merton College Ⅲ,University Library和Wadham College 3个宽基线数据集来测试本文方法的性能。从图 13中可以看到,针对室外结构化的场景在处于不同宽基线环境下,本文方法依然可以得到符合真实场景描述的3维模型。
5 结论
在本文中,仅通过少量的图像来实现室外大场景的3维模型重建。如果不同视角图像之间存在比较大的相机运动,则绝大多数传统的基于窄基线约束的3维场景重建方法均无法正常使用,使得本文的工作具有较高的挑战性。为此,展示了在宽基线条件下如何将多元图像特征与基于三角化的几何特征相结合以构建出精确的3维场景模型。本文方法采用MRF模型对不同视角图像中超像素的3维位置和朝向进行同时推理,并结合高层语义先验指导超像素区域的合并。与此同时,还使用了一种递归式框架以实现场景深度的渐进式优化估计。实验结果表明,和传统方法相比,本文方法在不同的宽基线环境下均能够取得更为稳定和精确的3维重建效果。在下一阶段的工作中,将以现有的深度卷积网络模型为基础,在宽基线环境下恢复出更为准确的3维场景模型。
参考文献
-
[1] Pritchett P, Zisserman A. Wide Baseline Stereo Matching[C]//Proceedings of the 6th International Conference on Computer Vision. Bombay, India: IEEE, 1998.[DOI: 10.1109/ICCV.1998.710802]
-
[2] Tuytelaars T, van Gool L. Wide baseline stereo matching based on Local, affinely invariant regions[C]//Proceedings of the 11th British Machine Vision Conference. Bristol, UK: University of Bristol, 2000: 412-425.
-
[3] Xiao J J, Shah. Two-frame wide baseline matching[C]//Proceedings of the 9th IEEE International Conference on Computer Vision. Nice, France: IEEE, 2003: 603-609.[DOI: 10.1109/ICCV.2003.1238403]
-
[4] Lowe D G. Distinctive image features from scale-invariant Keypoints[J]. International Journal of Computer Vision, 2004, 60(2): 91–110. [DOI:10.1023/B:VISI.0000029664.99615.94]
-
[5] Tola E, Lepetit V, Fua P. DAISY:an efficient dense descriptor applied to wide-baseline stereo[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(5): 815–830. [DOI:10.1109/TPAMI.2009.77]
-
[6] Hassner T, Mayzels V, Zelnik-Manor L. On SIFTs and their scales[C]//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, RI, USA: IEEE, 2012: 1522-1528.[DOI: 10.1109/CVPR.2012.6247842]
-
[7] Bay H, Ferrari V, van Gool L. Wide-baseline stereo matching with line segments[C]//Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego, CA, USA: IEEE, 2005: 329-336.[DOI: 10.1109/CVPR.2005.375]
-
[8] Micusik B, Wildenauer H, Kosecka J. Detection and matching of rectilinear structures[C]//Proceedings of 2008 IEEE Conference on Computer Vision and Pattern Recognition. Anchorage, AK, USA: IEEE, 2008: 1-7.[DOI: 10.1109/CVPR.2008.4587488]
-
[9] Matas J, Chum O, Urban M, et al. Robust wide-baseline stereo from maximally stable extremal regions[J]. Image and Vision Computing, 2004, 22(10): 761–767. [DOI:10.1016/j.imavis.2004.02.006]
-
[10] Schaffalitzky T, Zisserman A. Viewpoint invariant texture matching and wide baseline stereo[C]//Proceedings of the 8th IEEE International Conference on Computer Vision. Vancouver, BC, Canada: IEEE, 2001: 636-643.[DOI: 10.1109/ICCV.2001.937686]
-
[11] Trulls E, Kokkinos I, Sanfeliu A, et al. Dense segmentation-aware descriptors[C]//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE, 2013: 2890-2897.[DOI: 10.1109/CVPR.2013.372]
-
[12] Liu C, Yuen J, Torralba A. SIFT flow:dense correspondence across scenes and its applications[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(5): 978–994. [DOI:10.1109/TPAMI.2010.147]
-
[13] Barnes C, Shechtman E, Finkelstein A, et al. PatchMatch:a randomized correspondence algorithm for structural image editing[J]. ACM Transactions on Graphics, 2009, 28(3). [DOI:10.1145/1531326.1531330]
-
[14] Kim J, Liu C, Sha F, et al. Deformable spatial pyramid matching for fast dense correspondences[C]//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE, 2013: 2307-2314.[DOI: 10.1109/CVPR.2013.299]
-
[15] Duchenne O, Bach F, Kweon I S, et al. A tensor-based algorithm for high-order graph matching[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(12): 2383–2395. [DOI:10.1109/TPAMI.2011.110]
-
[16] Ranftl R, Vineet V, Chen Q F, et al. Dense monocular depth estimation in complex dynamic scenes[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016.[DOI: 10.1109/CVPR.2016.440]
-
[17] Roy A, Todorovic S. Monocular depth estimation using neural regression forest[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 5506-5514.[DOI: 10.1109/CVPR.2016.594]
-
[18] Dasgupta S, Fang K, Chen K, et al. DeLay: robust spatial layout estimation for cluttered indoor scenes[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 616-624.[DOI: 10.1109/CVPR.2016.73]
-
[19] Zou C H, Colburn A, Shan Q, et al. LayoutNet: reconstructing the 3D room layout from a single RGB image[C]//Proceedings of 2018 IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 2051-2059.
-
[20] Ren S Q, He K M, Girshick R, et al. Faster R-CNN:towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137–1149. [DOI:10.1109/TPAMI.2016.2577031]
-
[21] He K M, Gkioxari G, Dollár P, et al. Mask R-CNN[C]//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 2980-2988.[DOI: 10.1109/ICCV.2017.322]
-
[22] Hadfield S, Bowden R. Exploiting high level scene cues in stereo reconstruction[C]//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 783-791.[DOI: 10.1109/ICCV.2015.96]
-
[23] Tateno K, Tombari F, Laina I, et al. CNN-SLAM: real-time dense monocular SLAM with learned depth prediction[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 6565-6574.[DOI: 10.1109/CVPR.2017.695]
-
[24] Savinov N, Ladicky' L, Häne C, et al. Discrete optimization of ray potentials for semantic 3D reconstruction[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 5511-5518.[DOI: 10.1109/CVPR.2015.7299190]
-
[25] Savinov N, Häne C, Ladický L, et al. Semantic 3D reconstruction with continuous regularization and ray potentials using a visibility consistency constraint[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 5460-5469.[DOI: 10.1109/CVPR.2016.589]
-
[26] Häne C, Zach C, Cohen A, et al. Joint 3D scene reconstruction and class segmentation[C]//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE, 2013: 97-104.[DOI: 10.1109/CVPR.2013.20]
-
[27] Mustafa A, Hilton A. Semantically coherent co-segmentation and reconstruction of dynamic scenes[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 5583-5592.[DOI: 10.1109/CVPR.2017.592]
-
[28] Felzenszwalb P F, Huttenlocher D P. Efficient graph-based image segmentation[J]. International Journal of Computer Vision, 2004, 59(2): 167–181. [DOI:10.1023/B:VISI.0000022288.19776.77]
-
[29] Saxena A, Sun M, Ng A Y. 3D reconstruction from sparse views using monocular vision[C]//Proceedings of the 11th IEEE International Conference on Computer Vision. Rio de Janeiro, Brazil: IEEE, 2007.[DOI: 10.1109/ICCV.2007.4409219]
-
[30] Michels J, Saxena A, Ng A Y. High speed obstacle avoidance using monocular vision and reinforcement learning[C]//Proceedings of the 22nd International Conference on Machine Learning. Bonn, Germany: IEEE, 2005: 593-600.[DOI: 10.1145/1102351.1102426]
-
[31] Lourakis M, Argyros A. A generic sparse bundle adjustment C/C++ package based on the Levenberg-Marquardt algorithm[R]. Foundation for Research and Technology-Hellas, Tech. Rep., 2006.
-
[32] Bay H, Ess A, Tuytelaars T, et al. Speeded-up robust features (SURF)[J]. Computer Vision and Image Understanding, 2008, 110(3): 346–359. [DOI:10.1016/j.cviu.2007.09.014]
-
[33] Saxena A, Sun M, Ng A Y. Make3D:learning 3D scene structure from a single still image[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(5): 824–840. [DOI:10.1109/TPAMI.2008.132]
-
[34] Hoiem D, Efros A A, Hebert M. Geometric context from a single image[C]//Proceedings of the 10th IEEE International Conference on Computer Vision. Beijing, China: IEEE, 2005: 654-661.[DOI: 10.1109/ICCV.2005.107]
-
[35] Pollefeys M, Nistér D, Frahm J M, et al. Detailed real-time urban 3D reconstruction from video[J]. International Journal of Computer Vision, 2008, 78(2-3): 143–167. [DOI:10.1007/s11263-007-0086-4]