发布时间: 2022-02-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.210551
2022 | Volume 27 | Number 2

深度估计与三维重建

多尺度相似性迭代查找的可靠双目视差估计

晏敏, 王军政, 李静

北京理工大学自动化学院, 北京 100081

收稿日期: 2021-07-05; 修回日期: 2021-11-05; 预印本日期: 2021-11-12

基金项目: 国家自然科学基金项目（61103157）；国家重点研发计划项目（2019YFC1511401）

作者简介: 晏敏, 1990年生, 女, 博士研究生, 主要研究方向为计算机视觉和深度学习。E-mail: minyanbit@foxmail.com
王军政, 男, 教授, 主要研究方向为运动驱动与控制和图像检测与跟踪。E-mail: wangjz@bit.edu.cn
李静, 通信作者, 女, 副教授, 主要研究方向为计算机视觉。E-mail: bitljing@bit.edu.cn
*通信作者: 李静 bitljing@bit.edu.cn

中图法分类号: TP183

文献标识码: A

文章编号: 1006-8961(2022)02-0447-14

摘要

目的双目视差估计可以实现稠密的深度估计，因而具有重要研究价值。而视差估计和光流估计两个任务之间具有相似性，在两种任务之间可以互相借鉴并启迪新算法。受光流估计高效算法RAFT（recurrent all-pairs field transforms）的启发，本文提出采用单、双边多尺度相似性迭代查找的方法实现高精度的双目视差估计。针对方法在不同区域估计精度和置信度不一致的问题，提出了左右图像视差估计一致性检测提取可靠估计区域的方法。方法采用金字塔池化模块、跳层连接和残差结构的特征网络提取具有强表征能力的表示向量，采用向量内积表示像素间的相似性，通过平均池化得到多尺度的相似量，第0次迭代集成初始视差量，根据初始视差单方向向左查找多尺度的相似性得到的大视野相似量和上下文3种信息，而其他次迭代集成更新的视差估计量，根据估计视差双向查找多尺度的相似性得到的大视野相似量和上下文3种信息，集成信息通过第0次更新的卷积循环神经网络和其他次更新共享的卷积循环神经网络迭代输出视差的更新量，多次迭代得到最终的视差估计值。之后，通过对输入左、右图像反序和左右翻转估计右图视差，对比左、右图匹配点视差差值的绝对值和给定阈值之差判断视差估计置信度，从而实现可靠区域提取。结果本文方法在Sceneflow数据集上得到了与先进方法相当的精度，平均误差只有0.84像素，并且推理时间有相对优势，可以和精度之间通过控制迭代次数灵活平衡。可靠区域提取后，Sceneflow数据集上误差进一步减小到了历史最佳值0.21像素，在KITTI（Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago）双目测试数据集上，估计区域评估指标最优。结论本文方法对于双目视差估计具有优越性能，可靠区域提取方法能高效提取高精度估计区域，极大地提升了估计区域的可靠性。

关键词

双目视差估计; 遮挡; 卷积循环神经网络; 深度学习; 监督学习

Reliable binocular disparity estimation based on multi-scale similarity recursive search

Yan Min, Wang Junzheng, Li Jing

School of Automation, Beijing Institute of Technology, Beijing 100081, China

Supported by: National Natural Science Foundation of China(61103157); National Key R&D Program of China(2019YFC1511401)

Abstract

Objective Depth information is the key sensing information for the autonomous platform. As common depth sensors, the binocular camera can make up for the sparsity of LiDAR and depth camera not suitable for outdoor scenes. Comparing the performance of light detection and ranging (LiDAR) and depth cameras, it is very important to improve the accuracy and speed of the binocular disparity estimation algorithm. Disparity estimation algorithms based on deep learning have its own priority. Disparity estimation and optical flow estimation methods can learn from each other and faciliate new algorithms generation. Inspired by the efficient optical flow estimation algorithm recurrent all-pairs field transforms (RAFT), a unilateral and bilateral multi-scale similarity recursive search method is demonstrated to achieve high-precision binocular disparity estimation. A method of disparity estimation consistency detection for left and right images is proposed to extract reliable estimation regions to resolve inconsistent estimation accuracy and confidence in different regions. Method The pyramid pooling module (PPM), skip layer connection and residual structure are conducted in the feature network to extract the representation vector with strong representation capability. The inner product of representation vectors is used to demonstrate the similarity between pixels. The multi-scale similarity is obtained by average pooling. The updated or initial disparity, a certain range of similarity with a large field of view searched in multi-scale similarity according to the disparity (the 0th updating iteration is searched in one direction to the left and other updating iterations are searched in two directions) and context information are integrated together. The integrated information is transmitted to the convolutional recurrent neural network (ConvRNN) of the 0th updating process or the ConvRNN shared by other updating processes to obtain the updated amount of disparity, and the final disparity value is obtained via multiple updating iterations. The disparity of the right image is estimated by reversing the order and conducting left-right flipping of the inputted left and right images, and the confidence of disparity is determined by comparing the absolute value of disparity difference between the matched points of the left and right images and the given threshold. The output of each updating iteration is designated to reduce error gradually with increasing weight and the supervised method is used to train the network. In the training process, the learning rate is reduced by segments, and the root mean square prop(RMSProp) optimization algorithm is used for learning. To improve the inference efficiency, the resolution of the feature network is reduced by 8 times, so the learning up-sampling method is adopted to generate the disparity map with the same resolution of the original image. The disparity of the 8×8 adjacent region of a pixel in the original resolution image is calculated by weighting the disparity of the 3×3 adjacent region of the pixel in the reduced resolution image. The weights are obtained by convoluting the hidden state of the ConvRNN. To reduce the high cost of real-scene disparity data or depth data collection, the Sceneflow dataset generated by the 3D creation suite Blender is used to train and test the network, and the real-scene KITTI(Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago) data is used to verify the generalization capability of the proposed method.First, on the Flyingthings3D dataset of the Sceneflow dataset, 21 818 pairs of training images of 540×960 pixels are randomly cropped to get images of 256×512 pixels. The cropped images are inputted to the network to train 440 000 iterations. The batch size is set to 4. The trained network is tested on 4 248 pairs of test images. To verify the rationality of adding the unilateral search process, we use ablation experiments on the Sceneflow dataset to compare the performance of networks with and without the unilateral search process. Next, the network trained on Sceneflow data is tested on KITTI training data to verify the generalization ability of the algorithm between simulation data and real-scene data directly. Then, the network trained on the Sceneflow dataset is fine-tuned on the KITTI2012 and KITTI2015 training set (5.5k iterations of training), respectively, and then cross-tested on KITTI2015 and KITTI2012 training sets for qualitative analysis. Finally, the network trained on Sceneflow data is fine-tuned on KITTI2012 and KITTI2015 training sets together (trained 11 000 iterations), and then tested on KITTI2012 and KITTI2015 test sets to verify the performance of the network further. The code is implemented via the TensorFlow framework. Result Before reliable region extraction step, the accuracy of this method is comparable to that of state-of-the-art methods on the Sceneflow dataset. The average error is only 0.84 pixels, and the error decreases with the increase of the updating iteration count, while the inference time becomes longer. However, the resiliency between speed and accuracy can be obtained by manipulate the number of updating iterations. After credible region extraction, the error on the Sceneflow dataset is further reduced to the historical best value of 0.21 pixels. On the KITTI benchmark, this method may rank first when only estimated regions are evaluated. The colorized disparity images and point cloud images identified completely that almost all of the occluded regions and a huge amount of areas with large errors are removed based on reliable region extraction. Conclusion The proposed method has its superiority for binocular disparity estimation. The credible region extraction method can extract high-precision estimation regions efficiently, which improves the disparity reliability of the estimated regions greatly.

Key words

binocular disparity estimation; occlusion; convolutional recurrent neural network (CRNN); deep learning; supervised learning

0 引言

深度信息在自主平台避障等任务中属于核心感知信息，激光雷达、双目和深度摄像机作为常用的深度感知传感器，各自都有优缺点。在已知内外参数的双目情况下，深度估计可以等价转化为视差估计或者双目立体匹配的问题。由于双目可以弥补激光雷达数据稀疏和深度摄像机不适合室外场景的缺点，使得研究双目视差估计算法具有重要价值。对比激光雷达和深度摄像机的性能，提升双目视差估计算法的精度和推理速度尤为重要。

针对双目立体匹配，已经发展出一套经典方法，包括4个步骤：匹配代价计算、代价聚合、优化和细化(Lee和Shin，2019)。类似的方法有BM(block matching)、SGM(semi-global matching)(Hirschmuller，2008)等。这些方法往往计算时间长，不能适应重复纹理、低纹理及光照差异大的环境。而随着深度学习的发展及硬件性能的提升，越来越多的视觉问题取得了突破式的成果，如目标检测(赵永强等，2020)、语义分割(青晨等，2020)等。双目立体匹配问题也因为深度学习方法的引入，有了极大的突破，如LEAStereo(learning effective architecture stereo)(Cheng等，2020)和GANet-deep(guided aggregation network-deep)(Zhang等，2019)，这些方法相对传统的算法在各大数据集上的速度和精度表现出优异的性能。

目前几种典型的基于深度学习的方法有RTSNet(real-time stereo matching network)(Lee和Shin，2019)、GC-Net(geometry and context network)(Kendall等，2017)这种通过构造3维代价体，采用3维卷积来聚合信息，并通过对分类输出加权的方式实现亚像素的视差估计的方法；也有DispNetCorr(disparity network correlation)(Mayer等，2016)、iResNet(iterative residual prediction network)(Liang等，2017)这种采用相关性计算和2维卷积回归得到视差估计的方法等。不论分类或是回归，这些方法都是通过一个复杂的网络尽量集成所有有效信息进行一次推理，往往结构显得笨重，应用不够灵活。

注意到大量基于深度网络的方法性能极度依赖训练数据的分布，且这些方法对不同区域的估计精度有较大差异。为了适应新的场景，设计具有高度泛化能力的网络十分重要，而如何找出估计结果里相对更可靠的部分，去掉网络尚不能很好处理的区域，对于实际应用非常关键。关于前者，一种想法是在网络中集成可在不同数据间推广的操作，如DispNetCorr等方法采用具有明确物理意义的相关性计算来表示像素间的相似性。关于后者，考虑到双目视差估计主要依靠左、右图之间的相似匹配实现，会导致能匹配上的区域和不能匹配上的区域之间估计精度出现明显差异。而去除不能匹配上的区域，如遮挡区域可以提升结果的可靠性。但是目前还没有文献重点处理这个问题。一个可以参考的方向是找到双目的遮挡区域作为不可靠区域进行去除。关于遮挡区域，基于监督学习的方法很少提到相关处理办法，而非监督学习的方法由于要考虑匹配区域的外观一致性，需要排除遮挡区域才能让网络更好地学习和收敛，所以有大量文献会特别处理遮挡区域，如Gordon等人(2019)提出通过比较匹配点估计距离的相对远近来去掉可能的遮挡区域的影响，而Peng等人(2020)通过判断是否有多个点投影到右图的同一像素来去掉距离更远的点，缓解遮挡的影响等。受Gordon等人(2019)和Peng等人(2020)的启发，如果网络推测两个像素点匹配，那么它们的视差应该是接近的，从而可以通过左右视差图匹配点视差值的比较判断视差的可信度，从而去除高度不可靠的估计区域。

众所周知，基于深度学习的方法，不同任务之间的网络结构千差万别，甚至同一个任务也能有无数种网络结构对应，找到一个有优异性能的网络结构并非易事，而考虑到视差估计和光流估计由于任务之间的相似性，可以在两种任务之间互相借鉴和启迪新算法。RAFT(recurrent all-pairs field transforms)(Teed和Deng，2020)作为一种高效的光流估计算法，采用了相关性操作能保证一定的泛化能力，通过查找多尺度的能表示大范围的相似量，结合上下文等信息回归视差的更新量使得网络具有一定的可解释性，方便融入人类的先验知识和经验进行设计和改进。另外该方法采用迭代多次的方法提升精度，提供在精度和速度之间灵活平衡的可能。

本文提出采用单、双边多尺度相似性迭代查找的方法以实现高精度的双目视差估计。与RAFT不同的是，不失一般性地对特征网络进行增强设计，光流估计涉及上下左右4个方向，而视差只涉及一个方向，考虑迭代更新的是残差，所以实际涉及两个方向，原本对RAFT的4个方向的相似性查找改为双边查找是最直接的，针对视差只能为负，在双边查找的基础上增加单边查找结构可以提升视差估计性能。针对方法在不同区域估计精度和置信度不一致的问题，提出了左右图像视差估计一致性检测提取可靠估计区域的方法。

1 本文算法

1.1 网络结构

本文方法主要包括特征网络、上下文网络、多尺度相似性计算和查找及更新网络，如图 1所示。其中特征网络F-Net如图 2所示，通过采用跳层连接、残差结构(He等，2016)及关联上下文的金字塔池化模块(pyramid pooling module，PPM)(Zhao等，2017)来增强网络建模能力。特征网络最终输出维度为32、分辨率为原图1/8大小的特征图${\boldsymbol{F}}$用于计算相似性。上下文网络由3个步长为2的卷积、8个残差模块和1个步长为1的卷积计算构成。该设计的目的为：1)保持输出的尺度和特征一致；2)多个残差模块堆叠增加网络的建模能力。上下文模块部分输出用来初始化卷积循环神经网络的隐变量，部分作为辅助信息来源用来提供关于轮廓等额外信息。

图 1 总体计算框图

Fig. 1 The total calculation block diagram

图 2 特征网络F-Net

Fig. 2 The feature network F-Net

1.2 多尺度相似性计算和查找

相似性计算和查找过程如图 3所示，通过特征网络计算得到左、右图的特征图后，对左、右图的特征${\boldsymbol{F}}_{\text{L}}$和${\boldsymbol{F}}_{\text{R}}$分别提取出同一行，对取出的行进行矩阵乘法得到尺度$S=1$的相似性矩阵${\boldsymbol{C}}^{S=1}$(为了实现归一化的效果，对矩阵乘法的结果除以特征维度的平方根)，即得到左图中每个像素关于右图中同一行上每个像素的相似性，如图 3中相似性矩阵${\boldsymbol{C}}^{S=1}$的第$p$行第$q$列元素$C_{p, q}$等于左特征图中某一行的第$p$个元素${\boldsymbol{f}}_{ip}^{\text{L}}$和右特征图中同一行的第$q$个元素${\boldsymbol{f}}_{iq}^\text{R}$的乘积，表示左图某一行中第$p$个像素和右图中同一行的第$q$个像素的相似性。而多尺度的相似性通过对前面计算得到的相似性在宽度方向进行平均池化得到，如图 3中黄色和蓝色的颜色块所示，平均池化核大小为1×2，步长为2，所以尺度为$S=1/2$的相似矩阵${\boldsymbol{C}}^{S=1/2}$高度不变，宽度缩小两倍，后续多个尺度类似。

图 3 多尺度相似性计算和查找

Fig. 3 Multi-scale similarity calculation and search

相似性查找分成第0次查找和其他次查找。其他次查找是在计算得到多尺度的相似性后，对左图像的每一个像素结合前一次更新得到的视差，在不同尺度相似性中查找对应到右图像素位置(左图坐标减去视差)左右各4个像素对应的区域。

如图 3，在第$t$次更新迭代、尺度$S=1$时，针对左图第$i$行第$j$列像素，根据第$t-1$次迭代更新的视差${\boldsymbol{d}}^{t-1}_{i}$的第$j$列元素$ d ^{t-1}_{ij}$计算匹配到右图的像素位置为第$i$行第$j- d ^{t-1}_{ij}$列像素，于是查找得到的中心像素的相似量为$Q_{i j}^{S=1}=C_{j, j-d_{i j}^{t-1}}^{S=1}$，而当尺度$S=1/2$时，中心像素的相似量通过对尺度$S=1$的相似量的列坐标做对应的尺度变换得到$Q_{i j}^{S=1 / 2}=C_{j, \left(j-d_{i j}^{t-1}\right) / 2}^{S=1 / 2}$，后续尺度以此类推。

对于每一个尺度，选取的位置包括根据视差计算得到的右图像素位置和左右各4个像素位置，一共9个像素位置，如图 3中绿色色块所示，由于视差不一定是整数，这里通过线性插值得到9个位置点的具体相似值。每个尺度9个位置点的相似量级联成一个9×1维的向量，4个尺度的向量进一步级联成一个9×1×4维的向量，也即图 1中的${\boldsymbol{R}}_{t-1}$。于是，每个像素点的单次查找范围可以覆盖左右各2×2×2×4×8个像素宽的原图像范围(3个2表示3个缩小的尺度，4表示查找的邻域区域大小，8表示特征图已经相对原图缩小到了1/8)，从而可以在单次更新的时候就得到比较理想的结果。而第0次查找和其他次查找的区别是只查找对应到右图像素位置的左侧部分，因为左图的像素在右图的匹配点只能在左侧，而其他次查找是在第0次更新后进行校正，此时校正量正负都有可能。与只有双向查找的结构相比，增加单边查找可以完全确定初次查找方向是准确的，这是由视差符号同向决定的，从而可以提升初次迭代估计的视差精度以致提升多次迭代估计的精度。

1.3 更新网络

更新网络分成第0次更新和其他次迭代更新网络，两者的结构是类似的，核心是循环迭代的卷积循环神经网络，即图 4中的循环迭代模块，由于Teed和Deng(2020)的工作对比证明了采用该结构比直接采用等计算量的3层卷积层更新视差性能更优，这里为了高效地实现视差估计便沿用了该结构。为了迭代输出视差的更新量，提供的有效信息包括当前估计的视差、当前视差周围查找得到的相似性及进一步补充的上下文信息。将综合这些信息的${\boldsymbol{x}}_{t}$输入循环迭代模块，迭代更新隐变量${\boldsymbol{h}}_{t}$，${\boldsymbol{h}}_{t}$经过两次卷积就能输出视差的更新量$Δ{\boldsymbol{d}}_{t}$。为了减少计算量并且保留视野的大小，核心模块没有采用一个5×5的卷积循环神经网络，而是采用核分别为1×5和5×1的两个卷积循环神经网络级联构成，具体迭代过程(Cho等，2014)为

$ \boldsymbol{z}_{t}=\boldsymbol{\sigma}\left(\operatorname{Conv}_{1 \times 5}\left(\left[\boldsymbol{h}_{t-1} \mid \boldsymbol{x}_{t}\right], \boldsymbol{W}_{z}\right)\right) $

(1)

$ \boldsymbol{r}_{t}=\boldsymbol{\sigma}\left(\operatorname{Conv}_{1 \times 5}\left(\left[\boldsymbol{h}_{t-1} \mid \boldsymbol{x}_{t}\right], \boldsymbol{W}_{r}\right)\right) $

(2)

$ \tilde{\boldsymbol{h}}_{t}^{\prime}=\tanh \left(\operatorname{Conv}_{1 \times 5}\left(\left[\boldsymbol{r}_{t} \odot \boldsymbol{h}_{t-1} \mid \boldsymbol{x}_{t}\right], \boldsymbol{W}_{h^{\prime}}\right)\right) $

(3)

$ \boldsymbol{h}_{t}^{\prime}=\left(1-\boldsymbol{z}_{t}\right) \odot \boldsymbol{h}_{t-1}+\boldsymbol{z}_{t} \odot \widetilde{\boldsymbol{h}}_{t}^{\prime} $

(4)

$ \boldsymbol{z}_{t}^{\prime}=\sigma\left(\operatorname{Conv}_{5 \times 1}\left(\left[\boldsymbol{h}_{t}^{\prime} \mid \boldsymbol{x}_{t}\right], \boldsymbol{W}_{z^{\prime}}\right)\right) $

(5)

$ \boldsymbol{r}_{t}^{\prime}=\sigma\left(\operatorname{Conv}_{5 \times 1}\left(\left[\boldsymbol{h}_{t}^{\prime} \mid \boldsymbol{x}_{t}\right], \boldsymbol{W}_{r^{\prime}}\right)\right) $

(6)

$ \tilde{\boldsymbol{h}}_{t}=\tanh \left(\operatorname{Conv}_{5 \times 1}\left(\left[\boldsymbol{r}_{t}^{\prime} \odot \boldsymbol{h}_{t}^{\prime} \mid \boldsymbol{x}_{t}\right], \boldsymbol{W}_{h}\right)\right) $

(7)

$ \boldsymbol{h}_{t}=\left(1-\boldsymbol{z}_{t}^{\prime}\right) \odot \boldsymbol{h}_{t}^{\prime}+\boldsymbol{z}_{t}^{\prime} \odot \widetilde{\boldsymbol{h}}_{t} $

(8)

图 4 更新网络

Fig. 4 The updating network

式(1)—(4)对应第1个卷积循环神经网络，式(5)—(8)对应第2个卷积循环神经网络，“|”表示向量级联，$⊙$表示矩阵对应元素相乘，$σ$表示sigmoid激活函数，Conv表示卷积操作，Conv的下标表示卷积核的大小，${\boldsymbol{W}}$表示对应卷积核的参数，下标$t$表示第$t$次更新迭代，${\boldsymbol{z}}$表示更新门，确定隐变量的更新比例，且${\boldsymbol{z}}$越大隐变量更新越多，${\boldsymbol{r}}$表示重置门，确定历史隐变量被利用的程度，且${\boldsymbol{r}}$越小，历史记忆就越少，另外上述公式中所有卷积输出的通道数和隐变量一致。上下文网络输出第0次更新和其他次更新的上下文及第0次更新的循环迭代模块隐变量的初始值${\boldsymbol{h}}_{0}$，其他次迭代更新共享同一个网络，该网络的隐变量由第0次更新网络输出的隐变量进行初始化。初始化视差${\boldsymbol{d}}_{0}$为0，则每次迭代估计得到的视差为

$ \left\{ {\begin{array}{*{20}{l}} {{\mathit{\boldsymbol{d}}_t} = {\mathit{\boldsymbol{d}}_{t - 1}} + \Delta {\mathit{\boldsymbol{d}}_t},t > 0}\\ {{\mathit{\boldsymbol{d}}_0} = 0} \end{array}} \right. $

(9)

为了获得比双线性上采样更高精度原图分辨率的视差图，这里采用RAFT中学习上采样的方法，具体为对图 4中的循环迭代模块的输出隐变量${\boldsymbol{h}}_{t}$通过卷积核分别为3×3和1×3的卷积计算得到上采样的权重系数，对估计得到的1/8分辨率视差图每个像素邻近3×3区域的视差值进行64次加权分别得到原图分辨率邻近8×8大小区域的视差值，从而实现8倍上采样。

1.4 监督训练

采用有监督的方法对上面提出的网络进行训练，为了使得迭代输出的视差误差随着迭代次数的增加而逐渐减小，给后续迭代的损失设置相对大一些的权重，通过设置为底小于1的指数形式的权重实现，损失函数设计为

$ L=\sum\limits_{t=1}^{N} \gamma^{N-t}\left\|\boldsymbol{d}_{\mathrm{gt}}-\boldsymbol{d}_{t}\right\|_{1} $

(10)

式中，${\boldsymbol{d}}_{\text{gt}}$表示视差真值，$|| \ || _{1}$表示1范数，权重底$γ=0.8$，$N=5$表示5次更新(含第0次更新)。其中第0次更新通过单边查找得到更新的初始视差值，后面4次更新通过双边查找共享同一个更新网络，因为后续迭代都是双边查找，共享结构一则可以在训练时强化共享结构的学习，二则学习的模型参数量更小，另外后续4次更新可以允许在4个尺度下逐步细化。测试推理时第2个共享的网络可以根据需要重复多次，不需要保持和训练时一致。

1.5 左右结合推导可靠的视差估计区域

双目成像的特点如图 5所示，取物体后面的某一个平面，图中2维展示为一条线，两台摄像机共同可视的区域是绿线表示的部分，也就是左图中在右图的非遮挡区域，而红线部分是左图中视野超出右图的区域，黄色实线是左图中在右图被遮挡的区域，而黄色虚线是左图中不可见部分。这里将左图中超出右图视野和在右图被遮挡的区域统称为遮挡区域，而共同可视的区域称为非遮挡区域。由于本文方法是通过相似性查找得到左图像素的匹配点，理论上本文方法只能估计非遮挡区域的视差，而遮挡区域需要通过上下文等信息进行填充得到，两者的精度和置信度应该会存在明显的差别。对于实际应用而言，能够提取置信度高的估计区域具有重大意义。于是本文提出鲁棒提取可靠估计区域的方法。该方法需要同时估计左右图像的视差，而原本的网络是对左图的每个像素在右图中检索相似性(根据实际得到的匹配只能在右图的相对左侧，网络中加上了单边检索，所以无法通过交换左右图的顺序估计右图的视差)，结合左右固定顺序的训练，网络只能估计左图视差，为了能估计右图的视差，本文提出通过左右反序和图像左右翻转的办法来实现对右图的视差估计。

图 5 双目成像

Fig. 5 Binocular imaging

假设左图像一点$P_{\text{L}}$的横坐标为$x_{\text{L}}$，在右图的匹配点$P_{\text{R}}$横坐标为$x_{\text{R}}$，则视差$d_{\text{L}}=x_{\text{L}}-x_{\text{R}}$，左右图像反序并左右翻转后，原来右图的匹配点横坐标变成$x_{\text{R}→\text{L}}=w-x_{\text{R}}$，其中$w$表示图像的宽度，原来左图点横坐标变成$x_{\text{L}→\text{R}}=w-x_{\text{L}}$，此时新的视差为

$ \begin{gathered} d_{\mathrm{R}}=x_{\mathrm{R} \rightarrow \mathrm{L}}-x_{\mathrm{L} \rightarrow \mathrm{R}}=\left(w-x_{\mathrm{R}}\right)- \\ \left(w-x_{\mathrm{L}}\right)=x_{\mathrm{L}}-x_{\mathrm{R}}=d_{\mathrm{L}} \end{gathered} $

(11)

也就是通过左右反序和图像左右翻转实现了视差同符号等大小，使得存在单边检索及左右顺序数据训练的网络也能对右图进行视差估计。

得到左右视差图后，对左图中的点$P_{\text{L}}$根据估计视差$ \hat d _{\text{L}}$计算得到右图映射点$ \hat P _{\text{R}}$横坐标$ \hat x _{\text{R}}=x_{\text{L}}- \hat d _{\text{L}}$，查找右视差图映射点附近的左右两点$ \hat P _{x_{1}}$和$ \hat P _{x_{2}}$横坐标分别为$x_{1}=|\hat x _{\text{R}}|$，$x_{2}= |\hat x _{\text{R}}|+1$的视差估计值分别为$ \hat d _{x_{1}}$和$ \hat d _{x_{2}}$，其中| |表示向下取整，则点$ \hat P _{\text{R}}$的视差估计值通过下面的线性插值计算得到

$ \hat{d}_{\mathrm{R}}=\hat{d}_{x_{1}} \times\left(x_{2}-\hat{x}_{\mathrm{R}}\right)+\hat{d}_{x_{2}} \times\left(\hat{x}_{\mathrm{R}}-x_{1}\right) $

(12)

通过比较左右视差估计值的差值绝对值$Δ \hat d = |\hat d _{\text{L}}- \hat d _{\text{R}}|$与阈值$h$的大小(本文后续实验设置阈值$h$为0.5像素)来获取可靠估计区域，当$Δ \hat d ≤h$则认为是可靠估计，反之则估计误差可能很大或者因为被遮挡而不可信。

2 实验分析

2.1 实验数据

为了弥补获取真实场景视差数据或者深度数据代价过大的问题，这里采用模拟器产生的仿真数据集Sceneflow(Mayer等，2016)对网络进行训练和测试，并采用真实场景数据KITTI(Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago)(Geiger等，2012)对本文方法的泛化能力进行验证。

2.2 评估指标

采用4个与误差相关的评估指标：1)末端点误差(end point error，EPE)表示统计区域所有估计像素的视差误差平均值，即

$ \begin{gathered} f_{\mathrm{EPE}}=\left(1 / \sum\limits_{m=1}^{M} \sum\limits_{j=1}^{H} \sum\limits_{i=1}^{W} \chi_{I_{m i j} \in \boldsymbol{G}}\right) \cdot \\ \sum\limits_{m=1}^{M} \sum\limits_{j=1}^{H} \sum\limits_{i=1}^{W}\left\|d_{m i j}-d_{m i j}^{*}\right\|_{1} \cdot \chi_{I_{m i j} \in \boldsymbol{G}} \end{gathered} $

(13)

式中，$M$表示评估的图像数量，$H$表示图像的高度，$W$表示图像的宽度，$I_{mij}$表示第$m$幅图像第$j$行第$i$列像素，$d_{mij}$表示像素$I_{mij}$的视差估计值，$ d ^{*}_{mij}$表示像素$I_{mij}$的视差真值，$χ$表示示性函数，条件成立时等于1，否则等于0，${\boldsymbol{G}}$表示统计像素集合(没有明确指定区域范围的时候表示所有测试图像区域)。

2) 超过阈值的像素百分比(percentage of pixels over threshold，PPT)表示误差超过3个像素并且误差相对视差超过5%的像素占总统计像素数的百分比，即

$ \begin{gathered} f_{\mathrm{PPT}}=\left(1 / \sum\limits_{m=1}^{M} \sum\limits_{j=1}^{H} \sum\limits_{i=1}^{W} \chi_{I_{m i j} \in \boldsymbol{G}}\right) \cdot \\ \sum\limits_{m=1}^{M} \sum\limits_{j=1}^{H} \sum\limits_{i=1}^{W} \chi_{\left\|d_{m i j}-d_{m i j}^{*}\right\|_{1}>3} \cdot \chi_{\left\|d_{m i j}-d_{m i j}^{*}\right\|_{1} /\left|d_{m i j}^{*}\right|>5 \%} \cdot \chi_{I_{m i j} \in \boldsymbol{G}} \end{gathered} $

(14)

3) 可靠估计区域百分比表示提取的可靠区域占所有图像区域的面积百分比，简写为EST(estimated)，即

$ f_{\mathrm{EST}}=(1 / M \times H \times W) \sum\limits_{m=1}^{M} \sum\limits_{j=1}^{H} \sum\limits_{i=1}^{W} \chi_{\Delta \hat{d}_{m i j} \leqslant h} $

(15)

式中，$Δ \hat d _{mij}$表示1.5节介绍的匹配点对的左右视差估计值的差值绝对值，$h$表示给定视差不一致阈值。

4) KITTI2012数据用到的超过阈值的像素百分比(PPT2), 表示误差超过给定个数的像素占总统计像素数的百分比，即

$ \begin{gathered} f_{\mathrm{PPT} 2}=\left(1 / \sum\limits_{m=1}^{M} \sum\limits_{j=1}^{H} \sum\limits_{i=1}^{W} \chi_{I_{m i j} \in \boldsymbol{G}}\right) \cdot \\ \sum\limits_{m=1}^{M} \sum\limits_{j=1}^{H} \sum\limits_{i=1}^{W} \chi_{\left\|d_{m i j}-d_{m i j}^{*}\right\|_{1}>e_{h}} \cdot \chi_{I_{m i j} \in \boldsymbol{G}} \end{gathered} $

(16)

式中，$e_{h}$表示给定的视差误差阈值。

2.3 Sceneflow实验结果分析

2.3.1 消融实验

考虑视差估计可以等价为特殊情况下的光流估计，所以RAFT方法(Teed和Deng, 2020)的消融实验在这里也适用，本文和RAFT最大的区别是考虑视差估计的特性增加了第0次迭代的单边查找更新，所以这里对这一结构的提出进行消融实验，也就是比较有单边查找和无单边查找情况下的网络性能。

通过对Sceneflow的Flyingthings3D的21 818对540×960像素大小的训练图像随机裁剪得到256×512像素大小的图像输入网络训练44万次，批大小为4，训练过程中采用分段降低学习率的方法，使用RMSProp优化算法(Tieleman和Hinton，2012)进行学习。本文代码采用TensorFlow框架(Abadi等，2016)实现。各次迭代的EPE和PPT及迭代不同次数在NVIDIA GTX 1080Ti显卡下平均推理时间(100次推理取平均，推理图像大小为576×960像素)，如表 1所示，统计像素集合${\boldsymbol{G}}$包括所有的测试数据。从表 1可以看到，随着迭代次数的增加，推理时间变长，但平均误差和超过阈值的像素百分比逐渐减小。带单边查找的时候，第0次更新平均误差达到了1.50像素，第4次更新达到了0.84亚像素的精度，超过阈值的像素百分比也减小到3.19%。表 1中，带单边查找的网络比不带的网络各项指标都有一定的提升，尤其是第0次迭代的结果改善更加明显(注意推理时间与机器运行状态有关，可以认为两者的推理时间非常接近，不存在明显的差异)，所以增加单边查找是合理和有利的，说明前面基于人的先验知识和经验对RAFT方法的改进是有效的。后面实验都是采用带单边查找的网络。

表 1 Sceneflow测试数据集上各次迭代平均误差和超过阈值的像素百分比及推理时间
Table 1 Average error, pixel percentage without a given threshold and inference time of each iteration on Sceneflow test set (with unilateral search)

下载CSV

		迭代次数
		0	1	2	3	4
带单边查找	EPE/像素	1.50	1.04	0.92	0.87	0.84
	PPT/%	6.28	4.19	3.57	3.31	3.19
	时间/ms	39	72	98	127	158
不带单边查找	EPE/像素	1.61	1.09	0.96	0.90	0.88
	PPT/%	7.01	4.41	3.73	3.43	3.28
	时间/ms	39	73	100	132	159

2.3.2 方法对比

本文方法SRS(similarity recursive search)与GC-Net(Kendall等，2017)、PSMNet(pyramid stereo matching network)(Chang和Chen，2018)、Edgestereo (Song等，2020)和AANet^*(adaptive aggregation network)(Xu和Zhang，2020)方法的对比如表 2所示，统计像素集合${\boldsymbol{G}}$包括所有的测试数据，考虑推理时间和硬件及图像大小强相关，给出了推理时间涉及的硬件和图像大小，为了便于对比，本文方法给出了两种图像大小对应的推理时间。可以看出本文方法获得了与先进方法相近的精度，结合表 1，本文方法还可以通过选择不同的迭代次数实现在精度和推理时间上的灵活平衡，综合体现了本文方法的优越性能。

表 2 Sceneflow测试集上其他方法和本文方法第4次迭代评估指标
Table 2 Evaluation results of our method's the 4th iteration and other methods on Sceneflow test set

下载CSV

	GC-Net	PSMNet	Edgestereo	AANet^*	本文方法
EPE/像素	2.51	1.09	0.74	0.83	0.84
时间/s	0.95	0.41	0.32	0.160	0.158/0.153
显卡	Titan-X	Titan-Xp	GTX 1080Ti	NVIDIA V100	GTX 1080Ti
大小/像素	960×540	376×1 240	393×1 313	576×960	576×960/384×1 280

2.3.3 可靠提取区域视差估计评估

本文方法在非遮挡区域、遮挡区域和可靠提取区域的评估指标如表 3所示，注意这里的统计像素集合${\boldsymbol{G}}$分别对应非遮挡区域、遮挡区域和可靠提取区域。可以看出，本文方法在遮挡和非遮挡区域指标存在明显的差别，其中非遮挡区域第4次迭代平均误差达到了0.55像素，超过阈值的像素百分比只有1.61%，而遮挡区域平均误差为2.87像素，超过阈值的像素百分比甚至达到了13.95%。通过可靠区域提取后，平均误差减小到0.21像素，超过阈值的像素百分比甚至减小到0.29%，极大地提升了估计性能，而此时的提取区域大小也达到75.85%。图 6展示了本文方法在Sceneflow测试集某对图像上的测试结果，可以看到，第0次迭代输出就已经得到比较完整的结构信息，随着迭代的增加，误差逐渐减小，边缘更加清晰。对比图 6(g)-(i)可以发现，可靠区域提取基本去掉了遮挡区域，并且去掉了一些误差比较大的非遮挡区域(如最右上角的区域)，保留了64.72%的区域。从图 6(j)像素随误差分布图可以看出，可靠区域提取后，大误差的像素基本被去除，小误差的像素百分比明显增加。综合以上，说明本文可靠区域提取方法的有效性。

表 3 本文方法在Sceneflow测试集上3种区域的评估指标
Table 3 Evaluation results of our method in three kinds of areas on Sceneflow test set

下载CSV

	第0次迭代			第1次迭代			第4次迭代
	PPT/%	EPE/像素	EST/%	PPT/%	EPE/像素	EST/%	PPT/%	EPE/像素	EST/%
非遮挡区域	3.40	0.98	-	2.03	0.65	-	1.61	0.55	-
遮挡区域	25.93	5.06	-	18.92	3.73	-	13.95	2.87	-
可靠提取	0.41	0.31	60.71	0.31	0.23	72.41	0.29	0.21	75.85
注：加粗字体表示最优结果，“-”表示没有对应值。

图 6 Sceneflow测试集测试结果图示

Fig. 6 Illustration of test results on Sceneflow test set

((a) left image; (b) the 0th iteration output; (c) the 1st iteration output; (d) the 2nd iteration output; (e) the 3rd iteration output; (f) ground truth; (g) the 4th iteration output; (h) ground truth in non-occluded areas; (i) the 4th iteration reliable region extraction; (j) pixel distribution with EPE before and after reliable region extraction of the 4$^\text{th}$ iteration)

2.4 KITTI实验结果分析

为了进一步验证本文方法在实际场景数据上的性能，将Sceneflow数据集上训练好的模型分别在KITTI2015和KITTI2012两个数据集的训练集上进行微调(训练5 500次)，然后交叉在KITTI2012和KITTI2015数据集的训练集上进行测试。另外，为了方便与其他方法进行公平比较，将Sceneflow数据集上训练好的模型在KITTI2012和KITTI2015两个训练数据合集上训练1.1万次，然后分别在KITTI2012和KITTI2015测试集上进行测试。

2.4.1 KITTI训练集交叉测试

KITTI2012和KITTI2015分别测试一对图像的交叉测试结果如图 7所示，图中泛化表示在仿真数据集Sceneflow上训练好的模型直接在实际场景数据集KITTI上测试。从图 7(c)可见，仅仅泛化就能得到不错的结果，而图 7(d)-(i)说明微调及可靠区域提取后，模型在实际场景的指标能进一步提升。从图 7(l)可见微调相对泛化能明显提升小误差的像素百分比且减小大误差的像素百分比，而可靠区域提取后不仅明显增加了小误差的像素百分比，且几乎去除了大误差的像素。对比图 7的点云图，可以看到提取可靠区域前，点云中存在大量被遮挡区域的填充点云，这些填充点云分布在前景和背景之间，往往都是噪点，而可靠区域提取后这些噪点基本去除，使得前景和背景能比较好地分离，但是提取后的点云相对提取前密度减小，出现了一些明显的孔洞区域，这个问题可以后续考虑改进方法提升遮挡区域的深度估计，或者结合位姿估计叠加时间序列上的多帧深度估计信息。

图 7 KITTI训练集测试结果图示

Fig. 7 KITTI training set test results

((a) left images; (b) ground truths; (c) generalization results of the 4th iteration; (d) the 0th iteration output after fine tuning; (e) the 1st iteration output after fine tuning; (f) the 2nd iteration output after fine tuning; (g) the 3rd iteration output after fine tuning; (h) the 4th iteration output after fine tuning; (i) the 4th iteration reliable region extraction after fine tuning; (j) point cloud within 50 meters of the 4th iteration after fine tuning; (k) point cloud within 50 meters of the 4th iteration reliable region extraction result after fine tuning; (l) pixel distribution with EPE before and after reliable region extraction of the 4th iteration after fine tuning)

2.4.2 KITTI测试集测试

同时在两个KITTI训练集上训练后的模型分别在KITTI两个测试集上的测试结果及对比方法的结果如表 4和表 5所示(表中其他方法的数据来源于KITTI网站http://www.cvlibs.net/datasets/kitti/index.php)，其中LEAStereo(Cheng等，2020)和GANet-deep(Zhang等，2019)是KITTI排行榜上指标最佳、已经发表文献并且在KITTI2012和KITTI2015两个数据集上进行了测试的方法。本文方法(提取可靠区域后)在KITTI2012数据集上估计了79.42%的像素点，在KITTI2015数据集上估计了77.80%的像素点，其他方法估计了100%的区域。表 4和表 5中仅展示了本文方法最后一次迭代的评估指标。其中Noc表示非遮挡区域，All表示包含所有的区域，分别对应各自的统计像素集合${\boldsymbol{G}}$。表 4是误差大于给定像素(式(16)中$e_{h}$分别为2、3、4、5像素时)的像素百分比PPT2。表 5是误差大于3像素并且误差占真实视差的百分比(即相对误差)大于5%的像素百分比PPT。组合表示统计像素集合${\boldsymbol{G}}$。由于本文方法在KITTI数据上没有做过多的参数调优，所以在做可靠区域提取前本文方法的指标与对比方法还存在一定差距。但如果只考虑被算法估计了有效视差值的像素点，本文方法的指标在两个数据集上都具有绝对优势(KITTI官网提供根据全部带真值的像素进行评估和根据带真值的像素与提交数据提供有效值的像素的交集进行评估的排行榜，截至2021年12月1日时该方法在第2种评估指标上居于榜首)，充分说明本文提取可靠区域方法的有效性。

表 4 KITTI2012测试集上其他方法和本文方法第4次迭代评估指标PPT2
Table 4 Evaluation results of our method's the 4th iteration and other methods on KITTI2012 test set

下载CSV

方法	误差 > 2像素/%		误差 > 3像素/%		误差 > 4像素/%		误差 > 5像素/%		EPE/像素
方法	Noc	All	Noc	All	Noc	All	Noc	All	Noc	All
GC-Net	2.71	3.46	1.77	2.30	1.36	1.77	1.12	1.46	0.6	0.7
LEAStereo	1.90	2.39	1.13	1.45	0.83	1.08	0.67	0.88	0.5	0.5
GANet-deep	1.89	2.50	1.19	1.60	0.91	1.23	0.76	1.02	0.4	0.5
本文(提取前)	3.89	4.54	2.31	2.76	1.66	2.00	1.29	1.58	0.6	0.7
本文(提取后)	1.08	1.08	0.47	0.48	0.30	0.30	0.22	0.22	0.4	0.4
注：Noc表示非遮挡区域，All表示所有区域, 加粗字体表示每列最优结果。

表 5 KITTI2015测试集上其他方法和本文方法第4次迭代评估指标PPT
Table 5 Evaluation results of our method's the 4th iteration and other methods on KITTI2015 test set

下载CSV

方法	Noc/%			All/%			时间/s
方法	D1-bg	D1-fg	D1-all	D1-bg	D1-fg	D1-all	时间/s
GC-Net	2.02	5.58	2.61	2.21	6.16	2.87	0.9
LEAStereo	1.29	2.65	1.51	1.40	2.91	1.65	0.30
GANet-deep	1.34	3.11	1.63	1.48	3.46	1.81	1.80
本文(提取前)	2.67	4.02	2.90	2.83	4.28	3.07	0.15
本文(提取后)	0.64	1.14	0.71	0.64	1.14	0.72	0.15
注：D1表示左图，bg表示背景，fg表示前景，all表示同时包含背景和前景, 加粗字体表示每列最优结果。

3 结论

本文利用视差估计和光流估计之间的相似性，将光流估计的优势方法RAFT迁移到了视差估计，利用多尺度的相似性迭代查找实现高精度的视差估计，并且能在精度和推理时间之间通过选取不同的迭代次数实现灵活平衡。针对视差估计左图匹配点只能出现在右图的相对左边，增加单边相似性查找，进一步提升了视差估计精度，在Sceneflow数据集上得到了与先进方法可以相提并论的精度。针对遮挡区域误差较大，提出了左、右图反序和翻转同时估计左、右图视差，并对比左、右图匹配点视差估计值的差值绝对值与给定阈值的大小获取可靠估计区域的方法，保证了提取区域的高精度视差估计，去掉了大量的遮挡区域和其他误差较大的区域，评估指标得到了明显提升。通过泛化实验验证了模型在仿真数据和真实场景数据之间的迁移能力，但是微调可以进一步提升真实场景的估计性能，在只考虑被估计部分的情况下，本文方法(结合可靠区域提取)在KITTI双目测试数据集上指标高居榜首。

下一步，考虑本文方法采用1/8分辨率的特征图通过相似性查找进行视差估计，极大地限制了估计精度，后续可以考虑采用高分辨率的特征图进行视差估计，但是同时需要提升推理速度；另外本文方法基本只能保证非遮挡区域较高精度的估计，遮挡区域需要针对性地提出改进策略；最后本文采用监督学习对网络进行训练，需要代价极高的标注数据，后续可以考虑研究非监督的方法对网络进行训练。

参考文献

Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z F, Citro C, Corrado G S, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y Q, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mane D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viegas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y and Zheng X Q. 2016. Tensorflow: large-scale machine learning on heterogeneous distributed systems[EB/OL]. [2021-12-01]. https://arxiv.org/pdf/1603.04467.pdf

Chang J R and Chen Y S. 2018. Pyramid stereo matching network//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 5410-5418[DOI: 10.1109/CVPR.2018.00567]

Cheng X L, Zhong Y R, Harandi M, Dai Y C, Chang X J, Drummond T, Li H D and Ge Z Y. 2020. Hierarchical neural architecture search for deep stereo matching[EB/OL]. [2021-12-01]. https://arxiv.org/pdf/2010.13501.pdf

Cho K, Van Merriёnboer B, Bahdanau D and Bengio Y. 2014. On the properties of neural machine translation: encoder-decoder approaches//Proceedings of the 8th Workshop on Syntax, Semantics and Structure in Statistical Translation. Doha, Qatar: Association for Computational Linguistics: 103-111[DOI: 10.3115/v1/W14-4012]

Geiger A, Lenz P and Urtasun R. 2012. Are we ready for autonomous driving? The KITTI vision benchmark suite//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 3354-3361[DOI: 10.1109/CVPR.2012.6248074]

Gordon A, Li H H, Jonschkowski R and Angelova A. 2019. Depth from videos in the wild: unsupervised monocular depth learning from unknown cameras//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 8976-8985[DOI: 10.1109/ICCV.2019.00907]

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Identity mappings in deep residual networks//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 630-645[DOI: 10.1007/978-3-319-46493-0_38]

Hirschmuller H. 2008. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2): 328-341 [DOI:10.1109/TPAMI.2007.1166]

Kendall A, Martirosyan H, Dasgupta S, Henry P, Kennedy R, Bachrach A and Bry A. 2017. End-to-end learning of geometry and context for deep stereo regression//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 66-75[DOI: 10.1109/ICCV.2017.17]

Lee H and Shin Y. 2019. Real-time stereo matching network with high accuracy//Proceedings of 2019 IEEE International Conference on Image Processing. Taipei, China: IEEE: 4280-4284[DOI: 10.1109/ICIP.2019.8803514].

Liang Z F, Feng Y L, Guo Y L, Liu H Z, Qiao L B, Chen W, Zhou L and Zhang J F. 2017. Learning deep correspondence through prior and posterior feature constancy[EB/OL]. [2021-12-01]. https://arxiv.org/pdf/1712.01039v1.pdf

Mayer N, Ilg E, Häusser P, Fischer P, Cremers D, Dosovitskiy A and Brox T. 2016. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 4040-4048[DOI: 10.1109/CVPR.2016.438]

Peng L, Deng D and Cai D. 2020. Geometry-based occlusion-aware unsupervised stereo matching for autonomous driving[EB/OL]. [2021-12-01]. https://arxiv.org/pdf/2010.10700.pdf

Qing C, Yu J, Xiao C B, Duan J. 2020. Deep convolutional neural network for semantic image segmentation. Journal of Image and Graphics, 25(6): 1069-1090 (青晨, 禹晶, 肖创柏, 段娟. 2020. 深度卷积神经网络图像语义分割研究进展. 中国图象图形学报, 25(6): 1069-1090) [DOI:10.11834/jig.190355]

Song X, Zhao X, Fang L J, Hu H W, Yu Y Z. 2020. EdgeStereo: an effective multi-task learning network for stereo matching and edge detection. International Journal of Computer Vision, 128(4): 910-930 [DOI:10.1007/s11263-019-01287-w]

Teed Z and Deng J. 2020. RAFT: recurrent all-pairs field transforms for optical flow//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 402-419[DOI: 10.1007/978-3-030-58536-5_24]

Tieleman T, Hinton G. 2012. Lecture 6.5-rmsprop: DIVIDE the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4(2): 26-31

Xu H F and Zhang J Y. 2020. AANet: adaptive aggregation network for efficient stereo matching//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 1956-1965[DOI: 10.1109/CVPR42600.2020.00203]

Zhang F H, Prisacariu V, Yang R G and Torr P H S. 2019. GA-Net: guided aggregation net for end-to-end stereo matching//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 185-194[DOI: 10.1109/CVPR.2019.00027]

Zhao H S, Shi J P, Qi X J, Wang X G and Jia J Y. 2017. Pyramid scene parsing network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 6230-6239[DOI: 10.1109/CVPR.2017.660]

Zhao Y Q, Rao Y, Dong S P, Zhang J Y. 2020. Survey on deep learning object detection. Journal of Image and Graphics, 25(4): 629-654 (赵永强, 饶元, 董世鹏, 张君毅. 2020. 深度学习目标检测方法综述. 中国图象图形学报, 25(4): 629-654) [DOI:10.11834/jig.190307]