Current Issue Cover
多尺度相似性迭代查找的可靠双目视差估计

晏敏, 王军政, 李静(北京理工大学自动化学院, 北京 100081)

摘 要
目的 双目视差估计可以实现稠密的深度估计,因而具有重要研究价值。而视差估计和光流估计两个任务之间具有相似性,在两种任务之间可以互相借鉴并启迪新算法。受光流估计高效算法RAFT (recurrent all-pairs field transforms)的启发,本文提出采用单、双边多尺度相似性迭代查找的方法实现高精度的双目视差估计。针对方法在不同区域估计精度和置信度不一致的问题,提出了左右图像视差估计一致性检测提取可靠估计区域的方法。方法 采用金字塔池化模块、跳层连接和残差结构的特征网络提取具有强表征能力的表示向量,采用向量内积表示像素间的相似性,通过平均池化得到多尺度的相似量,第0次迭代集成初始视差量,根据初始视差单方向向左查找多尺度的相似性得到的大视野相似量和上下文3种信息,而其他次迭代集成更新的视差估计量,根据估计视差双向查找多尺度的相似性得到的大视野相似量和上下文3种信息,集成信息通过第0次更新的卷积循环神经网络和其他次更新共享的卷积循环神经网络迭代输出视差的更新量,多次迭代得到最终的视差估计值。之后,通过对输入左、右图像反序和左右翻转估计右图视差,对比左、右图匹配点视差差值的绝对值和给定阈值之差判断视差估计置信度,从而实现可靠区域提取。结果 本文方法在Sceneflow数据集上得到了与先进方法相当的精度,平均误差只有0.84像素,并且推理时间有相对优势,可以和精度之间通过控制迭代次数灵活平衡。可靠区域提取后,Sceneflow数据集上误差进一步减小到了历史最佳值0.21像素,在KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago)双目测试数据集上,估计区域评估指标最优。结论 本文方法对于双目视差估计具有优越性能,可靠区域提取方法能高效提取高精度估计区域,极大地提升了估计区域的可靠性。
关键词
Reliable binocular disparity estimation based on multi-scale similarity recursive search

Yan Min, Wang Junzheng, Li Jing(School of Automation, Beijing Institute of Technology, Beijing 100081, China)

Abstract
Objective Depth information is the key sensing information for the autonomous platform. As common depth sensors, the binocular camera can make up for the sparsity of LiDAR and depth camera not suitable for outdoor scenes. Comparing the performance of light detection and ranging (LiDAR) and depth cameras, it is very important to improve the accuracy and speed of the binocular disparity estimation algorithm. Disparity estimation algorithms based on deep learning have its own priority. Disparity estimation and optical flow estimation methods can learn from each other and faciliate new algorithms generation. Inspired by the efficient optical flow estimation algorithm recurrent all-pairs field transforms (RAFT), a unilateral and bilateral multi-scale similarity recursive search method is demonstrated to achieve high-precision binocular disparity estimation. A method of disparity estimation consistency detection for left and right images is proposed to extract reliable estimation regions to resolve inconsistent estimation accuracy and confidence in different regions. Method The pyramid pooling module (PPM), skip layer connection and residual structure are conducted in the feature network to extract the representation vector with strong representation capability. The inner product of representation vectors is used to demonstrate the similarity between pixels. The multi-scale similarity is obtained by average pooling. The updated or initial disparity, a certain range of similarity with a large field of view searched in multi-scale similarity according to the disparity (the 0th updating iteration is searched in one direction to the left and other updating iterations are searched in two directions) and context information are integrated together. The integrated information is transmitted to the convolutional recurrent neural network (ConvRNN) of the 0th updating process or the ConvRNN shared by other updating processes to obtain the updated amount of disparity, and the final disparity value is obtained via multiple updating iterations. The disparity of the right image is estimated by reversing the order and conducting left-right flipping of the inputted left and right images, and the confidence of disparity is determined by comparing the absolute value of disparity difference between the matched points of the left and right images and the given threshold. The output of each updating iteration is designated to reduce error gradually with increasing weight and the supervised method is used to train the network. In the training process, the learning rate is reduced by segments, and the root mean square prop(RMSProp) optimization algorithm is used for learning. To improve the inference efficiency, the resolution of the feature network is reduced by 8 times, so the learning up-sampling method is adopted to generate the disparity map with the same resolution of the original image. The disparity of the 8×8 adjacent region of a pixel in the original resolution image is calculated by weighting the disparity of the 3×3 adjacent region of the pixel in the reduced resolution image. The weights are obtained by convoluting the hidden state of the ConvRNN. To reduce the high cost of real-scene disparity data or depth data collection, the Sceneflow dataset generated by the 3D creation suite Blender is used to train and test the network, and the real-scene KITTI(Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago) data is used to verify the generalization capability of the proposed method.First, on the Flyingthings3D dataset of the Sceneflow dataset, 21 818 pairs of training images of 540×960 pixels are randomly cropped to get images of 256×512 pixels. The cropped images are inputted to the network to train 440 000 iterations. The batch size is set to 4. The trained network is tested on 4 248 pairs of test images. To verify the rationality of adding the unilateral search process, we use ablation experiments on the Sceneflow dataset to compare the performance of networks with and without the unilateral search process. Next, the network trained on Sceneflow data is tested on KITTI training data to verify the generalization ability of the algorithm between simulation data and real-scene data directly. Then, the network trained on the Sceneflow dataset is fine-tuned on the KITTI2012 and KITTI2015 training set (5.5k iterations of training), respectively, and then cross-tested on KITTI2015 and KITTI2012 training sets for qualitative analysis. Finally, the network trained on Sceneflow data is fine-tuned on KITTI2012 and KITTI2015 training sets together (trained 11 000 iterations), and then tested on KITTI2012 and KITTI2015 test sets to verify the performance of the network further. The code is implemented via the TensorFlow framework. Result Before reliable region extraction step, the accuracy of this method is comparable to that of state-of-the-art methods on the Sceneflow dataset. The average error is only 0.84 pixels, and the error decreases with the increase of the updating iteration count, while the inference time becomes longer. However, the resiliency between speed and accuracy can be obtained by manipulate the number of updating iterations. After credible region extraction, the error on the Sceneflow dataset is further reduced to the historical best value of 0.21 pixels. On the KITTI benchmark, this method may rank first when only estimated regions are evaluated. The colorized disparity images and point cloud images identified completely that almost all of the occluded regions and a huge amount of areas with large errors are removed based on reliable region extraction. Conclusion The proposed method has its superiority for binocular disparity estimation. The credible region extraction method can extract high-precision estimation regions efficiently, which improves the disparity reliability of the estimated regions greatly.
Keywords

订阅号|日报