多尺度相似性迭代查找的可靠双目视差估计

晏敏; 王军政; 李静

doi:10.11834/jig.210551

深度估计与三维重建 | 浏览量 : 0 下载量: 0 CSCD: 0

PDF
导出
分享
收藏
专辑

多尺度相似性迭代查找的可靠双目视差估计
Reliable binocular disparity estimation based on multi-scale similarity recursive search
2022年27卷第2期页码：447-460
纸质出版日期： 2022-02-16 ，

录用日期： 2021-11-12
DOI： 10.11834/jig.210551
稿件说明：

移动端阅览

晏敏, 王军政, 李静. 多尺度相似性迭代查找的可靠双目视差估计[J]. 中国图象图形学报, 2022,27(2):447-460.

Min Yan, Junzheng Wang, Jing Li. Reliable binocular disparity estimation based on multi-scale similarity recursive search[J]. Journal of Image and Graphics, 2022,27(2):447-460.
晏敏, 王军政, 李静. 多尺度相似性迭代查找的可靠双目视差估计[J]. 中国图象图形学报, 2022,27(2):447-460. DOI： 10.11834/jig.210551.

Min Yan, Junzheng Wang, Jing Li. Reliable binocular disparity estimation based on multi-scale similarity recursive search[J]. Journal of Image and Graphics, 2022,27(2):447-460. DOI： 10.11834/jig.210551.

摘要

目的

双目视差估计可以实现稠密的深度估计，因而具有重要研究价值。而视差估计和光流估计两个任务之间具有相似性，在两种任务之间可以互相借鉴并启迪新算法。受光流估计高效算法RAFT（recurrent all-pairs field transforms）的启发，本文提出采用单、双边多尺度相似性迭代查找的方法实现高精度的双目视差估计。针对方法在不同区域估计精度和置信度不一致的问题，提出了左右图像视差估计一致性检测提取可靠估计区域的方法。

方法

采用金字塔池化模块、跳层连接和残差结构的特征网络提取具有强表征能力的表示向量，采用向量内积表示像素间的相似性，通过平均池化得到多尺度的相似量，第0次迭代集成初始视差量，根据初始视差单方向向左查找多尺度的相似性得到的大视野相似量和上下文3种信息，而其他次迭代集成更新的视差估计量，根据估计视差双向查找多尺度的相似性得到的大视野相似量和上下文3种信息，集成信息通过第0次更新的卷积循环神经网络和其他次更新共享的卷积循环神经网络迭代输出视差的更新量，多次迭代得到最终的视差估计值。之后，通过对输入左、右图像反序和左右翻转估计右图视差，对比左、右图匹配点视差差值的绝对值和给定阈值之差判断视差估计置信度，从而实现可靠区域提取。

结果

本文方法在Sceneflow数据集上得到了与先进方法相当的精度，平均误差只有0.84像素，并且推理时间有相对优势，可以和精度之间通过控制迭代次数灵活平衡。可靠区域提取后，Sceneflow数据集上误差进一步减小到了历史最佳值0.21像素，在KITTI（Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago）双目测试数据集上，估计区域评估指标最优。

结论

本文方法对于双目视差估计具有优越性能，可靠区域提取方法能高效提取高精度估计区域，极大地提升了估计区域的可靠性。

Abstract

Objective

Depth information is the key sensing information for the autonomous platform. As common depth sensors

the binocular camera can make up for the sparsity of LiDAR and depth camera not suitable for outdoor scenes. Comparing the performance of light detection and ranging (LiDAR) and depth cameras

it is very important to improve the accuracy and speed of the binocular disparity estimation algorithm. Disparity estimation algorithms based on deep learning have its own priority. Disparity estimation and optical flow estimation methods can learn from each other and faciliate new algorithms generation. Inspired by the efficient optical flow estimation algorithm recurrent all-pairs field transforms (RAFT)

a unilateral and bilateral multi-scale similarity recursive search method is demonstrated to achieve high-precision binocular disparity estimation. A method of disparity estimation consistency detection for left and right images is proposed to extract reliable estimation regions to resolve inconsistent estimation accuracy and confidence in different regions.

Method

The pyramid pooling module (PPM)

skip layer connection and residual structure are conducted in the feature network to extract the representation vector with strong representation capability. The inner product of representation vectors is used to demonstrate the similarity between pixels. The multi-scale similarity is obtained by average pooling. The updated or initial disparity

a certain range of similarity with a large field of view searched in multi-scale similarity according to the disparity (the 0th updating iteration is searched in one direction to the left and other updating iterations are searched in two directions) and context information are integrated together. The integrated information is transmitted to the convolutional recurrent neural network (ConvRNN) of the 0th updating process or the ConvRNN shared by other updating processes to obtain the updated amount of disparity

and the final disparity value is obtained via multiple updating iterations. The disparity of the right image is estimated by reversing the order and conducting left-right flipping of the inputted left and right images

and the confidence of disparity is determined by comparing the absolute value of disparity difference between the matched points of the left and right images and the given threshold. The output of each updating iteration is designated to reduce error gradually with increasing weight and the supervised method is used to train the network. In the training process

the learning rate is reduced by segments

and the root mean square prop(RMSProp) optimization algorithm is used for learning. To improve the inference efficiency

the resolution of the feature network is reduced by 8 times

so the learning up-sampling method is adopted to generate the disparity map with the same resolution of the original image. The disparity of the 8×8 adjacent region of a pixel in the original resolution image is calculated by weighting the disparity of the 3×3 adjacent region of the pixel in the reduced resolution image. The weights are obtained by convoluting the hidden state of the ConvRNN. To reduce the high cost of real-scene disparity data or depth data collection

the Sceneflow dataset generated by the 3D creation suite Blender is used to train and test the network

and the real-scene KITTI(Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago) data is used to verify the generalization capability of the proposed method.First

on the Flyingthings3D dataset of the Sceneflow dataset

21 818 pairs of training images of 540×960 pixels are randomly cropped to get images of 256×512 pixels. The cropped images are inputted to the network to train 440 000 iterations. The batch size is set to 4. The trained network is tested on 4 248 pairs of test images. To verify the rationality of adding the unilateral search process

we use ablation experiments on the Sceneflow dataset to compare the performance of networks with and without the unilateral search process. Next

the network trained on Sceneflow data is tested on KITTI training data to verify the generalization ability of the algorithm between simulation data and real-scene data directly. Then

the network trained on the Sceneflow dataset is fine-tuned on the KITTI2012 and KITTI2015 training set (5.5k iterations of training)

respectively

and then cross-tested on KITTI2015 and KITTI2012 training sets for qualitative analysis. Finally

the network trained on Sceneflow data is fine-tuned on KITTI2012 and KITTI2015 training sets together (trained 11 000 iterations)

and then tested on KITTI2012 and KITTI2015 test sets to verify the performance of the network further. The code is implemented via the TensorFlow framework.

Result

Before reliable region extraction step

the accuracy of this method is comparable to that of state-of-the-art methods on the Sceneflow dataset. The average error is only 0.84 pixels

and the error decreases with the increase of the updating iteration count

while the inference time becomes longer. However

the resiliency between speed and accuracy can be obtained by manipulate the number of updating iterations. After credible region extraction

the error on the Sceneflow dataset is further reduced to the historical best value of 0.21 pixels. On the KITTI benchmark

this method may rank first when only estimated regions are evaluated. The colorized disparity images and point cloud images identified completely that almost all of the occluded regions and a huge amount of areas with large errors are removed based on reliable region extraction.

Conclusion

The proposed method has its superiority for binocular disparity estimation. The credible region extraction method can extract high-precision estimation regions efficiently

which improves the disparity reliability of the estimated regions greatly.

关键词

双目视差估计遮挡卷积循环神经网络深度学习监督学习

Keywords

binocular disparity estimationocclusionconvolutional recurrent neural network (CRNN)deep learningsupervised learning

references

Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z F, Citro C, Corrado G S, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y Q, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mane D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viegas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y and Zheng X Q. 2016. Tensorflow: large-scale machine learning on heterogeneous distributed systems[EB/OL]. [2021-12-01].https://arxiv.org/pdf/1603.04467.pdfhttps://arxiv.org/pdf/1603.04467.pdf

Chang J R and Chen Y S. 2018. Pyramid stereo matching network//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 5410-5418[DOI: 10.1109/CVPR.2018.00567http://dx.doi.org/10.1109/CVPR.2018.00567]

Cheng X L, Zhong Y R, Harandi M, Dai Y C, Chang X J, Drummond T, Li H D and Ge Z Y. 2020. Hierarchical neural architecture search for deep stereo matching[EB/OL]. [2021-12-01].https://arxiv.org/pdf/2010.13501.pdfhttps://arxiv.org/pdf/2010.13501.pdf

Cho K, Van Merriёnboer B, Bahdanau D and Bengio Y. 2014. On the properties of neural machine translation: encoder-decoder approaches//Proceedings of the 8th Workshop on Syntax, Semantics and Structure in Statistical Translation. Doha, Qatar: Association for Computational Linguistics: 103-111[DOI: 10.3115/v1/W14-4012http://dx.doi.org/10.3115/v1/W14-4012]

Geiger A, Lenz P and Urtasun R. 2012. Are we ready for autonomous driving? The KITTI vision benchmark suite//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 3354-3361[DOI: 10.1109/CVPR.2012.6248074http://dx.doi.org/10.1109/CVPR.2012.6248074]

Gordon A, Li H H, Jonschkowski R and Angelova A. 2019. Depth from videos in the wild: unsupervised monocular depth learning from unknown cameras//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 8976-8985[DOI: 10.1109/ICCV.2019.00907http://dx.doi.org/10.1109/ICCV.2019.00907]

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Identity mappings in deep residual networks//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 630-645[DOI: 10.1007/978-3-319-46493-0_38http://dx.doi.org/10.1007/978-3-319-46493-0_38]

Hirschmuller H. 2008. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2): 328-341[DOI: 10.1109/TPAMI.2007.1166]

Kendall A, Martirosyan H, Dasgupta S, Henry P, Kennedy R, Bachrach A and Bry A. 2017. End-to-end learning of geometry and context for deep stereo regression//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 66-75[DOI: 10.1109/ICCV.2017.17http://dx.doi.org/10.1109/ICCV.2017.17]

Lee H and Shin Y. 2019. Real-time stereo matching network with high accuracy//Proceedings of 2019 IEEE International Conference on Image Processing. Taipei, China: IEEE: 4280-4284[DOI: 10.1109/ICIP.2019.8803514]http://dx.doi.org/10.1109/ICIP.2019.8803514].

Liang Z F, Feng Y L, Guo Y L, Liu H Z, Qiao L B, Chen W, Zhou L and Zhang J F. 2017. Learning deep correspondence through prior and posterior feature constancy[EB/OL]. [2021-12-01].https://arxiv.org/pdf/1712.01039v1.pdfhttps://arxiv.org/pdf/1712.01039v1.pdf

Mayer N, Ilg E, Häusser P, Fischer P, Cremers D, Dosovitskiy A and Brox T. 2016. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 4040-4048[DOI: 10.1109/CVPR.2016.438http://dx.doi.org/10.1109/CVPR.2016.438]

Peng L, Deng D and Cai D. 2020. Geometry-based occlusion-aware unsupervised stereo matching for autonomous driving[EB/OL]. [2021-12-01].https://arxiv.org/pdf/2010.10700.pdfhttps://arxiv.org/pdf/2010.10700.pdf

Qing C, Yu J, Xiao C B and Duan J. 2020. Deep convolutional neural network for semantic image segmentation. Journal of Image and Graphics, 25(6): 1069-1090

青晨, 禹晶, 肖创柏, 段娟. 2020. 深度卷积神经网络图像语义分割研究进展. 中国图象图形学报, 25(6): 1069-1090[DOI: 10.11834/jig.190355]

Song X, Zhao X, Fang L J, Hu H W and Yu Y Z. 2020. EdgeStereo: an effective multi-task learning network for stereo matching and edge detection. International Journal of Computer Vision, 128(4): 910-930[DOI: 10.1007/s11263-019-01287-w]

Teed Z and Deng J. 2020. RAFT: recurrent all-pairs field transforms for optical flow//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 402-419[DOI: 10.1007/978-3-030-58536-5_24http://dx.doi.org/10.1007/978-3-030-58536-5_24]

Tieleman T and Hinton G. 2012. Lecture 6.5-rmsprop: DIVIDE the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4(2): 26-31

Xu H F and Zhang J Y. 2020. AANet: adaptive aggregation network for efficient stereo matching//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 1956-1965[DOI: 10.1109/CVPR42600.2020.00203http://dx.doi.org/10.1109/CVPR42600.2020.00203]

Zhang F H, Prisacariu V, Yang R G and Torr P H S. 2019. GA-Net: guided aggregation net for end-to-end stereo matching//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 185-194[DOI: 10.1109/CVPR.2019.00027http://dx.doi.org/10.1109/CVPR.2019.00027]

Zhao H S, Shi J P, Qi X J, Wang X G and Jia J Y. 2017. Pyramid scene parsing network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 6230-6239[DOI: 10.1109/CVPR.2017.660http://dx.doi.org/10.1109/CVPR.2017.660]

Zhao Y Q, Rao Y, Dong S P and Zhang J Y. 2020. Survey on deep learning object detection. Journal of Image and Graphics, 25(4): 629-654

赵永强, 饶元, 董世鹏, 张君毅. 2020. 深度学习目标检测方法综述. 中国图象图形学报, 25(4): 629-654[DOI:10.11834/jig.190307]

文章被引用时，请邮件提醒。

提交