局部双目视差回归的目标距离估计

张羽丰; 李昱希; 赵明璧; 喻晓源; 占云龙; 林巍峣

doi:10.11834/jig.200511

图像分析和识别 | 浏览量 : 0 下载量: 54 CSCD: 3

PDF
导出
分享
收藏
专辑

局部双目视差回归的目标距离估计
Object distance estimation based on stereo regional disparity regression
2021年26卷第7期页码：1604-1613
收稿日期：2020-08-24，

修回日期：2021-01-13，

录用日期：2021-1-20，

纸质出版日期：2021-07-16
DOI： 10.11834/jig.200511
稿件说明：

移动端阅览

张羽丰, 李昱希, 赵明璧, 喻晓源, 占云龙, 林巍峣. 局部双目视差回归的目标距离估计[J]. 中国图象图形学报, 2021,26(7):1604-1613. DOI： 10.11834/jig.200511.

Yufeng Zhang, Yuxi Li, Mingbi Zhao, Xiaoyuan Yu, Yunlong Zhan, Weiyao Lin. Object distance estimation based on stereo regional disparity regression[J]. Journal of image and graphics, 2021, 26(7): 1604-1613. DOI： 10.11834/jig.200511.

摘要

目的

双目视觉是目标距离估计问题的一个很好的解决方案。现有的双目目标距离估计方法存在估计精度较低或数据准备较繁琐的问题，为此需要一个可以兼顾精度和数据准备便利性的双目目标距离估计算法。

方法

提出一个基于R-CNN（region convolutional neural network）结构的网络，该网络可以实现同时进行目标检测与目标距离估计。双目图像输入网络后，通过主干网络提取特征，通过双目候选框提取网络以同时得到左右图像中相同目标的包围框，将成对的目标框内的局部特征输入目标视差估计分支以估计目标的距离。为了同时得到左右图像中相同目标的包围框，使用双目候选框提取网络代替原有的候选框提取网络，并提出了双目包围框分支以同时进行双目包围框的回归；为了提升视差估计的精度，借鉴双目视差图估计网络的结构，提出了一个基于组相关和3维卷积的视差估计分支。

结果

在KITTI（Karlsruhe Institute of Technology and Toyota Technological Institute）数据集上进行验证实验，与同类算法比较，本文算法平均相对误差值约为3.2%，远小于基于双目视差图估计算法（11.3%），与基于3维目标检测的算法接近（约为3.9%）。另外，提出的视差估计分支改进对精度有明显的提升效果，平均相对误差值从5.1%下降到3.2%。通过在另外采集并标注的行人监控数据集上进行类似实验，实验结果平均相对误差值约为4.6%，表明本文方法可以有效应用于监控场景。

结论

提出的双目目标距离估计网络结合了目标检测与双目视差估计的优势，具有较高的精度。该网络可以有效运用于车载相机及监控场景，并有希望运用于其他安装有双目相机的场景。

Abstract

Objective

Object distance estimation is a fundamental problem in 3D vision. However

most successful object distance estimators need extra 3D information from active depth cameras or laser scanner

which increases the cost. Stereo vision is a convenient and cheap solution for this problem. Modern object distance estimation solutions are mainly based on deep neural network

which provides better accuracy than traditional methods. Deep learning-based solutions are of two main types. The first solution is combining a 2D object detector and a stereo image disparity estimator. The disparity estimator outputs depth information of the image

and the object detector detects object boxes or masks from the image. Then

the detected object boxes or masks are applied to the depth image to extract the pixel depth in the detected box

are then sorted

and the closest is selected to represent the distance of the object. However

such systems are not accurate enough to solve this problem according to the experiments. The second solution is to use a monocular 3D object detector. Such detectors can output 3D bounding boxes of objects

which indicate their distance. 3D object detectors are more accurate

but need annotations of 3D bounding box coordinates for training

which require special devices to collect data and entail high labelling costs. Therefore

we need a solution that has good accuracy while keeping the simplicity of model training.

Method

We propose a region convolutional neural network(R-CNN)-based network to perform object detection and distance estimation from stereo images simultaneously. This network can be trained only using object distance labels

which is easy to apply to many fields such as surveillance scenes and robot motion. We utilize stereo region proposal network to extract proposals of the corresponding target bounding box from the left view and right view images in one step. Then

a stereo bounding-box regression module is used to regress corresponding bounding-box coordinates simultaneously. The disparity could be calculated from the corresponding bounding box coordinate at x axis

but the obtained distance from disparity may be inaccurate due to the reciprocal relation between depth and disparity. Therefore

we propose a disparity estimation branch to estimate object disparity accurately. This branch estimates object-wise disparity from local object features from corresponding areas in the left view and right view images. This process can be treated as regression

so we can use a similar network structure as the stereo bounding-box regression module. However

the disparity estimated by this branch is still inaccurate. Inspired by other disparity image estimation methods

we propose to use a similar structure as disparity image estimation networks in this module. We use groupwise correlation and 3D convolutional stacked-hourglass network structure to construct this disparity estimation branch.

Result

We validated and trained our method on Karlsruhe Institute of Technology and Toyota Technological Institute(KITTI) dataset to show that our network is accurate for this task. We compare our method with other types of methods

including disparity image estimation-based methods and 3D object detection-based methods. We also provide qualitative experiment results by visualizing distance-estimation errors on the left view image. Our method outperforms disparity image estimation-based methods by a large scale

and is comparable with or superior to 3D object detection-based methods

which require 3D box annotation. In addition

we also compare experiments between different disparity estimation solutions proposed in this paper

showing that our proposed disparity estimation branch helps our network to obtain much more robust object distance

and the network structure based on 3D convolutional stacked-hourglass further improves the object-distance estimation accuracy. To prove that our method can be applied to surveillance stereo-object distance estimation

we collect and labeled a new dataset containing surveillance pedestrian scenes. The dataset contains 3 265 images shot by a stereo camera

and we label all the pedestrians in the left-view images with their bounding box as well as the pixel position of their head and foot

which helps to recover the pedestrian distance from the disparity image. We perform similar experiments on this dataset

which proved that our method can be applied to surveillance scenes effectively and accurately. As this dataset does not contain 3D bounding box annotation

3D object detection-based methods cannot be applied in this scenario.

Conclusion

In this study

we propose an R-CNN-based network to perform object detection and distance estimation simultaneously from stereo images. The experiment results show that our model is accurate enough and easy to train and apply to other fields.

关键词

Keywords

references

Bertoni L, Kreiss S and Alahi A. 2019. MonoLoco: monocular 3D pedestrian localization and uncertainty estimation//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 6860-6870[ DOI: 10.1109/ICCV.2019.00696 http://dx.doi.org/10.1109/ICCV.2019.00696 ]

Caesar H, Bankiti V, Lang A H, Vora S, Liong V E, Xu Q, Krishnan A, Pan Y, Baldan G and Beijbom O. 2020. nuScenes: a multimodal dataset for autonomous driving//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 11618-11628[ DOI: 10.1109/CVPR42600.2020.01164 http://dx.doi.org/10.1109/CVPR42600.2020.01164 ]

Chang J R and Chen Y S. 2018. Pyramid stereo matching network//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 5410-5418[ DOI: 10.1109/CVPR.2018.00567 http://dx.doi.org/10.1109/CVPR.2018.00567 ]

Chen B, Chen H P and Li X H. 2014. Near real time linear stereo cost aggregation on GPU. Journal of Image and Graphics, 19(10): 1481-1489

陈彬, 陈和平, 李晓卉. 2014. GPU近实时线性双目立体代价聚合. 中国图象图形学报, 19(10): 1481-1489 [DOI:10.11834/jig.20141010]

Chen X Z, Kundu K, Zhu Y K, Ma H M, Fidler S and Urtasun R. 2018. 3D object proposals using stereo imagery for accurate object class detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(5): 1259-1272[DOI:10.1109/TPAMI.2017.2706685]

Chen Y L, Liu S, Shen X Y and Jia J Y. 2020. DSGN: deep stereo geometry network for 3D object detection//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 12533-12542[ DOI: 10.1109/CVPR42600.2020.01255 http://dx.doi.org/10.1109/CVPR42600.2020.01255 ]

Francisco M and Ross G. 2018. Maskrcnn-benchmark: fast, modular reference implementation of instance segmentation and object detection algorithms in PyTorch[CP/OL]. [2020-07-24] . https://github.com/facebookresearch/maskrcnn-benchmark https://github.com/facebookresearch/maskrcnn-benchmark

Geiger A, Lenz P and Urtasun R. 2012. Are we ready for autonomous driving? The KITTI vision benchmark suite//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 3354-3361[ DOI: 10.1109/CVPR.2012.6248074 http://dx.doi.org/10.1109/CVPR.2012.6248074 ]

Guo X Y, Yang K, Yang W K, Wang X G and Li H S. 2019. Group-wise correlation stereo network//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 3268-3277[ DOI: 10.1109/CVPR.2019.00339 http://dx.doi.org/10.1109/CVPR.2019.00339 ]

He K M, Gkioxari G, Dollár P and Girshick R. 2017. Mask R-CNN//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2980-2988[ DOI: 10.1109/ICCV.2017.322 http://dx.doi.org/10.1109/ICCV.2017.322 ]

Hirschmuller H. 2008. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2): 328-341[DOI:10.1109/TPAMI.2007.1166]

Kendall A, Martirosyan H, Dasgupta S, Henry P, Kennedy R, Bachrach A and Bry A. 2017. End-to-end learning of geometry and context for deep stereo regression//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 66-75[ DOI: 10.1109/ICCV.2017.17 http://dx.doi.org/10.1109/ICCV.2017.17 ]

Li P L, Chen X Z and Shen S J. 2019. Stereo R-CNN based 3D object detection for autonomous driving//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 7636-7644[ DOI: 10.1109/CVPR.2019.00783 http://dx.doi.org/10.1109/CVPR.2019.00783 ]

Lyu N Q, Song G H and Yang B W. 2018. Semi-global stereo matching algorithm based on feature fusion and its CUDA implementation. Journal of Image and Graphics, 23(6): 874-886

吕倪祺, 宋广华, 杨波威. 2018. 特征融合的双目半全局匹配算法及其并行加速实现. 中国图象图形学报, 23(6): 874-886 [DOI:10.11834/jig.170157]

Mayer N, Ilg E, Häusser P, Fischer P, Cremers D, Dosovitskiy A and Brox T. 2016. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 4040-4048[ DOI: 10.1109/CVPR.2016.438 http://dx.doi.org/10.1109/CVPR.2016.438 ]

Scharstein D, Szeliski R and Zabih R. 2001. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms//Proceedings of 2001 IEEE Workshop on Stereo and Multi-Baseline Vision. Kauai, USA: IEEE: 131-140[ DOI: 10.1109/SMBV.2001.988771 http://dx.doi.org/10.1109/SMBV.2001.988771 ]

Wang Y, Chao W L, Garg D, Hariharan B, Campbell M and Weinberger K Q. 2019. Pseudo-LiDAR from visual depth estimation: bridging the gap in 3D object detection for autonomous driving//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 8437-8445[ DOI: 10.1109/CVPR.2019.00864 http://dx.doi.org/10.1109/CVPR.2019.00864 ]

Žbontar J and LeCun Y. 2016. Stereo matching by training a convolutional neural network to compare image patches. The Journal of Machine Learning Research, 17(65): 1-32

Zhu J and Fang Y. 2019. Learning object-specific distance from a monocular image//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 3838-3847[ DOI: 10.1109/ICCV.2019.00394 http://dx.doi.org/10.1109/ICCV.2019.00394 ]

文章被引用时，请邮件提醒。

提交

红外与可见光图像特征动态选择的目标检测网络

面向高光谱图像分类网络的对比半监督对抗训练方法

面向图像拼接检测的自适应残差算法

结合环状原型空间优化的开放集目标检测

单幅图像去雨数据集和深度学习算法的联合评估与展望