局部双目视差回归的目标距离估计
Object distance estimation based on stereo regional disparity regression
- 2021年26卷第7期 页码:1604-1613
收稿日期:2020-08-24,
修回日期:2021-01-13,
录用日期:2021-1-20,
纸质出版日期:2021-07-16
DOI: 10.11834/jig.200511
移动端阅览
浏览全部资源
扫码关注微信
收稿日期:2020-08-24,
修回日期:2021-01-13,
录用日期:2021-1-20,
纸质出版日期:2021-07-16
移动端阅览
目的
2
双目视觉是目标距离估计问题的一个很好的解决方案。现有的双目目标距离估计方法存在估计精度较低或数据准备较繁琐的问题,为此需要一个可以兼顾精度和数据准备便利性的双目目标距离估计算法。
方法
2
提出一个基于R-CNN(region convolutional neural network)结构的网络,该网络可以实现同时进行目标检测与目标距离估计。双目图像输入网络后,通过主干网络提取特征,通过双目候选框提取网络以同时得到左右图像中相同目标的包围框,将成对的目标框内的局部特征输入目标视差估计分支以估计目标的距离。为了同时得到左右图像中相同目标的包围框,使用双目候选框提取网络代替原有的候选框提取网络,并提出了双目包围框分支以同时进行双目包围框的回归;为了提升视差估计的精度,借鉴双目视差图估计网络的结构,提出了一个基于组相关和3维卷积的视差估计分支。
结果
2
在KITTI(Karlsruhe Institute of Technology and Toyota Technological Institute)数据集上进行验证实验,与同类算法比较,本文算法平均相对误差值约为3.2%,远小于基于双目视差图估计算法(11.3%),与基于3维目标检测的算法接近(约为3.9%)。另外,提出的视差估计分支改进对精度有明显的提升效果,平均相对误差值从5.1%下降到3.2%。通过在另外采集并标注的行人监控数据集上进行类似实验,实验结果平均相对误差值约为4.6%,表明本文方法可以有效应用于监控场景。
结论
2
提出的双目目标距离估计网络结合了目标检测与双目视差估计的优势,具有较高的精度。该网络可以有效运用于车载相机及监控场景,并有希望运用于其他安装有双目相机的场景。
Objective
2
Object distance estimation is a fundamental problem in 3D vision. However
most successful object distance estimators need extra 3D information from active depth cameras or laser scanner
which increases the cost. Stereo vision is a convenient and cheap solution for this problem. Modern object distance estimation solutions are mainly based on deep neural network
which provides better accuracy than traditional methods. Deep learning-based solutions are of two main types. The first solution is combining a 2D object detector and a stereo image disparity estimator. The disparity estimator outputs depth information of the image
and the object detector detects object boxes or masks from the image. Then
the detected object boxes or masks are applied to the depth image to extract the pixel depth in the detected box
are then sorted
and the closest is selected to represent the distance of the object. However
such systems are not accurate enough to solve this problem according to the experiments. The second solution is to use a monocular 3D object detector. Such detectors can output 3D bounding boxes of objects
which indicate their distance. 3D object detectors are more accurate
but need annotations of 3D bounding box coordinates for training
which require special devices to collect data and entail high labelling costs. Therefore
we need a solution that has good accuracy while keeping the simplicity of model training.
Method
2
We propose a region convolutional neural network(R-CNN)-based network to perform object detection and distance estimation from stereo images simultaneously. This network can be trained only using object distance labels
which is easy to apply to many fields such as surveillance scenes and robot motion. We utilize stereo region proposal network to extract proposals of the corresponding target bounding box from the left view and right view images in one step. Then
a stereo bounding-box regression module is used to regress corresponding bounding-box coordinates simultaneously. The disparity could be calculated from the corresponding bounding box coordinate at x axis
but the obtained distance from disparity may be inaccurate due to the reciprocal relation between depth and disparity. Therefore
we propose a disparity estimation branch to estimate object disparity accurately. This branch estimates object-wise disparity from local object features from corresponding areas in the left view and right view images. This process can be treated as regression
so we can use a similar network structure as the stereo bounding-box regression module. However
the disparity estimated by this branch is still inaccurate. Inspired by other disparity image estimation methods
we propose to use a similar structure as disparity image estimation networks in this module. We use groupwise correlation and 3D convolutional stacked-hourglass network structure to construct this disparity estimation branch.
Result
2
We validated and trained our method on Karlsruhe Institute of Technology and Toyota Technological Institute(KITTI) dataset to show that our network is accurate for this task. We compare our method with other types of methods
including disparity image estimation-based methods and 3D object detection-based methods. We also provide qualitative experiment results by visualizing distance-estimation errors on the left view image. Our method outperforms disparity image estimation-based methods by a large scale
and is comparable with or superior to 3D object detection-based methods
which require 3D box annotation. In addition
we also compare experiments between different disparity estimation solutions proposed in this paper
showing that our proposed disparity estimation branch helps our network to obtain much more robust object distance
and the network structure based on 3D convolutional stacked-hourglass further improves the object-distance estimation accuracy. To prove that our method can be applied to surveillance stereo-object distance estimation
we collect and labeled a new dataset containing surveillance pedestrian scenes. The dataset contains 3 265 images shot by a stereo camera
and we label all the pedestrians in the left-view images with their bounding box as well as the pixel position of their head and foot
which helps to recover the pedestrian distance from the disparity image. We perform similar experiments on this dataset
which proved that our method can be applied to surveillance scenes effectively and accurately. As this dataset does not contain 3D bounding box annotation
3D object detection-based methods cannot be applied in this scenario.
Conclusion
2
In this study
we propose an R-CNN-based network to perform object detection and distance estimation simultaneously from stereo images. The experiment results show that our model is accurate enough and easy to train and apply to other fields.
Bertoni L, Kreiss S and Alahi A. 2019. MonoLoco: monocular 3D pedestrian localization and uncertainty estimation//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 6860-6870[ DOI: 10.1109/ICCV.2019.00696 http://dx.doi.org/10.1109/ICCV.2019.00696 ]
Caesar H, Bankiti V, Lang A H, Vora S, Liong V E, Xu Q, Krishnan A, Pan Y, Baldan G and Beijbom O. 2020. nuScenes: a multimodal dataset for autonomous driving//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 11618-11628[ DOI: 10.1109/CVPR42600.2020.01164 http://dx.doi.org/10.1109/CVPR42600.2020.01164 ]
Chang J R and Chen Y S. 2018. Pyramid stereo matching network//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 5410-5418[ DOI: 10.1109/CVPR.2018.00567 http://dx.doi.org/10.1109/CVPR.2018.00567 ]
Chen B, Chen H P and Li X H. 2014. Near real time linear stereo cost aggregation on GPU. Journal of Image and Graphics, 19(10): 1481-1489
陈彬, 陈和平, 李晓卉. 2014. GPU近实时线性双目立体代价聚合. 中国图象图形学报, 19(10): 1481-1489 [DOI:10.11834/jig.20141010]
Chen X Z, Kundu K, Zhu Y K, Ma H M, Fidler S and Urtasun R. 2018. 3D object proposals using stereo imagery for accurate object class detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(5): 1259-1272[DOI:10.1109/TPAMI.2017.2706685]
Chen Y L, Liu S, Shen X Y and Jia J Y. 2020. DSGN: deep stereo geometry network for 3D object detection//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 12533-12542[ DOI: 10.1109/CVPR42600.2020.01255 http://dx.doi.org/10.1109/CVPR42600.2020.01255 ]
Francisco M and Ross G. 2018. Maskrcnn-benchmark: fast, modular reference implementation of instance segmentation and object detection algorithms in PyTorch[CP/OL]. [2020-07-24] . https://github.com/facebookresearch/maskrcnn-benchmark https://github.com/facebookresearch/maskrcnn-benchmark
Geiger A, Lenz P and Urtasun R. 2012. Are we ready for autonomous driving? The KITTI vision benchmark suite//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 3354-3361[ DOI: 10.1109/CVPR.2012.6248074 http://dx.doi.org/10.1109/CVPR.2012.6248074 ]
Guo X Y, Yang K, Yang W K, Wang X G and Li H S. 2019. Group-wise correlation stereo network//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 3268-3277[ DOI: 10.1109/CVPR.2019.00339 http://dx.doi.org/10.1109/CVPR.2019.00339 ]
He K M, Gkioxari G, Dollár P and Girshick R. 2017. Mask R-CNN//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2980-2988[ DOI: 10.1109/ICCV.2017.322 http://dx.doi.org/10.1109/ICCV.2017.322 ]
Hirschmuller H. 2008. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2): 328-341[DOI:10.1109/TPAMI.2007.1166]
Kendall A, Martirosyan H, Dasgupta S, Henry P, Kennedy R, Bachrach A and Bry A. 2017. End-to-end learning of geometry and context for deep stereo regression//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 66-75[ DOI: 10.1109/ICCV.2017.17 http://dx.doi.org/10.1109/ICCV.2017.17 ]
Li P L, Chen X Z and Shen S J. 2019. Stereo R-CNN based 3D object detection for autonomous driving//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 7636-7644[ DOI: 10.1109/CVPR.2019.00783 http://dx.doi.org/10.1109/CVPR.2019.00783 ]
Lyu N Q, Song G H and Yang B W. 2018. Semi-global stereo matching algorithm based on feature fusion and its CUDA implementation. Journal of Image and Graphics, 23(6): 874-886
吕倪祺, 宋广华, 杨波威. 2018. 特征融合的双目半全局匹配算法及其并行加速实现. 中国图象图形学报, 23(6): 874-886 [DOI:10.11834/jig.170157]
Mayer N, Ilg E, Häusser P, Fischer P, Cremers D, Dosovitskiy A and Brox T. 2016. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 4040-4048[ DOI: 10.1109/CVPR.2016.438 http://dx.doi.org/10.1109/CVPR.2016.438 ]
Scharstein D, Szeliski R and Zabih R. 2001. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms//Proceedings of 2001 IEEE Workshop on Stereo and Multi-Baseline Vision. Kauai, USA: IEEE: 131-140[ DOI: 10.1109/SMBV.2001.988771 http://dx.doi.org/10.1109/SMBV.2001.988771 ]
Wang Y, Chao W L, Garg D, Hariharan B, Campbell M and Weinberger K Q. 2019. Pseudo-LiDAR from visual depth estimation: bridging the gap in 3D object detection for autonomous driving//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 8437-8445[ DOI: 10.1109/CVPR.2019.00864 http://dx.doi.org/10.1109/CVPR.2019.00864 ]
Žbontar J and LeCun Y. 2016. Stereo matching by training a convolutional neural network to compare image patches. The Journal of Machine Learning Research, 17(65): 1-32
Zhu J and Fang Y. 2019. Learning object-specific distance from a monocular image//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 3838-3847[ DOI: 10.1109/ICCV.2019.00394 http://dx.doi.org/10.1109/ICCV.2019.00394 ]
相关作者
相关机构