Current Issue Cover
局部双目视差回归的目标距离估计

张羽丰1, 李昱希1, 赵明璧2, 喻晓源2, 占云龙3, 林巍峣1(1.上海交通大学电子信息与电气工程学院, 上海 201100;2.华为技术有限公司华为云, 杭州 310051;3.深圳市海思半导体有限公司, 深圳 518116)

摘 要
目的 双目视觉是目标距离估计问题的一个很好的解决方案。现有的双目目标距离估计方法存在估计精度较低或数据准备较繁琐的问题,为此需要一个可以兼顾精度和数据准备便利性的双目目标距离估计算法。方法 提出一个基于R-CNN(region convolutional neural network)结构的网络,该网络可以实现同时进行目标检测与目标距离估计。双目图像输入网络后,通过主干网络提取特征,通过双目候选框提取网络以同时得到左右图像中相同目标的包围框,将成对的目标框内的局部特征输入目标视差估计分支以估计目标的距离。为了同时得到左右图像中相同目标的包围框,使用双目候选框提取网络代替原有的候选框提取网络,并提出了双目包围框分支以同时进行双目包围框的回归;为了提升视差估计的精度,借鉴双目视差图估计网络的结构,提出了一个基于组相关和3维卷积的视差估计分支。结果 在KITTI(Karlsruhe Institute of Technology and Toyota Technological Institute)数据集上进行验证实验,与同类算法比较,本文算法平均相对误差值约为3.2%,远小于基于双目视差图估计算法(11.3%),与基于3维目标检测的算法接近(约为3.9%)。另外,提出的视差估计分支改进对精度有明显的提升效果,平均相对误差值从5.1%下降到3.2%。通过在另外采集并标注的行人监控数据集上进行类似实验,实验结果平均相对误差值约为4.6%,表明本文方法可以有效应用于监控场景。结论 提出的双目目标距离估计网络结合了目标检测与双目视差估计的优势,具有较高的精度。该网络可以有效运用于车载相机及监控场景,并有希望运用于其他安装有双目相机的场景。
关键词
Object distance estimation based on stereo regional disparity regression

Zhang Yufeng1, Li Yuxi1, Zhao Mingbi2, Yu Xiaoyuan2, Zhan Yunlong3, Lin Weiyao1(1.School of Electronic, Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 201100, China;2.Huawei Technologies Co., Ltd., Huawei Cloud, Hangzhou 310051, China;3.Hisilicon Technologies Co., Ltd., Shenzhen 518116, China)

Abstract
Objective Object distance estimation is a fundamental problem in 3D vision. However, most successful object distance estimators need extra 3D information from active depth cameras or laser scanner, which increases the cost. Stereo vision is a convenient and cheap solution for this problem. Modern object distance estimation solutions are mainly based on deep neural network, which provides better accuracy than traditional methods. Deep learning-based solutions are of two main types. The first solution is combining a 2D object detector and a stereo image disparity estimator. The disparity estimator outputs depth information of the image, and the object detector detects object boxes or masks from the image. Then, the detected object boxes or masks are applied to the depth image to extract the pixel depth in the detected box, are then sorted, and the closest is selected to represent the distance of the object. However, such systems are not accurate enough to solve this problem according to the experiments. The second solution is to use a monocular 3D object detector. Such detectors can output 3D bounding boxes of objects, which indicate their distance. 3D object detectors are more accurate, but need annotations of 3D bounding box coordinates for training, which require special devices to collect data and entail high labelling costs. Therefore, we need a solution that has good accuracy while keeping the simplicity of model training. Method We propose a region convolutional neural network(R-CNN)-based network to perform object detection and distance estimation from stereo images simultaneously. This network can be trained only using object distance labels, which is easy to apply to many fields such as surveillance scenes and robot motion. We utilize stereo region proposal network to extract proposals of the corresponding target bounding box from the left view and right view images in one step. Then, a stereo bounding-box regression module is used to regress corresponding bounding-box coordinates simultaneously. The disparity could be calculated from the corresponding bounding box coordinate at x axis, but the obtained distance from disparity may be inaccurate due to the reciprocal relation between depth and disparity. Therefore, we propose a disparity estimation branch to estimate object disparity accurately. This branch estimates object-wise disparity from local object features from corresponding areas in the left view and right view images. This process can be treated as regression, so we can use a similar network structure as the stereo bounding-box regression module. However, the disparity estimated by this branch is still inaccurate. Inspired by other disparity image estimation methods, we propose to use a similar structure as disparity image estimation networks in this module. We use groupwise correlation and 3D convolutional stacked-hourglass network structure to construct this disparity estimation branch. Result We validated and trained our method on Karlsruhe Institute of Technology and Toyota Technological Institute(KITTI) dataset to show that our network is accurate for this task. We compare our method with other types of methods, including disparity image estimation-based methods and 3D object detection-based methods. We also provide qualitative experiment results by visualizing distance-estimation errors on the left view image. Our method outperforms disparity image estimation-based methods by a large scale, and is comparable with or superior to 3D object detection-based methods, which require 3D box annotation. In addition, we also compare experiments between different disparity estimation solutions proposed in this paper, showing that our proposed disparity estimation branch helps our network to obtain much more robust object distance, and the network structure based on 3D convolutional stacked-hourglass further improves the object-distance estimation accuracy. To prove that our method can be applied to surveillance stereo-object distance estimation, we collect and labeled a new dataset containing surveillance pedestrian scenes. The dataset contains 3 265 images shot by a stereo camera, and we label all the pedestrians in the left-view images with their bounding box as well as the pixel position of their head and foot, which helps to recover the pedestrian distance from the disparity image. We perform similar experiments on this dataset, which proved that our method can be applied to surveillance scenes effectively and accurately. As this dataset does not contain 3D bounding box annotation, 3D object detection-based methods cannot be applied in this scenario. Conclusion In this study, we propose an R-CNN-based network to perform object detection and distance estimation simultaneously from stereo images. The experiment results show that our model is accurate enough and easy to train and apply to other fields.
Keywords

订阅号|日报