发布时间: 2021-12-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.200109
2021 | Volume 26 | Number 12

图像理解和计算机视觉

双目机器视觉及RetinaNet模型的路侧行人感知定位

连丽容¹, 罗文婷¹, 秦勇², 李林¹

1. 福建农林大学交通与土木工程学院, 福州 350000;

2. 北京交通大学轨道交通控制与安全国家重点实验室, 北京 100084

收稿日期: 2020-04-13; 修回日期: 2020-10-12; 预印本日期: 2020-10-19

基金项目: 国家重点研发计划资助（2018YFB1201601）；福建农林大学杰出青年科研人才计划项目（xjq2018007）；轨道交通控制与安全国家重点实验室（北京交通大学）开放课题（RCS2020K004）

作者简介: 连丽容, 1996年生, 女, 硕士研究生, 主要研究方向为道路交通安全工程。E-mail: 542292282@qq.com
罗文婷, 通信作者, 女, 讲师, 主要研究方向为智能交通和交通安全。E-mail: luowenting531@gmail.com
秦勇, 男, 教授, 博士生导师, 主要研究方向为交通运输信息工程与安全保障方向。E-mail: yqin@bjtu.edu.cn
李林, 男, 讲师, 主要研究方向为多传感器融合、机器视觉及人工智能技术在道路交通智能检测的应用。E-mail: lilin@fafu.edu.cn
*通信作者: 罗文婷 luowenting531@gmail.com

中图法分类号: TP391.41

文献标识码: A

文章编号: 1006-8961(2021)12-2941-12

摘要

目的行人感知是自动驾驶中必不可少的一项内容，是行车安全的保障。传统激光雷达和单目视觉组合的行人感知模式，设备硬件成本高且多源数据匹配易导致误差产生。对此，本文结合双目机器视觉技术与深度学习图像识别技术，实现对公共路权环境下路侧行人的自动感知与精准定位。方法利用双目道路智能感知系统采集道路前景图像构建4种交通环境下的行人识别模型训练库；采用RetinaNet深度学习模型进行目标行人自动识别；通过半全局块匹配（semi-global block matching，SGBM）算法实现行人道路前景图像对的视差值计算；通过计算得出的视差图分别统计U-V方向的视差值，提出结合行人识别模型和U-V视差的测距算法，实现目标行人的坐标定位。结果实验统计2.5 km连续测试路段的行人识别结果，对比人工统计结果，本文算法的召回率为96.27%。与YOLOv3（you only look once）和Tiny-YOLOv3方法在4种交通路况下进行比较，平均F值为96.42%，比YOLOv3和Tiny-YOLOv3分别提高0.9%和3.03%；同时，实验利用标定块在室内分别拍摄3 m、4 m和5 m不同距离的20对双目图像，验证测距算法，计算标准偏差皆小于0.01。结论本文提出的结合RetinaNet目标识别模型与改进U-V视差算法能够实现对道路行人的检测，可以为自动驾驶的安全保障提供技术支持，具有一定的应用价值。

关键词

行人检测; 深度学习; RetinaNet; 半全局块匹配(SGBM)算法; U-V视差算法

Roadside pedestrian detection and location based on binocular machine vision and RetinaNet

Lian Lirong¹, Luo Wenting¹, Qin Yong², Li Lin¹

1. School of Traffic and Civil Engineering, Fujian Agriculture and Forestry University, Fuzhou 350000, China;

2. State Key Laboratory of Rail Traffic Control and Safety, Beijing Jiaotong University, Beijing 100084, China

Supported by: National Key R & D Program of China (2018YFB1201601)

Abstract

Objective Deep learning has been widely used in the field of computer vision. The application of target recognition on driverless vehicles field via using the extraction based on convolutional neural networks (CNNs). However, the environment of traffic road is complex and changeable, it is difficult to achieve obstacle detection under the actual traffic conditions. The variable characteristic of yielded traffic pedestrian makes pedestrian detection more prominent in road obstacle detection. 1) Currently, most pedestrian recognition models are trained and tested based on a simple background, and few researches have been done on the recognition effect of pedestrian targets in complex road traffic realities. Image parallax has been customized in target ranging based on the development of binocular stereo vision. Image pairs have been captured via binocular stereo vision cameras. Parallax value for left and right images have been calculated based stereo matching algorithms. The depth maps have been obtained based on disparity maps further. Ultimately, the detection of road obstacles is implemented. 2) The difficulties to extract, match and track image sequence feature points and reconstruct projection scenes have been resolving. A new algorithm has been proposed to extract obstacle coordinate information on U-V histograms via counting disparity values in the U-V direction. The two-dimensional plane information in the original image has been converted into line segment information in the U-V direction via calculating the U-V parallax image. Least squares method, Hough transform and other line extraction methods have been used to extract road and obstacle-related line segments further. 3) This type of method is simple to calculate and is conducive to real-time performance, but has a large impact on noise in complex environments. The methodology which combines deep learning and modifies U-V parallax algorithm has proposed to realize the detection of road pedestrians (including recognition and location of pedestrian) that improve the driving safety of vehicles on the road. Method The binocular road intelligent perception system has been used to collect road pedestrian foreground images. The training dataset has been established based on the data collected under four types of roadways. RetinaNet model has been utilized on pedestrian recognition. A deep residual network (ResNet) has been as a feature extraction network. The feature pyramid network (FPN) has been used to form multi-scale features to strengthen the feature network containing multi-scale target information. The two feature networks have been applied respectively. Two fully convolutional network (FCN) subnetworks with the same structure with different parameters have been used to implement tasks including the target box category classification and bounding box position regression. Pedestrian data library has been established to feed RetinaNet network for training and testing in training phase. The trials-based batch size has been set to 24 and learning rate has been to be 0.000 1. The accomplishment completion of training process has reached 100 epochs. Random 400 samples have been chosen from training samples as validation data to test the model performance in each time of training. Counting iteration loss value in each epoch and selected the model corresponding to the minimum value as pedestrian recognition model. The horizontal gradient filtering has been conducted on the left image, and then calculates the Birchfield and Tomasi (BT) cost value of the left and right images have been calculated subsequently. The cost value of the left and right images has been fused, and the current cost value has been substituted replaced based on the sum of the cost value of the area around the pixel via traversing pixel by pixel. The cost value has been optimize using semi-global matching (SGM) cost aggregation algorithm. The disparity corresponding to the lowest matching error has been opted to calculate the image disparity based on winner takes all (WTA). The false parallax value has been eliminated via confidence detection, and the parallax holes have been supplemented via sub-pixel interpolation. The left and right consistency has been used to eliminate the parallax error caused by the left and right occlusion. The disparity map has presented noisy due to the interference of the complicated environment of the traffic road. First, the median filtering has been used to perform preliminary denoising processing on the disparity map to obtain a better disparity map. The parallax statistical range has narrowed to inside bounding box to remove irrelevant parallax interference as much as possible. Next, through traversing all the parallax values within the target pedestrian rectangular bounding box to find. The maximum parallax value has replaced all other parallax values in the bounding box. The number of disparities in the U-V direction has been re-counted based on the improved disparity map. At last, the coordinate positions of pedestrians have been obtained. The improved U-V parallax algorithm has filled the parallax holes inside of the bounding box and replaced the noise parallax with the maximum parallax value to improve the accuracy of pedestrian positioning. Result Compared with the artificial statistical results, the recall rate is 96.27% based on the experimental statistics of the pedestrian recognition results of the self-training RetinaNet model of the 2 500 m continuous test section. In comparison of the you only look once v3 (YOLOv3) and Tiny-YOLOv3 methods under four traffic conditions, the average F-value can reach 96.42%, 0.9% higher than YOLOv3, and 3.03% higher than Tiny-YOLOv3. A calibration block to shoot 20 pairs of binocular images at different distances of 3 m, 4 m, and 5 m in the laboratory to verify the distance measurement algorithm. The calculated standard deviation has been less than 0.01. Conclusion In this study, RetinaNet model combined with U-V parallax algorithm have been proposed to identify and positioning the pedestrians. Effectively pedestrian detection in the traffic environment has been proposed, and it is significance for the safety of driverless vehicles.

Key words

pedestrian detection; deep learning; RetinaNet; semi-global block matching (SGBM) algorithm; U-V parallax algorithm

0 引言

交通道路环境复杂多变，在车辆行驶过程中检测其他车辆、道路交通设施以及行人等障碍物是实现自动驾驶的难点。由于交通行人具有多变特性，使得行人检测成为道路障碍物检测中的突出问题。

在计算机视觉中，障碍物识别归属于运动图像分析。随着双目立体视觉相机在计算机视觉中的推广，采用双目立体视觉相机拍摄图像对，根据立体匹配算法计算视差值(Wang等，2020b)，基于视差图进一步计算得到深度图，最终在深度图上实现障碍物的检测(王荣本等，2007)，已成为目前障碍物检测算法的基本流程，但存在图像序列特征点提取、匹配、跟踪以及场景投影重建相对困难等问题。近年来，有研究提出通过统计U-V方向的视差值，在U-V直方图上提取障碍物信息，一定程度上突破了基于视差图像检测障碍物的局限(Shrivastava，2019)。该方法通过计算U-V视差图像将原图像中的2维平面信息转化为U-V方向的线段信息，线段长度与目标的远近程度成反比；接着，根据最小二乘法(Liu等，2020)和Hough变换(Hou等，2021)等直线提取方法提取道路及障碍物相关线，进一步实现障碍物检测，计算过程简单、实时性好，但在复杂环境下噪声影响较大。

行人是车辆行驶中独具特性的障碍物。实现行人检测在人体行为分析、智能监控视频、汽车无人驾驶系统(advanced driving assistance system, ADAS)和智能交通等领域(Wang等，2020a)有着广泛的应用前景，已成为计算机视觉领域的一个前沿课题。随着深度学习在计算机视觉领域的广泛应用，人们寻求表现更加优异的检测方法，不断提出新的检测模型，以求达到更加精准快速的检测效果。基于深度学习的行人检测方法主要分为基于候选区域(two-stage系)(Wang等，2020c)，如RCNN(region convolutional neural networks)系列(Howal等，2019)和FPN(feature pyramid networks)(He等，2020)等；以及单次检测器(one-stage系)，如YOLO(you only look once)系列(Ko等，2020)、SSD(single shot multi box detector)(Xu等，2020)、RetinaNet(Lin等，2017)等，以上方法在检测精度和速度上各有优势。针对网络对行人特征提取能力不足，Xie等人(2019)提出采用两种不同的比率对原始图进行采样，以形成图像金字塔序列并拼接两层金字塔图像，保证与原始图像尺寸相同，提取更多行人特征。曾接贤等人(2019)采用深度残差网络对特征图进行上采样，将高层和底层特征进行融和输入区域候选网络中以提取特征。

基于快速图像分割的半全局匹配算法(semi-global matching，SGM)和Fast-YOLO网络，通过投影变换同时实现对行人的检测和定位(杨荣坚等，2018)，较好地解决了传统SGM算法匹配度低，以及Fast-YOLO网络对较小和相邻目标检测能力不足问题，但在分割图像时仍需较大的计算量并且只针对简单背景下行人目标。基于双目视觉的立体匹配算法能够实现行人的测距，但噪声影响大，难以获取行人在2维图像中的坐标；基于深度学习网络的方法在提取图像中行人的特征上有很大优势，但无法获取行人的3维坐标位置信息，实现行人的测距。

针对上述问题，本文提出基于RetinaNet模型和U-V视差统计的道路行人检测算法。将采集的交通道路实景图像数据，结合RetinaNet网络和半全局块匹配(semi-global block matching, SGBM)算法，统计U-V视差值，实现交通道路环境下的行人检测。首先，利用RetinaNet神经网络构架，训练行人目标识别模型，选取准确率、召回率和综合评价指标F值作为识别模型的评价标准，评估在4种不同道路交通环境下的识别效果，选取最佳模型。其次，采用SGBM算法计算图像对视差值；然后，结合行人识别结果，保留检测边界框(bounding box, Bbox)内的行人视差，剔除无关的干扰视差，实现视差图的改进；最后，在改进视差图上统计U-V方向视差，获取目标行人位置坐标，实现道路行人的检测。

1 数据采集

1.1 基于双目机器视觉的数据采集设备

双目摄影测量较激光雷达测距具有设备成本低、数据匹配简单准确的优点。为采集道路前景高清图像，精准定位识别目标物，本文开发了一套双目道路智能感知系统。该设备由双目相机、GPS (global positioning system)以及惯性测量单元组成，如图 1(a)所示。设备可通过强力吸盘安装于车顶，通过惯性测量单元自动定位双目相机三轴方向的姿态角，同时GPS可实现亚米级别定位，并通过距离控制相机的数据采集频率。设备的采集界面见图 1(b)，通过双目视觉可实现道路前景景深图像的实时显示。

图 1 便携式智能巡检设备

Fig. 1 Portable intelligent inspection equipment

((a) equipment appearance; (b) system acquisition interface)

1.2 行人识别模型训练库的构建

为构建行人识别模型训练库，采集城市道路及城郊公路数据，总计里程20 km。经过初步数据筛选，去除无效图像数据(图像模糊或图像中未出现行人)，包含有效图像14 500幅，分辨率为2 208×1 242像素；数据分为4种路段类型，具体信息如表 1所示。并通过人工标注数据样本，生成行人标签信息，如图 2所示。

表 1 数据集信息
Table 1 The information of collected data

下载CSV

路段类型	数量	图像/幅	GPS坐标(起点/终点)
城区机非混合道路	4 km	2 000	(119.202，26.078)/(119.156，26.055)
城区机非隔离道路	4 km	2 000	(119.196，26.055)/(119.199，26.078)
郊区公路	10 km	10 000	(119.191，26.057)/(119.180，26.068)
城区交叉口	5个	500	-

图 2 人工标注数据

Fig. 2 Manually labeled data

((a) non-isolated roads; (b) non-mixed road; (c) suburban road; (d) urban intersection)

2 行人自动识别算法

2.1 RetinaNet模型

为实现行人识别定位在自动驾驶中的应用，需采用准确率和检测效率都较高的深度学习模型。RetinaNet、YOLOv3和Tiny-YOLOv3属于one-stage系检测模型，大比重的背景类别降低了后两者的准确率，而RetinaNet模型利用focal loss损失函数大幅削减了背景类别，增加了目标类别的权重，有效解决了类别不平衡问题，使得该模型在目标识别精度和识别速度上达到较佳的协调。因此采用RetinaNet模型作为行人识别的训练网络。图 3为RetinaNet模型的网络结构，其以深度残差网络(ResNet)为特征提取网络，采用特征金字塔(feature pyramid network, FPN)形成多尺度特征，以强化包含多尺度目标信息的特征网络，并在特征网络上分别使用两个结构相同但不同参数的全卷积网络(fully convolutional networks, FCN)作为其子网络实现目标框类别的分类和Bbox位置回归任务，并在回归任务中采用focal loss损失函数提高模型的识别精度(Lin等，2017)。其中$\oplus $表示特征信息融合，$W$、$H$分别代表特征图的宽度和高度。

图 3 RetinaNet模型网络结构

Fig. 3 RetinaNet network structure

2.2 模型训练

搭建行人样本数据库，将样本划分为训练集和验证集输入神经网络进行识别模型训练和验证。经过实验，RetinaNet网络参数设置如下：迭代次数为100，每轮训练的批处理量为24，学习率为0.000 1。统计每轮的迭代损失值，结果表明第79次迭代时损失达到最小值，因此实验优先选择第79次迭代对应的权重构建行人识别模型。YOLOv3和Tiny-YOLOv3模型网络参数设置与上述一致，分别在第85次和91次迭代达到最佳损失。

3 行人自动定位算法

3.1 基于SGBM算法的双目图像立体匹配

采用双目立体视觉相机采集道路沿线前景图像，每次可同时采集左右两幅图像。由于双目相机左右镜头存在视距，使得同一物体在左右图像中存在像素坐标的微小差异，由此可以计算左右图像的像素视差。首先对左右图像进行立体匹配。本文采用SGBM算法匹配双目图像对，计算左右图像视差值，如图 4所示。具体步骤如下：

图 4 立体匹配结合识别模型统计U-V视差流程

Fig. 4 U-V parallax statistical process with stereo matching and recognition model

1) 对左图进行水平方向的梯度滤波，然后计算左图和右图的BT(Birchfield and Tomasi)代价值(Birchfield和Tomasi，1999)，具体为

$ \gamma(\boldsymbol{M})=\boldsymbol{N}_{\mathrm{OCC}} k_{\mathrm{occ}}-\boldsymbol{N}_{m} k_{r}+\sum\limits_{i=1}^{N_{m}} d\left(x_{i}, y_{i}\right) $

(1)

式中，$γ({\mathit{\boldsymbol{M}}})$表示点数集合${\mathit{\boldsymbol{M}}}$的代价值，$k_{\rm {occ}}$表示未匹配的惩罚项(constant occlusion penalty)，$k_{r}$表示匹配的奖励项，${\mathit{\boldsymbol{N}}}_{\rm {OCC}}$和${\mathit{\boldsymbol{N}}}_{m}$分别表示未匹配和匹配的点数集合，$d(x_{i}, y_{i})$是像素间的视差。

随后，通过模块(block)运算将左图和右图的代价值进行融合，逐个像素进行遍历，选取像素周围区域代价值总和取代当前代价值。

2) 利用半全局匹配(semi global matching，SGM)(Malekabadi等，2019)代价聚合算法对代价值进行优化，具体为

$ \begin{aligned} &E(\boldsymbol{D})=\sum\limits_{P}\left(C\left(p, \boldsymbol{D}_{p}\right)+\right. \\ &\sum\limits_{q \in \boldsymbol{N}_{p}} P_{1} T\left[\left|\boldsymbol{D}_{p}-\boldsymbol{D}_{q}\right|=1\right]+ \\ &\sum\limits_{q \in \boldsymbol{N}_{p}} \left.P_{2} T\left[\left|\boldsymbol{D}_{p}-\boldsymbol{D}_{q}\right|>1\right]\right) \end{aligned} $

(2)

式中，${\mathit{\boldsymbol{D}}}$为视差图(disparity map)；$E({\mathit{\boldsymbol{D}}})$为该视差图对应的能量函数；${\mathit{\boldsymbol{N}}}_{p}$为像素$p$的相邻像素点(一般认为8连通)；$C(p, {\mathit{\boldsymbol{D}}}_{p})$为像素点的代价值；$P_{1}$为视差值为1的像素的惩罚系数；$P_{2}$为视差值大于1的像素的惩罚系数；$T[·]$为返回函数，若函数中的参数为真则返回1，否则返回0。

随后，选择最低匹配误差对应的视差值(winner takes all, WTA)(Shi等，2016)计算图像的视差。

3) 通过置信度检测剔除错误视差值，并利用亚像素插值进行视差空洞补充，最后采用左右一致性消除左右遮挡带来的视差错误。

4) 受交通道路复杂环境的干扰，视差图存在噪声影响，采取中值滤波(Xing等，2019)对视差图进行初步噪声去除处理，以得到效果较佳的视差图，具体为

$ D_{i j}=\underset{\boldsymbol{A}_{5 \times 5}}{{Med}}\left\{d_{i j}\right\} $

(3)

式中，$D_{ij}$为滤波后的视差值，${\mathit{\boldsymbol{A}}}$为$5×5$的滤波窗口，$Med\{·\}$为中值滤波函数，$d_{ij}$为第$i$行第$j$列的视差值。

3.2 基于U-V算法的双目图像视差计算

双目视觉是通过计算左右图像对的视差进行目标物测距。本文采用U-V视差算法(Benacer等，2015)统计左右图像U、V方向视差值，即按视差图的行和列分别统计视差值的个数。根据U-V视差图的特性，垂直路面的行人在V-视差图的投影近似为竖直直线，在U-视差图的投影近似为水平直线，如图 4所示。通过统计U-V方向的视差值得到U-视差图${\mathit{\boldsymbol{U}}}_{D_{\rm max}}, v$和V-视差图${\mathit{\boldsymbol{V}}}_{u, D_{\rm max}}$，具体为

$ \boldsymbol{D}_{\max }=\max \left(\boldsymbol{D}_{u, v}\left(d_{i j}\right)\right), 0<i<u, 0<j<v $

(4)

$ \boldsymbol{U}_{D_{\max }, v}=\left(\begin{array}{ccc} \cdots & \cdots & \cdots \\ \vdots & u_{j, d} & \vdots \\ \ldots & \ldots & \ldots \end{array}\right), 0<j<v $

(5)

$ \boldsymbol{V}_{u, D_{\max }}=\left(\begin{array}{ccc} \cdots & \cdots & \cdots \\ \vdots & v_{i, d} & \vdots \\ \ldots & \ldots & \ldots \end{array}\right), 0<i<u $

(6)

式中，${\mathit{\boldsymbol{D}}}_{u, v}(d_{ij})$是尺寸为$u$行$v$列的视差图，$d_{ij}$为视差值，$u_{j, d}$是第$j$列视差值为$d$的像素个数，$v_{i, d}$是第$i$行视差值为$d$的像素个数。

3.3 U-V视差算法的改进

采用RetinaNet模型识别行人，存在识别边界框的位置不精准的问题，导致获取的有效视差减少，直接影响目标行人定位的准确性，如图 5(a)所示。同时，SGBM双目立体匹配算法对于复杂环境的鲁棒性依然不足，实验采取初步噪声去除处理，但是由于交通道路环境复杂多变，仍遗留噪声干扰下一步统计U-V方向视差值。如图 5(b)所示，虽然根据识别结果，缩小视差统计范围至目标行人检测边界框内，很大程度上剔除了无关的视差干扰，但边界框内依然存在视差空洞以及非行人位置的噪声视差。因此，提出基于行人检测边界框内改进的U-V视差算法。统计1 000幅测试数据在算法改进前后的矩形边界框内视差覆盖率分别为53.7 %和100 %，可见算法改进补全了边界框内的视差空洞，计算公式为

图 5 原始及改进U-V视差算法效果对比

Fig. 5 The results of origin and improved U-V algorithm

((a) detection result; (b) before improvement; (c) after improvement)

$ d_{p}=\frac{S_{\text {parallax }}}{S_{\text {bbox }}} $

(7)

式中，$d_{p}$为视差覆盖率，$S_{\rm {parallax}}$为矩形边界框内的视差面积，$S_{\rm {bbox}}$为矩形边界框的面积。算法的具体步骤如下：

1) 在目标行人矩形边界框的范围内遍历所有的视差值，找到最大视差值$d_{\rm {max}}$，具体为

$ \begin{aligned} &d_{\max }=\max \left(\Delta_{n, m}\left(\delta_{i j}\right)\right) \\ &i \in\left(y_{1}, y_{2}\right), j \in\left(x_{1}, x_{2}\right) \end{aligned} $

(8)

式中，$d_{i, j}$、$n$、$m$和${\Delta _{n,\;m}}$为计算${d_{{\rm{max}}}}$的参数, 具体为

$ \delta_{i j}=d_{\max }, i \in\left(y_{1}, y_{2}\right), j \in\left(x_{1}, x_{2}\right) $

(9)

$ n =y_{2}-y_{1}+1 $

(10)

$ m =x_{2}-x_{1}+1 $

(11)

$ \Delta_{n m} =\left(\begin{array}{ccc} \delta_{y_{1} x_{1}} & \cdots & \delta_{y_{1} x_{2}} \\ \vdots & \ddots & \vdots \\ \delta_{y_{2} x_{1}} & \cdots & \delta_{y_{2} x_{2}} \end{array}\right) $

(12)

式中，$(x_{1}, y_{1})$为目标行人矩形边界框的左上角顶点在图像中的坐标；$(x_{2}, y_{2})$为目标行人矩形边界框的右下角顶点在图像中的坐标；$d_{i, j}$为原始视差图第$i$行第$j$列的视差值；Δ$_{n, m}$为与边界框尺寸一致的视差值矩阵。

2) 将边界框内的最大视差值$d_{\rm {max}}$替换框内的其他所有视差值，生成改进视差图，重新统计U-V方向视差值，得到U-V直方图统计结果如图 5(c)所示，与改进前的结果图 5(b)对比，改进后的U-V视差算法填补了边界框内范围的视差空洞，并以最大视差值取代噪声视差，提高了行人定位的准确性。

3.4 目标行人世界坐标获取

得到行人在图像中的坐标信息后，可通过视差计算行人在世界坐标中的位置。具体步骤如下：

1) 获取行人检测边界框在图像中的坐标信息，定义行人在图像中的坐标位置(如图 5(a))，具体公式为

$ x_{\mathrm{L}}=\operatorname{int}\left(\frac{x_{1}+x_{2}}{2}\right) $

(13)

$ y_{\mathrm{L}}=y_{2} $

(14)

式中，$(x_{\rm {L}}$$, y_{\rm {L}}$$)$为行人在图像中的定位坐标。

2) 在V-视差图上根据$y_{\rm {L}}$值获取行人对应$y$轴方向的视差值$d_{y}$；同时，在U-视差图上根据$x_{\rm {L}}$值获取行人对应$x$轴方向的视差值$d_{x}$。对$d_{y}$和$d_{x}$取均值，获得目标行人的定位视差$d_{\rm {p}}$，具体为

$ d_{\mathrm{p}}=\operatorname{int}\left(\frac{d_{x}+d_{y}}{2}\right) $

(15)

3) 根据目标行人到相机成像平面的距离$Z_{\rm {W}}$计算目标行人的世界坐标(Marin等，2019)，具体为

$ Z_{\mathrm{W}} =\frac{b \times f}{d_{\mathrm{p}}} $

(16)

$ X_{\mathrm{W}} =Z_{\mathrm{W}} \frac{x_{\mathrm{L}}}{f} $

(17)

$ Y_{\mathrm{W}} =Z_{\mathrm{W}} \frac{y_{\mathrm{L}}}{f} $

(18)

式中，$Z_{\rm {W}}$为目标行人到相机成像平面的距离；($X_{\rm {W}}$, $Y_{\rm {W}}$, $Z_{\rm {W}}$)为目标行人的世界坐标。

4 实验结果及讨论

4.1 实验路段

测试路段为福州市大学城区国宾大道至旗山大道的连接路段和建平路交叉口路段，全长2.5 km，沿线包含机非混行、机非隔离、交叉口三种交通路况。图像采集间隔为2.5 m，共采集图像1 000幅。测试数据用于评估识别模型和定位算法的有效性，实验测试路线如图 6所示。

图 6 实验路段

Fig. 6 Experimental section

4.2 行人自动识别结果

4.2.1 基于RetinaNet模型的行人识别结果统计

测试交通场景下，RetinaNet模型的行人识别统计结果如图 7所示。由于测试图像为连续拍摄图像，存在无目标行人的情况，即图中识别行人数为0；从统计结果可以得出，测试路段行人数量较多；在行驶里程825~875 m处周边行人数达到峰值，该路段为交叉口路段；对比人工统计结果，召回率为96.27 %，表明在实际交通路况下的识别效果较佳。

图 7 行人识别样例及结果统计

Fig. 7 Sample and statistics of pedestrian detection

4.2.2 行人识别结果验证

为验证RetinaNet模型识别的准确率，采用准确率、召回率和综合评价指标F值来评价模型的识别性能，并与YOLOv3和Tiny-YOLOv3目标检测模型进行对比分析。RetinaNet、YOLOv3和Tiny-YOLOv3模型在不同道路场景测试集上的识别结果统计为准确率—召回率(precision-recall，PR)曲线，如图 8所示。可以看出，RetinaNet模型检测的准确率和召回率都集中于右上角，检测效果较佳；相比之下，YOLOv3模型和Tiny-YOLOv3模型检测结果的PR点相对离散，其应对复杂道路交通环境的泛化性不够。

图 8 RetinaNet、YOLOv3和Tiny-YOLOv3的PR空间对比

Fig. 8 Comparison of PR space (precision-recall curves) among RetinaNet, YOLOv3 and Tiny-YOLOv3

((a) RetinaNet; (b) YOLOv3; (c) Tiny-YOLOv3)

为了比较模型在多种路况下的泛化能力，统计3类模型在测试路段上的行人识别结果，实验增加1 km单独拍摄的城郊公路测试图像，计算F值，如图 9所示。结果表明，RetinaNet模型在4种路段的平均F值为96.42 %，检测精度最高，在不同环境下稳健性好，检测精度和检测速度上有较好的协调；YOLOv3模型平均F值为95.52 %，在城郊公路和交叉口的检测效果良好，但在机非混行路况，召回率只有90.81 %；Tiny-YOLOv3模型平均F值为93.39 %，在机非混行、机非隔离路况F值都在90 % 以下，在背景单一、行人稀疏的城郊公路，F值可以达到98 %，检测效果随路况的变化波动明显，检测精度相对较低，泛化能力差。检测速度方面，Tiny-YOLOv3模型最快，为167 ms/幅；YOLOv3模型为194 ms/幅；RetinaNet模型为204 ms/幅。

图 9 不同路况下3类模型F值对比

Fig. 9 Comparison of F values of three types of models under different road conditions

4.3 行人自动定位结果

4.3.1 基于改进U-V视差算法的行人自动定位结果统计

测试路段的行人识别结果，依据测距算法对识别到的行人进行定位，得到目标行人相对相机成像平面的距离，统计结果如图 10所示。其中，统计距离范围为50 m，距离值大于等于50 m皆计为50 m。

图 10 目标行人测距结果

Fig. 10 Target pedestrian ranging results

4.3.2 目标物定位算法验证

由于无法在采集车行驶过程中对移动的行人进行测距验证，采用室内实验，验证目标物定位算法的准确性。首先，将棋盘标定块固定在墙上，移动双目相机分别拍摄3 m、4 m、5 m距离的20对双目图像对，如图 11所示。然后，依据提出的测距算法，计算图中棋盘标定块的距离，并与实际距离进行比较，统计准确率。

图 11 距离标定

Fig. 11 Distance calibration

((a) 3 m; (b) 4 m; (c) 5 m)

为了评估提出的测距算法的精准度，对3种不同距离的20组测距结果数据计算标准偏差，实验根据贝塞尔法计算标准偏差(Lebed，2017)，具体为

$ \sigma=\sqrt{\frac{\sum\limits_{i=1}^{n}\left(X_{i}-\bar{X}\right)^{2}}{n-1}} $

(19)

式中，$X_{1}$，$X_{2}$，…，$X_{n}$为独立测量值，$为$n$次测量的平均值，$n－1$为自由度。

统计标准差如图 12所示，标准偏差表征测量结果的弥散程度，标准偏差越小，表明测距的离散程度越小。不同测试距离值的标准偏差皆小于0.01，表明本文提出的测距算法具有可靠性。

图 12 不同距离准确率对比

Fig. 12 Comparison of accuracy at different distances

5 结论

本文利用自主研发的双目道路智能感知系统采集道路前景图像，通过RetinaNet模型进行行人自动识别，结合改进的U-V视差算法实现对目标行人的定位，主要结论如下：1)利用自主研发的双目道路智能感知系统，采集城区的机非混合、机非隔离、交叉口路段以及郊区公路4种交通环境下的道路前景图像数据，对海量数据进行筛选、标注，构建基于我国交通特征的行人自动识别模型训练库，为我国自动驾驶安全研究提供数据基础。2)采用RetinaNet模型进行行人自动识别，并对比其他目标检测模型(YOLOv3模型、Tiny-YOLOv3模型)，统计3种模型行人检测的准确率、召回率以及F值，综合检测速度和检测准确性得出RetinaNet模型具有较强的综合表现能力，泛化能力较强，能够较准确地完成4种交通环境下的行人识别。3)结合双目图像，利用改进的U-V视差算法进行目标行人的定位。室内验证实验结果表明，本文所提基于双目图像的行人定位算法，与真实值的标准偏差均小于0.01，具有较高的准确性。4)将深度学习目标检测与双目立体视觉技术相结合，实现了在驾驶环境下周边行人的自动识别及定位，为自动驾驶的安全保障提供的技术支持，具有一定的应用价值。

参考文献

Benacer I, Hamissi A and Khouas A. 2015. A novel stereovision algorithm for obstacles detection based on U-V-disparity approach//2015 IEEE International Symposium on Circuits and Systems (ISCAS). Lisbon, Portugal: IEEE: 369-372[DOI: 10.1109/ISCAS.2015.7168647]

Birchfield S, Tomasi C. 1999. Depth discontinuities by pixel-to-pixel stereo. International Journal of Computer Vision, 35(3): 269-293 [DOI:10.1023/A:1008160311296]

He J J, Zhang Y P and Yao T Z. 2020. Robust pedestrian detection based on parallel channel cascade network//Proceedings of 2019 Science and Information Conference. Las Vegas, USA: Springer: 205-221[DOI: 10.1007/978-3-030-17795-9_15].

Hou H, Guo P, Zheng B and Wang J. 2021. An effective method for lane detection in complex situations//2021 9th International Symposium on Next Generation Electronics (ISNE). Changsha, China: IEEE, 2021: 1-4[DOI: 10.1109/ISNE48910.2021.9493597]

Howal S, Jadhav A, Arthshi C, Nalavade S and Shinde S. 2019. Object detection for autonomous vehicle using tensorflow//Proceedings of International Conference on Intelligent Computing, Information and Control Systems. Secunderabad, India: Springer: 86-93[DOI: 10.1007/978-3-030-30465-2_11]

Ko S, Kim B and Kim J D. 2020. Deep learning-based algorithm for object identification in multimedia//Proceedings of International Conference on Ubiquitous Information Technologies and Applications International Conference on Computer Science and Its Applications. Algiers, Algeria: Springer: 505-511[DOI: 10.1007/978-981-13-9341-9_87]

Lebed E V. 2017. The accuracy of statistical computing of the standard deviation of a random variable. IOP Conference Series: Earth and Environmental Science, 90(1): #012150 [DOI:10.1088/1755-1315/90/1/012150]

Lin T Y, Goyal P, Girshick R, He K M and Dollár P. 2017. Focal loss for dense object detection//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 2999-3007[DOI: 10.1109/iccv.2017.324]

Liu Q H, Wei M S, Chen C P. 2020. A note on the matrix-scaled total least squares problems with multiple solutions. Applied Mathematics Letters, 103: 106181 [DOI:10.1016/j.aml.2019.106181]

Malekabadi A J, Khojastehpour M, Emadi B. 2019. Comparison of block-based stereo and semi-global algorithm and effects of pre-processing and imaging parameters on tree disparity map. Scientia Horticulturae, 247: 264-274 [DOI:10.1016/j.scienta.2018.12.033]

Marin G, Agresti G, Minto L, Zanuttigh P. 2019. A multi-camera dataset for depth estimation in an indoor scenario. Data in Brief, 27: #104619 [DOI:10.1016/j.dib.2019.104619]

Shi H, Zhu H, Wang J, Yu S Y, Fu Z F. 2016. Segment-based adaptive window and multi-feature fusion for stereo matching. Journal of Algorithms & Computational Technology, 10(1): 3-11 [DOI:10.1177/1748301815618299]

Shrivastava S. 2019. Stereo vision based object detection using v-disparity and 3D density-based clustering//Proceedings of 2019 Science and Information Conference. Las Vegas, USA: Springer: 408-419[DOI: 10.1007/978-3-030-17798-0_33]

Wang L, Fan X Y, Chen J H, Cheng J, Tan J, Ma X L. 2020a. 3D object detection based on sparse convolution neural network and feature fusion for autonomous driving in smart cities. Sustainable Cities and Society, 54: #102002 [DOI:10.1016/j.scs.2019.102002]

Wang Q B, Liang Y Q, Wang Z T, Li W Y, Jiang Z G and Zhao Y J. 2020b. Deep learning and binocular stereovision to achieve fast detection and location of target//Proceedings of 2019 Chinese Intelligent Systems Conference, CISC. Haikou, China: Springer: 306-313[DOI: 10.1007/978-981-32-9686-2_36]

Wang R B, Li L H, Jin L S, Guo L, Zhao Y B. 2007. Study on binocular vision based obstacle detection technology for intelligent vehicle. Journal of Image and Graphics, 12(12): 2158-2163 (王荣本, 李琳辉, 金立生, 郭烈, 赵一兵. 2007. 基于双目视觉的智能车辆障碍物探测技术研究. 中国图象图形学报, 12(12): 2158-2163) [DOI:10.3969/j.issn.1006-8961.2007.12.020]

Wang R Q, Jiang Y L, Lou J G. 2020c. TDCF: two-stage deep recommendation model based on mSDA and DNN. Expert Systems with Applications, 145

Xie C, Li P and Sun Y R. 2019. Pedestrian detection and location algorithm based on deep learning//Proceedings of 2019 International Conference on Intelligent Transportation, Big Data & Smart City (ICITBS). Changsha, China: IEEE: 582-585[DOI: 10.1109/icitbs.2019.00145]

Xing J S, Tan W L and Bai J. 2019. Design of object edge detection system based on FPGA//Proceedings of 2019 Chinese Intelligent Systems Conference. Haikou, China: Springer: 194-202[DOI: 10.1007/978-981-32-9698-5_22]

Xu J, Wang W, Wang H Y, Guo J H. 2020. Multi-model ensemble with rich spatial information for object detection. Pattern Recognition, 99 [DOI:10.1016/j.patcog.2019.107098]

Yang R J, Wang F, Qin H. 2018. Research of pedestrian detection and location system based on stereo images. Application Research of Computers, 35(5): 1591-1595, 1600 (杨荣坚, 王芳, 秦浩. 2018. 基于双目图像的行人检测与定位系统研究. 计算机应用研究, 35(5): 1591-1595, 1600) [DOI:10.3969/j.issn.1001-3695.2018.05.068]

Zeng J X, Fang Q, Fu X, Leng L. 2019. Multi-scale pedestrian detection algorithm with multi-layer features. Journal of Image and Graphics, 24(10): 1683-1691 (曾接贤, 方琦, 符祥, 冷璐. 2019. 融合多层特征的多尺度行人检测. 中国图象图形学报, 24(10): 1683-1691)