发布时间: 2017-05-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.160600
2017 | Volume 22 | Number 5

第十八届全国图像图形
学术会议专栏

基于Fast R-CNN的车辆目标检测

曹诗雨¹, 刘跃虎², 李辛昭²

1. 西安交通大学软件学院, 西安 710049;

2. 西安交通大学人工智能与机器人研究所, 西安 710049

收稿日期: 2016-11-19; 修回日期: 2017-02-08

基金项目: 国家自然科学基金项目（91120009）

第一作者简介: 曹诗雨 (1989-), 女, 西安交通大学软件工程专业硕士研究生, 主要研究方向为深度学习, 图像目标检测。E-mail:caokajia@outlook.com

中图法分类号: TP391

文献标识码: A

文章编号: 1006-8961(2017)05-0671-07

摘要

目的在传统车辆目标检测问题中，需要针对不同图像场景选择适合的特征。为此提出一种基于快速区域卷积神经网络（Fast R-CNN）的场景图像车辆目标发现方法，避免传统车辆目标检测问题中需要设计手工特征的问题。方法该方法基于深度学习卷积神经网络思想。首先使用待检测车辆图像定义视觉任务。利用选择性搜索算法获得样本图像的候选区域，将候选区域坐标与视觉任务示例图像一起输入网络学习。示例图像经过深度卷积神经网络中的卷积层，池化层计算，最终得到深度卷积特征。在输入时没有规定示例图像的规格，此时得到的卷积特征规格不定。然后，基于Fast R-CNN网络结构，通过感兴趣区域池化层规格化特征，最后将特征输入不同的全连接分支，并行回归计算特征分类，以及检测框坐标值。经过多次迭代训练，最后得到与指定视觉任务强相关的目标检测模型，具有训练好的权重参数。在新的场景图像中，可以通过该目标检测模型检测给定类型的车辆目标。结果首先确定视觉任务包含公交车，小汽车两类，背景场景是城市道路。利用与视觉任务强相关的测试样本集对目标检测模型进行测试，实验表明，当测试样本场景与视觉任务相关度越高，且样本中车辆目标的形变越小，得到的车辆目标检测模型对车辆目标检测具有良好的检测效果。结论本文提出的车辆目标检测方法，利用卷积神经网络提取卷积特征代替传统手工特征提取过程，通过Fast R-CNN对由示例图像组成定义的视觉任务训练得到了效果良好的车辆目标检测模型。该模型可以对与视觉任务强相关新场景图像进行效果良好的车辆目标检测。本文结合深度学习卷积神经网络思想，利用卷积特征替代传统手工特征，避免了传统检测问题中特征选择问题。深层卷积特征具有更好的表达能力。基于Fast R-CNN网络，最终通过多次迭代训练得到车辆检测模型。该检测模型对本文规定的视觉任务有良好的检测效果。本文为解决车辆目标检测问题提供了更加泛化和简洁的解决思路。

关键词

快速区域卷积神经网络; 深度学习; 车辆; 视觉任务; 目标检测

Vehicle detection method based on fast R-CNN

Cao Shiyu¹, Liu Yuehu², Li Xinzhao²

1. School of Software Engineering, Xi'an Jiaotong University, Xi'an 710049, China;

2. Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, Xi'an 710049, China

Supported by: National Natural Science Foundation of China (91120009)

Abstract

Objective The traditional vehicle target detection problem is typically divided into two steps:the first step is generating assumptions, that is, the image may exist in the vehicle target, thus reducing the need to calculate the area; the second step is verifying the hypothesis, that is, testing to verify whether there is a vehicle target in the image. In the first step, different features must be designed for different scenes. Among the features commonly used in vehicle detection problems are symmetry, color, shadows, corners, edges, textures, and lights. In the second step, verifying the hypothesis is typically based on the template method or on the appearance of the characteristics of the method. In addition to the above basic features, HOG, Harris, SIFT, and other manual features are also typically used. Finally, the test results are obtained through the support vector machine and other classifiers that classify the feature matrix. The whole process appears to be very detrimental to the generalization of detection problems; thus, it is necessary to select suitable characteristics for the case of unreasonable samples. This paper proposes a vehicle detection method based on Fast R-CNN, which can find vehicle objects in scene images. Method The method is based on the idea of deep learning convolution neural network. First, define the visual task using the vehicle image to be detected. The candidate region of the sample image is obtained by the selective search algorithm, and the candidate region coordinates are inputted to the network learning together with the visual task sample image. The sample image is calculated by the convolution layer and the pool layer in the deep convolution neural network. Finally, the deep convolution feature is obtained. The specifications of the sample image are not specified at the time of input, and the convolution characteristics obtained at this time are variable. Subsequently, the feature is normalized by the pooling region of the region of interest based on the Fast R-CNN network structure. Finally, the feature is inputted into different full-connection branches, parallel regression calculation feature classification, and detection frame coordinate values. After several iterations and training, the target detection model, which is strongly related to the specified visual task is obtained, and the trained weight parameters are trained. In a new scene image, a given type of vehicle target can be detected by the target detection model. Result The method uses a test dataset that is relative to the vision task to test the object detection model. The experiment suggests that it will achieve effective detection result by the vehicle detection model in the situation where the scenes of test samples are strongly correlated to the vision task. Conclusion First, determine the visual tasks that include the bus and car as the two categories. The background scene is the city road. Experimental results show that when the correlation between the test sample scene and the visual task is higher and the deformation of the vehicle target in the sample is smaller, the vehicle target detection model is obtained for the vehicle. Target detection has a good detection effect. Conclusion The vehicle target detection method proposed in this paper uses the convolution neural network to extract the convolution feature instead of the traditional manual feature extraction process. Fast R-CNN obtained the vehicle target detection model with good effect on the visual task training defined by the sample image. The model can achieve well-performing vehicle target detection on new scene images that are strongly related to visual tasks. In this paper, the convolution characteristics are used to replace the traditional manual features in combination with the depth of learning convolution neural network to avoid the traditional detection problems in the feature selection problem. Deep convolution features have better expression ability. Based on the Fast R-CNN network, the vehicle detection model is obtained by several iterations. The detection model has a good detection effect on the visual task specified in this paper. This paper provides a more general and concise solution of vehicle detection problems.

Key words

fast region convolutional neural network (Fast R-CNN); deep learning; vehicle; vision tasks; object detection

0 引言

车辆目标检测是一种在场景图像中指出车辆目标的研究问题。它是人工智能应用领域中一个很重要的组成模块。近些年，在道路场景监控系统，无人车系统，智能停车缴费系统中有着广泛的应用。所以，优化车辆目标检测问题有着重要的意义。

最近，有很多关于在静态场景中进行车辆目标检测的研究工作。他们用不同的方式试图更好地解决问题。通常，会提取手工特征^[1]，组合特征，分类。或者将车辆目标拆分成局部再进一步提取特征，以便得到更精确的检测结果。但是，手工设计特征进行提取，依赖于研究人员的经验，缺乏对问题的泛化能力，存储这些手工特征也需要一定的存储空间，再者，对分类器的选择也十分影响最后的检测效果。当问题变换检测目标，或者延伸到复杂场景中，传统方式将面临更加严峻的挑战。为了达到更好的检测效果，整个系统框架将变得更加复杂。

任何的场景目标发现问题，都可以看做是一个和该目标相关的视觉任务。提出一种不依赖手工特征，在静态场景中进行车辆目标检测的方法。

视觉任务定义为：在某一场景中对一类或几类目标进行目标检测的任务描述。视觉任务是由和检测目标强相关的场景样本图像构成。当希望检测某一场景中的某种类型目标时，可以通过收集一系列该场景下的目标图像样本，组成视觉任务。本文视觉任务是：检测在城市道路背景下的正面车辆。如图 1，通过对视觉任务学习得到一种稳定的，与任务强相关的目标检测模型。将新的样本图像输入目标检测模型便可以得到检测结果。本文将利用深度学习思想解决视觉任务的训练过程。通过卷积神经网络对样本进行计算得到卷积特征，利用Fast R-CNN网络解决目标检测定位的问题。对车辆目标检测的问题提供了更泛化和简洁的解决思路。

图 1 基于视觉任务学习的目标发现方法思想

Fig. 1 The thought of object detection method based on vision task learning

1 相关工作

在传统模型中，基于手工设计特征^[2]的检测模型被广泛应用于车辆目标检测问题。Tsai等人^[3]利用新型的颜色变换模型对颜色特征计算，快速找到重要的车辆颜色，定位可能的候选物。然后，通过边缘地图的小波变换系数来构建多通道分类器，从而验证候选物。Jazayeri等人^[4]在做视频流中的车辆目标检测时，将临时的底层几何特征的时间信息与车辆的动态信息结合，并利用隐马尔可夫模型从背景中区分出目标车辆。

可以看出，在传统检测模型中，提取图像的底层特征是至关重要的一步。

传统模型有以下的缺点：

1) 需要针对不通的检测问题进行特征设计；

2) 需要对特征进行组合，选定分类器进行训练，其过程较为复杂；

3) 需要存储空间对特征进行存储。

近些年，深度卷积神经网络被成功地应用于物体识别与目标检测^[5-6]。2012年，Krizhevsky等人在ImageNet大规模视觉识别的挑战 (ILSVRC) 中展现了CNN显著提高的图像分类精确度的能力。2014年，Girshick等人^[7]将Region proposals和CNNs结合，并在网络最后利用SVM对卷积特征进行训练分类，得到了利用CNN进行图像检测的方法R-CNN。2015年Ross Girshick进一步将SPPnet^[8]的思想和R-CNN结合提出卷积神经网络模型Fast R-CNN^[9]，用softmax^[10]回归替代SVM分类器降低空间和时间的开销。整个训练过程不需要分级进行，检测过程更加高效，准确度更高。

2 车辆检测的Fast R-CNN过程

本文方法过程分为两个阶段，训练阶段和测试阶段。

训练阶段，用我们定义的视觉任务对预训练后具有初始参数的卷积神经网络进行训练，得到目标检测模型；

测试阶段，将新样本输入目标检测模型得到检测结果。

如图 2所示，方法整个计算流程包含以下5个步骤：

图 2 基于Fast R-CNN的车辆目标检测流程

Fig. 2 The process of vehicle object detection based on Fast R-CNN

1) 定义车辆视觉任务；

2) 对样本图像进行object proposals提取，可以利用selective search^[11]或edgebox^[12]等方法，本文使用selective search方法；

3) 选定CNN网络结构，利用ImageNet数据集进行预训练，得到预训练参数；

4) 将训练集，其对应的标定值，以及object proposals作为输入，通过Fast R-CNN对预训练模型进行二次训练，得到最终的车辆目标检测模型；

5) 利用新样本对车辆目标检测模型进行测试，得到新样本的检测结果。

2.1 车辆视觉任务的构建

本文的视觉任务是在城市道路场景中检测正面车辆目标。车辆目标的类型包括小汽车 (car) 和公交车 (bus)。车辆视觉任务确定之后需要采集示例图像构建训练集，采集被测图像构建测试集。被检测的场景图像和目标是与视觉任务强相关的，所以在采集示例图像和被测图像应在相同环境下以相同的方式进行采集。将样本采集地点设定在天桥上，方便采集正面车辆。利用普通相机进行拍摄，相机的变焦保持一致。为了充分训练卷积神经网络上的参数，应采集足够多的图像样本。采集后的图像场景如图 3所示。

图 3 数据样本

Fig. 3 Sampledata ((a) bus; (b) car)

最后对数据集样本进行标定，标定时记录车辆目标左上和右下的坐标值构成矩形框的坐标值$\boldsymbol{v}\text{ }=({{v}_{x}}_{1}, \text{ }{{v}_{y}}_{1}, \text{ }{{v}_{x}}_{2}, \text{ }{{v}_{y}}_{2})$。

2.2 Fast R-CNN的核心过程

Fast R-CNN的输入，包含3个部分：样本图像、其标定值 (groundtruth)、object proposals。通过计算每个样本图像的标定框与object proposals的覆盖率，可以得到每个样本图像对应的一组感兴趣区域 (RoIs)。每个图像的$R$个RoIs由表 1构成。如图 4所示卷积神经网络通过一些卷积层和最大池化层获得样本的卷积特征。然后，RoI池化层对每一个感兴趣区域，从卷积特征中提取出对应的一个规格化的特征向量。所有特征向量将输入全连接层最终将结果共享，产生两个支路，进入两个不同的层中。一个层负责利用softmax回归计算$K$类目标加一个“背景”类的概率估算值；另一个层负责输出，用来表征每个图像上$K$类目标的检测框坐标的4个值。

表 1 RoIs的来源
Table 1 RoIs sources

下载CSV

类别	比例	方式
1	25%	IoU在[0.5, 1]的区域, 作为背景
2	75%	IoU在[0.1, 0.5) 中值较大的区域, 作为前景

2.3 RoI池化层计算

RoI池化层在固定的空间幅度$H$×$W$下，利用最大池化层将RoI的特征矩阵转换成更小的规格化的特征矩阵。

RoI坐标用四元组 ($r$，$c$，$h$，$w$) 表示，分别代表RoI的左上角 ($r$，$c$) 以及高度和宽度 ($h$，$w$)。RoI池化层将$h$×$w$的RoI窗口用$H$×$W$大小的子窗口进行分割，得到约等于$h$/$H$×$w$/$W$个子窗口。利用池化对每个子窗口进行计算，得到相应的网格输出。有了这一层，将不用再约束训练过程的输入图像必须是一致的规格，RoI池化层将不同大小的RoI特征矩阵池化成统一的规格。

2.4 输出层回归计算

如图 4所示，最终要得到两个输出层，分别计算分类结果以及检测框坐标结果。第1个输出层计算了每个RoI在$K$类中的概率$\boldsymbol{P}$，$\boldsymbol{P}$是通过softmax回归计算。第2个输出层负责计算$K$个类型检测框坐标值。于是，用一种多任务的损失函数对每个标定的RoI的类型和检测框坐标进行回归计算

$ \begin{align} & L(p,u,{{t}^{u}},v)={{L}_{\text{cls}}}\left( p,u \right)+ \\ & \ \ \ \lambda \left[ u\ge 1 \right]{{L}_{\text{loc}}}({{t}^{u}},v) \\ \end{align} $

(1)

$ {{L}_{\text{cls}}}\left( p,u \right)=-\text{log}\ {{p}_{u}} $

(2)

图 4 Fast R-CNN结构

Fig. 4 Fast R-CNN architecture

式中，[$u$≥1]表示当$u$大于等于1时，中括号值为1，其他值为0。$L$_loc是类型概率的对数损失；另一个$L$_loc，则是检测框坐标损失，通过真实的$u$类的检测框坐标值$\boldsymbol{v}=({{v}_{x}}, \text{ }{{v}_{y}}, \text{ }{{v}_{w}}, \text{ }{{v}_{h}})$，以及$u$类的预测坐标值${{\boldsymbol{t}}^{u}}=\left( t_{x}^{u}, \text{ }t_{y}^{u}, \text{ }t_{w}^{u}, \text{ }t_{h}^{u} \right)$得到。$L$_loc定义为

$ {{L}_{\text{loc}}}({{t}^{u}},v)=\sum\limits_{i\in \left\{ x,y,w,h \right\}}{{{s}_{L1}}(t_{i}^{u}-{{v}_{i}})} $

(3)

$ {{S}_{L1}}\left( x \right)=\left\{ \begin{align} & 0.5{{x}^{2}}\ \ \ \ \ ~\ \ \left| x \right|<1 \\ & \left| x \right|-0.5\ \ \ \ \left| x \right|\ge 1 \\ \end{align} \right. $

(4)

在训练过程中，定义一个随机梯度下降 (SGD) mini-batches，它由$N$个图像样本和$R$个RoIs构成。本文设置为：$R$=128，$N$=2，即每组batch，2个图像，每个图像64个RoIs。

在反向传播过程中经过RoI池化层时，进行反向传播的计算，即

$ \frac{\partial L}{\partial {{x}_{i}}}=\sum\limits_{r}{\sum\limits_{j}{\left[ i={{i}^{*}}\left( r,j \right) \right]}}\frac{\partial L}{\partial {{y}_{rj}}} $

(5)

式中，[·]表示当残差在ROI层反向传播时，需判断该残差所在节点$i$是否连接到该ROI输入值的最大值。如果是则累计残差，中括号值为1，否则为0。

3 实验结果与分析

通过采集得到训练集$\boldsymbol{A}$，测试集$\boldsymbol{A}$^*。通过水平反转扩大训练集$\boldsymbol{A}$的样本容量，提高训练效果。另外，从测试集$\boldsymbol{A}$^*的8 500幅小汽车样本中选取形变较小，光线良好，质量较高的小汽车样本1 300幅构成测试集$\boldsymbol{B}$^*，如表 2所示。

表 2 样本数量
Table 2 Sample quantity

下载CSV

数据集	数量	水平翻转	分辨率/像素
训练集$\boldsymbol{A}$	car 9 500 bus 1 500	car 19 000 bus 3 000
测试集$\boldsymbol{A}$^*	car 8 500 bus 1 500	-	480×640
测试集$\boldsymbol{B}$^*	car 1 300	-

本文使用Caffe框架，在Nvidia K5200下进行GPU加速计算。预训练模型结构VGG16，预训练模型参数权重来自于VGG16在ImageNet下的训练结果。对自己的视觉任务训练集进行训练，在GPU加速下经过14 h训练，得到了目标检测模型。

首先，选定一些样本图像进行单张图像检测测试，并在图像样本上显示检测框。将检测框类型的概率$\boldsymbol{p}$阈值设置为0.7，检测效果如图 5所示，图 5(e)有漏检情况，其他车辆的检测框类型概率$\boldsymbol{p}$均高于0.9。在一些形变比较特殊或者颜色比较特殊的车辆上，效果一般，漏检情况也多一些。并且，检测效果和第1步通过selective search算法得到object proposals的计算结果有关。当object proposals提取效果不佳时，便无法和标定值计算得到良好的RoIs，从而影响之后的计算。

图 5 检测效果

Fig. 5 Detection effect ((a)$p$=0.996;(b)$p$=0.999;(c)$p$=0.917;(d)$p$=0.999;(e)$p$=0.991)

然后，在测试集$\boldsymbol{A}$和$\boldsymbol{B}$下进行整体测试。经过测试得到如表 3的测试结果，AP是平均准确率，为了解决召回率，准确率的单点值局限性问题，AP是一个能够反映全局性能的指标，准确率召回率曲线 (PR曲线) 下的面积即AP的值。mAP则是不同种类AP值的平均值。表中AUC表示ROC曲线下的面积值。ROC曲线由结果的正正值率 (TPR) 表示纵轴，负正值率 (FPR) 表示横轴。AUC值越大表示分类效果越好。图 6是测试集$\boldsymbol{A}$，$\boldsymbol{B}$的PR曲线图。可以看出，bus类型有比较好的检测效果，其检测效果好于car，而car类型的检测效果一般。分析发现，因为bus相较于car类型，其车辆正脸的形变比较小，并且颜色种类较为单一；而car类型中，小汽车的种类很多，如suv，三厢，两厢等，而且在收集样本过程中很难保证车辆位姿的一致性，所以对于不同位姿的车辆样本，样本数量过小，在训练过程中参数拟合的效果不佳。当对特别选定样本的测试集$\boldsymbol{B}$进行测试发现，检测效果有了很大提升。通过测试结果可以看出，构建的视觉任务可以基本满足在新样本下进行车辆目标发现。测试样本与训练样本的相关性影响着测试结果。当在相似场景，并且目标形变较小的情况下，可以得到非常良好的目标检测效果。

表 3 测试结果
Table 3 Test result

下载CSV

测试集	数量	分辨率/像素	AP	AUC	mAP
测试集$\boldsymbol{A}$	Bus 1 500 Car 8 500	480×640 480×640	0.889 0.768	0.905 0.791	0.83
测试集$\boldsymbol{B}$	Car 1 300	480×640	0.992	0.996	0.99

图 6 PR曲线

Fig. 6 PR curve ((a) test dataset $\boldsymbol{A}$, class:bus, AP=0.889; (b) test dataset $\boldsymbol{A}$, class:bus, AP=0.768; (c) test dataset $\boldsymbol{B}$, class:bus, AP=0.992)

4 结论

通过构建视觉任务，利用深度卷积神经网络提取视觉任务示例图像的卷积特征，并基于Fast R-CNN对特征进行规格化和并行的回归计算，最终得到视觉任务相关的车辆目标检测模型。该方法利用深度卷积特征，有效避免了传统车辆目标检测模型对手工特征的依赖问题，以及大量手工特征矩阵的存储问题。经过实验，该检测模型能够实现较好的检测效果。然而，视觉任务需要大量的有效样本进行构建，前期样本object proposals提取过程较为费时，并且无法和Fast R-CNN形成一种端到端的检测过程。在保证检测效果的前提下，解决上述问题是未来需要研究的重点问题。

参考文献

[1] Wang X Y, Yang M, Zhu S H, et al. Regionlets for generic object detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(10): 2071–2084. [DOI:10.1109/TPAMI.2015.2389830]

[2] Wang C C R, Lien J J J. Automatic vehicle detection using local features-a statistical approach[J]. IEEE Transactions on Intelligent Transportation Systems, 2008, 9(1): 83–96. [DOI:10.1109/TITS.2007.908572]

[3] Tsai L W, Hsieh J W, Fan K C. Vehicle detection using normalized color and edge map[J]. IEEE Transactions on Image Processing, 2007, 16(3): 850–864. [DOI:10.1109/TIP.2007.891147]

[4] Jazayeri A, Cai H, Zheng J Y, et al. Vehicle detection and tracking in car video based on motion model[J]. IEEE Transactions on Intelligent Transportation Systems, 2011, 12(2): 583–595. [DOI:10.1109/TITS.2011.2113340]

[5] Le Cun Y, Kavukcuoglu K, Farabet C. Convolutional networks and applications in vision[C]//Proceedings of 2010 IEEE International Symposium on Circuits and Systems. Paris:IEEE, 2010:253-256.[DOI:10.1109/ISCAS.2010.5537907]

[6] Ouyang W L, Wang X G. Joint deep learning for pedestrian detection[C]//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, VIC:IEEE, 2013:2056-2063.[DOI:10.1109/ICCV.2013.257]

[7] Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of 2014 IEEE International Conference on Computer Vision and Pattern Recognition. Columbus, OH:IEEE, 2014:580-587.[DOI:10.1109/CVPR.2014.81]

[8] He K M, Zhang X Y, Ren S Q, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9): 1904–1916. [DOI:10.1109/TPAMI.2015.2389824]

[9] Girshick R. Fast R-CNN[C]//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago:IEEE, 2015:1440-1448.[DOI:10.1109/ICCV.2015.169]

[10] Bouchard G. Clustering and Classification Employing Softmax Function Including Efficient Bounds:US, 8065246[P]. 2011-11-22

[11] Uijlings J R R, van de Sande K E A, Gevers T, et al. Selective search for object recognition[J]. International Journal of Computer Vision, 2013, 104(2): 154–171. [DOI:10.1007/s11263-013-0620-5]

[12] Zitnick C L, Dollár P. Edge boxes:locating object proposals from edges[C]//Proceedings of the 13th European Conference on Computer Vision-ECCV 2014. Switzerland:Springer International Publishing, 2014:391-405.[DOI:10.1007/978-3-319-10602-1_26]