发布时间: 2019-10-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.190009
2019 | Volume 24 | Number 10

图像分析和识别

融合多层特征的多尺度行人检测

曾接贤^1,2, 方琦¹, 符祥¹, 冷璐¹

1. 南昌航空大学软件学院, 南昌 330063;

2. 南昌航空大学科技学院, 共青城 332020

收稿日期: 2019-01-14; 修回日期: 2019-05-03; 预印本日期: 2019-05-10

基金项目: 国家自然科学基金项目（61763033，61662049，61741312）

第一作者简介: 曾接贤, 1958年生, 男, 教授, 主要研究方向为图像处理、模式识别和计算机视觉。E-mail:zengjx58@163.com;
方琦, 男, 硕士研究生, 主要研究方向为图像处理与模式识别。E-mail:646767305@qq.com;
冷璐, 男, 副教授, 主要研究方向为图像处理、模式识别和计算机视觉。E-mail:drluleng@gmail.com.

中图法分类号: TP391.4

文献标识码: A

文章编号: 1006-8961(2019)10-1683-09

摘要

目的行人检测在自动驾驶、视频监控领域中有着广泛应用，是一个热门的研究话题。针对当前基于深度学习的行人检测算法在分辨率较低、行人尺度较小的情况下存在误检和漏检问题，提出一种融合多层特征的多尺度的行人检测算法。方法首先，针对行人检测问题，删除了深度残差网络的一部分，仅采用深度残差网络的3个区域提取特征图，然后采用最邻近上采样法将最后一层提取的特征图放大两倍后再用相加法，将高层语义信息丰富的特征和低层细节信息丰富的特征进行融合；最后将融合后的3层特征分别输入区域候选网络中，经过softmax分类，得到带有行人的候选框，从而实现行人检测的目的。结果实验结果表明，在Caltech行人检测数据集上，在每幅图像虚警率（FPPI）为10%的条件下，本文算法丢失率仅为57.88%，比最好的模型之一——多尺度卷积神经网络模型（MS-CNN）丢失率（60.95%）降低3.07%。结论深层的特征具有高语义信息且感受野较大的特点，而浅层的特征具有位置信息且感受野较小的特点，融合两者特征可以达到增强深层特征的效果，让深层的特征具有较为丰富的目标位置信息。融合后的多层特征图具有不同程度的细节和语义信息，对检测不同尺度的行人有较好的效果。所以利用融合后的特征进行行人检测，能够提高行人检测性能。

关键词

目标检测; 行人检测; 特征融合; 多尺度行人; 多层特征

Multi-scale pedestrian detection algorithm with multi-layer features

Zeng Jiexian^1,2, Fang Qi¹, Fu Xiang¹, Leng Lu¹

1. School of Software, Nanchang Hangkong University, Nanchang 330063, China;

2. School of Science and Technology, Nanchang Hangkong University, Gongqingcheng 332020, China

Supported by: National Natural Science Foundation of China (61763033, 61662049, 61741312)

Abstract

Objective Humans fully understand a picture, often classify different images, and understand all the information in each image, including the location and concept of the object. This task is called object detection and is one of the basic research areas in computer vision. Object detection consists of different subtasks, such as pedestrian detection and skeleton detection. Pedestrian detection is a key link in object detection and one of the difficult tasks. This study mainly investigates pedestrian detection in traffic scenes, which is one of the most valuable topics in the field of pedestrian detection. Pedestrian detection in traffic scenes has always been a key technology for intelligent video surveillance technology, unmanned technology, intelligent transportation, and other issues. In recent years, this topic has been the research focus in academic and industrial circles. With the upsurge of artificial intelligence technology development, a large number of computer vision technologies are widely used. Multi-scale pedestrian detection has great research value because the development and application of pedestrian detection has complex real scenes and different pedestrian scales. Pedestrian detection is widely used in the field of automatic driving and video surveillance and is a hot research topic. Current pedestrian detection algorithms based on deep learning have false detection and miss detection problems in the case of low resolution and small pedestrian scale. A multi-scale pedestrian detection algorithm based on multi-layer features is proposed. The proposed convolutional neural network exhibits improved accuracy of pedestrian detection by a level, and there has been no small progress in practical applications. The academic enthusiasm brought about by deep learning has enabled scholars to make great progress and breakthroughs in pedestrian detection in complex scenes. Deep learning in the future will be a major boost for pedestrian detection. Method The deep residual network is mainly used in the multi-objective classification field. After analyzing the network, only the feature maps of the three stages are used and the residual unit and the full connection layer of the last stage are deleted. The deep residual network is mainly used to extract the feature maps of the three stages. The feature map extracted by the last layer is doubled using the characteristics of the three feature maps and then added by the nearest neighbor sampling method. The features with rich high-level semantic information and the features with rich low-level detail information are combined to improve the detection effect. The merged three-layer features are encoded into the region proposal network, and the proposal frames with pedestrians are obtained through Softmax classification for pedestrian detection. In this work, four experiments are designed, three of which are used to verify the validity of the proposed method. Results are compared with the mainstream algorithm results. Comparative experiments indicate that simple stratification does not improve the effect and the effect of multi-layer fusion is unsatisfactory. Therefore, the method of adjacent layer fusion is selected, and the result of multi-scale pedestrian detection is directly compared with that of the deepest network. The effect of adjacent layer fusion is better than the result. All experimental results are compared, and the fusion results of the adjacent layers are the best. The rate of missed detection is lower than that of the mainstream algorithm. The network is fully convolved and consists end-to-end training through random downsampling and backpropagation. Each image contains a number of candidate boxes for positive and negative samples. However, directly taking the optimized sample will easily lead to loss bias to the negative sample because the number of negative samples is larger than that of positive samples. This study takes an image to select 256 anchors and calculates its loss. The ratio of the positive and negative samples is 1:1. This article will randomly initialize all new layers in the network, and the standard of initialization is from the zero mean standard deviation. The value set is 0.01, and the weight is taken from Gaussian distribution. The other layers are initialized by classifying the pre-trained model, and the entire training process iterates through two epochs. Result On the Caltech pedestrian detection dataset and under the condition that each image false alarm rate (FPPI) is 10%, the loss rate of the proposed algorithm is only 57.88%, which is decreased by 3.07% compared with the loss of one of the best models, namely, MS-CNN (multi-scale convolutional neural network) (60.95%). This work also adopts comparative experiment. The overall loss rate of Ped-RPN is 64.55%, which is worse than that of the proposed algorithm. The loss rate of the layered and then detected method (Ped-muti-RPN) is 77.15%, which is better than that of Ped-RPN method. Ped-fused-RPN is a detection algorithm that combines multiple layers. The result is 61.32%, and the effect is better than the proposed algorithm. Conclusion Small-scale pedestrians have the disadvantage of blurred images, which make the detection effect extremely poor and affect the overall multi-scale detection. In order to solve the problem of the sharp decline of small-scale pedestrian detection, this paper proposes a method of integrating deep semantic information and shallow detail features so the features of all scales have rich semantic information. The deep features have high semantic information, and the receptive field is small. The shallow features have positional information, and the receptive field is more fused. The two features can enhance the deep features, which have rich target position information. The merged feature map has different levels of detail and semantic information and has a good effect on detecting pedestrians of different scales.

Key words

target detection; pedestrian detection; feature fusion; multi-scale pedestrians; multi-layer features

0 引言

行人检测可定义为：判断输入图片(视频帧)是否包含行人，如果有，给出位置信息^[1]。行人检测作为人机交互、行人重识别、视频监控、机器人、辅助驾驶以及无人驾驶系统中重要的组成部分，受到了研究人员的广泛关注^[2]。

行人检测算法大体分为候选窗口选择、特征提取和分类器设计3个阶段。传统的行人检测方法使用手工设计的特征，并且大多采用滑动窗口扫描作为候选窗口提取算法，速度较快。为了获取更好的行人检测效果，研究者们提出了用候选区域算法(如selective search^[3])代替传统的滑动窗口策略，该算法在提升检测准确率的同时又降低了时间复杂度。

行人的衣着、姿态、场景和光照变化均具有多样性特点，如何来准确地表示行人的特征，研究者们提出了一系列的方法。Dalal等人^[4]提出方向梯度直方图(HOG)特征，通过划分细胞单元，采集每个单元里所有像素点的梯度或者边缘直方图组成特征描述子，再输入到分类器中检测行人。HOG的提出加快了行人检测的发展，大量的改进算法相继被提出。田仙仙等人^[5]改进了HOG特征，提出了MultiHOG特征。该方法在Inria数据集上的检测的准确率和速度都有很大提高。车志富等人^[6]提出了一种改进的HOG特征提取算法，该方法结合支持向量机(SVM)分类器不仅略微提升了检测精度，还可以实际应用于地铁环境中。Cao等人^[7]设计了一种相邻和非相邻特征(NNNF)，该特征的思路来源于行人的水平方向内部纹理具有一定的对称性且内部纹理结构相似。该方法在Caltech数据集上取得了非常好的效果。Dollár等人^[8]提出一种结合颜色和梯度的聚合通道特征(ACF)，该方法在提升行人检测精度的同时，大幅提升了检测速度。Nam等人^[9]对ACF特征进行改进，提出一种去相关的ACF特征(LDCF)。这些方法能够较好地表示行人，提高了行人检测的效果，但对于复杂场景(自动驾驶、视频监控)实用化，精度还有待提高。曾接贤等人^[10]针对行人相互遮挡问题，提出了一个结合单双行人的DPM(deformable part model)模型，有效提高了有遮挡情况下的行人检测精度，但更复杂的行人遮挡问题的检测精度仍有待进步。吴喆^[11]提出了一种多分类器级联的行人检测方法，用来解决行人遮挡影响检测精度的问题，两级共9个分类器有效地提高了遮挡情况下的行人检测精度，但与现阶段基于卷积神经网络的行人检测的检测精度仍有差距。

Krizhevsky等人^[12]使用深度卷积神经网络(CNNs)在2012年国际大规模视觉识别大赛(ILSVRC)中将分类任务Top-5错误率降低到15.3%。深度学习提取的特征相对于传统手工设计的特征鲁棒性更强，如何将深度学习应用到行人检测任务中，成为了急需解决的问题。对此，Ma等人^[13]提出一种将传统行人检测算法(LDCF)和CNN结合的算法，首先使用传统行人检测算法(LDCF)作为候选窗口提取算法，然后用CNN提取特征，最后用SVM对提取特征进行分类，得到最终检测结果。Wen等人^[14]使用扩充数据训练基于R-CNN (Region-CNN)的行人检测器, 并使用多窗口融合算法提升行人检测准确率。由于这些方法需要将图片缩放到统一尺度，会造成图片严重变形，影响检测精度；同时候选窗口之间存在大量重叠，造成特征重复提取，另外，候选窗口提取占据了检测阶段的大部分时间，严重影响了检测的速度。Zhao等人^[15]提出了一种基于faster R-CNN的端到端行人检测算法，该算法将候选窗口提取、特征提取、分类器包含在一个统一的框架中，检测速度较快。但是，由于真实场景中存在着大量的小尺度行人，罗杰等人^[16]提出一种改进的区域候选网络的方法来提升检测小目标的检测精度，通过利用检测小网络在卷积特征图上滑动，修正锚边框的位置和尺度，达到提升小尺度行人检测的目的，但是在中大尺度的行人检测效果较差。文献[17]仅利用最后一层卷积输出的特征图，由于该特征图分辨率较低，造成小尺度行人检测效果急剧下降。

综上所述，如何设计一个速度较快、对于不同尺度行人的检测效果较好的算法，是当前行人检测研究的重点。与文献[17]不同，本文提出一种在不同分辨率的特征图中检测不同大小行人的检测方法。

1 基于区域候选网络(RPN)的行人检测算法

在文献[18]中，检测算法主要分为两个阶段，候选窗口提取和AlexNet-Pedestrian。候选窗口提取阶段运行时间为511 ms, AlexNet-Pedestrian阶段运行时间为19 ms，候选窗口提取阶段严重影响了算法的速度。

深度残差网络(ResNet)由He等人^[19]提出，并在2015年的ILSVRC定位和分类两个比赛上都取得了第1名。ResNet主要用于大型多类别的图像识别任务，行人检测问题相对较为单一，规模较小，因此，本文删减了ResNet的一部分，进而设计形成自己的卷积神经网络，提升了检测速度。针对行人检测问题，本文对区域候选网络RPN进行了改进，通过设计针对行人检测的锚边框来提升网络性能。本文网络模型为全卷积网络，由ResNet的conv1，conv2，conv3，conv4 4个部分组成，主干部分用于提取特征。特征提取后，将其输入到区域候选网络(RPN)中，然后输出带有行人置信度的矩形框，达到行人检测的目的。

进行行人检测时，往往会因为背景复杂度的原因，产生许多漏检，大部分为中小尺度的行人。正常的交通场景中，中小尺度的行人较为常见，为了有效解决行人尺度对行人检测的影响，本文设计了3种检测模型：1)基于单层特征的RPN行人检测模型；2)基于特征融合的单层特征的RPN行人检测模型；3)基于特征融合的多层特征的RPN行人检测模型。其中，模型1)用于研究行人检测的有效性；模型2)用于研究特征融合对行人检测的影响；模型3)用于研究在不同尺度、不同分辨率的特征图中检测不同尺度行人的结果。

1.1 区域候选网络(RPN)

RPN(region proposal network)由Ren等人^[20]在2015年提出，是一种单类目标检测网络，其核心思想是使用卷积神经网络直接产生带有置信度的框，使用的方法本质上就是滑动窗口。网络结构如图 1所示，假设输入图像大小为$M \times N$，通过ResNet提取的特征图大小为$m \times n$(不同阶段的特征图大小不一，文献[20]中取第4阶段最后一个残差块，大小为原图的1/16)。

图 1 RPN结构图

Fig. 1 Region proposal network structure diagram

在提取出的特征图上使用维度为512、大小为3×3的卷积核与特征图进行卷积，那么这个3×3的区域卷积后可以获得一个维度为512的特征向量，用这个512维的特征向量预测3×3卷积核中心点位置对应的多种尺度和多种长宽比的锚边框。预测包括分类和回归，分类过程如下：用维度为2$k$($k$为每个3×3对应中心点位置锚点(anchor)的个数)的1×1卷积核对特征向量进行卷积，得到一个2$k$维的向量。将这个2$k$维向量输入到softmax分类器中，得到$k$个锚边框是目标和非目标的概率；同时用维度为4$k$的1×1卷积核与得到的512维向量进行卷积，得到一个维度为4$k$的向量。这个4$k$维向量代表$k$个锚边框的起始点横坐标、纵坐标、宽度和高度的回归量。通过上述操作，就可实现目标的检测。

特征图大小为$m \times n$，一共可以得到$m \times n$个512维向量。3×3卷积核中心点位置，对应预测多种尺度(文中设置为(8, 16, 32))和多种长宽比(文中设为(0.5, 1, 2))的锚边框，每个中心点位置预测$k$($k$= 9)个窗口，总共预测$m \times n \times k$个窗口。

模型训练阶段，输入包括图像、窗口的标签、窗口的位置和宽高，输出为损失值，损失值大小表示模型的好坏，训练的目的就是最小化损失值。

本文给每个框分配一个标签，标签为二进制，即0和1(判断是否为行人)。正样本有两种情况:1)当交并比(IOU)值小于0.5时，取与真值(ground truth)标定框的IOU数值为最高的anchor框；2)RPN与任意标定框的IOU值大于0.5的anchor框都认为是正样本。所以1张特征图上可能有多个正样本anchor框。负样本则是IOU值小于0.5的anchor框。因此区域候选网络的总损失函数可以表达为

$ \begin{array}{*{20}{c}} {\mathit{\boldsymbol{L}}(\{ {p_i}\} , \{ {\mathit{\boldsymbol{t}}_i}\} ) = \frac{1}{{{\mathit{\boldsymbol{N}}_{{\rm{cls }}}}}}\sum\limits_i {{\mathit{\boldsymbol{L}}_{{\rm{cls }}}}} ({p_i}, p_i^*) + }\\ {\lambda \frac{1}{{{\mathit{\boldsymbol{N}}_{{\rm{reg }}}}}}\sum\limits_i {{\mathit{\boldsymbol{L}}_{{\rm{ reg }}}}} ({\mathit{\boldsymbol{t}}_i}, \mathit{\boldsymbol{t}}_i^*)} \end{array} $

(1)

式中，$i$是指第$i$个anchor框，${p_i}$是指第$i$个anchor框有行人的概率，${p_i^*}$是指第$i$个标定框有行人的概率。当anchor框为正样本的时候，标定框的标签${p_i^*}$为1，否则${p_i^*}$为0。${{\mathit{\boldsymbol{t}}_i}}$是一个向量，表示的是滑动窗口的4个参数，分别是中心点的坐标、宽和高，${\mathit{\boldsymbol{t}}_i^*}$是正样本对应的标定框的坐标向量。分类损失${{\mathit{\boldsymbol{L}}_{{\rm{cls }}}}}$为

$ {\mathit{\boldsymbol{L}}_{{\rm{cls}}}}({p_i}, p_i^*) = - \ln \left[ {p_i^*{p_i} + (1 - p_i^*)(1 - {p_i})} \right] $

(2)

边界框回归损失为

$ {\mathit{\boldsymbol{L}}_{{\rm{reg}}}}\left( {{\mathit{\boldsymbol{t}}_i}, \mathit{\boldsymbol{t}}_i^*} \right) = \sum\limits_{i \in \left\{ {x, y, w, h} \right\}} {st{h_{L1}}\left( {{\mathit{\boldsymbol{t}}_i}, \mathit{\boldsymbol{t}}_i^*} \right)} $

(3)

$ st{h_{L1}}\left( {{\mathit{\boldsymbol{t}}_i}, \mathit{\boldsymbol{t}}_i^*} \right) = \left\{ \begin{array}{l} 0.5{\left( {{\mathit{\boldsymbol{t}}_i} - \mathit{\boldsymbol{t}}_i^*} \right)^2}\;\;\;\;\;\;\left| {{\mathit{\boldsymbol{t}}_i} - \mathit{\boldsymbol{t}}_i^*} \right| < 1\\ \left| {{\mathit{\boldsymbol{t}}_i} - \mathit{\boldsymbol{t}}_i^*} \right| - 0.5\;\;\;其他 \end{array} \right. $

(4)

${p_i^*}{\mathit{\boldsymbol{L}}_{{\rm{reg}}}}$说明只有正样本$p_i^* = 1$才有回归，如果是负样本$p_i^* = 0$，就没有回归。回归时采用4个坐标来计算，即

$ {\mathit{\boldsymbol{t}}_i} = \left\{ {{t_x}, {t_y}, {t_w}, {t_h}} \right\}, \mathit{\boldsymbol{t}}_i^* = \{ t_x^*, t_y^*, t_w^*, t_h^*\} $

(5)

$ \begin{array}{*{20}{c}} {{t_x} = \frac{{x - {x_a}}}{{{w_a}}}, {\rm{ }}{t_y} = \frac{{y - {y_a}}}{{{h_a}}}}\\ {{t_w} = \ln \left( {\frac{w}{{{w_a}}}} \right), {\rm{ }}{t_h} = \ln \left( {\frac{h}{{{h_a}}}} \right)} \end{array} $

(6)

$ \begin{array}{*{20}{c}} {t_x^* = \frac{{{x^*} - {x_a}}}{{{w_a}}}, {\rm{ }}t_y^* = \frac{{{y^*} - {y_a}}}{{{h_a}}}}\\ {t_w^* = \ln \left( {\frac{{{w^*}}}{{{w_a}}}} \right), {\rm{ }}t_h^* = \ln \left( {\frac{{{h^*}}}{{{h_a}}}} \right)} \end{array} $

(7)

式中，$x$, $y$, $w$, $h$分别是预测框的坐标、宽和高；${x_a}, {y_a}, {w_a}, {h_a}$分别是anchor框的中心点坐标、宽和高；${x^*}, {y^*}, {w^*}, {h^*}$分别是标定框的中心点坐标、宽和高。

区域候选网络在小尺度的物体检测上有较好的效果，而在行人检测方面，小尺度的行人检测一直是一个难题。

单纯使用区域候选网络结合普通深度残差网络的多尺度行人检测准确率与传统算法相比略有提高，但是与相对较为先进的一些算法来比，效果并不是特别优秀，本文针对这个问题做出一些改进。

1.2 基于单层特征区域候选网络的行人检测

为了体现区域候选网络在多尺度行人检测方面的有效性，特别设计了一个简单的深度残差网络与RPN结合的实验，该实验的结果命名为Ped-RPN。

图 2所示是一个简单的深度残差网络(ResNet)结合RPN的结构图。图像输入深度残差网络中提取特征，其中R2为第9层网络的输出结果，通过一个256维的3×3的卷积核卷积得出，R3为第21层的结果，该块(block)的卷积核大小是3×3×512，其中512是维度。R4层是第90层的结果，该块的卷积核为3×3×1 024。首先将提取出的特征图经过一个512维3×3的卷积核，将原来1 024维度的特征图降到512维；然后通过两个256维的1×1的卷积核，分别输入到cls层用softmax分类和bbox中进行回归；最后再对锚边框进行非极大值抑制，这样就能得到含有行人的候选窗。

图 2 深度残差网络结合RPN网络的结构图

Fig. 2 Resnet combined with the RPN structure diagram

2 基于多层特征图的行人检测

在交通场景中，存在很多小尺度行人。RPN-Ped通过在最后一层特征图(R4)上进行滑动窗口扫描，生成特征图大小为原图大小的1/16，在特征图上移动1个像素，相当于在原图上移动16个像素，容易使得滑动窗口越过行人，不利于行人定位；另一方面，特征分辨率过低，使得分类效果不好。

为了解决这个问题，本文提出了一种在不同大小、不同分辨率的特征图上使用RPN检测不同尺度的行人。RPN-Ped的主干网络部分仅输出1组特征图P4，Ped-mutiRPN的主干部分输出3组特征图P2，P3，P4。在ResNet中, R2、R3和R4输出特征图的大小分别为输入大小的1/4、1/8、1/16, 维度分别为256、512、1 024。为了方便进行特征融合，本文使用256维的1×1的卷积统一维度，统一维度后的特征为P4、R3la和R2la。P3通过融合P4和R3la得到，首先使用最近邻法把P4放大两倍得到P4Up，R3la和P4Up相加得到P3Sum，再使用一个3×3的卷积去除不同分辨率带来的混叠效应，得到P3。同理，可得到P2。然后使用3个RPN检测不同尺度的行人，最后将结果进行融合。网络结构如图 3所示。

图 3 Ped-fused-mutiRPN网络结构图

Fig. 3 Ped-fused-mutiRPN network structure diagram

融合特征的方法通常有两种，相加法和连接法。相加法要求特征大小相同、维度相同。连接法仅需要特征大小相同。由于连接法融合后特征维度过高，影响之后检测速度，本文采用相加法融合特征。

检测部分则由特征层、分类层和回归层组成。特征层为一层卷积层，卷积核个数为512个，大小为3×3分类层则是卷积层和softmax分类层，卷积层卷积核个数为18个，大小为1×1；回归层则是由一个卷积层组成，卷积核大小为1×1，个数为36个。

本文的网络采取的是全卷积的网络，通过随机下采样和反向传播的端到端训练。每个图片都包含很多正负样本的候选框。但是因为负样本的数量远大于正样本，直接采取优化样本，容易导致损失偏向负样本。所以本文采取一幅图像选取256个anchors，计算它的损失函数，正样本和负样本的比例则是1:1。

本文将随机初始化网络中所有的新层，初始化的标准为从零均值标准差，设定的值为0.01，权重将从其高斯分布中获取。其他层通过分类预训练的模型初始化。在数据集上，本文对前6万个样本采用的学习率为0.001，后2万个为0.000 1。动量为0.9，权重衰减为0.000 5^[19]，本文模型使用Mxnet^[20]框架实现。

3 实验

Caltech行人数据库^[21]是非常流行且使用最多的行人检测数据集之一，该数据集通过车载摄像头拍摄了总长约为10个小时的视频，其中帧率为30帧/s。该数据集提供了大约25万帧的标注图片，总共有35万个标记框和2 300个独立的行人，视频的分辨率为640×480像素。

数据集总共有11个集合，其中前6个数据集为训练集，后5个为测试集。每4帧取1帧为训练用图片，这样既保证了训练可用的数据量，同时又避免训练所有帧图像产生数据冗余。本文与文献[22]中一样，每间隔10帧取1帧为测试集，总共4 205幅图像。

ResNet101分为5个阶段，一般认为每个阶段最深层具有最强的特征，所以{ C1，C2，C3，C4，C5}对应的为conv1，conv2，conv3，conv4，conv5的最深层输出，其中C1因为内存原因，一般不采用，针对行人检测方面，删除了Conv5阶段，第5部分提取的特征图和输入图像的比例为1 :32。在这个阶段的特征图上每滑动1个像素，在原图上则滑动32个像素，容易使宽度或者高度小于32像素的小尺度行人漏检。所以本文删除了最后一部分，仅使用Resnet101的前4个阶段作为网络的主体结构。其中融合P2, P3, P4的结果本文也已做出实验验证，实验结果名为Ped-fused-RPN。

目前各大主流算法的对比实验均基于该数据集检测并绘制ROC(receiver operating characteristic)曲线。ROC曲线的横坐标表示行人的虚警率，纵坐标表示行人的漏检率，曲线向下衰减的越快代表该模型的检测效果越好。

本文算法与当前主流的多尺度行人检测算法^[21]的对比结果如图 4所示，从图 4中可以明显看出，本文算法优于其他算法。图例中数值指虚警率为10%时，不同方法的丢失率数据。

图 4 本文算法与当前主流算法比较

Fig. 4 Comparison of the proposed algorithm with the current major popular algorithms

表 1为本文采取的不同方法检测多尺度行人的效果图，其中RPN-Ped采取主体网络即ResNet最后层输出的特征输入区域候选网络后，再进行行人检测。结果表明在大尺度行人上有较好的效果，但是小目标基本全部丢失，中型目标丢失率也较高，虽然方法相较于传统方法较好，但总体比较效果并不是特别好。

表 1 本文方法在不同大小行人上丢失率的结果
Table 1 Results of the method in different sizes of pedestrians

下载CSV

/%
方法	巨大尺度(100以上)	大尺度(80以上)	中尺度(30~80)	小尺度(20~30)	所有尺度(20以上)
Ped-RPN	0.94	2.01	54.77	100	65.45
Ped-mutiRPN	1.10	7.96	68.89	100	77.15
Ped-fused-RPN	1.92	4.34	45.93	88.71	61.32
Ped-fused-mutiRPN	1.84	3.63	41.87	76.96	57.88
注：加粗字体表示最优结果, “()”中数值为行人大小的像素值。

Ped-mutiRPN采用的是分层的结果，即深度残差网络4个部分提取的特征，分别输入RPN中，进行行人检测。检测结果略微差于RPN-Ped，单独的分层检测在不同尺度上的效果反而不如高层特征语义信息的检测效果。

从Ped-mutiRPN的结果上分析，分成4块提取出的特征图分别检测并不能提升检测效果，本文认为在高层丰富的语义信息对大尺度行人有较好的效果，而低层的细节信息也有助于提升小尺度行人的检测，所以本文采用了融合的方式来提升检测效果。

Ped-fused-RPN是融合P3, P4, P5的结果，从表中可以看出检测效果并不如本文的效果好。因为本文网络是固定窗口大小的滑动窗口检测器，在不同层滑动可以增加它对尺度变化的鲁棒性。另外虽然融合后的特征图的锚定框数量更多，但是效果仍然不如本文的好，说明增加锚定框的数量不能提高准确率。

Ped-fused-mutiRPN为本文方法，从结果上就能看出，融合的特征图在检测不同尺度的行人上有了较好提升。

表 2按照行人检测标准，取虚警率(FPPI)为10^-1时的丢失率(MR)的结果进行对比。由表 2可看出，在多尺度的行人检测数据集中，本文方法明显优于其他方法，精确度提升3.07%。

表 2 在FPPI为10%时，各算法丢失率比较
Table 2 Compares MR of each algorithm when the FPPI is 10%

下载CSV

/%
方法	丢失率
VJ	99.53
HOG	90.36
SCF+ AlexNet	70.33
ACF++	69.07
RPN-ped	67.25
CompACT-Deep	64.44
SA-FasterRCNN	62.59
MS-CNN	60.95
本文	57.88
注：加粗字体表示最优结果。

表 3为几个常用算法在数据集上的运行时间对比，从表 3可以看出，本文算法的运行时间相对较快，提升明显。

表 3 算法运行时间对比
Table 3 Comparison of different algothms' running time

下载CSV

方法	硬件	时间/(帧/s)	丢失率/%
LDCF	CPU	0.6	67.24
ACF++	Titan Z GPU	1.3	69.07
CompACT-Deep	Tesla K40 GPU	0.5	64.44
RPN+BF	Tesla K40 GPU	0.5	64.66
本文	GTX 1080	0.23	57.88
注：加粗字体表示最优结果。

实验结果表明，在本文使用的行人图库中，本文结合RPN及深度残差网络，融合多层特征对多尺度行人检测效果优于单独使用区域候选网络的效果，且运行时间较为理想，说明本文算法能够很好地检测出多种尺寸的行人。

4 结论

为了解决小尺度行人检测效果急剧下降问题，提出了一种融合高层语义信息和低层细节特征的方法，让所有尺度下的特征都具有丰富的语义信息，对检测不同尺度的行人有较好的效果。高层的特征具有低分辨率、高语义信息，而低层的特征具有高分辨率、低语义信息的特点。融合两者特征可以达到增强高层特征的效果，同时也让高层特征具有了较为丰富的目标位置信息。具体过程如下：对于高层语义信息丰富的特征P4，本文直接使用其检测大尺度行人；然后将P4使用最近邻法放大两倍，与分辨率较高的R3层特征进行融合，用融合后的特征P3检测中等尺度行人；最后将P3放大两倍，与分辨率最高、细节特征最丰富的R3层特征进行融合，用融合后的特征P3检测小尺度行人。

本文算法在不同层特征图上检测不同尺度的行人，并且由于锚边框的限定，并不是无约束的多尺度检测，因此不会增加大量的无效计算。在Caltech行人数据集上的实验结果表明，本文算法能够较精确地检测小尺度行人，检测效果优于当前主流算法。同时，在中大尺度的行人上，检测效果并不劣于其他算法，多尺度的行人检测总体效果相较主流算法有所提升。

本文算法的局限性与研究展望：本文对过多的负样本直接采用随机下采样方法，这样容易使得一些难例无法参与训练，造成检测结果下降。因此用级联的RPN首先对易分类样本进行过滤，再进行细分类有待后续研究。

参考文献

[1] Su S Z, Li S Z, Chen S Y, et al. A survey on pedestrian detection[J]. Acta Electronica Sinica, 2012, 40(4): 814–820. [苏松志, 李绍滋, 陈淑媛, 等. 行人检测技术综述[J]. 电子学报, 2012, 40(4): 814–820. ] [DOI:10.3969/j.issn.0372-2112.2012.04.031]

[2] Dollár P, Wojek C, Schiele B, et al. Pedestrian detection:an evaluation of the state of the art[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(4): 743–761. [DOI:10.1109/TPAMI.2011.155]

[3] Benenson R, Omran M, Hosang J, et al. Ten years of pedestrian detection, what have we learned?[C]//Proceedings of European Conference on Computer Vision. Zurich, Switzerland: Springer, 2014: 613-627.[DOI: 10.1007/978-3-319-16181-5_47]

[4] Dalal N, Triggs B. Histograms of oriented gradients for human detection[C]//Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego, USA: IEEE, 2005: 886-893.[DOI: 10.1109/CVPR.2005.177]

[5] Tian X X, Bao H, Xu C. Improved HOG algorithm of pedestrian detection[J]. Computer Science, 2014, 41(9): 320–324. [田仙仙, 鲍泓, 徐成. 一种改进HOG特征的行人检测算法[J]. 计算机科学, 2014, 41(9): 320–324. ] [DOI:10.11896/j.issn.1002-137X.2014.09.062]

[6] Che Z F, Miao Z J, Wang M S. Investigation and application of pedestrian detection in metro video monitoring system[J]. Modern Urban Transit, 2010(2): 31–33, 36. [车志富, 苗振江, 王梦思. 地铁视频监控系统中的行人检测研究与应用[J]. 现代城市轨道交通, 2010(2): 31–33, 36. ] [DOI:10.3969/j.issn.1672-7533.2010.02.011]

[7] Cao J L, Pang Y W, Li X L. Pedestrian detection inspired by appearance constancy and shape symmetry[J]. IEEE Transactions on Image Processing, 2016, 25(12): 5538–5551. [DOI:10.1109/TIP.2016.2609807]

[8] Dollár P, Belongie S, Perona P. The fastest pedestrian detector in the west[C]. The British Machine Vision Conference, 2010.

[9] Nam W, Dollár P, Han J H. Local decorrelation for improved pedestrian detection[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press, 2014: 424-432.

[10] Zeng J X, Cheng X. Pedestrian detection Co mbined with single and couple pedestrian DPM models in traffic scene[J]. Acta Electronica Sinica, 2016, 44(11): 2668–2675. [曾接贤, 程潇. 结合单双行人DPM模型的交通场景行人检测[J]. 电子学报, 2016, 44(11): 2668–2675. ] [DOI:10.3969/j.issn.0372-2112.2016.11.015]

[11] Wu Z. Occluded pedestrian detection that combines multiple classifiers in street scene[D]. Nanchang: Nanchang Hangkong University, 2016. [吴喆.多分类器级联的街道场景遮挡行人检测[D].南昌: 南昌航空大学, 2016.]] http://xuewen.cnki.net/CMFD-1017711737.nh.html

[12] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks[C]//Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, Nevada, USA: Curran Associates Inc., 2012: 1097-1105.

[13] Ma Z G, Gao P P. Research on the cascade pedestrian detection model based on LDCF and CNN[C]//Proceedings of 2018 IEEE International Conference on Big Data and Smart Computing. Shanghai, China: IEEE, 2018: 314-320.[DOI: 10.1109/BigComp.2018.00053]

[14] Wen Z, Du Y H, Yoshida T, et al. DRI-RCNN:an approach to deceptive review identification using recurrent convolutional neural network[J]. Information Processing & Management, 2018, 54(4): 576–592. [DOI:10.1016/j.ipm.2018.03.007]

[15] Zhao X T, Li W, Zhang Y F, et al. A faster RCNN-based pedestrian detection system[C]//Proceedings of 2016 IEEE 84th Vehicular Technology Conference. Montreal, QC, Canada: IEEE, 2016: 1-5.[DOI: 10.1109/VTCFall.2016.7880852]

[16] Luo J, Zeng J X, Leng L, et al. Pedestrian detection based on improved region proposal network[J]. Journal of Nanchang Hangkong University:Natural Sciences, 2018, 32(2): 1–7, 43. [罗杰, 曾接贤, 冷璐, 等. 基于改进的区域候选网络的行人检测[J]. 南昌航空大学学报:自然科学版, 2018, 32(2): 1–7, 43. ] [DOI:10.3969/j.issn.1001-4926.2018.02.001]

[17] Xiang X Z, Lv N, Guo X L, et al. Engineering vehicles detection based on modified faster R-CNN for power grid surveillance[J]. Sensors, 2018, 18(7): 2258. [DOI:10.3390/s18072258]

[18] Zhang X G, Chen G Y, Saruta K, et al. Deep convolutional neural networks for all-day pedestrian detection[M]//Kim K, Joukov N. Information Science and Applications 2017. Singapore: Springer, 2017: 171-178.[DOI: 10.1007/978-981-10-4154-9_21]

[19] He K, Zhang X, Ren S, et al. Deep Residual Learning for Image Recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE 2016: 770-778.[DOI: 10.1109/CVPR.2016.90]

[20] Ren S, He K, Girshick R, et al. Faster R-CNN:towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2015, 39(6): 1137–1149. [DOI:10.1109/TPAMI.2016.2577031]

[21] Wojek C, Dollar P, Schiele B. Pedestrian detection:an evaluation of the state of the art[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(4): 743–761. [DOI:10.1109/TPAMI.2011.155]

[22] Li S H, Lin J Z, Li G Q, et al. Vehicle type detection based on deep learning in traffic scene[J]. Procedia Computer Science, 2018, 131: 564–572. [DOI:10.1016/j.procs.2018.04.281]