发布时间: 2021-01-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.200417
2021 | Volume 26 | Number 1

目标检测与跟踪

提升预测框定位稳定性的视频目标检测

郝腾龙, 李熙莹

1. 中山大学智能工程学院智能交通研究中心, 广州 510006;

2. 广东省智能交通系统重点实验室, 广州 510006;

3. 视频图像智能分析与应用技术公安部重点实验室, 广州 510006

收稿日期: 2020-07-26; 修回日期: 2020-10-13; 预印本日期: 2020-10-20

基金项目: 国家重点研发计划项目（2018YFB1601100，2018YFB1601101）

第一作者简介: 郝腾龙, 1998年生, 男, 硕士研究生, 主要研究方向为智能交通、计算机视觉。E-mail:haotlong@mail2.sysu.edu.cn.

通信作者: 李熙莹, 通信作者, 女, 副教授, 硕士生导师, 主要研究方向为智能交通、计算机视觉、视频大数据。E-mail:stslxy@mail.sysu.edu.cn.

中图法分类号: TP391

文献标识码: A

文章编号: 1006-8961(2021)01-0113-10

摘要

目的目前视频目标检测（object detection from video）领域大量研究集中在提升预测框定位准确性，对于定位稳定性提升的研究则较少。然而，预测框定位稳定性对多目标跟踪、车辆行驶控制等算法具有重要影响，为提高预测框定位稳定性，本文提出了一种扩张性非极大值抑制（expanded non-maximum suppression，Exp_NMS）方法和帧间平滑策略（frame bounding box smooth，FBBS）。方法目标检测阶段使用YOLO（you only look once）v3神经网络，非极大值抑制阶段通过融合多个预测框信息得出结果，增强预测框在连续视频流中的稳定性。后续利用视频相邻帧信息关联的特点，对预测框进行平滑处理，进一步提高预测框定位稳定性。结果选用UA-DETRAC（University at Albany detection and tracking benchmark dataset）数据集进行分析实验，使用卡尔曼滤波多目标跟踪算法进行辅助验证。本文在MOT（multiple object tracking）评价指标基础上，设计了平均轨迹曲折度（average track-tortuosity，AT）来直观、量化地衡量预测框定位稳定性及跟踪轨迹的平滑度。实验结果表明，本文方法几乎不影响预测框定位准确性，且对定位稳定性有大幅改善，相应跟踪质量得到显著提升。测试视频的MOTA（multiple object tracking accuracy）提升6.0%、IDs（identity switches）减少16.8%，跟踪FP（false positives）类型错误下降45.83%，AT下降36.57%，mAP（mean average precision）仅下降0.07%。结论从非极大值抑制和前后帧信息关联两个角度设计相关策略，经实验验证，本文方法在基本不影响预测框定位准确性的前提下，可有效提升预测框定位稳定性。

关键词

卷积神经网络; 视频目标检测; 预测框定位稳定性; 非极大值抑制策略; 相邻帧信息关联

Video object detection method for improving the stability of bounding box

Hao Tenglong, Li Xiying

1. Research Centre of Intelligent Transportation System, School of Engineering, Sun Yat-sen University, Guangzhou 510006, China;

2. Key Laboratory of Intelligent Transportation System of Guangdong Province, Guangzhou 510006, China;

3. Key Laboratory of Video and Image Intelligent Analysis and Application Technology, Ministry of Public Security, People's Republic of China, Guangzhou 510006, China

Supported by: National Key Research and Development of China (2018YFB1601100, 2018YFB1601101)

Abstract

Objective With the development of convolutional neural networks (CNNs), the speed and accuracy of CNN-based object detection algorithms have remarkably improved. However, the bounding boxes of the same target change intensively in adjacent frames when the algorithms are applied to the videos frame by frame, thereby reflecting the poor stability of the bounding box. This problem has received minimal attention because the object detection for single image does not have this problem. In the object detection from video (VID), stability refers to whether the bounding box of the same target changes smoothly and uniformly in successive video frames. Accuracy refers to the degree of overlap between the bounding box and the actual position. Mean average precision (mAP) is the commonly used evaluation index. It only considers the accuracy and ignores the stability. However, the stability of bounding box is extremely important for engineering applications. In self-driving systems, system stability is directly related to driving safety. At present, the self-driving study enters the L5 stage, and the vehicle driving control needs to sense and predict the movement of surrounding vehicles and pedestrians to make decisions rather than simply reacting in accordance with specific external conditions. Object detection is the basic algorithm of self-driving system to sense the surrounding environment. Poor stability negatively impact all the algorithms that analyze the object detection result, ultimately reducing the stability of the entire self-driving system and creating potential safety hazards. Thus, designing strategies to solve this problem are necessary. We propose expanded non-maximum suppression (Exp_NMS) and frame bounding box smoothing (FBBS) strategies in this paper. Method We design the Exp_NMS and FBBS strategies on the basis of YOLO(you only look once)v3 object detection algorithm. The overall process of the algorithm is to send the video frame by frame to the YOLOv3 network for object detection. We then use Exp_NMS to eliminate redundant bounding boxes and utilize FBBS to smooth the results. In the Exp_NMS strategy, the results are obtained by fusing multiple bounding box information because the original NMS strategy may directly discard some bounding boxes and cause poor stability. In the FBBS strategy, we use the adjacent frame information association thinking, which is widely used in VID algorithms. Different from conventional strategies, FBBS uses least squares regression to achieve information transmission between adjacent frames rather than additional information, such as optical flow. FBBS has a certain optimization effect on multidetection and missed detection errors and has a better effect on the stability problem. Result The scenarios in engineering applications are variable and complicated. Thus, the scenarios in training dataset should be as many as possible in the experiment. This paper uses MIO-TCD(miovision traffic camera dataset) as the object detection training dataset collected from thousands of real traffic scenarios and utilize UA-DETRAC(University at Albany datection and tracking benchmark dataset) as the test dataset. The MIO-TCD dataset cannot evaluate the multiobject tracking results. This paper uses YOLOv3 and Kalman filter multiobject tracking algorithms for verification experiments. The stability of the bounding box has a significant effect on the tracking algorithm, and most tracking algorithms are based on Kalman filter. This paper designs a parameter called average track-tortuosity (AT) to measure the stability of the bounding box and the smoothness of the tracking trajectory. Experimental results prove that our method can significantly improve the stability of the bounding box without affecting its accuracy, and the accuracy of the tracking algorithm is improved. Multiple object tracking accuracy is increased by 6.0%, and track id switch is reduced by 16.8% when Exp_NMS and FBBS are used. The number of tracking false positive errors is reduced by 45.83%, the AT is decreased by 36.57%, and mAP is only reduced by 0.07%. Conclusion In this paper, we design two strategies from the perspective of NMS and adjacent frame information association by analyzing the causes and manifestations of the bounding box stability problem. The experimental results show that the two strategies can significantly enhance the stability of bounding box without affecting its accuracy.

Key words

convolutional neural network(CNN); object detection from video(VID); stability of bounding box; non-maximum suppression (NMS); adjacent-frames information association

0 引言

卷积神经网络(convolutional neural network，CNN)目标检测算法输出的目标位置信息存在像素级的定位误差，且该定位误差的造成原因十分复杂，导致其呈现随机性。当目标检测算法应用在视频中时，该定位误差会引起预测框前后帧突变，进而引发预测框不稳定现象。在实际工程应用中，该现象会对多目标跟踪、行为分析预测等诸多算法产生影响。

以目前热门的自动驾驶系统应用研究为例，系统稳定性直接关乎驾驶安全。早在自动驾驶发展的初级阶段，系统稳定性就引起了学者的重视。自动驾驶测试工作一定程度上就是对系统稳定性进行测试(余卓平等，2019)，其中极端条件下的仿真测试(毛婷和梁玮，2020)更是围绕自动驾驶算法稳定性开展。目前自动驾驶系统研究正步入L5阶段，车辆行驶控制已经不是简单根据特定外界条件做出反应，而是需要感知并预测周围车辆及行人的运动情况来做出决策(Houston等，2020)。视频目标检测(object detection from video，VID)作为自动驾驶系统感知周围环境的基础算法，如果检测结果稳定性不佳，则会对车辆行为分析、车辆行为预测等基于目标检测结果进行分析的算法产生严重不良影响，轻则影响自动驾驶系统稳定性，重则导致自动驾驶系统做出错误决策，造成安全隐患。

基于深度学习的视频目标检测主要是在静态图像的目标检测基础上展开研究，而稳定性问题在静态图像中不存在，因此受到的关注较少，缺乏直接的评价指标。在视频目标检测领域，预测框定位稳定性与定位准确性是不同的性能指标要求。定位稳定性(stability of bounding box)指的是同一目标的预测框在连续视频帧中变化是否平滑、均匀。定位准确性(accuracy of bounding box)是指一个目标的预测框与该目标实际位置的重叠程度。应用最普遍的视频目标检测评价指标是基于交并比(intersection over union，IOU)计算的mAP(mean average precision)评价方法(Havard等，2017)，而IOU是衡量定位准确性的指标，并不体现定位稳定性。

在视频目标检测领域，多数算法是针对定位准确性做出的改进，如T-CNN(Kang等，2018)中引入跟踪修正及光流信息辅助减少漏检错误提升定位准确性。FGFA(flow-guided feature aggregation)算法(Zhu等，2017)，通过只检测部分关键帧，非关键帧通过FlowNet(Dosovitskiy等，2015)得到的光流信息补全特征图，利用光流信息的特点提高定位准确性。后续随着FlowNet2.0发布(Ilg等，2017)，研究者们通过改进不同帧间的特征传递规则，提升了算法运行速度和定位准确性(Zhang等，2019)。时空存储模块(spatial-temporal memory module，STMM)(Xiao和Lee，2018)通过捕捉视频中包含的运动信息来提高定位准确性。应用此类算法，定位准确性得到提高时，定位稳定性也会小幅改善。但提高定位准确性的代价往往十分高昂，算法运行速度、训练成本和数据标注成本等都会受到较大影响，降低算法实际应用价值。且部分工程应用场景，如车辆轨迹提取、车速计算等，并不需要极高的定位准确性，只是对定位稳定性有较高要求，如果使用上述算法，成本上升的同时也会造成部分算力浪费。

因此本文着重从定位稳定性角度进行分析改进，在目标检测流程中添加特殊处理策略，使得基于静态图像的目标检测算法应用在视频中时，预测框定位稳定性得到提升。本文方法对整体目标检测算法运行速度几乎没有影响，且不需要额外数据训练，可移植性较强。

1 本文方法

预测框不稳定的原因在于深度神经网络目标检测对物体的定位存在像素级定位偏差，此偏差由多种原因导致，训练数据标注因素、anchor机制缺陷以及网络学习能力不足等都会有一定影响。成因的复杂性导致难以针对某个特定原因进行优化改进，并且如果从产生原因角度进行研究，容易走向改进定位准确性的方向。因此本文并未选择从目标检测网络及训练数据等原因进行分析研究，而是着重在网络输出结果后的步骤进行优化分析。

本文在YOLO(you only look once)v3目标检测算法(Redmon和Farhadi，2018)的基础上，设计了扩张性非极大值抑制(expanded non-maximum suppression, Exp_NMS)策略，增加了前后帧信息关联的帧间平滑(frame bounding box smooth，FBBS)策略，整体算法流程如图 1所示。输入第$k$帧图像，经过YOLO检测及Exp_NMS后得到第$k$帧的检测结果。后续经过FBBS策略对检测结果进行平滑修正，得到最终检测结果。

图 1 算法流程图

Fig. 1 Algorithm flow chart

1.1 扩张性非极大值抑制策略

预测框定位稳定性差的部分原因在于非极大值抑制策略(non-maximum suppression，NMS)会直接舍弃部分预测结果。如图 2所示，对于置信度不高的目标检测结果，视频连续帧中容易出现因图像微小变化导致两个预测框置信度大小关系产生变化，使得在非极大值抑制阶段舍弃保留关系改变，最终第1帧检测结果偏左上方，第2帧偏右下方。

图 2 原NMS损害预测框定位稳定性原因

Fig. 2 Reason for original NMS reducing the stability of bounding box

基于该原因，本文设计了Exp_NMS，核心思想是对于置信度较低的目标预测框，需要融合多个预测框信息进行修正，以增加其在连续视频流中的定位稳定性。

Exp_NMS分为两个阶段。第1阶段为分组阶段，即根据设定的IOU阈值$N_{t}$将所有预测框进行分组，使得同组的所有预测框与组内最高置信度的预测框IOU均大于设定的阈值。

第2阶段为扩张阶段，每组预测框根据扩张规则输出一个预测框。扩张规则如下：

1) 预测框初筛。判断组内最高置信度是否大于$P_{t}$，如大于则输出最高置信度的预测框作为本组结果。否则筛选组内预测框，剔除与组内最高置信度相差超过阈值$S_{t}$的预测框，并执行步骤2)—4)。

2) 求取剩余预测框的交集区域$\mathit{\boldsymbol{C}}$。

3) 获得各个预测框对应的保留区域$\mathit{\boldsymbol{T}}$。假设组内剩余如图 3(a)所示两个预测框，将区域划分为$\mathit{\boldsymbol{C}}$、$\mathit{\boldsymbol{A}}_{1}$和$\mathit{\boldsymbol{A}}_{2}$共3个区域，则需要分别对$\mathit{\boldsymbol{A}}_{1}$、$\mathit{\boldsymbol{A}}_{2}$区域求取保留区域${\mathit{\boldsymbol{T}}_{{\mathit{\boldsymbol{A}}_1}}}$、${\mathit{\boldsymbol{T}}_{{\mathit{\boldsymbol{A}}_2}}}$。以$\mathit{\boldsymbol{A}}_{1}$区域为例，根据式(1)计算得$\mathit{\boldsymbol{A}}_{1}$区域对应的置信度$Pr_{{\mathit{\boldsymbol{A}}_1}}$，随后根据式(2)计算得${\mathit{\boldsymbol{T}}_{{\mathit{\boldsymbol{A}}_1}}}$的面积，最后根据式(3)求得${\mathit{\boldsymbol{T}}_{{\mathit{\boldsymbol{A}}_1}}}$区域位置，结果如图 3(b)中红色区域所示。

图 3 Exp_NMS扩张阶段图示

Fig. 3 Exp_NMS expansion stage

((a) divide area; (b) distribution T)

4) 计算所有$\mathit{\boldsymbol{T}}$区域和$\mathit{\boldsymbol{C}}$区域的最小外接矩形，作为输出预测框的位置信息，如图 4中黑色矩形框所示。输出预测框置信度信息继承组内最高置信度。

图 4 Exp_NMS结果

Fig. 4 Exp_NMS result

$ \left\{ {\begin{array}{*{20}{l}} {\mathit{P}{\mathit{r}_{{A_1}}} = \frac{{\mathit{P}{\mathit{r}_{b1}} \times \left({S\left({{\mathit{\boldsymbol{A}}_1}} \right) + S(\mathit{\boldsymbol{C}})} \right) - \mathit{P}{\mathit{r}_C} \times S(\mathit{\boldsymbol{C}})}}{{S\left({{\mathit{\boldsymbol{A}}_1}} \right)}}}\\ {\mathit{P}{\mathit{r}_C} = 1} \end{array}} \right. $

(1)

式中，$\mathit{P}{\mathit{r}_{{\mathit{\boldsymbol{A}}_1}}}$为$\mathit{\boldsymbol{A}}_{1}$区域的置信度，为式(1)所求变量；$\mathit{P}\mathit{r}_{C}$为$\mathit{\boldsymbol{C}}$区域置信度，设定为1；$\mathit{P}{\mathit{r}_{b1}}$为1号bbox的置信度，由YOLOv3网络输出，如图 3(a)中所示为0.7；${S\left({{\mathit{\boldsymbol{A}}_1}} \right)}$为$\mathit{\boldsymbol{A}}_{1}$区域对应面积；${S\left({\mathit{\boldsymbol{C}}} \right)}$为$\mathit{\boldsymbol{C}}$区域对应面积。

$ S\left({{\mathit{\boldsymbol{T}}_{{A_1}}}} \right) = \mathit{P}{\mathit{r}_{{A_1}}} \times S\left({{\mathit{\boldsymbol{A}}_1}} \right) $

(2)

$ \left\{ \begin{array}{l} \frac{{{w_t}}}{{{w_{b1}} - {w_C}}} = \frac{{{h_t}}}{{{h_{b1}} - {h_C}}}\\ \frac{{{w_t}}}{{{w_{b1}} - {w_C}}} = \frac{{{h_t}}}{{{h_{b1}} - {h_C}}}\\ \frac{{{w_t}}}{{{w_{b1}} - {w_C}}} = \frac{{{h_t}}}{{{h_{b1}} - {h_C}}}\\ 0 < {w_t} < {w_{b1}} - {w_C}\\ 0 < {h_t} < {h_{b1}} - {h_C} \end{array} \right. $

(3)

式中，$w_{b1}$、$h_{b1}$代表 1号bbox的宽度和高度，$w_{C}$、$h_{C}$代表$\mathit{\boldsymbol{C}}$区域的宽度、高度，其中$w_{t}$、$h_{t}$为所求参数，含义如图 3(b)所示。

Exp_NMS的核心思想是融合多个预测框信息，具体通过划分区域并分别计算各区域置信度。出于预测框定位稳定性考虑，认为各区域置信度等于区域内真实目标面积所占百分比，即式(2)所示关系。对于图 3(a)中1号bbox而言，将1号bbox分为$\mathit{\boldsymbol{A}}_{1}$、$\mathit{\boldsymbol{C}}$两个区域，则有

$ \begin{array}{*{20}{c}} {\mathit{P}{\mathit{r}_{b1}} \times \left({S\left({{\mathit{\boldsymbol{A}}_1}} \right) + S(\mathit{\boldsymbol{C}})} \right) = }\\ {\mathit{P}{\mathit{r}_C} \times S(\mathit{\boldsymbol{C}}) + \mathit{P}{\mathit{r}_{{A_1}}} \times S(\mathit{\boldsymbol{A}}1)} \end{array} $

(4)

求解${\mathit{P}{\mathit{r}_{{A_1}}}}$还缺少$\mathit{P}\mathit{r}_{C}$的值，假设$\mathit{\boldsymbol{C}}$区域都是真实目标，则可利用式(2)计算得$\mathit{P}\mathit{r}_{C}$为1，经过变形即式(1)。经过式(1)和式(2)，可求解得到${\mathit{\boldsymbol{T}}_{{\mathit{\boldsymbol{A}}_1}}}$的面积，式(3)用来分配${\mathit{\boldsymbol{T}}_{{\mathit{\boldsymbol{A}}_1}}}$，含义为按照$\mathit{\boldsymbol{A}}_{1}$区域的长宽比进行分配。

1.2 帧间平滑策略

预测框不稳定在视频中表现为相邻帧同一目标预测框变化幅度较大，如果将前后帧信息进行关联(adjacent-frames information association)，将会有效提升预测框定位稳定性。基于这个思路，本文设计了FBBS策略。整体思路为利用前两帧和后两帧的目标检测结果对本帧目标检测结果进行修正，具体步骤如下：

1) 预测框分组。将连续5帧经过Exp_NMS后的全部目标检测结果根据设定的IOU阈值$BN_{t}$ 分组，使同组内位于相邻帧的预测框IOU均大于$BN_{t}$。理想情况下，一组预测框内为5个预测框，分别对应5帧图像中的同一目标。

2) 多检剔除。剔除掉组内预测框数量小于3的组别，即连续5帧中有3帧及以上未检测到此目标，则认为该组结果为目标检测部分出现了多检错误。随后对保留下的每个组分别执行步骤3)—5)。

3) 预测框序列提取。本组内所有预测框结果组成点集$\left\{ {\left({{n_{{\rm{frame}}{\kern 1pt} {\rm{1}}}}, {x_1}} \right), \left({{n_{{\rm{frame}}{\kern 1pt} 2}}, {x_2}} \right), \left({{n_{{\rm{frame}}{\kern 1pt} 3}}, {x_3}} \right), \cdots } \right\}$。其中${{n_{{\rm{frame}}{\kern 1pt} {\rm{1}}}}}$代表第$i$个预测框所在帧序号，$x_{i}$代表第$i$个预测框中心点$x$坐标。

4) 轨迹直线拟合。利用步骤3)中组成的点集拟合直线，求得直线斜率$k$、截距$b$。修正后中间帧预测框中心点$x$坐标为$\hat x = k \times {n_{{\rm{frame\;mid}}{\kern 1pt} {\rm{1}}}} + b$，其中$n_{\rm{frame\;mid}}$代表中间帧的帧序号。

5) 预测框位置修正。对中心点$y$坐标、宽度$w$、高度$h$执行同步骤3)—4), 求得修正后中间帧预测框中心点$y$坐标$\hat y$、宽度$\hat w$、高度$\hat h$，组合$\left({\hat x, \hat y, \hat w, \hat h} \right)$作为修正后的预测框位置，置信度、类别等信息继承组内最高置信度的预测框。

FBBS策略属于一种前后帧信息关联的策略，不需要额外增加光流信息输入，提高预测框定位稳定性的同时对部分帧漏检、部分帧多检有一定的补全筛选作用。

1.3 算法效果对比

将同一辆车连续5帧的预测框绘制在同一图像下，如图 5所示。图中预测框左边界修正效果较为明显，修正后的预测框变化更平滑，定位稳定性更高。

图 5 平滑修正前后对比

Fig. 5 Comparison before and after smooth correction

((a) result before smooth correction; (b) result after smooth correction)

预测框定位稳定性可以直接体现在同一目标在连续帧中的坐标轨迹，稳定性越差，坐标轨迹越曲折; 稳定性越高，坐标轨迹越平滑均匀。为了更直观地体现效果，将图 5中车辆连续20帧预测框中心点轨迹及预测框轨迹绘制为折线图，中心点轨迹如图 6所示，预测框轨迹如图 7所示。从图 6及图 7中可以直观看出，修正后的折线图更平滑，变化更加均匀，稳定性更高。

图 6 修正前后预测框中心点轨迹对比

Fig. 6 Comparison diagram of the center point trajectory before and after correction

图 7 修正前后预测框轨迹对比

Fig. 7 Comparison diagram of the bounding box trajectory before and after correction

((a) coordinate line chart of up-left point; (b) coordinate line chart of up-right point; (c) coordinate line chart of bottow-left point; (d) coordinate line chart of bottow-right point)

2 实验设计

2.1 数据集选取

本文所使用的数据集为UA-DETRAC(University at Albany detection and tracking benchmark dataset)(Lyu等，2018)和MIO-TCD(miovision traffic camera dataset)(Luo等，2018)。

UA-DETRAC数据集为多任务数据集，是由24个场景下100段实际交通录像组成，标注有车辆位置信息、车辆类别信息和车辆跟踪ID信息。可用于目标检测定位准确性验证和多目标跟踪质量验证。

MIO-TCD数据集为交通目标检测数据集，是由数千个实际交通场景下所采集的图像组成，标注有车辆位置信息和车辆类别信息。可用于目标检测定位准确性验证。

其中，UA-DETRAC数据集包含的交通场景数较少，且训练集与验证集场景相似度较高，如果直接使用UA-DETRAC数据集训练目标检测网络并验证本文算法效果，与实际工程应用情况有所违背。因为在大多数工程应用中，应用场景较多且难以针对所有具体场景做大量标注工作。因此，本文选用MIO-TCD数据集作为YOLOv3网络训练集，选取UA-DETRAC数据集中拍摄稳定、车辆数较多的视频作为验证集。选取的验证集文件如表 1所示。

表 1 验证集文件列表
Table 1 Test data file list

下载CSV

实验所用验证集文件列表
MVI_20035	MVI_39401	MVI_40131	MVI_40181
MVI_40201	MVI_40714	MVI_40742	MVI_40771
MVI_40775	MVI_40793	MVI_40851	MVI_40852
MVI_40855	MVI_40864	MVI_40871	MVI_40891
MVI_40903	MVI_40905	MVI_40981	MVI_41063
MVI_63563	MVI_40191	MVI_40772	MVI_40854
MVI_40901	MVI_63553

因两个数据集车辆类别标签设置不同，故对目标检测网络输出结果做抹除类别信息处理，对验证集也抹除车辆类别信息。

2.2 评价指标

因预测框定位稳定性缺乏相应的评价指标衡量。考虑预测框定位稳定性的影响主要表现在以下两个方面，故本文采用间接验证的思想设计相关评价指标。

1) 跟踪飘移。多目标跟踪中通常使用卡尔曼滤波器进行目标轨迹预测，前后帧预测框不稳定容易导致卡尔曼滤波预测结果产生漂移，最终影响跟踪结果。故本文使用多目标跟踪领域评价指标MOT来衡量卡尔曼跟踪算法的结果，间接验证算法对预测框定位稳定性的改善。

2) 轨迹突变。视频中的车辆轨迹是相同跟踪ID的预测框中心点在连续帧中依次相连所得的折线，预测框不稳定会影响轨迹的平滑程度。如图 8所示，$\theta $为因斜率突变造成的夹角，单条车辆轨迹为多条线段组成的折线，存在因斜率突变产生的夹角，如${\theta _1}, {\theta _2}, {\theta _3}$所示。而现实中的车辆轨迹为直线或者光滑曲线，不存在因斜率突变产生的夹角。基于此分析，定义轨迹曲折度$T$(track-tortuosity)来衡量车辆轨迹的平滑程度，计算为

$ T = \frac{{\sum\limits_{i = 1}^{a - 1} {{\theta _{i + 1, i}}} }}{{a - 1}} $

(5)

图 8 视频中车辆轨迹图示

Fig. 8 Vehicle trajectory in the video

式中，$T$为轨迹曲折度，$a$为该车辆的轨迹折线总数，轨迹折线是相同跟踪ID预测框在相邻帧的中心点连线所得，$\theta _{i + 1, 1}$为第$i+1$条折线与第$i$条折线的夹角。

由式(5)可知，当车辆轨迹为直线或平滑曲线时，无斜率突变造成的夹角，曲折度计算为0°；轨迹斜率突变越严重，计算所得的曲折度越大，最大为180°。

为了综合分析评价视频中多个目标的轨迹平滑情况，定义平均轨迹曲折度$AT$(average track-tortuosity)，即对确定视频时间段内提取出多条轨迹，所有的轨迹曲折度求均值，计算为

$ AT = \frac{{\sum\limits_{i = 1}^n {{T_i}} }}{n} $

(6)

式中，$T_{i}$为第$i$辆车的轨迹曲折度，$n$为车辆轨迹总数。

同时，为验证本文算法对预测框定位准确性的影响，选用mAP来衡量定位准确性变化。

2.3 算法选取及实验环境

本文目标检测部分选用目前工程应用中最普遍的YOLOv3目标检测网络。为了更直观地表现预测框定位稳定性的变化趋势，避免其他因素造成干扰，跟踪环节选用卡尔曼滤波加匈牙利匹配的多目标跟踪算法。

本文实验计算机硬件环境CPU为Inter(R)Core(TM)i7-9700 CPU@3.00 GHz，GPU为RTX2060。软件环境为Windows操作系统，算法语言为C++。

训练及验证数据集通过相应数据集官网下载得到。训练YOLOv3时使用官方预训练权重。训练及各实验参数如表 2所示。各算法及评价代码获取来源如表 3。

表 2 实验参数列表
Table 2 Experimental parameter list

下载CSV

算法模块	参数名称	参数值
YOLOv3网络训练参数	batch	48
	subdivisions	16
	learning_rate	0.001
YOLOv3网络运行参数	bbox置信度阈值	0.2
NMS参数	IOU阈值	0.45
Exp_NMS参数	P_t	0.9
	N_t	0.45
	S_t	0.01
FBBS参数	BN_t	0.5

表 3 算法代码获取来源列表
Table 3 Algorithm code source list

下载CSV

算法及评估工具	获取来源	算法语言
YOLOv3	Darknet框架(Alexey等，2020)	C++
卡尔曼跟踪	Darknet框架(Alexey等，2020)	C++
mAP计算、TP、FP统计工具	开源mAP计算工具(Cartucho等，2018)	Python
MOT计算工具	MOT官网下载(Milan等，2016)	MATLAB
AT计算工具	自行编码计算	MATLAB
注：TP(true positives)指检测为正，真实为正的样本数量；FP(false positives)指检测为正，真实为负的样本数量；MOT(multiple object tracking)指多目标的跟踪评价指标。

3 实验结果及分析

3.1 mAP分析

表 4为各策略组合所得的mAP指标。从表 4中信息可知，总体来说, 各个策略对mAP指标影响较小，可以说明本文方法几乎不会影响预测框定位准确性。

表 4 各策略组合的mAP结果
Table 4 mAP result of different strategies

下载CSV

/%
评价指标	NMS	Exp_NMS	FBBS+ NMS	FBBS+ Exp_NMS
mAP₅₀	74.86	74.96	74.70	74.73
mAP₅₅	73.01	73.09	72.86	72.85
mAP₆₀	70.27	70.27	70.16	70.07
mAP₆₅	66.36	66.20	66.31	66.07
mAP₇₀	60.30	60.11	60.10	59.90
mAP₇₅	51.12	50.90	50.97	50.82
mAP₈₀	35.78	35.56	35.93	35.85
mAP₈₅	16.67	16.49	17.14	17.11
mAP₉₀	2.97	2.94	3.28	3.28
mAP₉₅	0.04	0.04	0.05	0.05
mAP_{50 :95}	45.14	45.06	45.15	45.07
注：加粗字体为每行最优值。

Exp_NMS对定位准确性影响较小的原因在于定位准确性主要依赖于目标检测网络及训练方法等因素，NMS本身因素影响较小，再者，Exp_NMS相比于NMS，本质去除冗余框的思想并未改变，也并未采用类似Soft-NMS(Bodla等，2017)或Diou-NMS(Zheng等，2019)中针对密集目标改进的思想，只是出于预测框定位稳定性考虑，对结果进行了轻微调整，其调整幅度并不会大幅影响定位准确性。

FBBS策略对部分帧多检、部分帧漏检有一定的筛选补全作用，理论上对mAP有提升作用。但因为其对预测框平滑修正时，只考虑定位稳定性，不考虑定位准确性，所以可能造成修正后的结果偏离真实目标从而降低mAP，具体体现在表 5中对应的TP(true positive)、FP(false positive)数目。FBBS策略可以显著减小FP错误，但TP数目也会受到影响。

表 5 TP和FP的数目
Table 5 The number of TP and FP

下载CSV

指标	策略组合	TP数目	FP数目
mAP₅₀	NMS	381 706	63 320
mAP₅₀	NMS+FBBS	380 567	32 764
mAP₇₅	NMS	288 536	156 440
mAP₇₅	NMS+FBBS	286 364	126 967

3.2 MOT及AT分析

预测框定位稳定性主要体现在MOT相关指标及AT数值上，各策略组合实验结果如表 6所示。评价指标为MOTA(multiple object tracking accuracy), MOTP(multiple object tracking precision)，MT(mostly tracked targets), ML(mostly lost targets), IDs(identity switches), FM(number of fragmentations), FP(false positives), FN(false negatives)等。

表 6 MOT评价指标结果
Table 6 MOT evaluation index results

下载CSV

评价指标	NMS	Exp_NMS	FBBS+ NMS	FBBS+ Exp_NMS
MOTA/%↑	59.9	62.3	65.6	65.9
MOTP/%↑	79.2	79.3	79.2	79.2
MT↑	1 757	1 763	1 741	1 740
ML↓	189	186	191	194
IDs↓	12 278	13 951	10 472	10 215
FM↓	4 194	4 181	2 271	2 222
FP↓	65 317	51 788	36 809	35 385
FN↓	123 987	123 518	125 640	125 750
AT/(°)↓	77.03	66.90	50.70	48.86
注：加粗字体为每行最优值，↑代表该指标越高越好，↓代表该指标越低越好。

MOT指标部分，MOTA提升6%，MOTP无显著变化，IDs下降16.8%，FP类型错误减少45.83%，这说明跟踪质量得到大幅提升，结合mAP无显著变化的结论，可以说明跟踪质量的上升是因为预测框定位稳定性提升，使得卡尔曼滤波的预测结果更准确，从而提高了跟踪质量。

AT指标部分，平均轨迹曲折度显著减小，这说明使用本文算法后得到的轨迹更加平滑，预测框定位稳定性得到提升。

4 结论

针对视频目标检测中的预测框定位稳定性差的问题，本文设计了Exp_NMS和FBBS策略，在不显著影响定位准确性的前提下，提升了定位稳定性。

针对预测框定位稳定性表现在连续视频流中，难以量化计算的特点，本文定义了平均轨迹曲折度$AT$的评测指标，以量化分析定位稳定性。且使用了间接验证的思想，通过预测框定位稳定性对卡尔曼滤波器的影响以及车辆目标检测跟踪轨迹平滑程度的影响，间接证明了本文算法对提高预测框定位稳定性有着良好效果。在UA-DETRAC数据集上选取拍摄稳定、车辆较多的视频作为验证集，使用MIO-TCD数据集作为目标检测网络训练集，使用YOLOv3目标检测网络及卡尔曼滤波跟踪算法进行试验，在Exp_NMS和FBBS策略同时使用时，跟踪质量得到提升，AT数值显著下降，说明了预测框定位稳定性得到提升。

在未来工作中，将着重改进FBBS策略。现有的FBBS策略因没有光流信息或是其他信息输入导致平滑修正后的预测框有可能产生轻微偏移，导致定位准确性轻微受损。后续考虑如何保证运行速度在工程应用可接受范围的前提下，添加额外信息输入来缓解此缺陷。

参考文献

Alexey, Redmon J, Sinigardi S, Cyy, Hager T, Zhang V, Maaz M, IlyaOvodov, Kahn P, Veitch-Michaelis J, Dujardin A, Duohappy, Acxz, Aughey J, Özipek E, White J, Smith D, Aven, Shibata T K C, Giordano M, Daras G, Hagege R, Gąsiorzewski B, Babaei A, Vhavle H, Arends E, Cho D C, Lin C H, Baranski A and 7FM. 2020. AlexeyAB/darknet: YOLOv4 pre-release[CP/OL].[2020-07-09]. http://doi.org/10.5281/zenodo.3829035

Bodla N, Singh B, Chellappa R and Davis L S. 2017. Soft-NMS-improving object detection with one line of code//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 5562-5570[DOI:10.1109/ICCV.2017.593]

Cartucho J, Ventura R and Veloso M. 2018. Robust object recognition through symbiotic deep learning in mobile robots//Proceedings of 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems. Madrid, Spain: IEEE: 2336-2341[DOI:10.1109/IROS.2018.8594067]

Dosovitskiy A, Fischer P, Ilg E, Häusser P, Hazirbas C, Golkov V, van der Smagt P, Cremers D and Brox T. 2015. FlowNet: learning optical flow with convolutional networks//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 2758-2766[DOI:10.1109/ICCV.2015.316]

Havard W, Besacier L and Rosec O. 2017. SPEECH-COCO: 600k visually grounded spoken captions aligned to MSCOCO dataset[EB/OL].[2020-07-09]. https://arxiv.org/pdf/1707.08435.pdf

Houston J, Zuidhof G, Bergamini L, Ye Y W, Jain A, Omari S, Iglovikov V and Ondruska P. 2020. One thousand and one hours: self-driving motion prediction dataset[EB/OL].[2020-07-09]. https://arxiv.org/pdf/2006.14480v1.pdf

Ilg E, Mayer N, Saikia T, Keuper M, Dosovitskiy A and Brox T. 2017. FlowNet 2.0: evolution of optical flow estimation with deep networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 1647-1655[DOI:10.1109/CVPR.2017.179]

Kang K, Li H S, Yan J J, Zeng X Y, Yang B, Xiao T, Zhang C, Wang Z, Wang R H, Wang X G, Ouyang W L. 2018. T-CNN:tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology, 28(10): 2896-2907 [DOI:10.1109/TCSVT.2017.2736553]

Luo Z M, Branchaud-Charron F, Lemaire C, Konrad J, Li S Z, Mishra A, Achkar A, Eichel J, Jodoin P M. 2018. MIO-TCD:a new benchmark dataset for vehicle classification and localization. IEEE Transactions on Image Processing, 27(10): 5129-5141 [DOI:10.1109/TIP.2018.2848705]

Lyu S, Chang M C, Du D W, Li W B, Wei Y, del Coco M, Carcagn P, Schumann A, Munjal B, Dang D Q T, Choi D H, Bochinski E, Galasso F, Bunyak F, Seetharaman G, Baek J W, Lee J T, Palaniappan K, Lim K T, Moon K, Kim K J, Sommer L, Brandlmaier M, Kang M S, Jeon M, Al-Shakarji N M, Acatay O, Kim P K, Amin S, Sikora T, Dinh T, Senst T, Che V G H, Lim Y C, Song Y M and Chung Y S. 2018. UA-DETRAC 2018: report of AVSS2018 & IWT4S challenge on advanced traffic monitoring//Proceedings of the 15th IEEE International Conference on Advanced Video and Signal Based Surveillance. Auckland, New Zealand: IEEE: 1-6[DOI:10.1109/AVSS.2018.8639089]

Mao T, Liang W. 2020. The design and evaluation of abnormal event generation system for autonomous driving algorithms. Transactions of Beijing Institute of Technology, 40(7): 753-759 (毛婷, 梁玮. 2020. 自动驾驶算法的异常事件生成系统设计与评估. 北京理工大学学报, 40(7): 753-759) [DOI:10.15918/j.tbit1001-0645.2019.025]

Milan A, Leal-Taixé L, Reid I, Roth S and Schindler K. 2016. MOT16: a benchmark for multi-object tracking[EB/OL].[2020-07-09]. https://arxiv.org/pdf/1603.00831.pdf

Redmon J and Farhadi A. 2018. Yolov3: an incremental improvement[EB/OL].[2019-08-20]. https://arxiv.org/pdf/1804.02767.pdf

Xiao F Y and Lee Y J. 2020. Video object detection with an aligned spatial-temporal memory[EB/OL].[2020-08-29]. https://arxiv.org/pdf/1712.06317.pdf

Yu Z P, Xing X Y, Chen J Y. 2019. Review on automated vehicle testing technology and its application. Journal of Tongji University (Natural Science), 47(4): 540-547 (余卓平, 邢星宇, 陈君毅. 2019. 自动驾驶汽车测试技术与应用进展. 同济大学学报(自然科学版), 47(4): 540-547) [DOI:10.11908/j.issn.0253-374x.2019.04.013]

Zhang S Y, Wang T, Wang C Y, Wang Y, Shan G C and Snoussi H. 2019. Video object detection base on RGB and optical flow analysis//Proceedings of the 2nd China Symposium on Cognitive Computing and Hybrid Intelligence. Xi'an, China: IEEE: 280-284[DOI:10.1109/CCHI.2019.8901921]

Zheng Z H, Wang P, Liu W, Li J Z, Ye R G and Ren D W. 2019. Distance-IoU loss: faster and better learning for bounding box regression[EB/OL].[2020-07-09]. https://arxiv.org/pdf/1911.08287.pdf

Zhu X Z, Wang Y J, Dai J F, Yuan L and Wei Y C. 2017. Flow-guided feature aggregation for video object detection//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 408-417[DOI:10.1109/ICCV.2017.52]