提升预测框定位稳定性的视频目标检测

郝腾龙; 李熙莹

doi:10.11834/jig.200417

目标检测与跟踪 | 浏览量 : 0 下载量: 0 CSCD: 3

PDF
导出
分享
收藏
专辑

提升预测框定位稳定性的视频目标检测
Video object detection method for improving the stability of bounding box
2021年26卷第1期页码：113-122
纸质出版日期： 2021-01-16 ，

录用日期： 2020-10-20
DOI： 10.11834/jig.200417
稿件说明：

移动端阅览

郝腾龙, 李熙莹. 提升预测框定位稳定性的视频目标检测[J]. 中国图象图形学报, 2021,26(1):113-122.

Tenglong Hao, Xiying Li. Video object detection method for improving the stability of bounding box[J]. Journal of Image and Graphics, 2021,26(1):113-122.
郝腾龙, 李熙莹. 提升预测框定位稳定性的视频目标检测[J]. 中国图象图形学报, 2021,26(1):113-122. DOI： 10.11834/jig.200417.

Tenglong Hao, Xiying Li. Video object detection method for improving the stability of bounding box[J]. Journal of Image and Graphics, 2021,26(1):113-122. DOI： 10.11834/jig.200417.

摘要

目的

目前视频目标检测（object detection from video）领域大量研究集中在提升预测框定位准确性，对于定位稳定性提升的研究则较少。然而，预测框定位稳定性对多目标跟踪、车辆行驶控制等算法具有重要影响，为提高预测框定位稳定性，本文提出了一种扩张性非极大值抑制（expanded non-maximum suppression，Exp_NMS）方法和帧间平滑策略（frame bounding box smooth，FBBS）。

方法

目标检测阶段使用YOLO（you only look once）v3神经网络，非极大值抑制阶段通过融合多个预测框信息得出结果，增强预测框在连续视频流中的稳定性。后续利用视频相邻帧信息关联的特点，对预测框进行平滑处理，进一步提高预测框定位稳定性。

结果

选用UA-DETRAC（University at Albany detection and tracking benchmark dataset）数据集进行分析实验，使用卡尔曼滤波多目标跟踪算法进行辅助验证。本文在MOT（multiple object tracking）评价指标基础上，设计了平均轨迹曲折度（average track-tortuosity，AT）来直观、量化地衡量预测框定位稳定性及跟踪轨迹的平滑度。实验结果表明，本文方法几乎不影响预测框定位准确性，且对定位稳定性有大幅改善，相应跟踪质量得到显著提升。测试视频的MOTA（multiple object tracking accuracy）提升6.0%、IDs（identity switches）减少16.8%，跟踪FP（false positives）类型错误下降45.83%，AT下降36.57%，mAP（mean average precision）仅下降0.07%。

结论

从非极大值抑制和前后帧信息关联两个角度设计相关策略，经实验验证，本文方法在基本不影响预测框定位准确性的前提下，可有效提升预测框定位稳定性。

Abstract

Objective

With the development of convolutional neural networks (CNNs)

the speed and accuracy of CNN-based object detection algorithms have remarkably improved. However

the bounding boxes of the same target change intensively in adjacent frames when the algorithms are applied to the videos frame by frame

thereby reflecting the poor stability of the bounding box. This problem has received minimal attention because the object detection for single image does not have this problem. In the object detection from video (VID)

stability refers to whether the bounding box of the same target changes smoothly and uniformly in successive video frames. Accuracy refers to the degree of overlap between the bounding box and the actual position. Mean average precision (mAP) is the commonly used evaluation index. It only considers the accuracy and ignores the stability. However

the stability of bounding box is extremely important for engineering applications. In self-driving systems

system stability is directly related to driving safety. At present

the self-driving study enters the L5 stage

and the vehicle driving control needs to sense and predict the movement of surrounding vehicles and pedestrians to make decisions rather than simply reacting in accordance with specific external conditions. Object detection is the basic algorithm of self-driving system to sense the surrounding environment. Poor stability negatively impact all the algorithms that analyze the object detection result

ultimately reducing the stability of the entire self-driving system and creating potential safety hazards. Thus

designing strategies to solve this problem are necessary. We propose expanded non-maximum suppression (Exp_NMS) and frame bounding box smoothing (FBBS) strategies in this paper.

Method

We design the Exp_NMS and FBBS strategies on the basis of YOLO(you only look once)v3 object detection algorithm. The overall process of the algorithm is to send the video frame by frame to the YOLOv3 network for object detection. We then use Exp_NMS to eliminate redundant bounding boxes and utilize FBBS to smooth the results. In the Exp_NMS strategy

the results are obtained by fusing multiple bounding box information because the original NMS strategy may directly discard some bounding boxes and cause poor stability. In the FBBS strategy

we use the adjacent frame information association thinking

which is widely used in VID algorithms. Different from conventional strategies

FBBS uses least squares regression to achieve information transmission between adjacent frames rather than additional information

such as optical flow. FBBS has a certain optimization effect on multidetection and missed detection errors and has a better effect on the stability problem.

Result

The scenarios in engineering applications are variable and complicated. Thus

the scenarios in training dataset should be as many as possible in the experiment. This paper uses MIO-TCD(miovision traffic camera dataset) as the object detection training dataset collected from thousands of real traffic scenarios and utilize UA-DETRAC(University at Albany datection and tracking benchmark dataset) as the test dataset. The MIO-TCD dataset cannot evaluate the multiobject tracking results. This paper uses YOLOv3 and Kalman filter multiobject tracking algorithms for verification experiments. The stability of the bounding box has a significant effect on the tracking algorithm

and most tracking algorithms are based on Kalman filter. This paper designs a parameter called average track-tortuosity (AT) to measure the stability of the bounding box and the smoothness of the tracking trajectory. Experimental results prove that our method can significantly improve the stability of the bounding box without affecting its accuracy

and the accuracy of the tracking algorithm is improved. Multiple object tracking accuracy is increased by 6.0%

and track id switch is reduced by 16.8% when Exp_NMS and FBBS are used. The number of tracking false positive errors is reduced by 45.83%

the AT is decreased by 36.57%

and mAP is only reduced by 0.07%.

Conclusion

In this paper

we design two strategies from the perspective of NMS and adjacent frame information association by analyzing the causes and manifestations of the bounding box stability problem. The experimental results show that the two strategies can significantly enhance the stability of bounding box without affecting its accuracy.

关键词

卷积神经网络视频目标检测预测框定位稳定性非极大值抑制策略相邻帧信息关联

Keywords

convolutional neural network(CNN)object detection from video(VID)stability of bounding boxnon-maximum suppression (NMS)adjacent-frames information association

references

Alexey, Redmon J, Sinigardi S, Cyy, Hager T, Zhang V, Maaz M, IlyaOvodov, Kahn P, Veitch-Michaelis J, Dujardin A, Duohappy, Acxz, Aughey J,Özipek E, White J, Smith D, Aven, Shibata T K C, Giordano M, Daras G, Hagege R, Gąsiorzewski B, Babaei A, Vhavle H, Arends E, Cho D C, Lin C H, Baranski A and 7FM. 2020. AlexeyAB/darknet: YOLOv4 pre-release[CP/OL].[2020-07-09].http://doi.org/10.5281/zenodo.3829035http://doi.org/10.5281/zenodo.3829035

Bodla N, Singh B, Chellappa R and Davis L S. 2017. Soft-NMS-improving object detection with one line of code//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 5562-5570[DOI:10.1109/ICCV.2017.593http://dx.doi.org/10.1109/ICCV.2017.593]

Cartucho J, Ventura R and Veloso M. 2018. Robust object recognition through symbiotic deep learning in mobile robots//Proceedings of 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems. Madrid, Spain: IEEE: 2336-2341[DOI:10.1109/IROS.2018.8594067http://dx.doi.org/10.1109/IROS.2018.8594067]

Dosovitskiy A, Fischer P, Ilg E, Häusser P, Hazirbas C, Golkov V, van der Smagt P, Cremers D and Brox T. 2015. FlowNet: learning optical flow with convolutional networks//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 2758-2766[DOI:10.1109/ICCV.2015.316http://dx.doi.org/10.1109/ICCV.2015.316]

Havard W, Besacier L and Rosec O. 2017. SPEECH-COCO: 600k visually grounded spoken captions aligned to MSCOCOdataset[EB/OL].[2020-07-09].https://arxiv.org/pdf/1707.08435.pdfhttps://arxiv.org/pdf/1707.08435.pdf

Houston J, Zuidhof G, Bergamini L, Ye Y W, Jain A, Omari S, Iglovikov V and Ondruska P. 2020. One thousand and one hours: self-driving motion prediction dataset[EB/OL].[2020-07-09].https://arxiv.org/pdf/2006.14480v1.pdfhttps://arxiv.org/pdf/2006.14480v1.pdf

Ilg E, Mayer N, Saikia T, Keuper M, Dosovitskiy A and Brox T. 2017. FlowNet 2.0: evolution of optical flow estimation with deep networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 1647-1655[DOI:10.1109/CVPR.2017.179http://dx.doi.org/10.1109/CVPR.2017.179]

Kang K, Li H S, Yan J J, Zeng X Y, Yang B, Xiao T, Zhang C, Wang Z, Wang R H, Wang X G and Ouyang W L. 2018. T-CNN:tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology, 28(10):2896-2907[DOI:10.1109/TCSVT.2017.2736553]

Luo Z M, Branchaud-Charron F, Lemaire C, Konrad J, Li S Z, Mishra A, Achkar A, Eichel J and Jodoin P M. 2018. MIO-TCD:a new benchmark dataset for vehicle classification and localization. IEEE Transactions on Image Processing, 27(10):5129-5141[DOI:10.1109/TIP.2018.2848705]

Lyu S, Chang M C, Du D W, Li W B, Wei Y, del Coco M, Carcagn P, Schumann A, Munjal B, Dang D Q T, Choi D H, Bochinski E, Galasso F, Bunyak F, Seetharaman G, Baek J W, Lee J T, Palaniappan K, Lim K T, Moon K, Kim K J, Sommer L, Brandlmaier M, Kang M S, Jeon M, Al-Shakarji N M, Acatay O, Kim P K, Amin S, Sikora T, Dinh T, Senst T, Che V G H, Lim Y C, Song Y M and Chung Y S. 2018. UA-DETRAC 2018: report of AVSS2018&IWT4S challenge on advanced traffic monitoring//Proceedings of the 15th IEEE International Conference on Advanced Video and Signal Based Surveillance. Auckland, New Zealand: IEEE: 1-6[DOI:10.1109/AVSS.2018.8639089http://dx.doi.org/10.1109/AVSS.2018.8639089]

Mao T and Liang W. 2020. The design and evaluation of abnormal event generation system for autonomous driving algorithms. Transactions of Beijing Institute of Technology, 40(7):753-759

毛婷, 梁玮. 2020.自动驾驶算法的异常事件生成系统设计与评估.北京理工大学学报, 40(7):753-759)[DOI:10.15918/j.tbit1001-0645.2019.025]

Milan A, Leal-TaixéL, Reid I, Roth S and Schindler K. 2016. MOT16: a benchmark for multi-object tracking[EB/OL].[2020-07-09].https://arxiv.org/pdf/1603.00831.pdfhttps://arxiv.org/pdf/1603.00831.pdf

Redmon J and Farhadi A. 2018. Yolov3: an incremental improvement[EB/OL].[2019-08-20].https://arxiv.org/pdf/1804.02767.pdfhttps://arxiv.org/pdf/1804.02767.pdf

Xiao F Y and Lee Y J. 2020. Video object detection with an aligned spatial-temporal memory[EB/OL].[2020-08-29].https://arxiv.org/pdf/1712.06317.pdfhttps://arxiv.org/pdf/1712.06317.pdf

Yu Z P, Xing X Y and Chen J Y. 2019. Review on automated vehicle testing technology and its application. Journal of Tongji University (Natural Science), 47(4):540-547

余卓平, 邢星宇, 陈君毅. 2019.自动驾驶汽车测试技术与应用进展.同济大学学报(自然科学版), 47(4):540-547)[DOI:10.11908/j.issn.0253-374x.2019.04.013]

Zhang S Y, Wang T, Wang C Y, Wang Y, Shan G C and Snoussi H. 2019. Video object detection base on RGB and optical flow analysis//Proceedings of the 2nd China Symposium on Cognitive Computing and Hybrid Intelligence. Xi'an, China: IEEE: 280-284[DOI:10.1109/CCHI.2019.8901921http://dx.doi.org/10.1109/CCHI.2019.8901921]

Zheng Z H, Wang P, Liu W, Li J Z, Ye R G and Ren D W. 2019. Distance-IoU loss: faster and better learning for bounding box regression[EB/OL].[2020-07-09].https://arxiv.org/pdf/1911.08287.pdfhttps://arxiv.org/pdf/1911.08287.pdf

Zhu X Z, Wang Y J, Dai J F, Yuan L and Wei Y C. 2017. Flow-guided feature aggregation for video object detection//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 408-417[DOI:10.1109/ICCV.2017.52http://dx.doi.org/10.1109/ICCV.2017.52]

文章被引用时，请邮件提醒。

提交

RMFS-CNN：遥感图像分类深度学习新框架

自动驾驶场景的尺度感知实时行人检测

鼻咽癌原发肿瘤放疗靶区的自动分割

2D级联CNN模型的放疗危及器官自动分割

TSCNN:面向可穿戴心电信号监测与分析的卷积神经网络