高效检测复杂场景的快速金字塔网络SPNet
Snap pyramid network: real-time complex scene detecting system
- 2020年25卷第5期 页码:977-992
收稿:2019-06-21,
修回:2019-7-12,
录用:2019-7-19,
纸质出版:2020-05-16
DOI: 10.11834/jig.190303
移动端阅览

浏览全部资源
扫码关注微信
收稿:2019-06-21,
修回:2019-7-12,
录用:2019-7-19,
纸质出版:2020-05-16
移动端阅览
目的
2
针对现今主流one-stage网络框架在检测高帧率视频中的复杂目标场景时,无法兼顾检测精度和检测效率的问题,本文改进one-stage网络架构,并使用光流跟踪特征图,提出一种高效检测复杂场景的快速金字塔网络(snap pyramid network,SPNet)。
方法
2
调查特征提取网络以及金字塔网络内部,实现特征矩阵及卷积结构完全可视化,找到one-stage网络模型有效提升检测小目标以及密集目标的关键因素;构建复杂场景检测网络SPNet,由骨干网络(MainNet)内置子网络跟踪器(TrackNet)。在MainNet部分,设计特征权重控制(feature weight control,FWC)模块,改进基本单元(basic block),并设计MainNet的核心网络(BackBone)与特征金字塔网络(feature pyramid network,FPN)架构结合的多尺度金字塔结构,有效提升视频关键帧中存在的小而密集目标的检测精度和鲁棒性;在TrackNet部分,内置光流跟踪器到BackBone,使用高精度的光流矢量映射BackBone卷积出的特征图,代替传统的特征全卷积网络架构,有效提升检测效率。
结果
2
SPNet能够兼顾小目标、密集目标的检测性能,在目标检测数据集MS COCO(Microsoft common objects in context)和PASCAL VOC上的平均精度为52.8%和75.96%,在MS COCO上的小目标平均精度为13.9%;在目标跟踪数据集VOT(visual object tracking)上的平均精度为42.1%,检测速度提高到5070帧/s。
结论
2
本文快速金字塔结构目标检测框架,重构了one-stage检测网络的结构,利用光流充分复用卷积特征信息,侧重于复杂场景的检测能力与视频流的检测效率,实现了精度与速度的良好平衡。
Objective
2
With the great breakthrough of deep convolutional neural networks (CNNs)
tremendous state-of-the-art networks have been created to significantly improve image classification and utilize modern CNN-based object detectors. Image classification and detection research has matured and entered the industrial stage. However
detecting objects in complex samples and high frame rate videos remains a challenging task in the field of computer vision
especially considering that samples and videos are filled with huge numbers of small and dense instances in each frame. The issues of state-of-the-art networks do not make a tradeoff between accuracy and efficiency for detecting small dense targets as the priority consideration. Thus
in this study
we propose a deep hybrid network
namely
snap pyramid network (SPNet).
Method
2
Our model is incorporated with dense optical flow technique in the enhanced one-stage architecture. First
the complete inner visualization of feature-extract net and pyramid net is built to mine out the critical factors of small dense objects. Through this method
the contextual information is found to be a significant key and should thus be fully utilized in the feature extraction. Moreover
sharing the context in multiple convolutional templates of the network is essential for the high semantic information from the deep templates
which help the shallow templates precisely predict the target location. Accordingly
our proposed hybrid net called SPNet is presented. It is composed of two parts
namely
MainNet and TrackNet. For the MainNet
the inception and feature weight control (FWC) modules are designed to modify the conventional network architecture. The whole MainNet consists of the BackBone network (the core of MainNet
which can efficiently extract feature information) and the feature pyramid network (FPN) network to predict classification and candidate box position. Inception greatly reduces the parameter quantity. FWC can raise the weight of essential features that help detect prospective targets while suppressing the features of non-target and other disturbances. To further accelerate the training speed
swish activation and group normalization are also employed to enhance SPNet. The training speed and validation accuracy of MainNet is better than YOLO (you only look once) and SSD (single shot multibox detector). As a result
the performance and robustness for small dense object detection can be substantially improved in key frames of the video. For the TrackNet
dense optical flow field technique is applied in the adjacent frame to obtain the FlowMap. Next
to substantially improve the detection efficiency
FlowMap is mapped with the FeatureMap through a pyramid structure instead of the traditional fully convolutional net. This method markedly shortens the time because optical flow calculation on GPU(graphics processing unit) is much faster than the feature extraction of convolutional network. Then
in the adjacent frame
only FPN
a light-type network
needs to be calculated.
Result
2
The model is trained and validated thoroughly by MS COCO(Microsoft common objects in context). Trained results demonstrate that the proposed SPNet can make a tradeoff between accuracy and efficiency for detecting small dense objects in a video stream
obtaining 52.8% accuracy on MS COCO
75.96% on PASCAL VOC
and speeding up to 70 frame/s.
Conclusion
2
Experimental results show that SPNet can effectively detect small and tough targets with complexity. Multiplexing optical flow technology also greatly improves the detection efficiency and obtains good detection results. The investigation shows that improving the network performance is crucial to the research on the internal structure of the network and the exploration of its internal process. The performance is remarkably improved by the complete visualization of the network structure and the overall optimization of the network architecture.
Bertinetto L, Valmadre J, Golodetz S, Miksik O and Torr P H S. 2016. Staple: complementary learners for real-time tracking//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR2016). Las Vegas, NV, USA: IEEE: 1401-1409[ DOI: 10.1109/CVPR.2016.156 http://dx.doi.org/10.1109/CVPR.2016.156 ]
Bodla N, Singh B, Chellappa R and Davis L S. 2017. Soft-NMS-Improving object detection w ith one line of code//Proceedings of 2017 International Conference on Computer Vision (ICCV 2017). Venice, Italy: IEEE: 5562-5570[ DOI: 10.1109/ICCV.2017.593 http://dx.doi.org/10.1109/ICCV.2017.593 ]
Bottou L. 2012. Stochastic gradient descent tricks//Montavon G, Orr G B and Müller K R, eds. Neural Networks: Tricks of the Trade. Berlin, Heidelberg: Springer: 421-436[ DOI: 10.1007/978-3-642-35289-8_25 http://dx.doi.org/10.1007/978-3-642-35289-8_25 ]
Feichtenhofer C, Pinz A and Zisserman A. 2017. Detect to track and track to detect//Proceedings of 2017 International Conference on Computer Vision (ICCV 2017). Venice, Italy: IEEE: 3057-3065[ DOI: 10.1109/ICCV.2017.330 http://dx.doi.org/10.1109/ICCV.2017.330 ]
Glorot X, Bordes A and Bengio Y. 2011. Deep sparse rectifier neural networks//Proceedings of the 14th International Conference on Artificial Intelligence and Statistics.[s.l.]: PMLR: 315-323
He K M, Zhang X Y, Ren S Q and Sun J. 2015. Delving deep into rectifiers: surpassing human-level performance on ImageNet classification//Proceedings of 2015 International Conference on Computer Vision (ICCV 2015). Santiago, Chile: IEEE: 1026-1034[ DOI: 10.1109/ICCV.2015.123 http://dx.doi.org/10.1109/ICCV.2015.123 ]
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016). Las Vegas, NV, USA: IEEE: 770-778[ DOI: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]
Held D, Thrun S and Savarese S. 2016. Learning to track at 100 FPS with deep regression networks//Proceedings of 14th European Conference on Computer Vision (ECCV 2016). Amsterdam, Netherlands: ECCV: 749-765[ DOI: 10.1007/978-3-319-46448-0_45 http://dx.doi.org/10.1007/978-3-319-46448-0_45 ]
Hu J, Shen L and Sun G. 2018. Squeeze-and-excitation networks//Proceedings of the 31st IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, USA: IEEE: 7132-7141[ DOI: 10.1109/CVPR.2018.00745 http://dx.doi.org/10.1109/CVPR.2018.00745 ]
Hu P Y and Ramanan D. 2017. Finding tiny faces//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017). Honolulu, Hawaii, USA: IEEE: 1522-1530[ DOI: 10.1109/CVPR.2017.166 http://dx.doi.org/10.1109/CVPR.2017.166 ]
Ioffe S and Szegedy C. 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift//Proceedings of the 32nd International Conference on Machine Learning (ICML 2015). Lille, France: JMLR: 448-456
Kingma D P and Ba J. 2015. Adam: a method for stochastic optimization[EB/OL ] .[2019-06-18 ] . https://arxiv.org/pdf/1412.698009.pdf https://arxiv.org/pdf/1412.698009.pdf
Li A N, Lin M, Wu Y, Yang M H and Yan S C. 2016. NUS-PRO:a new visual tracking challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2):335-349[DOI:10.1109/TPAMI.2015.2417577]
Li D J, Li J, Nie B L and Sun S Q. 2017. Deconvolution single shot multibox detector for supermarket commodity detection and classification//Proceedings of SPIE 10420, Ninth International Conference on Digital Image Processing (ICDIP 2017). Hong Kong, China: SPIE: 10420[ DOI: 10.1117/12.2281740 http://dx.doi.org/10.1117/12.2281740 ]
Lin T Y, Dollár P, Girshick R, He K M, Hariharan B and Belongie S. 2017. Feature pyramid networks for object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017). Honolulu, Hawaii, USA: IEEE: 936-944[ DOI: 10.1109/CVPR.2017.106 http://dx.doi.org/10.1109/CVPR.2017.106 ]
Lin T Y, Goyal P, Girshick R, He K M and Dollar P. 2018. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2):318-327[DOI:10.1109/TPAMI.2018.2858826]
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C Y and Berg A C. 2016. SSD: single shot multibox detector//Proceedings of the 14th European Conference on Computer Vision (ECCV). Amsterdam, Netherlands: Springer: 21-37[ DOI: 10.1007/978-3-319-46448-0_2 http://dx.doi.org/10.1007/978-3-319-46448-0_2 ]
Park D, Ramanan D and Fowlkes C. 2010. Multiresolution models for object detection//Proceedings of the 11th European Conference on Computer Vision (ECCV). Heraklion, Greece: Springer: 241-254[ DOI: 10.1007/978-3-642-15561-1_18 http://dx.doi.org/10.1007/978-3-642-15561-1_18 ]
Ramachandran P, Zoph B and Le Q V. 2018. Searching for activation functions[EB/OL ] .[2019-06-18 ] . https://arxiv.org/pdf/1710.05941v2.pdf https://arxiv.org/pdf/1710.05941v2.pdf
Redmon J, Divvala S, Girshick R and Farhadi A. 2016. You only look once: unified, real-time object detection//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016). Las Vegas, NV, USA: IEEE: 779-788[ DOI: 10.1109/CVPR.2016.91 http://dx.doi.org/10.1109/CVPR.2016.91 ]
Ren S Q, He K M, Girshick R and Sun J. 2015. Faster R-CNN:towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137-1149[DOI:10.1109/TPAMI.2016.2577031]
Santurkar S, Tsipras D, Ilyas A and Madry A. 2018. How does batch normalization help optimization?//Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS 2018). Montreal, Canada: IEEE: 2488-2498
Sharma P and Singh A. 2017. Era of deep neural networks: a review//Proceedings of the 8th International Conference on Computing, Communication and Networking Technologies (ICCNT 2017). Delhi, India: IEEE: 1-5[ DOI: 10.1109/ICCCNT.2017.8203938 http://dx.doi.org/10.1109/ICCCNT.2017.8203938 ]
Szegedy C, Ioffe S and Vanhoucke V. 2016. Inception-V4, Inception-ResNet and the impact of residual connections on learning//Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI). Phoenix, Arizona, USA: AAAI: 4278-4284
Tang C, Ling Y S, Zheng K D, Yang X, Zheng C, Yang H and Jin W. 2018. Object detection method of multi-view SSD based on deep learning. Infrared and Laser Engineering, 47(1):0126003
唐聪, 凌永顺, 郑科栋, 杨星, 郑超, 杨华, 金伟. 2018.基于深度学习的多视窗SSD目标检测方法.红外与激光工程, 47(1):#0126003)[DOI:10.3788/IRLA201847.0126003]
Tang X, Du D K, He Z Q and Liu J T. 2018. PyramidBox: a context-assisted single shot face detector//Proceedings of 2018 European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 812-828
Wu J, Dong G H and Li Q S. 2018. Object tracking based on region convolution neural network and optical flow method. Telecommunication Engineering, 58(1):6-12
吴进, 董国豪, 李乔深. 2018.基于区域卷积神经网络和光流法的目标跟踪.电讯技术, 58(1):6-12)[DOI:10.3969/j.issn.1001-893x.2018.01.002]
Wu Y X and He K M. 2018. Group normalization. International Journal of Computer Vision, 128:742-755[DOI:10.1007/978-3-030-01261-8_1]
Zhang S F, Wen L Y, Bian X, Lei Z and Li S T. 2018. Single-shot refinement neural network for object detection//Proceedings of the 31st IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018). Salt Lake City, USA: IEEE: 4203-4212[ DOI: 10.1109/CVPR.2018.00442 http://dx.doi.org/10.1109/CVPR.2018.00442 ]
Zhu X Z, Xiong Y W, Dai J F, Yuan L and Wei Y C. 2017. Deep feature flow for video recognition//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017). Honolulu, Hawaii, USA: IEEE: 41-50[ DOI: 10.1109/CVPR.2017.441 http://dx.doi.org/10.1109/CVPR.2017.441 ]
相关作者
相关机构
京公网安备11010802024621