深度学习行人检测方法综述

罗艳; 张重阳; 田永鸿; 郭捷; 孙军

doi:10.11834/jig.200831

综述 | 浏览量 : 0 下载量: 0 CSCD: 7

PDF
导出
分享
收藏
专辑

深度学习行人检测方法综述
An overview of deep learning based pedestrian detection algorithms
2022年27卷第7期页码：2094-2111
纸质出版日期： 2022-07-16 ，

录用日期： 2021-04-08
DOI： 10.11834/jig.200831
稿件说明：

移动端阅览

罗艳, 张重阳, 田永鸿, 郭捷, 孙军. 深度学习行人检测方法综述[J]. 中国图象图形学报, 2022,27(7):2094-2111.

Yan Luo, Chongyang Zhang, Yonghong Tian, Jie Guo, Jun Sun. An overview of deep learning based pedestrian detection algorithms[J]. Journal of Image and Graphics, 2022,27(7):2094-2111.
罗艳, 张重阳, 田永鸿, 郭捷, 孙军. 深度学习行人检测方法综述[J]. 中国图象图形学报, 2022,27(7):2094-2111. DOI： 10.11834/jig.200831.

Yan Luo, Chongyang Zhang, Yonghong Tian, Jie Guo, Jun Sun. An overview of deep learning based pedestrian detection algorithms[J]. Journal of Image and Graphics, 2022,27(7):2094-2111. DOI： 10.11834/jig.200831.

摘要

行人检测技术在智能交通系统、智能安防监控和智能机器人等领域均表现出了极高的应用价值，已经成为计算机视觉领域的重要研究方向之一。得益于深度学习的飞速发展，基于深度卷积神经网络的通用目标检测模型不断拓展应用到行人检测领域，并取得了良好的性能。但是由于行人目标内在的特殊性和复杂性，特别是考虑到复杂场景下的行人遮挡和尺度变化等问题，基于深度学习的行人检测方法也面临着精度及效率的严峻挑战。本文针对上述问题，以基于深度学习的行人检测技术为研究对象，在充分调研文献的基础上，分别从基于锚点框、基于无锚点框以及通用技术改进（例如损失函数改进、非极大值抑制方法等）3个角度，对行人检测算法进行详细划分，并针对性地选取具有代表性的方法进行详细结合和对比分析。本文总结了当前行人检测领域的通用数据集，从数据构成角度分析各数据集应用场景。同时讨论了各类算法在不同数据集上的性能表现，对比分析各算法在不同数据集中的优劣。最后，对行人检测中待解决的问题与未来的研究方法做出预测和展望。如何缓解遮挡导致的特征缺失问题、如何应对单一视角下尺度变化问题、如何提高检测器效率以及如何有效利用多模态信息提高行人检测精度，均是值得进一步研究的方向。

Abstract

Computer vision technology has been intensively developed nowadays and it is essential to facilitate image classification and human face identification. Machine learning based methods have been used as basic technologies to carry out computer vision tasks. The core of this technology is to distinguish the location and category of the target via manual image feature designation for targeted tasks. However

the manual design process is costly. Current emerging deep learning-based technology can automatically learn effective features from labeled or unlabeled data in a supervised or unsupervised manner and facilitate image recognition and target detection tasks. Deep learning based pedestrian detection technology is one of the aspects its development. Our pedestrian detection is to identify pedestrian targets in a scenario of input single frame image or image sequence and determine the localization of the pedestrians in the targeted image. Due to the complicated scenarios and the uniqueness of pedestrian targets

deep learning based pedestrian detection technology has challenged two key issues shown below: 1) one aspect is the occlusion issue. The other one is that

the human body structure information of pedestrians is severely affected in the case of severe occlusion. As a result

the visual features of the occluded pedestrians are differentiated from those of the un-occluded ones leading to false negatives during inference. Due to the diversity of occlusion patterns

it is challenged to analyze which part is occluded accurately

and locates on-site capability for pedestrian detection algorithms; 2) the other challenge is scale-based variance. The pedestrians' detection status is constrained of crowded or sparse scenario l. For a tiny target

due to the lack of sufficient semantic information

the detector is likely to misjudge it as background noise. Simultaneously

it is challenged for a set of clear anchors that can match it perfectly for a large-scale target during the training procedure. Moreover

large-scale pedestrian instances often have clear internal texture and skeleton features

while small-scale ones often only have blurred edge information. Therefore

a unified framework designation is required to for large and small targets both. Our research carries out an overview of related works on several of deep learning-based pedestrian detection algorithms. Our analysis is targeted on current improvement of the mainstream pedestrian detection framework from three aspects

including anchor-based algorithm

anchor-free algorithm and technology modification (e.g.

loss function and non-maximum suppression). In the scope of anchor-based methods

this research is mainly focused on pedestrian detectors based on Faster region-based convolutional neural network (R-CNN) or single shot multi-box detector (SSD) baseline

in which region proposals are firstly to generate and refined to get the final detection subsequently. In the context of these algorithms

current designation is for customized pedestrian modules whether it is based on single-stage or two-stage anchor-based detectors. We summarize them into the categories as following: 1) partial-based methods: local part features contain more pedestrian occlusion and deformation information

and thus some methods like occlusion-aware R-CNN (OR-CNN) have investigated to extract part-level features to improve occluded pedestrian detection performance. In addition to using extra part detectors or delineating partial regions manually

several pedestrian detection methods like mask-guided attention network(MGAN) use the attention mechanism to enhance the features of visible pedestrian regions while suppressing the features of occluded ones. 2) Hybrid methods: such methods like Bi-box or PedHunter built two-branch networks for both part and full-body prediction

and introduce a fusion mechanism to ensure more robustness on the aspects of local and global features of pedestrians both. 3) Cascaded methods: to improve localization quality

cascade structure has been also applied for pedestrian detection. Cascade R-CNN

auto regressive network(AP-Ped) and asymptotic localization fitting network(ALFNet) stacked multiple head predictors for multi-stage regressions of the proposals

and thus the pedestrian detection boxes can be gradually refined to obtain optimized localization results. 4) Multi-scale methods: these methods are integrated to robust feature representation by fusing high-level and low-level features like feature pyramid network (FPN) to tackle with scale variance in pedestrian detection. In the scope of anchor-free methods

our demonstration illustrates the two detectors like point-based

center scale predictor (CSP) and line-based

topology localization (TLL). Our two methods do not use the pre-defined anchor boxes and thus split into the anchor-free paradigm. These anchor-free methods can avoid the redundant background information brought by the pre-defined boxes

so it has relatively better performance for small-scale and occluded pedestrian detection. In addition

our research also summarizes improvements in general technologies that can be used in both anchor-based and anchor-free detectors. The modification of loss function represented by repulsion loss (RepLoss) is designed to bring the proposal and its matched ground-truth box closer while keeping it away from other ground-truth boxes. Another key technique is non-maximum suppression (NMS)

which is usually used to reduce duplicated detection results. Representative methods among them are adaptive NMS and R

NMS

and they usually aim to find a more suitable post-processing threshold for the pedestrian detector to deal with the occlusion issue. The regular datasets like Caltech

Citypersons and its corresponding challenging subsets (e.g.

reasonable and heavy) are introduced in details. On the basis of the evaluation metric of log-average miss rate

our overview promotes a comparison of the performance on different subsets targeting at various challenging tasks

and provides an experimental analysis.

关键词

行人检测深度学习卷积神经网络(CNN)遮挡目标检测小目标检测

Keywords

pedestrian detectiondeep learningconvolutional neural network (CNN)occlusion target detectionsmall-scale target detection

references

Brazil G and Liu X M. 2019. Pedestrian detection with autoregressive network phases//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 7224-7233 [DOI: 10.1109/CVPR.2019.0074010.1109/CVPR.2019.00740]

Cai Z W, Fan Q F, Feris R S and Vasconcelos N. 2016. A unified multi-scale deep convolutional neural network for fast object detection//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: IEEE: 354-370 [DOI: 10.1007/978-3-319-46493-0_2210.1007/978-3-319-46493-0_22]

Cai Z W and Vasconcelos N. 2018. Cascade R-CNN: delving into high quality object detection//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 6154-6162 [DOI: 10.1109/CVPR.2018.0064410.1109/CVPR.2018.00644]

Chi C, Zhang S F, Xing J L, Lei Z, Li S Z and Zou X D. 2020a. Relational learning for joint head and human detection. Proceedings of the AAAI Conference on Artificial Intelligence, 34(7): 10647-10654 [DOI: 10.1609/aaai.v34i07.6691]

Chi C, Zhang S F, Xing J L, Lei Z, Li S Z and Zou X D. 2020b. PedHunter: occlusion robust pedestrian detector in crowded scenes. Proceedings of the AAAI Conference on Artificial Intelligence, 34(7): 10639-10646 [DOI: 10.1609/aaai.v34i07.6690]

Dalal N and Triggs B. 2005. Histograms of oriented gradients for human detection//Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego, USA: IEEE: 886-893 [DOI: 10.1109/CVPR.2005.17710.1109/CVPR.2005.177]

Dollár P, Appel R, Belongie S and Perona P. 2014. Fast feature pyramids for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(8): 1532-1545 [DOI: 10.1109/TPAMI.2014.2300479]

Dollár P, Wojek C, Schiele B and Perona P. 2012. Pedestrian detection: an evaluation of the state of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4): 743-761 [DOI: 10.1109/TPAMI.2011.155]

Enzweiler M and Gavrila D M. 2009. Monocular pedestrian detection: survey and experiments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(12): 2179-2195 [DOI: 10.1109/TPAMI.2008.260]

Ess A, Leibe B and Van Gool L. 2007. Depth and appearance for mobile scene analysis//Proceedings of the 11th IEEE International Conference on Computer Vision (ICCV). Rio de Janeiro, Brazil: IEEE: 1-8 [DOI: 10.1109/ICCV.2007.440909210.1109/ICCV.2007.4409092]

Geiger A, Lenz P and Urtasun R. 2012. Are we ready for autonomous driving? The KITTI vision benchmark suite//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 3354-3361 [DOI: 10.1109/CVPR.2012.624807410.1109/CVPR.2012.6248074]

Gidaris S and Komodakis N. 2016. Attend refine repeat: active box proposal generation via in-out localization [EB/OL]. [2020-12-30].https://arxiv.org/pdf/1606.04446.pdfhttps://arxiv.org/pdf/1606.04446.pdf

Huang X, Ge Z, Jie Z Q and Yoshie O. 2020. NMS by representative region: towards crowded pedestrian detection by proposal pairing//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE: 10747-10756 [DOI: 10.1109/CVPR42600.2020.0107610.1109/CVPR42600.2020.01076]

Law H and Deng J. 2018. CornerNet: detecting objects as paired keypoints//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 765-781 [DOI: 10.1007/978-3-030-01264-9_4510.1007/978-3-030-01264-9_45]

Li J N, Liang X D, Shen S M, Xu T F, Feng J S and Yan S C.2018. Scale-aware fast R-CNN for pedestrian detection. IEEE Transactions on Multimedia, 20(4): 985-996 [DOI: 10.1109/TMM.2017.2759508]

Lin C Y, Xie H X and Zheng H. 2019. PedJointNet: joint head-shoulder and full body deep network for pedestrian detection. IEEE Access, 7: 47687-47697 [DOI: 10.1109/ACCESS.2019.2910201]

Lin T Y, Dollár P, Girshick R, He K M, Hariharan B and Belongie S. 2017. Feature pyramid networks for object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 936-944 [DOI: 10.1109/CVPR.2017.10610.1109/CVPR.2017.106]

Lin T Y, Goyal P, Girshick R, He K M and Dollár P. 2020. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2): 318-327 [DOI: 10.1109/TPAMI.2018.2858826]

Liu S T, Huang D and Wang Y H. 2019a. Adaptive NMS: refining pedestrian detection in a crowd//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 6452-6461 [DOI: 10.1109/CVPR.2019.0066210.1109/CVPR.2019.00662]

Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C F and Berg A C. 2016. SSD: single shot MultiBox detector//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 21-37 [DOI: 10.1007/978-3-319-46448-0_210.1007/978-3-319-46448-0_2]

Liu W, Liao S C, Hu W D, Liang X Z and Chen X. 2018. Learning efficient single-stage pedestrian detectors by asymptotic localization fitting//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 643-659 [DOI: 10.1007/978-3-030-01264-9_3810.1007/978-3-030-01264-9_38]

Liu W, Liao S C, Ren W Q, Hu W D and Yu Y N. 2019b. High-level semantic feature detection: a new perspective for pedestrian detection//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 5182-5191 [DOI: 10.1109/CVPR.2019.0053310.1109/CVPR.2019.00533]

Lowe D G. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2): 91-110 [DOI: 10.1023/B:VISI.0000029664.99615.94]

Luo Y, Zhang C Y, Zhao M M, Zhou H and Sun J. 2020. Where, what, whether: multi-modal learning meets pedestrian detection [EB/OL]. [2020-12-22].https://arxiv.org/pdf/2012.10880.pdfhttps://arxiv.org/pdf/2012.10880.pdf

Nam W, Dollár P and Han J H. 2014. Local decorrelation for improved pedestrian detection//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press: 424-432 [DOI: 10.5555/2968826.296887410.5555/2968826.2968874]

Ojala T, Pietikainen M and Maenpaa T. 2002. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7): 971-987 [DOI: 10.1109/TPAMI.2002.1017623]

Pang Y W, Xie J, Khan M H, Anwer R M, Khan F S and Shao L. 2019. Mask-guided attention network for occluded pedestrian detection//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE: 4966-4974 [DOI: 10.1109/ICCV.2019.0050710.1109/ICCV.2019.00507]

Papageorgiou C and Poggio T. 2000. A trainable system for object detection. International Journal of Computer Vision, 38(1): 15-33 [DOI: 10.1023/A:1008162616689]

Ren S Q, He K M, Girshick R and Sun J. 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149 [DOI: 10.1109/TPAMI.2016.2577031]

Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I and Savarese S. 2019. Generalized intersection over union: a metric and a loss for bounding box regression//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 658-666 [DOI: 10.1109/CVPR.2019.0007510.1109/CVPR.2019.00075]

Shao S, Zhao Z J, Li B X, Xiao T, Yu G, Zhang X Y and Sun J. 2018. CrowdHuman: a benchmark for detecting human in a crowd [EB/OL]. [2020-12-30].https://arxiv.org/pdf/1805.00123.pdfhttps://arxiv.org/pdf/1805.00123.pdf

Song T, Sun L Y, Xie D, Sun H M and Pu S L. 2018. Small-scale pedestrian detection based on topological line localization and temporal feature aggregation//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 554-569 [DOI: 10.1007/978-3-030-01234-2_3310.1007/978-3-030-01234-2_33]

Tian Y L, Luo P, Wang X G and Tang X O. 2015a. Pedestrian detection aided by deep learning semantic tasks//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA: IEEE: 5079-5087 [DOI: 10.1109/CVPR.2015.729914310.1109/CVPR.2015.7299143]

Tian Y L, Luo P, Wang X G and Tang X O. 2015b. Deep learning strong parts for pedestrian detection//Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE: 1904-1912 [DOI: 10.1109/ICCV.2015.22110.1109/ICCV.2015.221]

Wang D, Zhang C Y, Cheng H, Shang Y F and Mei L. 2017. SPID: surveillance pedestrian image dataset and performance evaluation for pedestrian detection//Proceedings of 2017 Asian Conference on Computer Vision (ACCV). Taipei, China: Springer: 463-477 [DOI: 10.1007/978-3-319-54526-4_3410.1007/978-3-319-54526-4_34]

Wang X L, Xiao T T, Jiang Y N, Shao S, Sun J and Shen C H. 2018. Repulsion loss: detecting pedestrians in a crowd//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, USA: IEEE: 7774-7783 [DOI: 10.1109/CVPR.2018.0081110.1109/CVPR.2018.00811]

Wu B and Nevatia R. 2005. Detection of multiple, partially occluded humans in a single image by Bayesian combination of edgelet part detectors//Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV). Beijing, China: IEEE: 90-97 [DOI: 10.1109/ICCV.2005.7410.1109/ICCV.2005.74]

Xu M M, Bai Y C, Qu S S and Ghanem B. 2019. Semantic part RCNN for real-world pedestrian detection//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. Long Beach, USA: IEEE: 45-54

Zagoruyko S, Lerer A, Lin T Y, Pinheiro P O, Gross S, Chintala S and Dollár P. 2016. A MultiPath network for object detection [EB/OL]. [2020-12-30].https://arxiv.org/pdf/1604.02135.pdfhttps://arxiv.org/pdf/1604.02135.pdf

Zhang K, Xiong F, Sun P Z, Hu L, Li B X and Yu G. 2019. Double anchor R-CNN for human detection in a crowd [EB/OL]. [2020-12-30].https://arxiv.org/pdf/1909.09998.pdfhttps://arxiv.org/pdf/1909.09998.pdf

Zhang L L, Lin L, Liang X D and He K M. 2016a. Is faster R-CNN doing well for pedestrian detection?//Proceedings of the 14th European Conference on Computer Vision (ECCV). Amsterdam, the Netherlands: Springer: 443-457 [DOI: 10.1007/978-3-319-46475-6_2810.1007/978-3-319-46475-6_28]

Zhang S F, Wen L Y, Bian X, Lei Z and Li S Z. 2018a. Occlusion-aware R-CNN: detecting pedestrians in a crowd//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 657-674 [DOI: 10.1007/978-3-030-01219-9_3910.1007/978-3-030-01219-9_39]

Zhang S F, Xie Y L, Wan J, Xia H S, Li S Z and Guo G D. 2020. WiderPerson: a diverse dataset for dense pedestrian detection in the wild. IEEE Transactions on Multimedia, 22(2): 380-393 [DOI: 10.1109/TMM.2019.2929005]

Zhang S S, Benenson R, Omran M, Hosang J and Schiele B. 2016b. How far are we from solving pedestrian detection?//Proceedings of 2016 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 1259-1267 [DOI: 10.1109/CVPR.2016.14110.1109/CVPR.2016.141]

Zhang S S, Benenson R and Schiele B. 2017. CityPersons: a diverse dataset for pedestrian detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 4457-4465 [DOI: 10.1109/CVPR.2017.47410.1109/CVPR.2017.474]

Zhang S S, Yang J and Schiele B. 2018b. Occluded pedestrian detection through guided attention in CNNs//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, USA: IEEE: 6995-7003 [DOI: 10.1109/CVPR.2018.0073110.1109/CVPR.2018.00731]

Zhou C L and Yuan J S. 2018. Bi-box regression for pedestrian detection and occlusion estimation//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 138-154 [DOI: 10.1007/978-3-030-01246-5_910.1007/978-3-030-01246-5_9]

Zhou X Y, Wang D Q and Krähenbühl P. 2019. Objects as points [EB/OL]. [2020-12-30].https://arxiv.org/pdf/1904.07850.pdfhttps://arxiv.org/pdf/1904.07850.pdf

文章被引用时，请邮件提醒。

提交

双目机器视觉及RetinaNet模型的路侧行人感知定位