第一人称视角下的社会力优化多行人跟踪
Multi-pedestrian tracking optimized by social force model under first-person perspective
- 2020年25卷第9期 页码:1869-1881
收稿:2019-12-18,
修回:2020-3-6,
录用:2020-3-13,
纸质出版:2020-09-16
DOI: 10.11834/jig.190632
移动端阅览

浏览全部资源
扫码关注微信
收稿:2019-12-18,
修回:2020-3-6,
录用:2020-3-13,
纸质出版:2020-09-16
移动端阅览
目的
2
多行人跟踪一直是计算机视觉领域最具挑战性的任务之一,然而受相机移动、行人频繁遮挡和碰撞影响导致第一人称视频中行人跟踪存在效率和精度不高的问题。对此,本文提出一种基于社会力模型优化的第一人称视角下的多行人跟踪算法。
方法
2
采用基于目标检测的跟踪算法,将跟踪问题简化为检测到的目标匹配问题,并且在初步跟踪之后进行社会力优化,有效解决频繁遮挡和碰撞行为导致的错误跟踪问题。首先,采用特征提取策略和宽高比重新设置的单步多框检测器(single shot multi-box detector,SSD),对输入的第一人称视频序列进行检测,并基于卷积神经网络(convolutional neural network,CNN)模型提取行人的表观特征,通过计算行人特征相似度获得初步的行人跟踪结果;然后,进行跟踪结果的社会力优化,一是定义行人分组行为,对每个行人跟踪目标进行分组计算,并通过添加分组标识,实现同组行人在遮挡的情况下的准确跟踪;二是通过定义的行人领域,对行人分组进行排斥计算,实现避免碰撞后的准确跟踪。
结果
2
在公用数据集ETH(eidgenössische technische hochschule)、MOT16(multi-object tracking 16)和ADL(adelaide)的6个第一人称视频序列上与其他跟踪算法进行对比实验,本文算法的运行速度达到准实时的20.8帧/s,同时相比其他准实时算法,本文算法的整体跟踪性能MOTA(multiple object tracking accuracy)提高了2.5%。
结论
2
提出的第一人称视频中社会力优化的多行人跟踪算法,既能准确地在第一人称场景中跟踪多个行人,又能较好地满足实际应用需求。
Objective
2
Pedestrian tracking and first-person vision are challenging tasks in the field of computer vision. First-person vision focuses on analyzing and processing first-person videos
thus helping camera wearers make the right decisions. Its particularities include the following: First
the foreground and background of the video are difficult to distinguish because the camera is always moving. Second
the shooting location of the video is not fixed
and the lighting changes considerably. Third
the shooting needs to have real-time processing capabilities. Fourth
it also needs to have embedded processing capabilities when considering application to smart glasses and other devices. The above problems can cause pedestrian occlusion problems and collision avoidance behavior
thus leading to low tracking efficiency and accuracy. Therefore
this study proposes a social force-optimized multipedestrian tracking algorithm in first-person videos to resolve frequent occlusions and collisions
thereby improving tracking efficiency and accuracy.
Method
2
We use a detection-based tracking algorithm
which simplifies tracking problems into detected target matching problems. After initial tracking
the social force model is used to optimize frequent occlusion and collision avoidance behavior. The feature extraction strategy of the single shot multi-box detector (SSD) algorithm is first adjusted
and the features from low-level feature maps
such as conv4_3
conv6_1
conv6_2
conv7_1
conv7_2
conv8_2
and conv9_2
are extracted. Then
the idea of a dense and residual connection of DenseNet is drawn. In order to realize the repeated use of features
we perform a union operation on the input and output of conv6_2
and input it to conv7_2. Then
the aspect ratio of the default box is reset
and the default frame is simplified to an aspect ratio of 0.41 on the basis of the Caltech large pedestrian dataset. These steps are performed to simplify calculations and reduce the interference in pedestrian detection. From the large-scale ReID dataset
the apparent features of pedestrians are extracted on the basis of a convolutional neural network model by adding two convolutional layers
a maximum pooling layer
and six remaining modules to the pretrained network; as a result
a wide residual network is constructed. The network model is used to extract the apparent features of the pedestrian target boxes. The preliminary pedestrian tracking results are obtained by calculating the similarity of pedestrian features. First
the degree of location matching is calculated
followed by the calculation of the apparent feature matching and the degree of fusion matching. The Kuhn-Munkres algorithm is used to perform the matching correlation of the detection results. Lastly
the idea of a social force model is introduced to optimize the preliminary tracking results. The first step is to define the grouping behavior of pedestrians. Then
the grouping of each pedestrian tracking target is calculated
and a grouping identifier is added. In the case of occlusion
pedestrians in the same group are still accurately tracked by maintaining the group identification. The second step is to define the pedestrian domain and calculate the exclusion of pedestrian groups that cross the domain. After the occurrence of collision avoidance behavior
the tracking target boxes also closely follow the pedestrian target.
Result
2
Compared with other tracking algorithms on the six first-person video sequences of the public datasets eidgenössische technische hochschule (ETH)
multi-object tracking 16 (MOT16)
and adelaide (ADL)
the algorithm runs at a near real-time speed of 20.8 frames per second
and the multiple object tracking accuracy (MOTA) is improved by 2.5%. Among the six tracking indicators
four obtained the optimum results
whereas two obtained suboptimal results. Among them
lifted multicut and person (LMP_p) obtained the best performance on the mostly tracked (MT) indicator
but it was achieved under the premise of loss of operating efficiency. Simple online and realtime tracking (SORT) performed well on the Hz index
but its other performance indicators are average. In the comparison experiment of operating efficiency
the running speed of the method in this study reaches approximately 20 frames per second on six datasets
and its operating efficiency reaches quasi real-time performance
which is second only to the SORT method. However
SORT comes at the expense of accuracy in exchange for operating efficiency
thus often causing problems
such as tracking failure.
Conclusion
2
This study explores several issues of first-person pedestrian tracking and proposes social force-optimized multipedestrian tracking in first-person videos. The core idea of this method is to simplify the tracking problem into a matching problem of detection results
use a single-shot multibox detector SSD to detect pedestrians
and then extract the apparent characteristics of pedestrians as the main basis for data association. The social force model is used for optimization to solve the tracking problem caused by frequent occlusion and collision avoidance. Moreover
this model performs well in problems
such as difficulty in distinguishing the foreground and background
unobtrusive features
numerous pedestrian targets
and lighting changes. Experimental results based on numerous first-person video sequences show that compared with the existing mainstream universal tracking methods
the proposed method have higher tracking accuracy and better real-time effect. These results validate the effectiveness of the proposed method in multipedestrian tracking in first-person videos.
Alahi A, Goel K, Ramanathan V, Robicquet A, Li F F and Savarese S. 2016. Social LSTM: human trajectory prediction in crowded spaces//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE: 961-971[ DOI:10.1109/CVPR.2016.110 http://dx.doi.org/10.1109/CVPR.2016.110 ]
Betancourt A, Morerio P, Regazzoni C S and Rauterberg M. 2015. The evolution of first person vision methods:a survey. IEEE Transactions on Circuits and Systems for Video Technology, 25(5):744-760[DOI:10.1109/tcsvt.2015.2409731]
Bewley A, Ge Z Y, Ott L, Ramos F and Upcroft B. 2016. Simple online and realtime tracking//Proceedings of 2016 IEEE International Conference on Image Processing. Phoenix: IEEE: 3464-3468[ DOI:10.1109/ICIP.2016.7533003 http://dx.doi.org/10.1109/ICIP.2016.7533003 ]
Bose B, Wang X G and Grimson E. 2007. Multi-class object tracking algorithm that handles fragmentation and grouping//Proceedings of 2007 IEEE Conference on Computer Vision and Pattern Recognition. Minneapolis: IEEE: 1-8[ DOI:10.1109/CVPR.2007.383175 http://dx.doi.org/10.1109/CVPR.2007.383175 ]
Choi W. 2015. Near-online multi-target tracking with aggregated local flow descriptor//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago: IEEE: 3029-3037[ DOI:10.1109/ICCV.2015.347 http://dx.doi.org/10.1109/ICCV.2015.347 ]
Ess A, Leibe B, Schindler K and van Gool L. 2008. A mobile vision system for robust multi-person tracking//Proceedings of 2008 IEEE Conference on Computer Vision and Pattern Recognition. Anchorage: IEEE: 1-8[ DOI:10.1109/CVPR.2008.4587581 http://dx.doi.org/10.1109/CVPR.2008.4587581 ]
Helbing D and Molnár P. 1995. Social force model for pedestrian dynamics. Physical Review E, 51(5):4282-4286[DOI:10.1103/PhysRevE.51.4282]
Huang G, Liu Z, van der Maaten L and Weinberger K Q. 2017. Densely connected convolutional networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE: 2261-2269[ DOI:10.1109/CVPR.2017.243 http://dx.doi.org/10.1109/CVPR.2017.243 ]
Keuper M, Tang S Y, Yu Z J, Andres B, Brox T and Schiele B. 2016. A multi-cut formulation for joint segmentation and tracking of multiple objects[EB/OL].[2019-12-01] . https://arxiv.org/pdf/1607.06317.pdf https://arxiv.org/pdf/1607.06317.pdf
Kuhn H W. 1955. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1-2):83-97[DOI:10.1002/nav.3800020109]
Leal-TaixéL, Milan A, Reid I, Roth S and Schindler K. 2015. MOTChallenge 2015: towards a benchmark for multi-target tracking[EB/OL].[2019-12-01] . https://arxiv.org/pdf/1504.01942.pdf https://arxiv.org/pdf/1504.01942.pdf
Li J W, Zhou X L, Chan S X and Chen S Y. 2018. A novel video target tracking method based on adaptive convolutional neural network feature. Journal of Computer-Aided Design and Computer Graphics, 30(2):273-281
李军伟, 周小龙, 产思贤, 陈胜勇. 2018.基于自适应卷积神经网络特征选择的视频目标跟踪方法.计算机辅助设计与图形学学报, 30(2):273-281)[DOI:10.3724/SP.J.1089.2018.16268]
Li X, Ma C, Wu B Y, He Z Y and Yang M H. 2019. Target-aware deep tracking//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE: 1369-1378[ DOI:10.1109/CVPR.2019.00146 http://dx.doi.org/10.1109/CVPR.2019.00146 ]
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C Y and Berg A C. 2016. SSD: single shot multibox detector//Proceedings of the 14th European Conference on Computer Vision. Amsterdam: Springer: 21-37[ DOI:10.1007/978-3-319-46448-0_2 http://dx.doi.org/10.1007/978-3-319-46448-0_2 ]
Milan A, Leal-Taixe L, Reid I, Roth S and Schindler K. 2016. MOT16: a benchmark for multi-object tracking[EB/OL].[2019-12-01] . https://arxiv.org/pdf/1603.00831.pdf https://arxiv.org/pdf/1603.00831.pdf
Su S, Pyo Hong J, Shi J B and Soo Park H. 2017. Predicting behaviors of basketball players from first person videos//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE: 1206-1215[ DOI:10.1109/CVPR.2017.133 http://dx.doi.org/10.1109/CVPR.2017.133 ]
Wang D J, Zhang R, Yin D and Zhang Z R. 2013. Median flow aided online multi-instance learning visual tracking. Journal of Image and Graphics, 18(1):93-100
王德建, 张荣, 尹东, 张智瑞. 2013.中值流辅助在线多示例目标跟踪.中国图象图形学报, 18(1):93-100)[DOI:10.11834/jig.20130112]
Wang H Y, Yang Y T, Zhang Z, Yan G L, Wang J Q, Li X L, Chen W G and Hua J. 2017. Deep-learning-aided multi-pedestrian tracking algorithm. Journal of Image and Graphics, 22(3):349-357
王慧燕, 杨宇涛, 张政, 严国丽, 王靖齐, 李笑岚, 陈卫刚, 华璟. 2017.深度学习辅助的多行人跟踪算法.中国图象图形学报, 22(3):349-357)[DOI:10.11834/jig.20170309]
Wang M H, Liang Y, Liu F M and Luo X N. 2015. Object tracking based on component-level appearance model. Journal of Software, 26(10):2733-2747
王美华, 梁云, 刘福明, 罗笑南. 2015.部件级表观模型的目标跟踪方法.软件学报, 26(10):2733-2747)[DOI:10.13328/j.cnki.jos.004737]
Wojke N, Bewley A and Paulus D. 2017. Simple online and realtime tracking with a deep association metric//Proceedings of 2017 IEEE International Conference on Image Processing. Beijing: IEEE: 3645-3649[ DOI:10.1109/icip.2017.8296962 http://dx.doi.org/10.1109/icip.2017.8296962 ]
Yagi T, Mangalam K, Yonetani R and Sato Y. 2018. Future person localization in first-person videos//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE: 7593-7602[ DOI:10.1109/cvpr.2018.00792 http://dx.doi.org/10.1109/cvpr.2018.00792 ]
Yamaguchi K, Berg A C, Ortiz L E and Berg T L. 2011. Who are you with and where are you going?//Proceedings of CVPR 2011. Providence: IEEE: 1345-1352[ DOI:10.1109/CVPR.2011.5995468 http://dx.doi.org/10.1109/CVPR.2011.5995468 ]
Yu F W, Li W B, Li Q Q, Liu Y, Shi X H and Yan J J. 2016. POI: multiple object tracking with high performance detection and appearance feature//Proceedings of European Conference on Computer Vision. Amsterdam: Springer: 36-42[ DOI:10.1007/978-3-319-48881-3_3 http://dx.doi.org/10.1007/978-3-319-48881-3_3 ]
Zhang L L, Lin L, Liang X D and He K M. 2016. Is faster R-CNN doing well for pedestrian detection?//Proceedings of the 14th European Conference on Computer Vision. Amsterdam: Springer: 443-457[ DOI:10.1007/978-3-319-46475-6_28 http://dx.doi.org/10.1007/978-3-319-46475-6_28 ]
Zhang S F, Wen L Y, Bian X, Lei Z and Li S Z. 2018. Occlusion-aware R-CNN: detecting pedestrians in a crowd//Proceedings of the 15th European Conference on Computer Vision. Munich: Springer: 657-674[ DOI:10.1007/978-3-030-01219-9_39 http://dx.doi.org/10.1007/978-3-030-01219-9_39 ]
相关作者
相关机构
京公网安备11010802024621