Current Issue Cover
第一人称视角下的社会力优化多行人跟踪

杨廷召1, 刘骊1,2, 付晓东1,2, 刘利军1,2, 黄青松1,2(1.昆明理工大学信息工程与自动化学院, 昆明 650500;2.云南省计算机技术应用重点实验室, 昆明 650500)

摘 要
目的 多行人跟踪一直是计算机视觉领域最具挑战性的任务之一,然而受相机移动、行人频繁遮挡和碰撞影响导致第一人称视频中行人跟踪存在效率和精度不高的问题。对此,本文提出一种基于社会力模型优化的第一人称视角下的多行人跟踪算法。方法 采用基于目标检测的跟踪算法,将跟踪问题简化为检测到的目标匹配问题,并且在初步跟踪之后进行社会力优化,有效解决频繁遮挡和碰撞行为导致的错误跟踪问题。首先,采用特征提取策略和宽高比重新设置的单步多框检测器(single shot multi-box detector,SSD),对输入的第一人称视频序列进行检测,并基于卷积神经网络(convolutional neural network,CNN)模型提取行人的表观特征,通过计算行人特征相似度获得初步的行人跟踪结果;然后,进行跟踪结果的社会力优化,一是定义行人分组行为,对每个行人跟踪目标进行分组计算,并通过添加分组标识,实现同组行人在遮挡的情况下的准确跟踪;二是通过定义的行人领域,对行人分组进行排斥计算,实现避免碰撞后的准确跟踪。结果 在公用数据集ETH(eidgenössische technische hochschule)、MOT16(multi-object tracking 16)和ADL(adelaide)的6个第一人称视频序列上与其他跟踪算法进行对比实验,本文算法的运行速度达到准实时的20.8帧/s,同时相比其他准实时算法,本文算法的整体跟踪性能MOTA(multiple object tracking accuracy)提高了2.5%。结论 提出的第一人称视频中社会力优化的多行人跟踪算法,既能准确地在第一人称场景中跟踪多个行人,又能较好地满足实际应用需求。
关键词
Multi-pedestrian tracking optimized by social force model under first-person perspective

Yang Tingzhao1, Liu Li1,2, Fu Xiaodong1,2, Liu Lijun1,2, Huang Qingsong1,2(1.Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China;2.Computer Technology Application Key Laboratory of Yunnan Province, Kunming 650500, China)

Abstract
Objective Pedestrian tracking and first-person vision are challenging tasks in the field of computer vision. First-person vision focuses on analyzing and processing first-person videos, thus helping camera wearers make the right decisions. Its particularities include the following: First, the foreground and background of the video are difficult to distinguish because the camera is always moving. Second, the shooting location of the video is not fixed, and the lighting changes considerably. Third, the shooting needs to have real-time processing capabilities. Fourth, it also needs to have embedded processing capabilities when considering application to smart glasses and other devices. The above problems can cause pedestrian occlusion problems and collision avoidance behavior, thus leading to low tracking efficiency and accuracy. Therefore, this study proposes a social force-optimized multipedestrian tracking algorithm in first-person videos to resolve frequent occlusions and collisions, thereby improving tracking efficiency and accuracy. Method We use a detection-based tracking algorithm, which simplifies tracking problems into detected target matching problems. After initial tracking, the social force model is used to optimize frequent occlusion and collision avoidance behavior. The feature extraction strategy of the single shot multi-box detector (SSD) algorithm is first adjusted, and the features from low-level feature maps, such as conv4_3, conv6_1, conv6_2, conv7_1, conv7_2, conv8_2, and conv9_2, are extracted. Then, the idea of a dense and residual connection of DenseNet is drawn. In order to realize the repeated use of features, we perform a union operation on the input and output of conv6_2, and input it to conv7_2. Then, the aspect ratio of the default box is reset, and the default frame is simplified to an aspect ratio of 0.41 on the basis of the Caltech large pedestrian dataset. These steps are performed to simplify calculations and reduce the interference in pedestrian detection. From the large-scale ReID dataset, the apparent features of pedestrians are extracted on the basis of a convolutional neural network model by adding two convolutional layers, a maximum pooling layer, and six remaining modules to the pretrained network; as a result, a wide residual network is constructed. The network model is used to extract the apparent features of the pedestrian target boxes. The preliminary pedestrian tracking results are obtained by calculating the similarity of pedestrian features. First, the degree of location matching is calculated, followed by the calculation of the apparent feature matching and the degree of fusion matching. The Kuhn-Munkres algorithm is used to perform the matching correlation of the detection results. Lastly, the idea of a social force model is introduced to optimize the preliminary tracking results. The first step is to define the grouping behavior of pedestrians. Then, the grouping of each pedestrian tracking target is calculated, and a grouping identifier is added. In the case of occlusion, pedestrians in the same group are still accurately tracked by maintaining the group identification. The second step is to define the pedestrian domain and calculate the exclusion of pedestrian groups that cross the domain. After the occurrence of collision avoidance behavior, the tracking target boxes also closely follow the pedestrian target. Result Compared with other tracking algorithms on the six first-person video sequences of the public datasets eidgenössische technische hochschule (ETH), multi-object tracking 16 (MOT16), and adelaide (ADL), the algorithm runs at a near real-time speed of 20.8 frames per second, and the multiple object tracking accuracy (MOTA) is improved by 2.5%. Among the six tracking indicators, four obtained the optimum results, whereas two obtained suboptimal results. Among them, lifted multicut and person (LMP_p) obtained the best performance on the mostly tracked (MT) indicator, but it was achieved under the premise of loss of operating efficiency. Simple online and realtime tracking (SORT) performed well on the Hz index, but its other performance indicators are average. In the comparison experiment of operating efficiency, the running speed of the method in this study reaches approximately 20 frames per second on six datasets, and its operating efficiency reaches quasi real-time performance, which is second only to the SORT method. However, SORT comes at the expense of accuracy in exchange for operating efficiency, thus often causing problems, such as tracking failure. Conclusion This study explores several issues of first-person pedestrian tracking and proposes social force-optimized multipedestrian tracking in first-person videos. The core idea of this method is to simplify the tracking problem into a matching problem of detection results, use a single-shot multibox detector SSD to detect pedestrians, and then extract the apparent characteristics of pedestrians as the main basis for data association. The social force model is used for optimization to solve the tracking problem caused by frequent occlusion and collision avoidance. Moreover, this model performs well in problems, such as difficulty in distinguishing the foreground and background, unobtrusive features, numerous pedestrian targets, and lighting changes. Experimental results based on numerous first-person video sequences show that compared with the existing mainstream universal tracking methods, the proposed method have higher tracking accuracy and better real-time effect. These results validate the effectiveness of the proposed method in multipedestrian tracking in first-person videos.
Keywords

订阅号|日报