1. 中国科学院空间应用工程与技术中心, 北京 100094;
2. 中国科学院太空应用重点实验室, 北京 100094;
3. 中国科学院大学, 北京 100049
 中图法分类号: P23 文献标识码: A 文章编号: 1006-8961(2021)11-2741-10

# 关键词

Integrating multiple features for tracking vehicles in satellite videos
Han Mingfei1,2, Li Shengyang1,2,3, Wan Xue1,2, Xuan Shiyu1,2, Zhao Zifei1,2,3, Tan Hong1,2, Zhang Wanfeng1,2
1. Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences, Beijing 100094, China;
2. Key Laboratory of Space Utilization, Chinese Academy of Sciences, Beijing 100094, China;
3. University of Chinese Academy of Sciences, Beijing 100049, China
# Abstract

Objective Satellite video is a new type of remote sensing system, which is capable of dynamic video and conventional image capturing. Compared with conventional very-high-resolution (VHR) remote sensing systems, a video satellite observes the Earth with a real-time temporal resolution, which has led to studies in the field of traffic density estimation, object detection, and 3D reconstruction. Satellite video has a strong potential in monitoring traffic, animal migration, and ships entering and leaving ports due to its high temporal resolution. Despite much research in the field of conventional video, relatively minimal work has been performed in object tracking for satellite video. Existing object tracking methods primarily emphasize relatively large objects, such as trains and planes. Several researchers have explored replacing or fusing the motion feature for a more accurate prediction of object position. However, few studies have focused on solving the problem caused by the insufficient amount of information of smaller objects, such as vehicles. Tracking vehicles in satellite video has three main challenges. The main challenge is the small size of the target. While the size of a single frame can be as large as 12 000×4 000 pixels, moving targets, such as cars, can be very small and only occupy 10~30 pixels. The second challenge is the lack of clear texture because the vehicle targets contain limited and/or confusing information. The third challenge is that unlike aircraft and ships, vehicles are more likely to appear in situations where the background is complex, which makes tracking the vehicle more challenging. For instance, a vehicle may make quick turns, appear partially to the vehicle, or be marked by instant changes in illumination. Selecting or constructing a single image feature that can handle all the situations mentioned above is difficult. Using multiple complementary image features is proposed by merging them into a unified framework based on a lightweight kernelized correlation filter to tackle these challenges. Method First, two complementary features with certain invariance and discriminative ability, histogram of gradients (HOG) and raw pixels, are used as descriptors of the target image patch. HOG is tied to edge information of vehicles, such as orientations, offering some discriminative ability. A HOG-based tracker can distinguish targets even when partial occlusion occurs or when illumination or road color changes. However, it would be unable to correctly classify the target from similar shapes in its surroundings, suffering from the problems caused by insufficient information. However, the raw pixel feature describes all contents in the image patch without processing, and more information can be kept without post-processing considering the smaller size of vehicles. It is invariant to the plane motion of a rigid object under low-texture information and to tracking vehicles in terms of orientation changes. However, it fails to track vehicles that are partially occluded or in changes of road color and illumination. A response map merging strategy is proposed to fuse the complementary image features by maintaining two trackers, one using the HOG feature to discriminate the target and the other using the raw pixel feature to improve invariance. In this manner, a peak response may arise at a new position, representing invariance and discriminative ability. Finally, restricted by the insufficient information of the target and the discriminative ability of the observation model, responses usually show a multipeak pattern when a disturbance exists. A response distribution criterion-based model updater is exploited to measure the distribution of merged responses. Using a correlation filter facilitates multiple vehicle tracking due to its calculation speed and online training mechanism. Result Our model is compared with six state-of-the-art correlation filter-based models. Experiments are performed on eight satellite videos captured in different locations worldwide under challenging situations, such as illumination variance, quick turn, partial occlusion, and road color change. Precision plot and success plot are adopted for evaluation. Ablation experiments are performed to demonstrate the efficiency of the method proposed, and quantitative assessments show that our method leads to an effective balance between two trackers. Moreover, visualization results of three videos show how our method achieves a balance between the two trackers. Our method outperforms all the six state-of-the-art methods and achieves a balance between the base trackers. Conclusion In this paper, a new tracker fused with complementary image features for vehicle tracking in satellite videos is proposed. To overcome the difficulties posed by the small size of the target and the lack of texture and complex background in satellite video tracking, combining the use of HOG and raw pixel features is proposed by merging the response maps of the two trackers to increase their discriminative and invariance abilities. Experiments on eight satellite videos under challenging circumstances demonstrate that our method outperforms other state-of-the-art algorithms in precision plots and success plots.

# Key words

object tracking; satellite video; kernelized correlation filter; feature fusion; vehicle tracking

# 1.1 特征提取

HOG是基于局部区域的梯度方向直方图信息，与车辆的边缘信息描述(如车辆方向)密切相关。而可以捕捉目标梯度特征信息的尺度不变特征变换(scale-invariant feature transform，SIFT)特征，由于计算耗时，且对模糊图像提取角点的能力有限，不适用于卫星视频目标跟踪。HOG具有一定的判别性，但存在信息不足问题。例如，在视频中车辆目标突然改变方向后，HOG无法从周围形状与朝向相似的物体中正确区分目标。但是，HOG特征的跟踪器可以在发生部分遮挡、光照变化或道路颜色变化时，正确区分目标与背景干扰。

# 1.2 观测模型

 $\mathop {\min }\limits_a {\left({\boldsymbol{y} - \boldsymbol{Ka}} \right)^{\rm{T}}}\left({\boldsymbol{y} - \boldsymbol{Ka}} \right) + \lambda {\boldsymbol{a}^{\rm{T}}}\boldsymbol{Ka}$ (1)

 $\boldsymbol{F}(\boldsymbol{a})=\frac{\boldsymbol{F}(\boldsymbol{y})}{\boldsymbol{F}\left(\boldsymbol{k}^{x x}\right)+\boldsymbol{\lambda}}$ (2)

 $\boldsymbol{m} = {\boldsymbol{F}^{ - 1}}\left({\boldsymbol{F}\left({{\boldsymbol{k}^{\tilde xz}}} \right) \cdot \boldsymbol{F}\left(\boldsymbol{a} \right)} \right)$ (3)

# 1.3 响应图融合

 $\boldsymbol{m}_{\mathrm{M}}(\boldsymbol{x}, \boldsymbol{y})=\frac{\left(\boldsymbol{m}_{\mathrm{H}}(\boldsymbol{x}, \boldsymbol{y})+\boldsymbol{m}_{\mathrm{G}}(\boldsymbol{x}, \boldsymbol{y})\right)}{2}$ (4)

 $po{s_t} = \max \left({{\boldsymbol{m}_{\rm{M}}}} \right)$ (5)

# 1.4 模型更新

 $\boldsymbol{F}\left(\boldsymbol{a}_{t}\right)= \eta \frac{\boldsymbol{F}(\boldsymbol{y})}{\boldsymbol{F}\left(\boldsymbol{k}^{2 z}\right)+\lambda}+(1-\eta) \boldsymbol{F}\left(\boldsymbol{a}_{t-1}\right)$ (6)

 $\tilde{\boldsymbol{x}}_{t}=\eta \tilde{\boldsymbol{x}}_{t-1}+(1-\eta) \boldsymbol{z}$ (7)

 $RDC = \sqrt {\sum\limits_{i = 1}^t {{{\left({{S_m}(i) - \mu } \right)}^2}} }$ (8)

 $\eta = \left\{ \begin{array}{l} \zeta \;\;\;RDC > r\\ 0\;\;\;\;其他 \end{array} \right.$ (9)

# 2.1 数据集

Table 1 List of data in various situation of Fig. 2

 视频 光照变化 快速转弯 部分遮挡 阴影 路面变色 相似车辆 图 2(a) √ 图 2(b) √ √ √ 图 2(c) √ √ 图 2(d) √ √ 图 2(e) √ √ 图 2(f) √ √ 图 2(g) √ √ √ 图 2(h) √ √ 注：“√”表示包含此类型。

# 2.3 结果与分析

Table 2 Comparison of accuracy among different methods

 方法 AUC/% CLE VOR MoFusion-HOG 91.42 68.38 MoFusion-Raw 63.99 48.21 本文 94.35 64.26 注：加粗字体表示各列最优结果。

