利用时空特征编码的单目标跟踪网络
A spatio-temporal encoded network for single object tracking
- 2022年27卷第9期 页码:2733-2748
收稿:2021-12-17,
修回:2022-5-16,
录用:2022-5-23,
纸质出版:2022-09-16
DOI: 10.11834/jig.211157
移动端阅览

浏览全部资源
扫码关注微信
收稿:2021-12-17,
修回:2022-5-16,
录用:2022-5-23,
纸质出版:2022-09-16
移动端阅览
目的
2
随着深度神经网络的出现,视觉跟踪快速发展,视觉跟踪任务中的视频时空特性,尤其是时序外观一致性(temporal appearance consistency)具有巨大探索空间。本文提出一种新颖简单实用的跟踪算法——时间感知网络(temporal-aware network
TAN),从视频角度出发,对序列的时间特征和空间特征同时编码。
方法
2
TAN内部嵌入了一个新的时间聚合模块(temporal aggregation module
TAM)用来交换和融合多个历史帧的信息,无需任何模型更新策略也能适应目标的外观变化,如形变、旋转等。为了构建简单实用的跟踪算法框架,设计了一种目标估计策略,通过检测目标的4个角点,由对角构成两组候选框,结合目标框选择策略确定最终目标位置,能够有效应对遮挡等困难。通过离线训练,在没有任何模型更新的情况下,本文提出的跟踪器TAN通过完全前向推理(fully feed-forward)实现跟踪。
结果
2
在OTB(online object tracking: a benchmark)50、OTB100、TrackingNet、LaSOT(a high-quality benchmark for large-scale single object tracking)和UAV(a benchmark and simulator for UAV tracking)123公开数据集上的效果达到了小网络模型的领先水平,并且同时保持高速处理速度(70帧/s)。与多个目前先进的跟踪器对比,TAN在性能和速度上达到了很好的平衡,即使部分跟踪器使用了复杂的模板更新策略或在线更新机制,TAN仍表现出优越的性能。消融实验进一步验证了提出的各个模块的有效性。
结论
2
本文提出的跟踪器完全离线训练,前向推理不需任何在线模型更新策略,能够适应目标的外观变化,相比其他轻量级的跟踪器,具有更优的性能。
Objective
2
Current visual tracking has been developed dramatically based on deep neural networks. The task of single object visual tracking aims at tracking random objects in a sequential video streaming through yielding the bounding box of the object bounding box in the initial frame. It can be an essential task for multiple computer vision applications like surveillance systems
robotics and human-computer interaction. The simplified
small scale
easy-used and feed-forward trackers are preferred due to resources-constrained application scenarios. Most of methods are focused on top performance. Instead
we break this paradox from another perspective through the key temporal cues modeling inside our model and the ignorance of model update process and large-scaled models. The intrinsic qualities are required to be developed in the research community (i.e.
video streaming). A video analysis task is beneficial to formulate the basis of the tracking task itself. First
the object has a spatial displacement constraint
which means the object locations in adjacent frames will not be widely ranged unless dramatic camera motions happened. Almost visual trackers are followed this path and a new framed search objects is overlapped starting from the location in the last frame. Next
the potential temporal appearance consistency problem
which indicates the target information from preceding frames changes smoothly. This could be regarded as the temporal context
which can provide clear cues for the following predictions. However
the second feature has not been fully explored in the literature. Existing methods is leveraged temporal appearance consistency in two ways as mentioned below: 1) use the target information only in the first frame by modeling visual tracking as a matching problem between the given initial patch and the follow-up frames. Siamese-network-based methods are the most popular and effective methods in this category. They applied a one-shot learning scheme for visual tracking
where the object patch in the first frame is treated as an exemplar
and the patches in the search regions within the consecutive frames are regarded as the candidate instances. The task is transferred to find the most similar instance from each frame. This paradigm ignores other historical frames completely
deals with each frame independently and causes tremendous information loss. 2) Use the given initial patch and the historical target patches both targets at every frame or selected frames to predict the object location in a new frame
including traditional and deep-neural-network-based ones. Traditional trackers based methods like the correlation filter (CF) can learn their models or classifiers from the first frame and update models in the subsequent frames with a small learning rate. Our diverse deep-neural-network-based methods first learn their models offline with vast training data and fine-tune the models online at the initial frame and other frames. However
the solution remains open to balancing the accuracy and latency
especially in deep-neural-network-based methods. Also
network finetuning is forbidden in some practical applications when it is deployed in inference chips
which hinders the wide deployment of these methods.
Method
2
We facilitate a novel and straightforward tracker to re-formulate the visual tracking problem from the perspective of video analysis. A new temporal-aware network (TAN) is designed to encode target information from multiple frames
which aims at taking advantage of the temporal appearance consistency and the spatial displacement constraint in the forward path without an online model update. To exchange and fuse information from historical frames input
we introduce temporal aggregation modules in TAN and empower our tracker TAN with the ability to learn spatio-temporal features. To balance the computational burden resulting from the multi-frame inputs and tracking accuracy
we employ a shallow network ResNet-18 as our feature extraction backbone and achieve a high speed of over 70 frame/s. Our tracker runs completely feed-forward and can adapt to the target's appearance changes with our offline-trained
temporal-encoded TAN because previous temporal appearance consistency is maintained by the first frame or historical frames
which require expensive online finetuning to be adaptable. To construct a completed simple tracking pipeline further
we design a novel anchor-free and proposal-free target estimation method
which can detect the four corners
including top-left
top-right
bottom-left
and bottom-right
with a corner detection head in TAN. As target locations can be determined by a pair of top-left and bottom-right corners or top-right and bottom-left
we make use of a center score map to indicate the confidence of these two bounding boxes rather than complicated embedding constraints
which can easily locate the target. Thanks to a corner-based target estimation mechanism
our tracker is capable of handling challenging scenarios for significant changes involved.
Result
2
Without bells and whistles
our method has its potentials on several public datasets
such as online object tracking: a benchmark 50(OTB50)
OTB100
TrackingNet
a high-quality benchmark for large-scale single object tracking(LaSOT)
and a benchmark and simulator for UAV tracking(UAV)123. Our real-time speed optimization and simplified pipeline make TAN more suitable for real applications
especially resource-limited platforms where the large models and online model updates are not supported.
Conclusion
2
The proposed tracker will provide a new perspective for single-object tracking by mining the video nature
especially the temporal appearance consistency of this task.
Bertinetto L, Valmadre J, Henriques J F, Vedaldi A and Torr P H S. 2016. Fully-convolutional siamese networks for object tracking//Proceedings of European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 850-865 [ DOI: 10.1007/978-3-319-48881-3_56 http://dx.doi.org/10.1007/978-3-319-48881-3_56 ]
Bhat G, Danelljan M, van Gool L and Timofte R. 2019. Learning discriminative model prediction for tracking//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 6181-6190 [ DOI: 10.1109/ICCV.2019.00628 http://dx.doi.org/10.1109/ICCV.2019.00628 ]
Chen C, Deng Z H, Gao Y L and Wang S T. 2020. Single target tracking algorithm based on multi-fuzzy kernel fusion. Journal of Frontiers of Computer Science and Technology, 14(5): 848-860
陈晨, 邓赵红, 高艳丽, 王士同. 2020. 多模糊核融合的单目标跟踪算法. 计算机科学与探索, 14(5): 848-860 [DOI: 10.3778/j.issn.1673-9418.1901063]
Chen Z D, Zhong B N, Li G R, Zhang S P and Ji R R. 2020. Siamese box adaptive network for visual tracking//Proceedings of2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 6667-6676 [ DOI: 10.1109/CVPR42600.2020.00670 http://dx.doi.org/10.1109/CVPR42600.2020.00670 ]
Dai K N, Zhang Y H, Wang D, Li J H, Lu H C and Yang X Y. 2020. High-performance long-term tracking with meta-updater//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 6297-6306 [ DOI: 10.1109/CVPR42600.2020.00633 http://dx.doi.org/10.1109/CVPR42600.2020.00633 ]
Danelljan M, Bhat G, Khan F S and Felsberg M. 2019. ATOM: accurate tracking by overlap maximization //Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 4655-4664 [ DOI: 10.1109/CVPR.2019.00479 http://dx.doi.org/10.1109/CVPR.2019.00479 ]
Danelljan M, van Gool L and Timofte R. 2020. Probabilistic regression for visual tracking//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 7181-7190 [ DOI: 10.1109/CVPR42600.2020.00721 http://dx.doi.org/10.1109/CVPR42600.2020.00721 ]
Ding H and Zhang W S. 2012. Multi-target tracking approach combined with SPA occlusion segmentation. Journal of Image and Graphics, 17(1): 90-98
丁欢, 张文生. 2012. 融合SPA遮挡分割的多目标跟踪方法. 中国图象图形学报, 17(1): 90-98 [DOI: 10.11834/jig.20120113]
Du F, Liu P, Zhao W and Tang X L. 2020. Correlation-guided attention for corner detection based visual tracking//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 6835-6844 [ DOI: 10.1109/CVPR42600.2020.00687 http://dx.doi.org/10.1109/CVPR42600.2020.00687 ]
Fan H, Lin L T, Yang F, Chu P, Deng G, Yu S J, Bai H X, Xu Y, Liao C Y and Ling H B. 2019. LaSOT: a high-quality benchmark for large-scale single object tracking//Pr oceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 5369-5378 [ DOI: 10.1109/CVPR.2019.00552 http://dx.doi.org/10.1109/CVPR.2019.00552 ]
Fan H and Ling H B. 2019. Siamese cascaded region proposal networks for real-time visual tracking//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 7944-7953 [ DOI: 10.1109/CVPR.2019.00814 http://dx.doi.org/10.1109/CVPR.2019.00814 ]
Gao P, Yuan R Y, Wang F, Xiao L Y, Fujita H and Zhang Y. 2020. Siamese attentional keypoint network for high performance visual tracking. Knowledge-Based Systems, 193: #105448 [DOI: 10.1016/j.knosys.2019.105448]
Gong H Y, Ren H G, Shi T and Li F J. 2018. Sparse subspace single target tracking algorithm based on improved particle filtering. Modern Electronics Technique, 41(13): 10-13
宫海洋, 任红格, 史涛, 李福进. 2018. 基于改进粒子滤波的稀疏子空间单目标跟踪算法. 现代电子技术, 41(13): 10-13 [DOI: 10.16652/j.issn.1004-373x.2018.13.003]
Guo D Y, Wang J, Cui Y, Wang Z H and Chen S Y. 2020. SiamCAR: siamese fully convolutional classification and regression for visual tracking//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 6268-6276 [ DOI: 10.1109/CVPR42600.2020.00630 http://dx.doi.org/10.1109/CVPR42600.2020.00630 ]
Hare S, Golodetz S, Saffari A, Vineet V, Cheng M M, Hicks S L and Torr P H S. 2016. Struck: structured output tracking with kernels. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10): 2096-2109 [DOI: 10.1109/TPAMI.2015.2509974]
Held D, Thrun S and Savarese S. 2016. Learning to track at 100 fps with deep regression networks//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 749-765 [ DOI: 10.1007/978-3-319-46448-0_45 http://dx.doi.org/10.1007/978-3-319-46448-0_45 ]
Henriques J F, Caseiro R, Martins P and Batista J. 2015. High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3): 583-596 [DOI: 10.1109/TPAMI.2014.2345390]
Huang L H, Zhao X and Huang K Q. 2019a. Bridging the gap between detection and tracking: a unified approach//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 3998-4008 [ DOI: 10.1109/ICCV.2019.00410 http://dx.doi.org/10.1109/ICCV.2019.00410 ]
Huang L H, Zhao X and Huang K Q. 2019b. GOT-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(5): 1562-1577 [DOI: 10.1109/TPAMI.2019.2957464]
Law H and Deng J. 2019. CornerNet: detecting objects as paired keypoints[EB/OL ] . [2019-05-18 ] . https://arxiv.org/pdf/1808.01244.pdf https://arxiv.org/pdf/1808.01244.pdf
Li B, Wu W, Wang Q, Zhang F Y, Xing J L and Yan J J. 2019. SiamRPN++: evolution of siamese visual tracking with very deep networks//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 4277-4286 [ DOI: 10.1109/CVPR.2019.00441 http://dx.doi.org/10.1109/CVPR.2019.00441 ]
Li B, Yan J J, Wu W, Zhu Z and Hu X L. 2018. High performance visual tracking with siamese region proposal network//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 8971-8980 [ DOI: 10.1109/CVPR.2018.00935 http://dx.doi.org/10.1109/CVPR.2018.00935 ]
Li Q, Qin Z K, Zhang W B and Zheng W. 2020. Siamese keypoint prediction network for visual object tracking[EB/OL ] . [2020-06-07 ] . https://arxiv.org/pdf/2006.04078.pdf https://arxiv.org/pdf/2006.04078.pdf
Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P and Zitnick C L. 2014. Microsoft COCO: common objects in context//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 740-755 [ DOI: 10.1007/978-3-319-10602-1_48 http://dx.doi.org/10.1007/978-3-319-10602-1_48 ]
Müller M, Smith N and Ghanem B. 2016. A benchmark and simulator for UAV tracking//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 445-461 [ DOI: 10.1007/978-3-319-46448-0_27 http://dx.doi.org/10.1007/978-3-319-46448-0_27 ]
Müller M, Bibi A, Giancola S, Alsubaihi S and Ghanem B. 2018. TrackingNet: a large-scale dataset and benchmark for object tracking in the wild//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 300-327 [ DOI: 10.1007/978-3-030-01246-5_19 http://dx.doi.org/10.1007/978-3-030-01246-5_19 ]
Nam H, Baek M and Han B. 2016. Modeling and propagating CNNs in a tree structure for visual tracking[EB/OL ] . [2020-08-25 ] . https://arxiv.org/pdf/1608.07242.pdf https://arxiv.org/pdf/1608.07242.pdf
Ning J F, Zhao Y B and Shi W Z. 2014. Multiple instance learning based object tracking with multi-channel Haar-like feature. Journal of Image and Graphics, 19(7): 1038-1045
宁纪锋, 赵耀博, 石武祯. 2014. 多通道Haar-like特征多示例学习目标跟踪. 中国图象图形学报, 19(7): 1038-1045 [DOI: 10.11834/jig.20140707]
Park E and Berg A C. 2018. Meta-tracker: fast and robust online adaptation for visual object trackers//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany:Springer: 587-604 [ DOI: 10.1007/978-3-030-01219-9_35 http://dx.doi.org/10.1007/978-3-030-01219-9_35 ]
Ren X Y, Liao Y T, Zhang G L and Zhang T X. 2002. A new correlation tracking method. Journal of Image and Graphics, 7(6): 553-557
任仙怡, 廖云涛, 张桂林, 张天序. 2002.一种新的相关跟踪方法研究. 中国图象图形学报, 7(6): 553-557 [DOI: 10.3969/j.issn.1006-8961.2002.06.006]
Song J F, Miao Q G, Wang C X, Xu H and Yang J. 2021. Multi-scale single object tracking based on the attention mechanism. Journal of Xidian University, 48(5): 110-116
宋建锋, 苗启广, 王崇晓, 徐浩, 杨瑾. 2021. 注意力机制的多尺度单目标跟踪算法. 西安电子科技大学学报, 48(5): 110-116 [DOI: 10.19665/j.issn1001-2400.2021.05.014]
Tao R, Gavves E and Smeulders A W M. 2016. Siamese instance search for tracking//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 1420-1429 [ DOI: 10.1109/CVPR.2016.158 http://dx.doi.org/10.1109/CVPR.2016.158 ]
Wang Q, Zhang L, Bertinetto L, Hu W M and Torr P H S. 2019. Fast online object tracking and segmentation: a unifying approach//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 1328-1338 [ DOI: 10.1109/CVPR.2019.00142 http://dx.doi.org/10.1109/CVPR.2019.00142 ]
Wang X and Tang Z M. 2010. Application of particle filter based on featurefusion in small IR target tracking. Journal of Image and Graphics, 15(1): 91-97
王鑫, 唐振民. 2010. 基于特征融合的粒子滤波在红外小目标跟踪中的应用. 中国图象图形学报, 15(1): 91-97 [DOI: 10.11834/jig.20100115]
Wu Y, Lim J and Yang M H. 2013. Online object tracking: a benchmark//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, USA: IEEE: 2411-2418 [ DOI: 10.1109/CVPR.2013.312 http://dx.doi.org/10.1109/CVPR.2013.312 ]
Xu Y D, Wang Z Y, Li Z X, Yuan Y and Yu G. 2020. SiamFC++: towards robust and accurate visual tracking with target estimation guidelines//Proceedings of the AAAI Conference on Artificial Intelligence, 34(7): 12549-12556 [ DOI: 10.1609/aaai.v34i07.6944 http://dx.doi.org/10.1609/aaai.v34i07.6944 ]
Yang T Y and Chan A B. 2018. Learning dynamic memory networks for object tracking//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 153-169 [ DOI: 10.1007/978-3-030-01240-3_10 http://dx.doi.org/10.1007/978-3-030-01240-3_10 ]
Yang T Y and Chan A B. 2021. Visual tracking via dynamic memory networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1): 360-374 [DOI: 10.1109/TPAMI.2019.2929034]
Yang T Y, Xu P F, Hu R B, Chai H and Chan A B. 2020. ROAM: recurrently optimizing tracking model//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 6717-6726 [ DOI: 10.1109/CVPR42600.2020.00675 http://dx.doi.org/10.1109/CVPR42600.2020.00675 ]
Yu Y C, Xiong Y L, Huang W L and Scott M R. 2020. Deformable siamese attention networks for visual object tracking//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seatt le, USA: IEEE: 6727-6736 [ DOI: 10.1109/CVPR42600.2020.00676 http://dx.doi.org/10.1109/CVPR42600.2020.00676 ]
Zhang Z P, Peng H W, Fu J L, LiB and Hu W M. 2020. Ocean: object-aware anchor-free tracking//Proceedings of the 16th European Conference Computer Vision. Glasgow, UK: Springer: 771-787 [ DOI: 10.1007/978-3-030-58589-1_46 http://dx.doi.org/10.1007/978-3-030-58589-1_46 ]
Zhou X Y, Wang L, Ma Y X and Chen P B. 2021. Single object tracking of LiDAR point cloud combined with auxiliary deep neural network. Chinese Journal of Lasers, 48(21): #2110001
周笑宇, 王玲, 马燕新, 陈沛铂. 2021. 融合附加神经网络的激光雷达点云单目标跟踪. 中国激光, 48(21): #2110001 [DOI: 10.3788/CJL202148.2110001]
Zhou W Z, Wen L Y, Zhang L B, Du D W, Luo T J and Wu Y J. 2020. SiamMan: siamese motion-aware network for visual tracking [EB/OL ] . [2020-01-18 ] . https://arxiv.org/pdf/1912.05515.pdf https://arxiv.org/pdf/1912.05515.pdf
Zhu Z, Huang G, Zou W, Du D L and Huang C. 2017. UCT: learning unified convolutional networks for real-time visual tracking//Proceedings of 2017 IEEE International Conference on Computer Vision Workshops. Venice, Italy: IEEE: 1973-1982 [ DOI: 10.1109/ICCVW.2017.231 http://dx.doi.org/10.1109/ICCVW.2017.231 ]
Zhu Z, Wang Q, Li B, Wu W, Yan J J and Hu W M. 2018. Distractor-aware siamese networks for visual object tracking//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 103-119 [ DOI: 10.1007/978-3-030-01240-3_7 http://dx.doi.org/10.1007/978-3-030-01240-3_7 ]
Zuo W M, Wu X H, Lin L, Zhang L and Yang M H. 2019. Learning support correlation filters for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(5): 1158-1172 [DOI: 10.1109/TPAMI.2018.2829180]
相关作者
相关机构
京公网安备11010802024621