结合双模板融合与孪生网络的鲁棒视觉目标跟踪
Double template fusion based siamese network for robust visual object tracking
- 2022年27卷第4期 页码:1191-1203
收稿:2020-11-13,
修回:2021-2-19,
录用:2021-2-26,
纸质出版:2022-04-16
DOI: 10.11834/jig.200660
移动端阅览

浏览全部资源
扫码关注微信
收稿:2020-11-13,
修回:2021-2-19,
录用:2021-2-26,
纸质出版:2022-04-16
移动端阅览
目的
2
视觉目标跟踪算法主要包括基于相关滤波和基于孪生网络两大类。前者虽然精度较高但运行速度较慢,无法满足实时要求。后者在速度和精度方面取得了出色的跟踪性能,然而,绝大多数基于孪生网络的目标跟踪算法仍然使用单一固定的模板,导致算法难以有效处理目标遮挡、外观变化和相似干扰物等情形。针对当前孪生网络跟踪算法的不足,提出了一种高效、鲁棒的双模板融合目标跟踪方法(siamese tracker with double template fusion,Siam-DTF)。
方法
2
使用第1帧的标注框作为初始模板,然后通过外观模板分支借助外观模板搜索模块在跟踪过程中为目标获取合适、高质量的外观模板,最后通过双模板融合模块,进行响应图融合和特征融合。融合模块结合了初始模板和外观模板各自的优点,提升了算法的鲁棒性。
结果
2
实验在3个主流的目标跟踪公开数据集上与最新的9种方法进行比较,在OTB2015(object tracking benchmark 2015)数据集中,本文方法的AUC(area under curve)得分和精准度分别为0.701和0.918,相比于性能第2的SiamRPN++(siamese region proposal network++)算法分别提高了0.6%和1.3%;在VOT2016(visual object tracking 2016)数据集中,本文方法取得了最高的期望平均重叠(expected average overlap,EAO)和最少的失败次数,分别为0.477和0.172,而且EAO得分比基准算法SiamRPN++提高了1.6%,比性能第2的SiamMask_E算法提高了1.1%;在VOT2018数据集中,本文方法的期望平均重叠和精确度分别为0.403和0.608,在所有算法中分别排在第2位和第1位。本文方法的平均运行速度达到47帧/s,显著超出跟踪问题实时性标准要求。
结论
2
本文提出的双模板融合目标跟踪方法有效克服了当前基于孪生网络的目标跟踪算法的不足,在保证算法速度的同时有效提高了跟踪的精确度和鲁棒性,适用于工程部署与应用。
Objective
2
Visual object tracking (VOT) analysis has challenged for computer vision research. Current trackers can be roughly segmented into two categories like correlation filter trackers and Siamese network based trackers. Correlation filter trackers train a circular correlation based regressor analysis in the Fourier domain. Siamese network based trackers have improved the speed and accuracy issue of deep features. A Siamese network consists of two branches which implicitly encodes the original patches to another space and then fuses them with an identified tensor to generate a single output. However
most Siamese network based trackers utilize the single fixed template to resolve occlusion
appearance change and distractors problems. We illustrate an efficient and robust Siamese network based tracker via double template fusion
referred as Siamese tracker with double template fusion (Siam-DTF). The demonstrated Siam-DTF has a double template mechanism in related to qualified robustness.
Method
2
Siam-DTF consists of three emerging branches like initial template
z
appearance template
z
a
and search area
x
. First
we facilitate the appearance template search module (ATSM) which fully utilizes the information of historical frames to efficiently obtain the appropriate and high-quality appearance template when the initial template is not consistent with the current frame. The appearance template
which is flexible and adaptive to the appearance changes of the object
can represent the object well when facing hard tracking challenges. We choose the frame with the highest confidence in the historical frames to crop the appearance template. To filter out low-quality template
we drop the appearance template if its predicted box has a lower intersection-of-union or its confidence score is lower than that of the initial template. In order to balance the accuracy and speed of our tracker
we use a sparse update strategy on the appearance template. In terms of theoretical analysis and experimental validations
we clarify that the confidence score change of tracker reflects the tracking quality more. When the max confidence of current frame is lower than average confidence of the historical
N
frames with a certain margin m
we conduct the ATSM to update the appearance template. Next
our fusion module illustration achieves more robust results based on these two templates. The initial template and the appearance template branch are integrated in terms of fusion of score maps and fusion of features.
Result
2
The nine tailored trackers model including the correlation filter trackers and Siamese network based trackers demonstrated on three public tracking datasets in the context of object tracking benchmark 2015 (OTB 2015)
VOT2016 and VOT2018. In OTB2015
the quantitative evaluation metrics contained area under curve (AUC) and precision. The proposed Siam-DTF is capable to rank 1 st both in terms of AUC and precision. Compared with the baseline tracker Siamese region proposal network++ (SiamRPN++)
our Siam-DTF improves 0.6% in AUC and 1.3% in precision. Since we unveiling the power of deep feature of historical frames
Siam-DTF precision is prior to correlation filter tracker efficient convolution operators (ECO) by 0.8%. In VOT2016 and VOT2018
the quantitative evaluation metrics contained accuracy (average overlap while tracking successfully) and robustness (failure times). The overall performance is evaluated using expected average overlap (EAO) which takes account of both accuracy and robustness. In VOT2016
Siam-DTF achieves the qualified EAO score 0.477 and the least failure rate 0.172. For EAO score
our method outperforms the baseline tracker SiamRPN++ and the second best tracker SiamMask_E by 1.6% and 1.1%
respectively. Also our method decreases the failure rate from 0.200 to 0.172 compared to SiamRPN++
indicating that our Siam-DTF tracker robustness is well. In VOT2018
Siam-DTF obtains the good result with accuracy of 0.608. Siam-DTF also obtains the second good EAO score 0.403. As for tracking speed
Siam-DTF tracker not only achieves a substantial improvement
but also running efficiently at 47 frame per second (FPS). In summary
it is concluded that all these consistent results show the strong generalization ability of our tracker Siam-DTF.
Conclusion
2
We propose an efficient and robust Siamese tracker with double template fusion
referred as Siam-DTF. Siam-DTF fully utilizes the information of historical frames to obtain the appearance template with good adaptability. All 3 benchmarks analysis demonstrate the effectiveness based on and Siam-DTF consistent results.
Bertinetto L, Valmadre J, Henriques J F, Vedaldi A and Torr P H S. 2016. Fully-convolutional Siamese networks for object tracking//Proceedings of 2016 European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 850-865 [ DOI: 10.1007/978-3-319-48881-3_56 http://dx.doi.org/10.1007/978-3-319-48881-3_56 ].
Bhat G, Danelljan M, Van Gool L and Timofte R. 2019. Learning discriminative model prediction for tracking//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE: 6181-6190 [ DOI: 10.1109/ICCV.2019.00628 http://dx.doi.org/10.1109/ICCV.2019.00628 ]
Bolme D S, Beveridge J R, Draper B A and Lui Y M. 2010. Visual object tracking using adaptive correlation filters//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, USA: IEEE: 2544-2550 [ DOI: 10.1109/CVPR.2010.5539960 http://dx.doi.org/10.1109/CVPR.2010.5539960 ]
Chen B X and Tsotsos J K. 2019. Fast visual object tracking with rotated bounding boxes [EB/OL]. [2020-09-30] . https://arxiv.org/pdf/1907.03892.pdf https://arxiv.org/pdf/1907.03892.pdf
Dai K N, Wang D, Lu H C, Sun C and Li J H. 2019. Visual tracking via adaptive spatially-regularized correlation filters//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 4665-4674 [ DOI: 10.1109/CVPR.2019.00480 http://dx.doi.org/10.1109/CVPR.2019.00480 ]
Danelljan M, Bhat G, Khan F S and Felsberg M. 2019. ATOM: accurate tracking by overlap maximization//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 4655-4664 [ DOI: 10.1109/CVPR.2019.00479 http://dx.doi.org/10.1109/CVPR.2019.00479 ]
Danelljan M, Bhat G, Khan F S and Felsberg M. 2017a. ECO: efficient convolution operators for tracking//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 6931-6939 [ DOI: 10.1109/CVPR.2017.733 http://dx.doi.org/10.1109/CVPR.2017.733 ]
Danelljan M, Häger G, Khan F S and Felsberg M. 2017b. Discriminative scale space tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(8): 1561-1575 [DOI: 10.1109/TPAMI.2016.2609928]
Danelljan M, Häger G, Khan F S and Felsberg M. 2015. Learning spatially regularized correlation filters for visual tracking//Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE: 4310-4318 [ DOI: 10.1109/ICCV.2015.490 http://dx.doi.org/10.1109/ICCV.2015.490 ]
Danelljan M, Robinson A, Khan F S and Felsberg M. 2016. Beyond correlation filters: learning continuous convolution operators for visual tracking//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 472-488 [ DOI: 10.1007/978-3-319-46454-1_29 http://dx.doi.org/10.1007/978-3-319-46454-1_29 ]
Fan H and Ling H B. 2019. Siamese cascaded region proposal networks for real-time visual tracking//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 7944-7953 [ DOI: 10.1109/CVPR.2019.00814 http://dx.doi.org/10.1109/CVPR.2019.00814 ]
He A F, Luo C, Tian X M and Zeng W J. 2018. Towards a better match in Siamese network based visual object tracker//Proceedings of 2018 European Conference on Computer Vision. Munich, Germany: Springer: 132-147 [ DOI: 10.1007/978-3-030-11009-3_7 http://dx.doi.org/10.1007/978-3-030-11009-3_7 ]
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 770-778 [ DOI: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]
Held D, Guillory D, Rebsamen B, Thrun S and Savarese S. 2016. A probabilistic framework for real-time 3D Segmentation using spatial, temporal, and semantic cues//Proceedings of 2016 Conference on Robotics: Science and Systems. Cambridge, USA: MIT [ DOI: 10.15607/RSS.2016.XII.024 http://dx.doi.org/10.15607/RSS.2016.XII.024 ]
Henriques J F, Caseiro R, Martins P and Batista J. 2012. Exploiting the circulant structure of tracking-by-detection with kernels//Proceedings of the 12th European Conference on Computer Vision. Florence, Italy: Springer: 702-715 [ DOI: 10.1007/978-3-642-33765-9_50 http://dx.doi.org/10.1007/978-3-642-33765-9_50 ]
Henriques J F, Caseiro R, Martins P and Batista J. 2015. High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3): 583-596 [DOI: 10.1109/TPAMI.2014.2345390]
Kristan M, Leonardis A, Matas J, Felsberg M, Pflugfelder R,Čehovin L, Vojí $$ {\rm{\tilde r}} $$ T, Häger G, LukežičA, Fernández G, Gupta A, Petrosino A, Memarmoghadam A, Garcia-Martin A, Montero A S, Vedaldi A, Robinson A, Ma A J, Varfolomieiev A, Alatan A, Erdem A, Ghanem B, Liu B, Han B, Martinez B, Chang C M, Xu C S, Sun C, Kim D, Chen D P, Du D W, Mishra D, Yeung D Y, Gundogdu E, Erdem E, Khan F, Porikli F, Zhao F, Bunyak F, Battistone F, Zhu G, Roffo G, Subrahmanyam G R K S, Bastos G, Seetharaman G, Medeiros H, Li H D, Qi H G, Bischof H, Possegger H, Lu H C, Lee H, Nam H, Chang H J, Drummond I, Valmadre J, Jeong J C, Cho J I, Lee J Y, Zhu J K, Feng J Y, Gao J, Choi J Y, Xiao J J, Kim J W, Jeong J, Henriques J F, Lang J, Choi J, Martinez J M, Xing J L, Gao J Y, Palaniappan K, Lebeda K, Gao K, Mikolajczyk K, Qin L, Wang L J, Wen L Y, Bertinetto L, Rapuru M K, Poostchi M, Maresca M, Danelljan M, Mueller M, Zhang M D, Arens M, Valstar M, Tang M, Baek M, Khan M H, Wang N Y, Fan N N, Al-Shakarji N, Miksik O, Akin O, Moallem P, Senna P, Torr P H S, Yuen P C, Huang Q M, Martin-Nieto R, Pelapur R, Bowden R, Laganière R, Stolkin R, Walsh R, Krah S B, Li S K, Zhang S P, Yao S Z, Hadfield S, Melzi S, Lyu S W, Li S Y, Becker S, Stuart Golodetz S, Kakanuru S, Choi S, Hu T, Mauthner T, Zhang T Z, Pridmore T, Santopietro V, Hu W M, Li W B, Hübner W, Lan XY, Wang X M, Li X, Li Y, Demiris Y, Wang Y F, Qi Y K, Yuan Z J, Cai Z X, Xu Z, He Z Y and Chi Z Z. 2016. The visual object tracking vot 2016 challenge results//Proceedings of 2016 European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 777-823 [ DOI: 10.1007/978-3-319-48881-3_54 http://dx.doi.org/10.1007/978-3-319-48881-3_54 ].
Kristan M, Leonardis A, Matas J, Felsberg M, Pflugfelder R, Zajc LČ, Vojí $$ {\rm{\tilde r}} $$ D T, Bhat G, LukežičA, Eldesokey A, Fernández G, García-MartínÁ, Iglesias-AriasÁ, Alatan A A, González-García A, Petrosino A, Memarmoghadam A, Vedaldi A, MuhičA, He A F, Smeulders A, Perera A G, Li B, Chen B Y, Kim C, Xu C S, Xiong C Z, Tian C, Luo C, Sun C, Hao C, Kim D, Mishra D, Chen D M, Wang D, Wee D, Gavves E, Gundogdu E, Velasco-Salido E, Khan F S, Yang F, Zhao F, Li F, Battistone F, De Ath G, Subrahmanyam G R K S, Bastos G, Ling H B, Galoogahi H K, Lee H, Li H J, Zhao H J, Fan H, Zhang H G, Possegger H, Li H Q, Lu H C, Zhi H, Li H Y, Lee H, Chang H J, Drummond I, Valmadre J, Martin J S, Chahl J, Choi J Y, Li J, Wang J Q, Qi J Q, Sung J, Johnander J, Henriques J, Choi J, Van De weijer J, Herranz J R, Martínez J M, Kittler J, Zhuang J F, Gao J Y, Grm K, Zhang L C, Wang L J, Yang L X, Rout L, Si L, Bertinetto L, Chu L T, Che M Q, Maresca M E, Danelljan M, Yang M H, Abdelpakey M, Shehata M, Kang M, Lee N, Wang N, Miksik O, Moallem P, Vicente-Moñivar P, Senna P, Li P X, Torr P, Raju P M, Qian R H, Wang Q, Zhou Q, Guo Q, Martín-Nieto R, Gorthi R K, Tao R, Bowden R, Everson R, Wang R L, Yun S, Choi S, Vivas S, Bai S, Huang S P, Wu S H, Hadfield S, Wang S W, Golodetz S, Ming T, Xu T Y, Zhang T Z, Fischer T, Santopietro V,Štruc V, Wei W, Zuo W M, Feng W, Wu W, Zou W, Hu W M, Zhou W G, Zeng W J, Zhang X F, Wu X H, WuX J, Tian X M, Li Y, Lu Y, Law Y W, Wu Y, Demiris Y, Yang Y C, Jiao Y F, Li Y H, Zhang Y H, Sun Y X, Zhang Z, Zhu Z, Feng Z H, Wang Z H and He Z Q. 2018. The sixth visual object tracking vot2018 challenge results//Proceedings of 2018 European Conference on Computer Vision. Munich, Germany: Springer: 3-53 [ DOI: 10.1007/978-3-030-11009-3_1 http://dx.doi.org/10.1007/978-3-030-11009-3_1 ].
Krizhevsky A, Sutskever I and Hinton G E. 2017. ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6): 84-90 [DOI: 10.1145/3065386]
Lee K H and Hwang J N. 2015. On-road pedestrian tracking across multiple driving recorders. IEEE Transactions on Multimedia, 17(9): 1429-1438 [DOI: 10.1109/TMM.2015.2455418]
Li B, Wu W, Wang Q, Zhang F Y, Xing J L and Yan J J. 2019a. SiamRPN++: evolution of Siamese visual tracking with very deep networks//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 4277-4286 [ DOI: 10.1109/CVPR.2019.00441 http://dx.doi.org/10.1109/CVPR.2019.00441 ]
Li B, Yan J J, Wu W, Zhu Z and Hu X L. 2018a. High performance visual tracking with Siamese region proposal network//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 8971-8980 [ DOI: 10.1109/CVPR.2018.00935 http://dx.doi.org/10.1109/CVPR.2018.00935 ]
Li P X, Chen B Y, Ouyang W L, Wang D, Yang X Y and Lu H C. 2019b. GradNet: gradient-guided network for visual object tracking//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE: 6161-6170 [ DOI: 10.1109/ICCV.2019.00626 http://dx.doi.org/10.1109/ICCV.2019.00626 ]
Li P X, Wang D, Wang L J and Lu H C. 2018b. Deep visual tracking: review and experimental comparison. Pattern Recognition, 76: 323-338. [DOI: 10.1016/j.patcog.2017.11.007]
Liu L W, Xing J L, Ai H Z and Ruan X. 2012. Hand posture recognition using finger geometric feature//Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012). Tsukuba, Japan: IEEE: 565-568
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C Y and Berg A C. 2016. SSD: single shot MultiBox detector//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 21-37 [ DOI: 10.1007/978-3-319-46448-0_2 http://dx.doi.org/10.1007/978-3-319-46448-0_2 ]
Ren S Q, He K M, Girshick R and Sun J. 2017. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149 [DOI: 10.1109/TPAMI.2016.2577031]
Shen J B, Tang X, Dong X P and Shao L. 2020. Visual object tracking by hierarchical attention Siamese network. IEEE Transactions on Cybernetics, 50(7): 3068-3080 [DOI: 10.1109/TCYB.2019.2936503]
Smeulders A W M, Chu D M, Cucchiara R, Calderara S, Dehghan A and Shah M. 2014. Visual tracking: an experimental survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7): 1442-1468 [DOI: 10.1109/TPAMI.2013.230]
Wang Q, Teng Z, Xing J L, Gao J, Hu W M and Maybank S. 2018. Learning attentions: residual attentional Siamese network for high performance online visual tracking//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4854-4863 [ DOI: 10.1109/CVPR.2018.00510 http://dx.doi.org/10.1109/CVPR.2018.00510 ]
Wang Q, Zhang L, Bertinetto L, Hu W M and Torr P H S. 2019. Fast online object tracking and segmentation: a unifying approach//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 1328-1338 [ DOI: 10.1109/CVPR.2019.00142 http://dx.doi.org/10.1109/CVPR.2019.00142 ]
Wu Y, Lim J and Yang M H. 2015. Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9): 1834-1848 [DOI: 10.1109/TPAMI.2014.2388226]
Xing J L, Ai H Z and Lao S H. 2010. Multiple human tracking based on multi-view upper-body detection and discriminative learning//Proceedings of the 20th International Conference on Pattern Recognition. Istanbul, Turkey: IEEE: 1698-1701 [ DOI: 10.1109/ICPR.2010.420 http://dx.doi.org/10.1109/ICPR.2010.420 ]
Xu T Y, Feng Z H, Wu X J and Kittler J. 2019. Learning adaptive discriminative correlation filters via temporal consistency preserving spatial feature selection for robust visual object tracking. IEEE Transactions on Image Processing, 28(11): 5596-5609 [DOI: 10.1109/TIP.2019.2919201]
Zhang G C and Vela P A. 2015. Good features to track for visual SLAM//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA: IEEE: 1373-1382 [ DOI: 10.1109/CVPR.2015.7298743 http://dx.doi.org/10.1109/CVPR.2015.7298743 ]
Zhang Z P and Peng H W. 2019. Deeper and wider Siamese networks for real-time visual tracking//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 4586-4595 [ DOI: 10.1109/CVPR.2019.00472 http://dx.doi.org/10.1109/CVPR.2019.00472 ]
Zhu Z, Wang Q, Li B, Wu W, Yan J J and Hu W M. 2018. Distractor-aware Siamese networks for visual object tracking//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 103-119 [ DOI: 10.1007/978-3-030-01240-3_7 http://dx.doi.org/10.1007/978-3-030-01240-3_7 ]
相关作者
相关机构
京公网安备11010802024621