Object tracking using enhanced second-order network modulation
- Vol. 26, Issue 3, Pages: 516-526(2021)
Received:27 April 2020,
Revised:03 June 2020,
Accepted:10 June 2020,
Published:16 March 2021
DOI: 10.11834/jig.200145
移动端阅览
浏览全部资源
扫码关注微信
Received:27 April 2020,
Revised:03 June 2020,
Accepted:10 June 2020,
Published:16 March 2021
移动端阅览
目的
2
表观模型对视觉目标跟踪的性能起着决定性的作用。基于网络调制的跟踪算法通过构建高效的子网络学习参考帧目标的表观信息,以用于测试帧目标的鲁棒匹配,在多个目标跟踪数据集上表现优异。但是,这类跟踪算法忽视了高阶信息对鲁棒建模物体表观的重要作用,致使在物体表观发生大尺度变化时易产生跟踪漂移。为此本文提出全局上下文信息增强的二阶池化调制子网络,以学习高阶特征提升跟踪器的性能。
方法
2
首先,利用卷积神经网络(convolutional neural networks,CNN)提取参考帧和测试帧的特征;然后,对提取的特征采用不同方向的长短时记忆网络(long shot-term memory networks,LSTM)捕获每个像素的全局上下文信息,再经过二阶池化网络提取高阶信息;最后,通过调制机制引导测试帧学习最优交并比预测。同时,为提升跟踪器的稳定性,在线跟踪通过指数加权平均自适应更新物体表观特征。
结果
2
实验结果表明,在OTB100(object tracking benchmark)数据集上,本文方法的成功率为67.9%,超越跟踪器ATOM(accurate tracking by overlap maximization)1.5%;在VOT(visual object tracking)2018数据集上平均期望重叠率(expected average overlap,EAO)为0.44,超越ATOM 4%。
结论
2
本文通过构建全局上下文信息增强的二阶池化调制子网络来学习高效的表观模型,使跟踪器达到目前领先的性能。
Objective
2
An appearance model plays a key role in the performance of visual object tracking. In recent years
tracking algorithms based on network modulation learn an appearance model by building an effective subnetwork
and thus
they can more robustly match the target in the search frames. The algorithms exhibite xcellent performance in many object tracking benchmarks. However
these tracking methods disregard the importance of high-order feature information
causing a drift when large-scale target appearance occurs. This study utilizes a global contextual attention-enhanced second-order network to model target appearance.This network is helpful in enhancing nonlinear modeling capability in visual tracking.
Method
2
The tracker includes two components: target estimation and classification components. It can be regarded as a two-stage tracker. Combined with the method based on Siamese networks
the speed of this method is relatively slow. The target estimation component is trained off-line to predict the overlapping of the target and the estimated bounding boxes. This tracker presents an effective network architecture for visual tracking.This architecture includes two novel module designs. The first design is called pixel-wisely global contextual attention (pGCA)
which leverages bidirectional long short-term memory(Bi-LSTM) to sweep row-wisely and column-wisely across feature maps and fully capture the global context information of each pixel. The other design is second-order pooling modulation (SPM)
which uses the feature covariance matrix of the template frame to learn a second-order modulation vector. Then
the modulation vector channel-wisely multiplies the intermediate feature maps of the query image to transfer the target-specific information from the template frame to the query frame. In addition
this study selects the widely adopted ResNet-50 as our backbone network.This network is pretrained on ImageNet classification task. Given the input template image
X
0
with bounding box
b
0
and query image
X
this studyselects the feature maps of the third and fourth layers for subsequent processing. The feature maps are fed into the pGCA module and the precise region of interest pooling (PrPool) module
which are used to obtain the features of the annotation area.The maps are then concatenated to yield the multi-scale features enhanced by global context information. Moreover
to handle them is aligned feature caused by the large-scale deformation between the query and the template images
the tracker injects two deformable convolution blocks into the bottom branch for feature alignment. Then
the fused feature is passed through two branches of SPM
generating two modulation vectors that channel-wisely multiply the corresponding feature layers on the bottom branch of the search frame. The fused feature is more helpful to the performance of the tracker via network modulation instead of a correlation in Siamese networks. Thereafter
the modulated features are fed into two PrPool layers and then concatenated. The output features are finally fed into the intersection over union predictor module that is composed of three fully connected layers. Given the annotated ground truth
the tracker minimizes the estimation error to train all the network parameters in an end-to-end manner. The classification component is a two-layer full convolutional neural network. In contrast with the estimation component
it trains online to predict a target confidence score. Thus
this component can provide a rough 2D location of the object. During online learning
the objective function is optimized using the conjugate gradient method instead of stochastic gradient descent for real-time tracking. For the robustness of the tracker
this study uses an averaging strategy to update object appearance in this component.This strategy has been widely been used in discriminative correlation filters. For this strategy
this study assumes that the appearance of the object changes smoothly and consistently in succession. Simultaneously
it the strategy can fully utilize the information of the previous frame. The overall tracking process involves using the classification to obtain a rough location of the target
which is a response map with dimensions of 14×14×1. This tracker can distinguish the specific foreground and background in accordance with the response map. Gaussian sampling is used to obtain some predicted target bounding boxes. Before selecting which predicted bounding box is the tracking result
the tracker trains the estimation component off-line. The predicted bounding boxes are fed to the estimation component. The highest score in the estimation component determines which box is the tracking result.
Result
2
The tracker validates the effectiveness and robustness of the proposed method on the OTB100(object tracking benchmark) and the challenging VOT2018(visual object tracking) datasets. The proposed method achieves the best performance in terms of success plots and precision plots with an area under the curve (AUC) score of 67.9% and a precision score 87.9%
outperforming the state-of-the-art ATOM(accurate tracking by overlap maximization) by 1.5% in terms of AUC score.Simultaneously
the expected average overlap (EAO) score of our method ranks first
with 0.441 1
significantly outperforming the second best-performing method ATOM by 4%
with an EAO score of 0.401 1.
Conclusion
2
This study proposes a visual tracker that uses network modulation.This tracker includes pGCA and SPM modules. The pGCA module leverages Bi-LSTM to capture the global context information of each pixel.The SPM module uses the feature covariance matrix of the template frame to learn a second-order modulation vector to model target appearance. It reduces the information loss of the first frame and enhances the correlation between features. The tracker utilizes an averaging strategy to update object appearance in the classification component for robustness. Therefore
the proposed tracker significantly outperforms state-of-the-art methods in terms of accuracy and efficiency.
Bertinetto L, Jack V, Henriques J F, Vedaldi A and Torr P H S. 2016. Fully-convolutional siamese networks for object tracking//Proceedings of European Conference on Computer Vision. Amsterdam, The Netherlands: Springer: 850-865[ DOI: 10.1007/978-3-319-48881-3_56 http://dx.doi.org/10.1007/978-3-319-48881-3_56 ]
Dai J F, Qi H Z, Xiong Y W, Li Y, Zhang G D, Hu H and Wei Y C. 2017. Deformable convolutional networks//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 764-773[ DOI: 10.1109/ICCV.2017.89 http://dx.doi.org/10.1109/ICCV.2017.89 ]
Danelljan M, Bhat G, Khan F S and Felsberg M. 2019. ATOM: accurate tracking by overlap maximization//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 4655-4664[ DOI: 10.1109/cvpr.2019.00479 http://dx.doi.org/10.1109/cvpr.2019.00479 ]
Danelljan M, Häger G, Khan F S and Felsberg M. 2015a. Learning spatially regularized correlation filters for visual tracking//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 4310-4318[ DOI: 10.1109/iccv.2015.490 http://dx.doi.org/10.1109/iccv.2015.490 ]
Danelljan M, Häger G, Khan F S and Felsberg M. 2015b. Convolutional features for correlation filter based visual tracking//Proceedings of 2015 IEEE International Conference on Computer Vision Workshop. Santiago, Chile: IEEE: 621-629[ DOI: 10.1109/iccvw.2015.84 http://dx.doi.org/10.1109/iccvw.2015.84 ]
Fan H, Lin L T, Yang F, Chu P, Deng G, Yu S J, Bai H X, Xu Y, Liao C Y and Ling H B. 2019. LaSOT: a high-quality benchmark for large-scale single object tracking//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 5369-5378[ DOI: 10.1109/cvpr.2019.00552 http://dx.doi.org/10.1109/cvpr.2019.00552 ]
Gao Z L, Xie J T, Wang Q L and Li P H. 2019. Global second-order pooling convolutional networks//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 3024-3033[ DOI: 10.1109/cvpr.2019.00314 http://dx.doi.org/10.1109/cvpr.2019.00314 ]
Ge B Y, Zuo X Z and Hu Y J. 2018. Review of visual object tracking technology. Journal of Image and Graphics, 23(8): 1091-1107
葛宝义, 左宪章, 胡永江. 2018. 视觉目标跟踪方法研究综述. 中国图象图形学报, 23(8): 1091-1107[DOI: 10.11834/jig.170604]
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778[ DOI: 10.1109/cvpr.2016.90 http://dx.doi.org/10.1109/cvpr.2016.90 ]
Henriques J F, Caseiro R, Martins P and Batista J. 2015. High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3): 583-596[DOI: 10.1109/tpami.2014.2345390]
Huang L H, Zhao X and Huang K Q. 2019. GOT-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019: #2957464[DOI: 10.1109/tpami.2019.2957464]
Jiang B R, Luo R X, Mao J Y, Xiao T T and Jiang Y N. 2018. Acquisition of localization confidence for accurate object detection//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 816-832[ DOI: 10.1007/978-3-030-01264-9_48 http://dx.doi.org/10.1007/978-3-030-01264-9_48 ]
Kristan M, Leonardis A, Matas J, Felsberg M, Pflugfelder R,Čehovin Zajc L, Vojír T, Bhat G, LukežičA, Eldesokey A, Fernández G, García-Martíná, Iglesias-Ariasá, Alatan A A, González-García, Petrosino A, Memarmoghadam A, Vedaldi A, MuhičA, He A F, Smeulders A, Perera A G, Li B, Chen B Y, Kim C, Mishra D, Chen D M, Wang D, Wee D, Gavves E, Gundogdu E, Velasco-Salido E, Khan F S, Yang F, Zhao F, Li F, Battistone F, De Ath G, Subrahmanyam G R K S, Bastos G, Ling H B, Galoogahi H K, Lee H, Li H J, Zhao H J, Fan H, Zhang H G, Possegger H, Li H Q, Lu H C, Zhi H, Li H Y, Lee H, Chang H J, Drummond I, Valmadre J, Martin J S, Chahl J, Choi J Y, Li J, Wang J Q, Qi J Q, Sung J, Johnander J, Henriques J, Choi J, Van De Weijer J, Rodríguez Herranz J, Martínez J M, Kittler J, Zhuang J F, Gao J Y, Grm K, Zhang L C, Wang L J, Yang L X, Rout L, Si L, Bertinetto L, Chu L T, Che M Q, Maresca M E, Danelljan M, Yang M H, Abdelpakey M, Shehata M, Kang M, Lee N, Wang N, Miksik O, Moallem P, Vicente-Moñivar P, Senna P, Li P X, Torr P, Raju P M, Qian R H, Wang Q, Zhou Q, Guo Q, Martín-Nieto R, Gorthi R K, Tao R, Bowden R, Everson R, Wang R L, Yun S, Choi S, Vivas S, Bai S, Huang S P, Wu S H, Hadfield S, Wang S W, Golodetz S, Ming T, Xu T Y, Zhang T Z, Fischer T, Santopietro V,Štruc V, Wang W, Zuo W M, Feng W, Wu W, Zou W, Hu W M, Zhou W G, Zeng W J, Zhang X F, Wu X H, Wu X J, Tian X M, Li Y, Lu Y, Law Y W, Demiris Y, Yang Y C, Jiao Y F, Li Y H, Zhang Y H, Sun Y X, Zhang Z, Zhu Z, Feng Z H, Wang Z H and He Z Q. 2019. The sixth visual object tracking vot2018 challenge results//Proceedings of the European Conference on Computer Vision. Munich, Germany: Springer: 3-53[ DOI: 10.1007/978-3-030-11009-3_1 http://dx.doi.org/10.1007/978-3-030-11009-3_1 ]
Krizhevsky A, Sutskever I and Hinton G E. 2012. ImageNet classification with deep convolutional neural networks//Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, USA: ACM: 1097-1105
Li B, Wu W, Wang Q, Zhang F Y, Xing J L and Yan J J. 2019. SiamRPN++: evolution of Siamese visual tracking with very deep networks//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 4277-4286[ DOI: 10.1109/cvpr.2019.00441 http://dx.doi.org/10.1109/cvpr.2019.00441 ]
Li B, Yan J J, Wu W, Zhu Z and Hu X L. 2018. High performance visual tracking with Siamese region proposal network//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 8971-8980[ DOI: 10.1109/cvpr.2018.00935 http://dx.doi.org/10.1109/cvpr.2018.00935 ]
Li X, Zha Y F, Zhang T Z, Cui Z, Zuo W M, Hou Z Q, Lu H C and Wang H Z. 2019. Survey of visual object tracking algorithms based on deep learning. Journal of Image and Graphics, 24(12): 2057-2080
李玺, 查宇飞, 张天柱, 崔振, 左旺孟, 侯志强, 卢湖川, 王菡子. 2019. 深度学习的目标跟踪算法综述. 中国图象图形学报, 24(12): 2057-2080[DOI: 10.11834/jig.190372]
Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P and Zitnick C L. 2014. Microsoft COCO: common objects in context//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 740-755[ DOI: 10.1007/978-3-319-10602-1_48 http://dx.doi.org/10.1007/978-3-319-10602-1_48 ]
Meng L and Li C X. 2019. Brief review of object tracking algorithms in recent years: correlated filtering and deep learning. Journal of Image and Graphics, 24(7): 1011-1016
孟琭, 李诚新. 2019. 近年目标跟踪算法短评——相关滤波与深度学习. 中国图象图形学报, 24(7): 1011-1016[DOI: 10.11834/jig.190111]
Müller M, Bibi A, Giancola S, Alsubaihi S and Ghanem B. 2018. TrackingNet: a large-scale dataset and benchmark for object tracking in the wild//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 310-327[ DOI: 10.1007/978-3-030-01246-5_19 http://dx.doi.org/10.1007/978-3-030-01246-5_19 ]
Ren S Q, He K M, Girshick R and Sun J. 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149[DOI: 10.1109/tpami.2016.2577031]
Valmadre J, Bertinetto L, Henriques J, Vedaldi A and Torr P H S. 2017. End-to-end representation learning for correlation filter based tracking//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 5000-5008[ DOI: 10.1109/CVPR.2017.531 http://dx.doi.org/10.1109/CVPR.2017.531 ]
Wu Y, Lim J and Yang M H. 2013. Online object tracking: a benchmark//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, USA: IEEE: 2411-2418[ DOI: 10.1109/cvpr.2013.312 http://dx.doi.org/10.1109/cvpr.2013.312 ]
Zhang L C, Gonzalez-Garcia A, Van De Weijer J, Danelljan M and Khan F S. 2019. Learning the model update for Siamese trackers//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 4009-4018[ DOI: 10.1109/iccv.2019.00411 http://dx.doi.org/10.1109/iccv.2019.00411 ]
Zhang Z P and Peng H W. 2019. Deeper and wider Siamese networks for real-time visual tracking//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 4586-4595[ DOI: 10.1109/cvpr.2019.00472 http://dx.doi.org/10.1109/cvpr.2019.00472 ]
Zhou P, Shi W, Tian J, Qi Z Y, Li B C, Hao H W and Xu B. 2016. Attention-based bidirectional long short-term memory networks for relation classification//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Berlin, Germany: Association for Computational Linguistics: 207-212[ DOI: 10.18653/v1/p16-2034 http://dx.doi.org/10.18653/v1/p16-2034 ]
相关文章
相关作者
相关机构