Object tracking using enhanced second-order network modulation

Xianhai Wang; Huihui Song; Kaihua Zhang; Qingshan Liu

doi:10.11834/jig.200145

Image Analysis and Recognition | Views : 0 下载量: 51 CSCD: 1

PDF
Export
Share
Collection
Album

Object tracking using enhanced second-order network modulation
Vol. 26, Issue 3, Pages: 516-526(2021)
Received：27 April 2020，

Revised：03 June 2020，

Accepted：10 June 2020，

Published：16 March 2021
DOI： 10.11834/jig.200145
稿件说明：

移动端阅览

DOI：

Xianhai Wang, Huihui Song, Kaihua Zhang, Qingshan Liu. Object tracking using enhanced second-order network modulation[J]. Journal of image and graphics, 2021, 26(3): 516-526. DOI： 10.11834/jig.200145.

摘要

目的

表观模型对视觉目标跟踪的性能起着决定性的作用。基于网络调制的跟踪算法通过构建高效的子网络学习参考帧目标的表观信息，以用于测试帧目标的鲁棒匹配，在多个目标跟踪数据集上表现优异。但是，这类跟踪算法忽视了高阶信息对鲁棒建模物体表观的重要作用，致使在物体表观发生大尺度变化时易产生跟踪漂移。为此本文提出全局上下文信息增强的二阶池化调制子网络，以学习高阶特征提升跟踪器的性能。

方法

首先，利用卷积神经网络（convolutional neural networks，CNN）提取参考帧和测试帧的特征；然后，对提取的特征采用不同方向的长短时记忆网络（long shot-term memory networks，LSTM）捕获每个像素的全局上下文信息，再经过二阶池化网络提取高阶信息；最后，通过调制机制引导测试帧学习最优交并比预测。同时，为提升跟踪器的稳定性，在线跟踪通过指数加权平均自适应更新物体表观特征。

结果

实验结果表明，在OTB100（object tracking benchmark）数据集上，本文方法的成功率为67.9%，超越跟踪器ATOM（accurate tracking by overlap maximization）1.5%；在VOT（visual object tracking）2018数据集上平均期望重叠率（expected average overlap，EAO）为0.44，超越ATOM 4%。

结论

本文通过构建全局上下文信息增强的二阶池化调制子网络来学习高效的表观模型，使跟踪器达到目前领先的性能。

Abstract

Objective

An appearance model plays a key role in the performance of visual object tracking. In recent years

tracking algorithms based on network modulation learn an appearance model by building an effective subnetwork

and thus

they can more robustly match the target in the search frames. The algorithms exhibite xcellent performance in many object tracking benchmarks. However

these tracking methods disregard the importance of high-order feature information

causing a drift when large-scale target appearance occurs. This study utilizes a global contextual attention-enhanced second-order network to model target appearance.This network is helpful in enhancing nonlinear modeling capability in visual tracking.

Method

The tracker includes two components: target estimation and classification components. It can be regarded as a two-stage tracker. Combined with the method based on Siamese networks

the speed of this method is relatively slow. The target estimation component is trained off-line to predict the overlapping of the target and the estimated bounding boxes. This tracker presents an effective network architecture for visual tracking.This architecture includes two novel module designs. The first design is called pixel-wisely global contextual attention (pGCA)

which leverages bidirectional long short-term memory(Bi-LSTM) to sweep row-wisely and column-wisely across feature maps and fully capture the global context information of each pixel. The other design is second-order pooling modulation (SPM)

which uses the feature covariance matrix of the template frame to learn a second-order modulation vector. Then

the modulation vector channel-wisely multiplies the intermediate feature maps of the query image to transfer the target-specific information from the template frame to the query frame. In addition

this study selects the widely adopted ResNet-50 as our backbone network.This network is pretrained on ImageNet classification task. Given the input template image

with bounding box

and query image

this studyselects the feature maps of the third and fourth layers for subsequent processing. The feature maps are fed into the pGCA module and the precise region of interest pooling (PrPool) module

which are used to obtain the features of the annotation area.The maps are then concatenated to yield the multi-scale features enhanced by global context information. Moreover

to handle them is aligned feature caused by the large-scale deformation between the query and the template images

the tracker injects two deformable convolution blocks into the bottom branch for feature alignment. Then

the fused feature is passed through two branches of SPM

generating two modulation vectors that channel-wisely multiply the corresponding feature layers on the bottom branch of the search frame. The fused feature is more helpful to the performance of the tracker via network modulation instead of a correlation in Siamese networks. Thereafter

the modulated features are fed into two PrPool layers and then concatenated. The output features are finally fed into the intersection over union predictor module that is composed of three fully connected layers. Given the annotated ground truth

the tracker minimizes the estimation error to train all the network parameters in an end-to-end manner. The classification component is a two-layer full convolutional neural network. In contrast with the estimation component

it trains online to predict a target confidence score. Thus

this component can provide a rough 2D location of the object. During online learning

the objective function is optimized using the conjugate gradient method instead of stochastic gradient descent for real-time tracking. For the robustness of the tracker

this study uses an averaging strategy to update object appearance in this component.This strategy has been widely been used in discriminative correlation filters. For this strategy

this study assumes that the appearance of the object changes smoothly and consistently in succession. Simultaneously

it the strategy can fully utilize the information of the previous frame. The overall tracking process involves using the classification to obtain a rough location of the target

which is a response map with dimensions of 14×14×1. This tracker can distinguish the specific foreground and background in accordance with the response map. Gaussian sampling is used to obtain some predicted target bounding boxes. Before selecting which predicted bounding box is the tracking result

the tracker trains the estimation component off-line. The predicted bounding boxes are fed to the estimation component. The highest score in the estimation component determines which box is the tracking result.

Result

The tracker validates the effectiveness and robustness of the proposed method on the OTB100(object tracking benchmark) and the challenging VOT2018(visual object tracking) datasets. The proposed method achieves the best performance in terms of success plots and precision plots with an area under the curve (AUC) score of 67.9% and a precision score 87.9%

outperforming the state-of-the-art ATOM(accurate tracking by overlap maximization) by 1.5% in terms of AUC score.Simultaneously

the expected average overlap (EAO) score of our method ranks first

with 0.441 1

significantly outperforming the second best-performing method ATOM by 4%

with an EAO score of 0.401 1.

Conclusion

This study proposes a visual tracker that uses network modulation.This tracker includes pGCA and SPM modules. The pGCA module leverages Bi-LSTM to capture the global context information of each pixel.The SPM module uses the feature covariance matrix of the template frame to learn a second-order modulation vector to model target appearance. It reduces the information loss of the first frame and enhances the correlation between features. The tracker utilizes an averaging strategy to update object appearance in the classification component for robustness. Therefore

the proposed tracker significantly outperforms state-of-the-art methods in terms of accuracy and efficiency.

关键词

Keywords

references

Bertinetto L, Jack V, Henriques J F, Vedaldi A and Torr P H S. 2016. Fully-convolutional siamese networks for object tracking//Proceedings of European Conference on Computer Vision. Amsterdam, The Netherlands: Springer: 850-865[ DOI: 10.1007/978-3-319-48881-3_56 http://dx.doi.org/10.1007/978-3-319-48881-3_56 ]

Dai J F, Qi H Z, Xiong Y W, Li Y, Zhang G D, Hu H and Wei Y C. 2017. Deformable convolutional networks//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 764-773[ DOI: 10.1109/ICCV.2017.89 http://dx.doi.org/10.1109/ICCV.2017.89 ]

Danelljan M, Bhat G, Khan F S and Felsberg M. 2019. ATOM: accurate tracking by overlap maximization//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 4655-4664[ DOI: 10.1109/cvpr.2019.00479 http://dx.doi.org/10.1109/cvpr.2019.00479 ]

Danelljan M, Häger G, Khan F S and Felsberg M. 2015a. Learning spatially regularized correlation filters for visual tracking//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 4310-4318[ DOI: 10.1109/iccv.2015.490 http://dx.doi.org/10.1109/iccv.2015.490 ]

Danelljan M, Häger G, Khan F S and Felsberg M. 2015b. Convolutional features for correlation filter based visual tracking//Proceedings of 2015 IEEE International Conference on Computer Vision Workshop. Santiago, Chile: IEEE: 621-629[ DOI: 10.1109/iccvw.2015.84 http://dx.doi.org/10.1109/iccvw.2015.84 ]

Fan H, Lin L T, Yang F, Chu P, Deng G, Yu S J, Bai H X, Xu Y, Liao C Y and Ling H B. 2019. LaSOT: a high-quality benchmark for large-scale single object tracking//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 5369-5378[ DOI: 10.1109/cvpr.2019.00552 http://dx.doi.org/10.1109/cvpr.2019.00552 ]

Gao Z L, Xie J T, Wang Q L and Li P H. 2019. Global second-order pooling convolutional networks//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 3024-3033[ DOI: 10.1109/cvpr.2019.00314 http://dx.doi.org/10.1109/cvpr.2019.00314 ]

Ge B Y, Zuo X Z and Hu Y J. 2018. Review of visual object tracking technology. Journal of Image and Graphics, 23(8): 1091-1107

葛宝义, 左宪章, 胡永江. 2018. 视觉目标跟踪方法研究综述. 中国图象图形学报, 23(8): 1091-1107[DOI: 10.11834/jig.170604]

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778[ DOI: 10.1109/cvpr.2016.90 http://dx.doi.org/10.1109/cvpr.2016.90 ]

Henriques J F, Caseiro R, Martins P and Batista J. 2015. High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3): 583-596[DOI: 10.1109/tpami.2014.2345390]

Huang L H, Zhao X and Huang K Q. 2019. GOT-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019: #2957464[DOI: 10.1109/tpami.2019.2957464]

Jiang B R, Luo R X, Mao J Y, Xiao T T and Jiang Y N. 2018. Acquisition of localization confidence for accurate object detection//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 816-832[ DOI: 10.1007/978-3-030-01264-9_48 http://dx.doi.org/10.1007/978-3-030-01264-9_48 ]

Kristan M, Leonardis A, Matas J, Felsberg M, Pflugfelder R,Čehovin Zajc L, Vojír T, Bhat G, LukežičA, Eldesokey A, Fernández G, García-Martíná, Iglesias-Ariasá, Alatan A A, González-García, Petrosino A, Memarmoghadam A, Vedaldi A, MuhičA, He A F, Smeulders A, Perera A G, Li B, Chen B Y, Kim C, Mishra D, Chen D M, Wang D, Wee D, Gavves E, Gundogdu E, Velasco-Salido E, Khan F S, Yang F, Zhao F, Li F, Battistone F, De Ath G, Subrahmanyam G R K S, Bastos G, Ling H B, Galoogahi H K, Lee H, Li H J, Zhao H J, Fan H, Zhang H G, Possegger H, Li H Q, Lu H C, Zhi H, Li H Y, Lee H, Chang H J, Drummond I, Valmadre J, Martin J S, Chahl J, Choi J Y, Li J, Wang J Q, Qi J Q, Sung J, Johnander J, Henriques J, Choi J, Van De Weijer J, Rodríguez Herranz J, Martínez J M, Kittler J, Zhuang J F, Gao J Y, Grm K, Zhang L C, Wang L J, Yang L X, Rout L, Si L, Bertinetto L, Chu L T, Che M Q, Maresca M E, Danelljan M, Yang M H, Abdelpakey M, Shehata M, Kang M, Lee N, Wang N, Miksik O, Moallem P, Vicente-Moñivar P, Senna P, Li P X, Torr P, Raju P M, Qian R H, Wang Q, Zhou Q, Guo Q, Martín-Nieto R, Gorthi R K, Tao R, Bowden R, Everson R, Wang R L, Yun S, Choi S, Vivas S, Bai S, Huang S P, Wu S H, Hadfield S, Wang S W, Golodetz S, Ming T, Xu T Y, Zhang T Z, Fischer T, Santopietro V,Štruc V, Wang W, Zuo W M, Feng W, Wu W, Zou W, Hu W M, Zhou W G, Zeng W J, Zhang X F, Wu X H, Wu X J, Tian X M, Li Y, Lu Y, Law Y W, Demiris Y, Yang Y C, Jiao Y F, Li Y H, Zhang Y H, Sun Y X, Zhang Z, Zhu Z, Feng Z H, Wang Z H and He Z Q. 2019. The sixth visual object tracking vot2018 challenge results//Proceedings of the European Conference on Computer Vision. Munich, Germany: Springer: 3-53[ DOI: 10.1007/978-3-030-11009-3_1 http://dx.doi.org/10.1007/978-3-030-11009-3_1 ]

Krizhevsky A, Sutskever I and Hinton G E. 2012. ImageNet classification with deep convolutional neural networks//Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, USA: ACM: 1097-1105

Li B, Wu W, Wang Q, Zhang F Y, Xing J L and Yan J J. 2019. SiamRPN++: evolution of Siamese visual tracking with very deep networks//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 4277-4286[ DOI: 10.1109/cvpr.2019.00441 http://dx.doi.org/10.1109/cvpr.2019.00441 ]

Li B, Yan J J, Wu W, Zhu Z and Hu X L. 2018. High performance visual tracking with Siamese region proposal network//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 8971-8980[ DOI: 10.1109/cvpr.2018.00935 http://dx.doi.org/10.1109/cvpr.2018.00935 ]

Li X, Zha Y F, Zhang T Z, Cui Z, Zuo W M, Hou Z Q, Lu H C and Wang H Z. 2019. Survey of visual object tracking algorithms based on deep learning. Journal of Image and Graphics, 24(12): 2057-2080

李玺, 查宇飞, 张天柱, 崔振, 左旺孟, 侯志强, 卢湖川, 王菡子. 2019. 深度学习的目标跟踪算法综述. 中国图象图形学报, 24(12): 2057-2080[DOI: 10.11834/jig.190372]

Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P and Zitnick C L. 2014. Microsoft COCO: common objects in context//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 740-755[ DOI: 10.1007/978-3-319-10602-1_48 http://dx.doi.org/10.1007/978-3-319-10602-1_48 ]

Meng L and Li C X. 2019. Brief review of object tracking algorithms in recent years: correlated filtering and deep learning. Journal of Image and Graphics, 24(7): 1011-1016

孟琭, 李诚新. 2019. 近年目标跟踪算法短评——相关滤波与深度学习. 中国图象图形学报, 24(7): 1011-1016[DOI: 10.11834/jig.190111]

Müller M, Bibi A, Giancola S, Alsubaihi S and Ghanem B. 2018. TrackingNet: a large-scale dataset and benchmark for object tracking in the wild//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 310-327[ DOI: 10.1007/978-3-030-01246-5_19 http://dx.doi.org/10.1007/978-3-030-01246-5_19 ]

Ren S Q, He K M, Girshick R and Sun J. 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149[DOI: 10.1109/tpami.2016.2577031]

Valmadre J, Bertinetto L, Henriques J, Vedaldi A and Torr P H S. 2017. End-to-end representation learning for correlation filter based tracking//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 5000-5008[ DOI: 10.1109/CVPR.2017.531 http://dx.doi.org/10.1109/CVPR.2017.531 ]

Wu Y, Lim J and Yang M H. 2013. Online object tracking: a benchmark//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, USA: IEEE: 2411-2418[ DOI: 10.1109/cvpr.2013.312 http://dx.doi.org/10.1109/cvpr.2013.312 ]

Zhang L C, Gonzalez-Garcia A, Van De Weijer J, Danelljan M and Khan F S. 2019. Learning the model update for Siamese trackers//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 4009-4018[ DOI: 10.1109/iccv.2019.00411 http://dx.doi.org/10.1109/iccv.2019.00411 ]

Zhang Z P and Peng H W. 2019. Deeper and wider Siamese networks for real-time visual tracking//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 4586-4595[ DOI: 10.1109/cvpr.2019.00472 http://dx.doi.org/10.1109/cvpr.2019.00472 ]

Zhou P, Shi W, Tian J, Qi Z Y, Li B C, Hao H W and Xu B. 2016. Attention-based bidirectional long short-term memory networks for relation classification//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Berlin, Germany: Association for Computational Linguistics: 207-212[ DOI: 10.18653/v1/p16-2034 http://dx.doi.org/10.18653/v1/p16-2034 ]

Alert me when the article has been cited

提交

Deep learning based backlight image enhancement method derived of zero-reference samples

Single image rain removal based on multi scale progressive residual network

Region-level channel attention for single image super-resolution combining high frequency loss

Landmark recognition based on ArcFace loss and multiple feature fusion

3D object detection based on domain attention and dilated convolution