动态模态交互和特征自适应融合的RGBT跟踪

王福田; 张淑云; 李成龙; 罗斌

doi:10.11834/jig.210287

图像分析和识别 | 浏览量 : 0 下载量: 82 CSCD: 4

PDF
导出
分享
收藏
专辑

动态模态交互和特征自适应融合的RGBT跟踪
RGBT tracking based on dynamic modal interaction and adaptive feature fusion
2022年27卷第10期页码：3010-3021
收稿日期：2021-04-21，

修回日期：2021-09-18，

录用日期：2021-9-25，

纸质出版日期：2022-10-16
DOI： 10.11834/jig.210287
稿件说明：

移动端阅览

王福田, 张淑云, 李成龙, 罗斌. 动态模态交互和特征自适应融合的RGBT跟踪[J]. 中国图象图形学报, 2022,27(10):3010-3021. DOI： 10.11834/jig.210287.

Futian Wang, Shuyun Zhang, Chenglong Li, Bin Luo. RGBT tracking based on dynamic modal interaction and adaptive feature fusion[J]. Journal of image and graphics, 2022, 27(10): 3010-3021. DOI： 10.11834/jig.210287.

摘要

目的

可见光和热红外模态数据具有很强的互补性，RGBT(RGB-thermal)跟踪受到越来越多的关注。传统RGBT目标跟踪方法只是将两个模态的特征进行简单融合，跟踪的性能受到一定程度的限制。本文提出了一种基于动态交互和融合的方法，协作学习面向RGBT跟踪的模态特定和互补表示。

方法

首先，不同模态的特征进行交互生成多模态特征，在每个模态的特定特征学习中使用注意力机制来提升判别性。其次，通过融合不同层次的多模态特征来获得丰富的空间和语义信息，并通过设计一个互补特征学习模块来进行不同模态互补特征的学习。最后，提出一个动态权重损失函数，根据对两个模态特定分支预测结果的一致性和不确定性进行约束以自适应优化整个网络中的参数。

结果

在两个基准RGBT目标跟踪数据集上进行实验，数据表明，在RGBT234数据集上，本文方法的精确率(precision rate，PR)为79.2%，成功率(success rate，SR)为55.8%；在GTOT(grayscale-thermal object tracking)数据集上，本文方法的精确率为86.1%，成功率为70.9%。同时也在RGBT234和GTOT数据集上进行了对比实验以验证算法的有效性，实验结果表明本文方法改善了RGBT目标跟踪的结果。

结论

本文提出的RGBT目标跟踪算法，有效挖掘了两个模态之间的互补性，取得了较好的跟踪精度。

Abstract

Objective

Visual target tracking can be applied to the computer vision analysis

such as video surveillance

unmanned autopilot systems

and human-computer interaction. Thermal infrared cameras have the advantages of long-range of action

strong penetrating ability

hidden objects. As a branch of visual tracking

RGBT(RGB-thermal) tracking aims to estimate the status of the target in a video sequence by aggregating complementary data from two different modalities given the groundtruth bounding box of the first frame of the video sequence. Previous RGBT tracking algorithms are constrained of traditional handcraft features or insufficient to explore and utilize complementary information from different modalities. In order to explore the complementary information between the two modalities

we propose a dynamic interaction and fusion method for RGBT tracking.

Method

Generally

RGB images capture visual appearance information (e.g.

colors and textures) of target

and thermal images acquire temperature information which is robust to the conditions of lighting and background clutter. To obtain more powerful representations

we can introduce the useful information of another modality. However

the fusion of different modalities is opted from addition or concatenation in common due to some noisy information of the obtained modality features. First

a modality interaction module is demonstrated to suppress clutter noise based on the multiplication operation. Second

a fusion module is designed to gather cross-modality features of all layers. It captures different abstractions of target representations for more accurate localization. Third

a complementary gate mechanism guided learning structure calculates the complementary features of different modalities. As the input of the gate

we use the modality-specific features and the cross-modality features obtained from the fusion module. The output of the gate is a numerical value. To obtain the complementary features

we carry out a dot product operation on this value and the cross-modality features. Finally

a dynamic weighting loss is presented to optimize the parameters of the network adaptively in terms of the constraints of the consistency and uncertainty of the prediction results of two modality-specific branches. Our method is evaluated on two standard RGBT tracking datasets

like GTOT(grayscale thermal object tracking) and RGBT234. The two evaluation indicators(precision rate and success rate) is illustrated to measure the performance of tracking. Our model is built through the open source toolbox Pytorch

and the stochastic gradient descent method is optimized. Our implementation runs on the platform of PyTorch with 4.2 GHz Intel Core I7-7700K and NVIDIA GeForce GTX 1080Ti GPU.

Result

We conduct many comparative experiments on the RGBT234 and GTOT datasets. Based on the GTOT dataset analysis

our method (86.1%

70.9%) exceeds baseline tracker (80.6%

65.6%) by 5.5% in precision rate(PR) and 5.3% in success rate(SR). our method (79.2%

55.8%) is 7.0% higher in PR and 6.3% higher in SR than the baseline tracker in terms of the RGBT234 dataset analysis. Compared to the second-performing tracking method

on the RGBT234 dataset

our method is 2.6% higher than DAPNet (76.6%) in PR

and 2.1% higher than DAPNet(53.7%) in SR. At the same time

we conduct component analysis experiments on two datasets. Our experimental results illustrate that each module can improve the performance of tracking.

Conclusion

Our RGBT target tracking algorithm obtains rich semantic and spatial information through modal interaction and fusion modules

and uses the gate mechanism to explore the complementarity between different modalities. Dynamic weighting loss is illustrated to adaptively optimize the parameters in the model in accordance with the constraints of the prediction results of two modality-specific branches.

关键词

Keywords

references

Chen Y P, Kalantidis Y, Li J S, Yan S C and Feng J S. 2018. A 2 -nets: double attention networks [EB/OL]. [2021-04-21] . https://arxiv.org/pdf/1810.11579.pdf https://arxiv.org/pdf/1810.11579.pdf .

Cheng Y H, Cai R, Li Z W, Zhao X and. Huang K Q. 2017. Locality-sensitive deconvolution networks with gated fusion for RGB-D indoor semantic segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 1475-1483 [ DOI: 10.1109/CVPR.2017.161 http://dx.doi.org/10.1109/CVPR.2017.161 ]

Danelljan M, Khan F S, Felsberg M and Van de Weijer J. 2014. Adaptive color attributes for real-time visual tracking//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 1090-1097 [ DOI: 10.1109/CVPR.2014.143 http://dx.doi.org/10.1109/CVPR.2014.143 ]

Danelljan M, Robinson A, Khan F S and Felsberg M. 2016. Beyond correlation filters: learning continuous convolution operators for visual tracking//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 472-488 [ DOI: 10.1007/978-3-319-46454-1_29 http://dx.doi.org/10.1007/978-3-319-46454-1_29 ]

Fu J, Liu J, Tian H J, Li Y, Bao Y J, Fang Z W and Lu H Q. 2019. Dual attention network for scene segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 3141-3149 [ DOI: 10.1109/CVPR.2019.00326 http://dx.doi.org/10.1109/CVPR.2019.00326 ]

Gao Y, Li C L, Zhu Y B, Tang J, He T and Wang F T. 2019. Deep adaptive fusion network for high performance RGBT tracking//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop. Seoul, Korea (South): IEEE: 91-99 [ DOI: 10.1109/ICCVW.2019.00017 http://dx.doi.org/10.1109/ICCVW.2019.00017 ]

Hare S, Golodetz S, Saffari A, Vineet V, Cheng M M, Hicks S L and Torr P H S. 2016. Struck: structured output tracking with kernels. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10): 2096-2109 [DOI: 10.1109/TPAMI.2015.2509974]

Henriques J F, Caseiro R, Martins P and Batista J. 2015. High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3): 583-596 [DOI: 10.1109/TPAMI.2014.2345390]

Hu J, Shen L and Sun G. 2018. Squeeze-and-excitation networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7132-7141 [ DOI: 10.1109/CVPR.2018.00745 http://dx.doi.org/10.1109/CVPR.2018.00745 ]

Jung I, Son J, Baek M and Han B. 2018. Real-time MDNet//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 89-104 [ DOI: 10.1007/978-3-030-01225-0_6 http://dx.doi.org/10.1007/978-3-030-01225-0_6 ]

Kim H U, Lee D Y, Sim J Y and Kim C S. 2015. SOWP: spatially ordered and weighted patch descriptor for visual tracking//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 3011-3019 [ DOI: 10.1109/ICCV.2015.345 http://dx.doi.org/10.1109/ICCV.2015.345 ]

Lan X Y, Ye M, Zhang S P and Yuen P. 2018. Robust collaborative discriminative learning for RGB-infrared tracking. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1): 7008-7015

Li C L, Cheng H, Hu S Y, Liu X B, Tang J and Lin L. 2016. Learning collaborative sparse representation for grayscale-thermal tracking. IEEE Transactions on Image Processing, 25(12): 5743-5756 [DOI: 10.1109/TIP.2016.2614135]

Li C L, Liang X Y, Lu Y J, Zhao N and Tang J. 2019. RGB-T object tracking: benchmark and baseline [EB/OL]. [2021-04-21] . https://arxiv.org/pdf/1805.08982.pdf https://arxiv.org/pdf/1805.08982.pdf

Li C L, Lin L, Zuo W M, Tang J and Yang M H. 2019a. Visual tracking via dynamic graph learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(11): 2770-2782 [DOI: 10.1109/TPAMI.2018.2864965]

Li C L, Liu L, Lu A D, Ji Q and Tang J. 2020a. Challenge-aware RGBT tracking//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 222-237 [ DOI: 10.1007/978-3-030-58542-6_14 http://dx.doi.org/10.1007/978-3-030-58542-6_14 ]

Li C L, Lu A D, Zheng A H, Tu Z Z and Tang J. 2019b. Multi-adapter RGBT tracking//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision Workshop. Seoul, Korea (South): IEEE: 2262-2270 [ DOI: 10.1109/ICCVW.2019.00279 http://dx.doi.org/10.1109/ICCVW.2019.00279 ]

Li C L, Wu X H, Zhao N, Cao X C and Tang J. 2018b. Fusing two-stream convolutional neural networks for RGB-T object tracking. Neurocomputing, 281: 78-85 [DOI: 10.1016/j.neucom.2017.11.068]

Li C L, Zhao N, Lu Y J, Zhu C L and Tang J. 2017. Weighted sparse representation regularized graph learning for RGB-T object tracking//Proceedings of the 25th ACM International Conference on Multimedia. California, USA: ACM: 1856-1864 [ DOI: 10.1145/3123266.3123289 http://dx.doi.org/10.1145/3123266.3123289 ]

Li C L, Zhu C L, Huang Y, Tang J and Wang L. 2018c. Cross-modal ranking with soft consistency and noisy labels for robust RGB-T tracking//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 831-847 [ DOI: 10.1007/978-3-030-01261-8_49 http://dx.doi.org/10.1007/978-3-030-01261-8_49 ]

Li X T, Zhao H L, Han L, Tong Y H and Yang K Y. 2020b. GFF: gated fully fusion for semantic segmentation[DB/OL]. [2021-04-21] . https://arxiv.org/pdf/1904.01803.pdf https://arxiv.org/pdf/1904.01803.pdf

Liu H P and Sun F C. 2012. Fusion tracking in color and infrared images using joint sparse representation. Science China Information Sciences, 55(3): 590-599 [DOI: 10.1007/s11432-011-4536-9]

Lukežic A, Vojír T, Zajc L C, Matas J and Kristan M. 2017. Discriminative correlation filter with channel and spatial reliability//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 4847-4856 [ DOI: 10.1109/CVPR.2017.515 http://dx.doi.org/10.1109/CVPR.2017.515 ]

Nam H and Han B. 2016. Learning multi-domain convolutional neural networks for visual tracking//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 4293-4302 [ DOI: 10.1109/CVPR.2016.465 http://dx.doi.org/10.1109/CVPR.2016.465 ]

Ruan W J, Chen J, Wu Y, Wang J Q, Liang C, Hu R M and Jiang J J. 2019. Multi-correlation filters with triangle-structure constraints for object tracking. IEEE Transactions on Multimedia, 21(5): 1122-1134 [DOI: 10.1109/TMM.2018.2872897]

Simonyan K and Zisserman A. 2015. Very deep convolutional networks for large-scale image recognition [EB/OL]. [2021-04-21] . https://arxiv.org/pdf/1409.1556.pdf https://arxiv.org/pdf/1409.1556.pdf

Tu Z Z, Lin C, Li C L, Tang J and Luo B. 2020. M 5 L: multi-modal multi-margin metric learning for RGBT tracking [EB/OL]. [2021-04-21] . https://arxiv.org/pdf/2003.07650.pdf https://arxiv.org/pdf/2003.07650.pdf .

Valmadre J, Bertinetto L, Henriques J, Vedaldi A and Torr P H S. 2017. End-to-end representation learning for correlation filter based tracking//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 5000-5008 [ DOI: 10.1109/CVPR.2017.531 http://dx.doi.org/10.1109/CVPR.2017.531 ]

Wang Q L, Wu B G, Zhu P F, Li P H, Zuo W M and Hu Q H. 2020. ECA-net: efficient channel attention for deep convolutional neural networks//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 11531-11539 [ DOI: 10.1109/CVPR42600.2020.01155 http://dx.doi.org/10.1109/CVPR42600.2020.01155 ]

Wu Y, Blasch E, Chen G S, Bai L and Ling H B. 2011. Multiple source data fusion via sparse representation for robust visual tracking//Proceedings of the 14th International Conference on Information Fusion. Chicago, USA: IEEE: 1-8

Xu N W, Xu G, Zhang X C and Bavirisetti D P. 2018. Relative object tracking algorithm based on convolutional neural network for visible and infrared video sequences//Proceedings of the 4th International Conference on Virtual Reality. Hong Kong, China: ACM: 44-49 [ DOI: 10.1145/3198910.3198918 http://dx.doi.org/10.1145/3198910.3198918 ]

Xu Q, Mei Y M, Liu J P and Li C L. 2022. Multimodal Cross-layer bilinear pooling for RGBT tracking. IEEE Transactions on Multimedia, 24: 567-580 [DOI: 10.1109/TMM.2021.3055362]

Yao R, Xia S X, Zhang Z and Zhang Y N. 2017. Real-time correlation filter tracking by efficient dense belief propagation with structure preserving. IEEE Transactions on Multimedia, 19(4): 772-784 [DOI: 10.1109/TMM.2016.2631727]

Yuan Y, Yang H, Fang Y M and Lin W S. 2015. Visual object tracking by structure complexity coefficients. IEEE Transactions on Multimedia, 17(8): 1125-1136 [DOI: 10.1109/TMM.2015.2440996]

Zhang J M, Ma S G and Sclaroff S. 2014. MEEM: robust tracking via multiple experts using entropy minimization//Proceedings of 2014 IEEE Conference on European Conference on Computer Vision. Zurich, Switzerland: Springer: 188-203 [ DOI: 10.1007/978-3-319-10599-4_13 http://dx.doi.org/10.1007/978-3-319-10599-4_13 ]

Zhang S L,Yu X, Sui Y, Zhao S C and Zhang L. 2015. Object tracking with multi-view support vector machines. IEEE Transactions on Multimedia, 17(3): 265-278 [DOI: 10.1109/TMM.2015.2390044]

Zhang X M, Zhang X H, Du X D, Zhou X M and Yin J. 2018. Learning multi-domain convolutional network for RGB-T visual tracking//Proceedings of the 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics. Beijing, China: IEEE: 1-6 [ DOI: 10.1109/CISP-BMEI.2018.8633180 http://dx.doi.org/10.1109/CISP-BMEI.2018.8633180 ]

Zhang Z P and Peng H W. 2019. Deeper and wider siamese networks for real-time visual tracking//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 4586-4595 [ DOI: 10.1109/CVPR.2019.00472 http://dx.doi.org/10.1109/CVPR.2019.00472 ]

Zhong W, Lu H C and Yang M H. 2012. Robust object tracking via sparsity-based collaborative model//Proceedings of 2012 IEEE Conference on Computer vision and Pattern Recognition. Providence, USA: IEEE: 1838-1845 [ DOI: 10.1109/CVPR.2012.6247882 http://dx.doi.org/10.1109/CVPR.2012.6247882 ]

Zhu Y B, Li C L, Tang J and Luo B. 2020. Quality-aware feature aggregation network for robust rgbt tracking//IEEE Transactions on Intelligent Vehicles, 99: #2980735 [ DOI: 10.1109/TIV.2020.2980735 http://dx.doi.org/10.1109/TIV.2020.2980735 ]

Zhu Y B, Li C L, Luo B, Tang J and Wang X. 2019. Dense feature aggregation and pruning for RGBT tracking//Proceedings of the 27th ACM International Conference on Multimedia. Nice, France: ACM: 465-472 [ DOI: 10.1145/3343031.3350928 http://dx.doi.org/10.1145/3343031.3350928 ]

文章被引用时，请邮件提醒。

提交

暂无数据