An efficient Transformer-based object-capturing video annotation method

Zhao Jie; Yuan Yongsheng; Zhang Pengyu; Wang Dong

doi:10.11834/jig.220823

Medical Image Processing | Views : 0 下载量: 1 CSCD: 0

PDF
Export
Share
Collection
Album

An efficient Transformer-based object-capturing video annotation method
Vol. 28, Issue 10, Pages: 3176-3190(2023)
Published： 16 October 2023 ，
DOI： 10.11834/jig.220823
稿件说明：

移动端阅览

赵洁，袁永胜，张鹏宇，王栋. 2023. 轻量化Transformer目标跟踪数据标注算法. 中国图象图形学报， 28(10):3176-3190

Zhao Jie， Yuan Yongsheng， Zhang Pengyu， Wang Dong. 2023. An efficient Transformer-based object-capturing video annotation method. Journal of Image and Graphics， 28(10):3176-3190
赵洁，袁永胜，张鹏宇，王栋. 2023. 轻量化Transformer目标跟踪数据标注算法. 中国图象图形学报， 28(10):3176-3190 DOI： 10.11834/jig.220823.

Zhao Jie， Yuan Yongsheng， Zhang Pengyu， Wang Dong. 2023. An efficient Transformer-based object-capturing video annotation method. Journal of Image and Graphics， 28(10):3176-3190 DOI： 10.11834/jig.220823.

摘要

目的

基于深度模型的跟踪算法往往需要大规模的高质量标注训练数据集，而人工逐帧标注视频数据会耗费大量的人力及时间成本。本文提出一个基于Transformer模型的轻量化视频标注算法（Transformer-based label network，TLNet），实现对大规模稀疏标注视频数据集的高效逐帧标注。

方法

该算法通过Transformer模型来处理时序的目标外观和运动信息，并融合前反向的跟踪结果。其中质量评估子网络用于筛选跟踪失败帧，进行人工标注；回归子网络则对剩余帧的初始标注进行优化，输出更精确的目标框标注。该算法具有强泛化性，能够与具体跟踪算法解耦，应用现有的任意轻量化跟踪算法，实现高效的视频自动标注。

结果

在2个大规模跟踪数据集上生成标注。对于LaSOT（large-scale single object tracking）数据集，自动标注过程仅需约43 h，与真实标注的平均重叠率（mean intersection over union，mIoU）由0.824提升至0.871。对于TrackingNet数据集，本文使用自动标注重新训练3种跟踪算法，并在3个数据集上测试跟踪性能，使用本文标注训练的模型在跟踪性能上超过使用TrackingNet原始标注训练的模型。

结论

本文算法TLNet能够挖掘时序的目标外观和运动信息，对前反向跟踪结果进行帧级的质量评估并进一步优化目标框。该方法与具体跟踪算法解耦，具有强泛化性，并能节省超过90%的人工标注成本，高效地生成高质量的视频标注。

Abstract

Objective

High performance-oriented robust tracking methods are often beneficial from its related depth model nowadays， which is deeply linked to the large-scale high quality annotations-related video datasets in the training phase. However， frame-manual annotating videos frame is labor-intensive and costly. In addition， existing video annotation methods are usually focused on interpolation operation relevant to sparse labeled datasets. An effective interpolation operation is mainly designed based on the geometric information or other related tracking methods. One potential of these methods is illustrated that they do not require any quality evaluation mechanism to filter out the noisy annotations， resulting in the annotations unreliable. Additionally， some interactive annotation tools have a high involvement of labor， which makes the annotating process more complex. An end-to-end deep model based video annotation via selection and refinement（VASR） can be used to generate reliable annotations via its selection and refinement. However， due to its training data consists of intermediate tracking results and object masks， it is highly associated with specific tracking methods. The process of generating annotations is complicated and time consuming. To meet the needs of visual tracking methods for large-scale annotated video datasets， simplify the annotation process， and optimize computing cost， we develop an efficient end-to-end model of Transformer-based label Net （TLNet） to automatically generate video annotations for sparse labeled tracking datasets further. Due to the Transformer model has its potentials to deal with sequential information， we melt it into our model to fuse the bidirectional and sequential input features.

Method

In this study， an efficient video annotation method is developed to generate reliable annotations quickly. The annotating strategy is implemented and it can be divided into three steps as follows： First， high-speed trackers are employed to perform forward and backward tracking. Second， these original annotations are evaluated in frame-level， and noisy frames are filtered out for manual annotation. Third， other related frames are optimized to generate more precise bounding boxes. To get its evaluation and optimization function mentioned above， a Transformer-based efficient model is illustrated， called TLNet. The proposed model consists of three sorts of modules， i.e.， the feature extractor， feature fusion module， and prediction heads. To extract the vision features of the object， pixel-wise cross-correlation operation is used for the template and search region. A Transformer-based model is introduced to handle the sequential information after the vision and motion features are incorporated with， including the object appearance and motion cues. Furthermore， it is also used to fuse the bidirectional tracking results， i.e.， forward tracking and backward tracking. The proposed model contains two sorts of sub-models in related to quality evaluation network and regression network. Among them， the former one is used to provide the confidence score of each frame’s original annotation， and then the failed frames will be filtered out and sent back to manual annotators. The latter one is used to optimize the original annotations of the remaining frames， and more precise bounding boxes can be output. Our method has its strong generalizability to some extent since video frames and bidirectional tracking results are used as input for the model training only， and no intermediate tracking results like confidence scores， response maps are used. Specifically， our method is decoupled from specific tracking algorithms， and it can integrate any existing high-speed tracking algorithms to perform the forward and backward tracking. In this way， efficient and reliable video annotations can be achieved further. The completed annotating process is designed simplified and easy-to-use as well.

Result

Our method proposed is focused on generating annotations on two sort of large-scale tracking datasets like LaSOT， and TrackingNet. Two kind of evaluation protocols are set and matched for each of the two datasets. For LaSOT dataset， we apply the mean intersection over union （mIoU） to evaluate the annotations quality straightforward. The accuracy （Acc）， recall， and true negative rate （TNR） are used to evaluate the filtering ability of the quality evaluation network as well. Compared to the VASR method costing about two weeks （336 hours totally）， our method saves the time （43 hours totally） significantly in generating annotation. Specifically， 5.4% of frames are recognized as failed frames， where the Acc is 96.7%， and TNR is reached to 76.1%. After filtering out noisy frames and replacing with manual annotations， the mIoU of annotations are increased from 0.824 to 0.864. It is improved and reached to 0.871 further in terms of regression network. For the TrackingNet dataset， due to the noisy annotations are existed in its ground truth value， an indirect way is applied to evaluate the annotation quality. That is， we select out three sort of different tracking algorithms in related to ATOM， DiMP， and PrDiMP， and they are retrained using our annotations. The results are captured on three sort of tracking datasets relevant to LaSOT， TrackingNet， and GOT-10k， and it demonstrates that our annotations can be used to train a tracking model better than its original ground truth. It also can generate high quality annotations efficiently. Moreover， the ablation study also demonstrates the effectiveness of the Transformer model and its other related designs.

Conclusion

We develop an efficient video annotation method， which can mine the sequential object appearance and motion information， and fuse the forward and backward tracking results. The reliable annotations can be generated， and more than 90% labor cost for manually annotating can be saved on the basis of the frame-level quality evaluation operation originated from our quality evaluation network， and bounding boxes optimization derived from our regression network. Due to decoupling with the specific tracking algorithms， our method has its strong generalizability， for which it applies any existing high-speed tracking algorithms to achieve efficient annotation. We predict that our annotation method proposed can make the annotating process more reliable， faster and simpler to a certain extent.

关键词

视频目标标注单目标视觉跟踪Transformer模型互相关操作时序信息融合

Keywords

video annotationsingle object trackingTransformer modelcross-correlationsequential information fusion

references

Ba J L， Kiros J R and Hinton G E. 2016. Layer normalization ［EB/OL］. ［2022-08-16］. https://arxiv.org/pdf/1607.06450.pdfhttps://arxiv.org/pdf/1607.06450.pdf

Bertinetto L， Valmadre J， Henriques J F， Vedaldi A and Torr P H S. 2016. Fully-convolutional siamese networks for object tracking//Proceedings of 2016 European Conference on Computer Vision. Amsterdam， the Netherlands： Springer： 850-865 ［DOI： 10.1007/978-3-319-48881-3_56http://dx.doi.org/10.1007/978-3-319-48881-3_56］

Bhat G， Danelljan M， van Gool L and Timofte R. 2019. Learning discriminative model prediction for tracking//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul， Korea （South）： IEEE： 6181-6190 ［DOI： 10.1109/ICCV.2019.00628http://dx.doi.org/10.1109/ICCV.2019.00628］

Carion N， Massa F， Synnaeve G， Usunier N， Kirillov A and Zagoruyko S. 2020. End-to-end object detection with transformers//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 213-229 ［DOI： 10.1007/978-3-030-58452-8_13http://dx.doi.org/10.1007/978-3-030-58452-8_13］

Chen X， Kang B， Wang D， Li D D and Lu H C. 2022. Efficient visual tracking via hierarchical cross-attention transformer ［EB/OL］. ［2022-10-30］. https://arxiv.org/pdf/2203.13537.pdfhttps://arxiv.org/pdf/2203.13537.pdf

Chen X， Yan B， Zhu J W， Wang D， Yang X Y and Lu H C. 2021. Transformer tracking//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 8122-8131 ［DOI： 10.1109/CVPR46437.2021.00803http://dx.doi.org/10.1109/CVPR46437.2021.00803］

Cui Y T， Jiang C， Wang L M and Wu G S. 2022. MixFormer： end-to-end tracking with iterative mixed attention//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 13598-13608 ［DOI： 10.1109/CVPR52688.2022.01324http://dx.doi.org/10.1109/CVPR52688.2022.01324］

Dai K N， Zhao J， Wang L J， Wang D， Li J H， Lu H C， Qian X S and Yang X Y. 2021. Video annotation for visual tracking via selection and refinement//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 10276-10285 ［DOI： 10.1109/ICCV48922.2021.01013http://dx.doi.org/10.1109/ICCV48922.2021.01013］

Danelljan M， Bhat G， Khan F S and Felsberg M. 2019. ATOM： accurate tracking by overlap maximization//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 4655-4664 ［DOI： 10.1109/CVPR.2019.00479http://dx.doi.org/10.1109/CVPR.2019.00479］

Danelljan M， van Gool L and Timofte R. 2020. Probabilistic regression for visual tracking//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 7181-7190 ［DOI： 10.1109/CVPR42600.2020.00721http://dx.doi.org/10.1109/CVPR42600.2020.00721］

Devlin J， Chang M W， Lee K and Toutanova K. 2019. BERT： pre-training of deep bidirectional transformers for language understanding ［EB/OL］. ［2022-08-16］. https://arxiv.org/pdf/1810.04805.pdfhttps://arxiv.org/pdf/1810.04805.pdf

Dosovitskiy A， Beyer L， Kolesnikov A， Weissenborn D， Zhai X H， Unterthiner T， Dehghani M， Minderer M， Heigold G， Gelly S， Uszkoreit J and Houlsby N. 2021. An image is worth 16×16 words： transformers for image recognition at scale ［EB/OL］. ［2022-08-16］. https://arxiv.org/pdf/2010.11929.pdfhttps://arxiv.org/pdf/2010.11929.pdf

Fan H， Lin L T， Yang F， Chu P， Deng G， Yu S J， Bai H X， Xu Y， Liao C Y and Ling H B. 2019. LaSOT： a high-quality benchmark for large-scale single object tracking//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 5369-5378 ［DOI： 10.1109/CVPR.2019.00552http://dx.doi.org/10.1109/CVPR.2019.00552］

He K M， Zhang X Y， Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 770-778 ［DOI： 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90］

Huang L H， Zhao X and Huang K Q. 2021. GOT-10k： a large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence， 43（5）： 1562-1577 ［DOI： 10.1109/TPAMI.2019.2957464http://dx.doi.org/10.1109/TPAMI.2019.2957464］

Kuznetsova A， Talati A， Luo Y W， Simmons K and Ferrari V. 2021. Efficient video annotation with visual interpolation and frame selection guidance//Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision. Waikoloa， USA： IEEE： 3069-3078 ［DOI： 10.1109/WACV48630.2021.00311http://dx.doi.org/10.1109/WACV48630.2021.00311］

Li B， Wu W， Wang Q， Zhang F Y， Xing J L and Yan J J. 2019. SiamRPN++： evolution of siamese visual tracking with very deep networks//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 4277-4286 ［DOI： 10.1109/CVPR.2019.00441http://dx.doi.org/10.1109/CVPR.2019.00441］

Li B， Yan J J， Wu W， Zhu Z and Hu X L. 2018. High performance visual tracking with siamese region proposal network//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 8971-8980 ［DOI： 10.1109/CVPR.2018.00935http://dx.doi.org/10.1109/CVPR.2018.00935］

Loshchilov I and Hutter F. 2019. Decoupled weight decay regularization ［EB/OL］. ［2022-08-16］ https://arxiv.org/pdf/1711.05101.pdfhttps://arxiv.org/pdf/1711.05101.pdf

Meng L and Yang X. 2019. A survey of object tracking algorithms. Acta Automatica Sinica， 45（7）： 1244-1260

孟琭，杨旭. 2019. 目标跟踪算法综述. 自动化学报， 45（7）： 1244-1260 ［DOI： 10.16383/j.aas.c180277http://dx.doi.org/10.16383/j.aas.c180277］

Mueller M， Smith N and Ghanem B. 2017. Context-aware correlation filter tracking//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 1387-1395 ［DOI： 10.1109/CVPR.2017.152http://dx.doi.org/10.1109/CVPR.2017.152］

Müller M， Bibi A， Giancola S， Alsubaihi S and Ghanem B. 2018. TrackingNet： a large-scale dataset and benchmark for object tracking in the wild//Proceedings of the 15th European Conference on Computer Vision. Munich， Germany： Springer： 300-317 ［DOI： 10.1007/978-3-030-01246-5_19http://dx.doi.org/10.1007/978-3-030-01246-5_19］

Rezatofighi H， Tsoi N， Gwak J， Sadeghian A， Reid I and Savarese S. 2019. Generalized intersection over union： a metric and a loss for bounding box regression//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 658-666 ［DOI： 10.1109/CVPR.2019.00075http://dx.doi.org/10.1109/CVPR.2019.00075］

Vaswani A， Shazeer N， Parmar N， Uszkoreit J， Jones L， Gomez A N， Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 6000-6010

Vondrick C and Ramanan D. 2011. Video annotation and tracking with active learning//Proceedings of the 24th International Conference on Neural Information Processing Systems. Granada， Spain： Curran Associates Inc.： 28-36

Vondrick C， Ramanan D and Patterson D. 2010. Efficiently scaling up video annotation with crowdsourced marketplaces//Proceedings of the 11th European Conference on Computer Vision. Heraklion， Greece： Springer： 610-623 ［DOI： 10.1007/978-3-642-15561-1_44http://dx.doi.org/10.1007/978-3-642-15561-1_44］

Wang M M， Yang X Q and Liu Y. 2022. A spatio-temporal encoded network for single object tracking. Journal of Image and Graphics， 27（9）： 2733-2748

王蒙蒙，杨小倩，刘勇. 2022. 利用时空特征编码的单目标跟踪网络. 中国图象图形学报， 27（9）： 2733-2748 ［DOI： 10.11834/jig.211157http://dx.doi.org/10.11834/jig.211157］

Wang Z Q， Xu J， Liu L， Zhu F and Shao L. 2019. RANet： ranking attention network for fast video object segmentation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul， Korea （South）： IEEE： 3977-3986 ［DOI： 10.1109/ICCV.2019.00408http://dx.doi.org/10.1109/ICCV.2019.00408］

Wu Y， Lim J and Yang M H. 2013. Online object tracking： a benchmark//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland， USA： IEEE： 2411-2418 ［DOI： 10.1109/CVPR.2013.312http://dx.doi.org/10.1109/CVPR.2013.312］

Xu Y D， Wang Z Y， Li Z X， Yuan Y and Yu G. 2020. SiamFC++： towards robust and accurate visual tracking with target estimation guidelines. Proceedings of the AAAI Conference on Artificial Intelligence， 34（7）： 12549-12556 ［DOI： 10.1609/aaai.v34i07.6944http://dx.doi.org/10.1609/aaai.v34i07.6944］

Yan B， Peng H W， Fu J L， Wang D and Lu H C. 2021a. Learning spatio-temporal transformer for visual tracking//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 10428-10437 ［DOI： 10.1109/ICCV48922.2021.01028http://dx.doi.org/10.1109/ICCV48922.2021.01028］

Yan B， Zhang X Y， Wang D， Lu H C and Yang X Y. 2021b. Alpha-refine： boosting tracking performance by precise bounding box estimation//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 5285-5294 ［DOI： 10.1109/CVPR46437.2021.00525http://dx.doi.org/10.1109/CVPR46437.2021.00525］

Yuen J， Russell B， Liu C and Torralba A. 2009. LabelMe video： building a video database with human annotations//Proceedings of the 12th IEEE International Conference on Computer Vision. Kyoto， Japan： IEEE： 1451-1458 ［DOI： 10.1109/ICCV.2009.5459289http://dx.doi.org/10.1109/ICCV.2009.5459289］

Zhang Z P， Peng H W， Fu J L， Li B and Hu W M. 2020. Ocean： object-aware anchor-free tracking//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 771-787 ［DOI： 10.1007/978-3-030-58589-1_46http://dx.doi.org/10.1007/978-3-030-58589-1_46］

Zhu J Z， Wang D and Lu H C. 2019. Learning background-temporal-aware correlation filter for real-time visual tracking. Journal of Image and Graphics， 24（4）： 536-549

朱建章，王栋，卢湖川. 2019. 背景与时间感知的相关滤波实时视觉跟踪. 中国图象图形学报， 24（4）： 536-549 ［DOI： 10.11834/jig.180320http://dx.doi.org/10.11834/jig.180320］

Alert me when the article has been cited

提交

暂无数据