目的 基于深度模型的跟踪算法往往需要大规模的高质量标注训练数据集，而人工逐帧标注视频数据会耗费大量的人力及时间成本。本文提出一个基于Transformer模型的轻量化视频标注算法（Transformer-based label network，TLNet），实现对大规模稀疏标注视频数据集的高效逐帧标注。方法 该算法通过Transformer模型来处理时序的目标外观和运动信息，并融合前反向的跟踪结果。其中质量评估子网络用于筛选跟踪失败帧，进行人工标注；回归子网络则对剩余帧的初始标注进行优化，输出更精确的目标框标注。该算法具有强泛化性，能够与具体跟踪算法解耦，应用现有的任意轻量化跟踪算法，实现高效的视频自动标注。结果 在2个大规模跟踪数据集上生成标注。对于LaSOT （large-scale single object tracking）数据集，自动标注过程仅需约43 h，与真实标注的平均重叠率（mean intersection over union，mIoU）由0.824提升至0.871。对于TrackingNet数据集，本文使用自动标注重新训练3种跟踪算法，并在3个数据集上测试跟踪性能，使用本文标注训练的模型在跟踪性能上超过使用TrackingNet原始标注训练的模型。结论 本文算法TLNet能够挖掘时序的目标外观和运动信息，对前反向跟踪结果进行帧级的质量评估并进一步优化目标框。该方法与具体跟踪算法解耦，具有强泛化性，并能节省超过90%的人工标注成本，高效地生成高质量的视频标注。
An efficient Transformer-based object-capturing video annotation method
Objective High performance-oriented robust tracking methods are often beneficial from its related depth model nowadays,which is deeply linked to the large-scale high quality annotations-related video datasets in the training phase. However,frame-manual annotating videos frame is labor-intensive and costly. In addition,existing video annotation methods are usually focused on interpolation operation relevant to sparse labeled datasets. An effective interpolation operation is mainly designed based on the geometric information or other related tracking methods. One potential of these methods is illustrated that they do not require any quality evaluation mechanism to filter out the noisy annotations,resulting in the annotations unreliable. Additionally,some interactive annotation tools have a high involvement of labor,which makes the annotating process more complex. An end-to-end deep model based video annotation via selection and refinement (VASR) can be used to generate reliable annotations via its selection and refinement. However,due to its training data consists of intermediate tracking results and object masks,it is highly associated with specific tracking methods. The process of generating annotations is complicated and time consuming. To meet the needs of visual tracking methods for large-scale annotated video datasets,simplify the annotation process,and optimize computing cost,we develop an efficient end-to-end model of Transformer-based label Net(TLNet) to automatically generate video annotations for sparse labeled tracking datasets further. Due to the Transformer model has its potentials to deal with sequential information,we melt it into our model to fuse the bidirectional and sequential input features. Method In this study,an efficient video annotation method is developed to generate reliable annotations quickly. The annotating strategy is implemented and it can be divided into three steps as follows:First,high-speed trackers are employed to perform forward and backward tracking. Second,these original annotations are evaluated in frame-level,and noisy frames are filtered out for manual annotation. Third,other related frames are optimized to generate more precise bounding boxes. To get its evaluation and optimization function mentioned above,a Transformer-based efficient model is illustrated,called TLNet. The proposed model consists of three sorts of modules,i. e.,the feature extractor,feature fusion module,and prediction heads. To extract the vision features of the object,pixelwise cross-correlation operation is used for the template and search region. A Transformer-based model is introduced to handle the sequential information after the vision and motion features are incorporated with,including the object appearance and motion cues. Furthermore,it is also used to fuse the bidirectional tracking results,i. e.,forward tracking and backward tracking. The proposed model contains two sorts of sub-models in related to quality evaluation network and regression network. Among them,the former one is used to provide the confidence score of each frame's original annotation,and then the failed frames will be filtered out and sent back to manual annotators. The latter one is used to optimize the original annotations of the remaining frames,and more precise bounding boxes can be output. Our method has its strong generalizability to some extent since video frames and bidirectional tracking results are used as input for the model training only,and no intermediate tracking results like confidence scores,response maps are used. Specifically,our method is decoupled from specific tracking algorithms,and it can integrate any existing high-speed tracking algorithms to perform the forward and backward tracking. In this way,efficient and reliable video annotations can be achieved further. The completed annotating process is designed simplified and easy-to-use as well. Result Our method proposed is focused on generating annotations on two sort of large-scale tracking datasets like LaSOT,and TrackingNet. Two kind of evaluation protocols are set and matched for each of the two datasets. For LaSOT dataset,we apply the mean intersection over union(mIoU) to evaluate the annotations quality straightforward. The accuracy(Acc),recall,and true negative rate(TNR) are used to evaluate the filtering ability of the quality evaluation network as well. Compared to the VASR method costing about two weeks(336 hours totally),our method saves the time(43 hours totally) significantly in generating annotation. Specifically,5. 4% of frames are recognized as failed frames,where the Acc is 96. 7%,and TNR is reached to 76. 1%. After filtering out noisy frames and replacing with manual annotations,the mIoU of annotations are increased from 0. 824 to 0. 864. It is improved and reached to 0. 871 further in terms of regression network. For the TrackingNet dataset,due to the noisy annotations are existed in its ground truth value,an indirect way is applied to evaluate the annotation quality. That is,we select out three sort of different tracking algorithms in related to ATOM,DiMP,and PrDiMP,and they are retrained using our annotations. The results are captured on three sort of tracking datasets relevant to LaSOT,TrackingNet,and GOT-10k,and it demonstrates that our annotations can be used to train a tracking model better than its original ground truth. It also can generate high quality annotations efficiently. Moreover,the ablation study also demonstrates the effectiveness of the Transformer model and its other related designs. Conclusion We develop an efficient video annotation method,which can mine the sequential object appearance and motion information,and fuse the forward and backward tracking results. The reliable annotations can be generated,and more than 90% labor cost for manually annotating can be saved on the basis of the framelevel quality evaluation operation originated from our quality evaluation network,and bounding boxes optimization derived from our regression network. Due to decoupling with the specific tracking algorithms,our method has its strong generalizability,for which it applies any existing high-speed tracking algorithms to achieve efficient annotation. We predict that our annotation method proposed can make the annotating process more reliable,faster and simpler to a certain extent.