Vision Transformer-based recognition tasks： a critical review

Zhou Lijuan; Mao Jianing

doi:10.11834/jig.220895

Review | Views : 0 下载量: 837 CSCD: 8

PDF
Export
Share
Collection
Album

Vision Transformer-based recognition tasks： a critical review
Vol. 28, Issue 10, Pages: 2969-3003(2023)
Received：02 September 2022，

Revised：22 November 2022，

Published：16 October 2023
DOI： 10.11834/jig.220895
稿件说明：

移动端阅览

周丽娟，毛嘉宁. 2023. 视觉Transformer识别任务研究综述. 中国图象图形学报， 28(10):2969-3003 DOI： 10.11834/jig.220895.

Zhou Lijuan， Mao Jianing. 2023. Vision Transformer-based recognition tasks： a critical review. Journal of Image and Graphics， 28(10):2969-3003 DOI： 10.11834/jig.220895.

摘要

Transformer模型在自然语言处理领域取得了很好的效果，同时因其能够更好地连接视觉和语言，也激发了计算机视觉界的极大兴趣。本文总结了视觉Transformer处理多种识别任务的百余种代表性方法，并对比分析了不同任务内的模型表现，在此基础上总结了每类任务模型的优点、不足以及面临的挑战。根据识别粒度的不同，分别着眼于诸如图像分类、视频分类的基于全局识别的方法，以及目标检测、视觉分割的基于局部识别的方法。考虑到现有方法在3种具体识别任务的广泛流行，总结了在人脸识别、动作识别和姿态估计中的方法。同时，也总结了可用于多种视觉任务或领域无关的通用方法的研究现状。基于Transformer的模型实现了许多端到端的方法，并不断追求准确率与计算成本的平衡。全局识别任务下的Transformer模型对补丁序列切分和标记特征表示进行了探索，局部识别任务下的Transformer模型因能够更好地捕获全局信息而取得了较好的表现。在人脸识别和动作识别方面，注意力机制减少了特征表示的误差，可以处理丰富多样的特征。Transformer可以解决姿态估计中特征错位的问题，有利于改善基于回归的方法性能，还减少了三维估计时深度映射所产生的歧义。大量探索表明视觉Transformer在识别任务中的有效性，并且在特征表示或网络结构等方面的改进有利于提升性能。

Abstract

Due to its ability to model long-distance dependencies， self-attention mechanism for adaptive computing， scalability for large models and big data， and better connection between vision and language， Transformer model is beneficial for natural language processing and computer vision apparently. To melt Transformer into vision tasks， such vision Transformer methods have been developing intensively. Current literatures can be summarized and analyzed for multiple applications-related methods. However， these different applications are often heterogeneous for various methods. In addition， comparative analysis is often focused on between Transformer and traditional convolution neural networks （CNNs）， and multi-Transformer models are less involved in and linked mutually. We summarize and compare more than 100 popular methods of vision Transformer for various recognition tasks. Global recognition-based methods are reviewed for such classification of image and video contexts， and local recognition-based methods of object detection and vision segmentation. We summarize the methods in the context of face recognition， action recognition and pose estimation based on three specific recognition tasks mentioned above. Furthermore， solo task and independent domain methods are summarized， which can be used for image classification， object detection and other related vision tasks. The performance of these Transformer-based models are compared and analyzed on the public datasets as well. Image classification is mostly used to represent features in terms of visual and class tokens. The vision Transformer （ViT） and data-efficient image Transformers （DeiT）-illustrated models have its potentials for ImageNet datasets. Object detection tasks are required to detect targeted objects derived from input visual data， and the coordinates and labels of a series of bounding boxes are predictable as well. Object detection is illustrated by detection Transformer （DETR）， which can alter the indirectness of previous classification and regression through proposals， anchors or windows. Subsequently， other related literatures are focused on improving the feature maps， computational complexity and convergence speed of DETR to a certain extent， such as conditional DETR， deformable DETR， unsupervised pre-training DETR （UP-DETR）. Additionally， Transformer-based models have preferred relevant to such applications of salient object detection， point cloud 3D detection and few-shot object detection. Semantic segmentation tasks are required for an assignment from class label to each pixel in the image and the bounding box of the object like object detection can be predicted and optimized further. However， semantic segmentation can be used to determine pixel classes only， and it is still challenged to identify multiple instances-between similar pixels. Transformer is also paid attention to improve U-Net for medical image segmentation. It is possible to link the Transformer with pyramid network， or design different decoder structures for pixel-by-pixel segmentation， such as segmentation Transformer progressive upsampling （SETR-PUP） and segmentation Transformer multi-level feature aggregation （SETR-MLA）. Mask classification methods are commonly used in instance segmentation and it can also be used for semantic segmentation via Transformer structure like a segmenter. Instance segmentation is similar to the combination of object detection and semantic segmentation. Compared to the bounding box of object detection， the output of instance segmentation is a mask， which can segment the edges of objects and distinguish different instances of similar objects. It can optimize the ability of semantic segmentation to some extent. Transformer can be used to melt more end-to-end methods into instance segmentation， and the quality of the masks can be used and improved during the segmentation process. Transformer can provide an alignment-free method for face recognition， and it can handle noises in related to facial expressions and racial bias. Action recognition tasks are required to classify videos-input human actions， which are similar to image classification tasks and additional processing of the temporal dimension is not avoidable. Transformer is developed for modeling long-term temporal and spatial dependencies for action recognition beyond two-stream network and three-dimensional convolution. Pose estimation is usually recognized as a human body keypoints-sorted problem and parts-between spatial relationship is identified. It consists of 2D pose estimation and 3D pose estimation. The former one is generally used to determine two-dimensional coordinates of body parts， while the latter one adds depth information on the basis of two-dimensional coordinates. Transformer is used to refine keypoint features for pose estimation， and the modeling of intra-frame node relationships and inter-frame temporal relationships are optimized as well. Multi-task models based Transformer research is focused on the integration of image classification， object detection and semantic segmentation tasks. Some other related popular models are also proposed that can be used in vision and language domains. Extensive research has shown the effectiveness of the vision Transformer in recognition tasks， and feature representation or network structure-relevant optimization is beneficial for its performance improvement. Future research direction are predicted in relevance to such effective and efficient methods for accuracy preservation in the context of positional encoding， self-supervised learning， multimodal integrating， and computational cost cutting.

关键词

Keywords

references

Arnab A ， Dehghani M ， Heigold G ， Sun C ， Lučić M and Schmid C . 2021 . ViViT： a video vision Transformer // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 6816 - 6826 ［ DOI： 10.1109/ICCV48922.2021.00676 http://dx.doi.org/10.1109/ICCV48922.2021.00676 ］

Atito S ， Awais M and Kittler J . 2021 . SiT： self-supervised vision Transformer ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2104.03602.pdf https://arxiv.org/pdf/2104.03602.pdf

Bai R W ， Li M ， Meng B ， Li F F ， Jiang M ， Ren J X and Sun D G . 2022 . Hierarchical graph convolutional skeleton Transformer for action recognition // Proceedings of 2022 IEEE International Conference on Multimedia and Expo . Taipei， China ： IEEE： 01 - 06 ［ DOI： 10.1109/ICME52920.2022.9859781 http://dx.doi.org/10.1109/ICME52920.2022.9859781 ］

Bar A ， Wang X ， Kantorov V ， Reed C J ， Herzig R ， Chechik G ， Rohrbach A ， Darrell T and Globerson A . 2021 . DETReg： unsupervised pretraining with region priors for object detection ［EB/OL］. ［ 2022-01-24 ］. https://arxiv.org/pdf/2106.04550.pdf https://arxiv.org/pdf/2106.04550.pdf

Barsoum E ， Zhang C ， Ferrer C C and Zhang Z Y . 2016 . Training deep networks for facial expression recognition with crowd-sourced label distribution // Proceedings of the 18th ACM International Conference on Multimodal Interaction . Tokyo， Japan ： ACM： 279 - 283 ［ DOI： 10.1145/2993148.2993165 http://dx.doi.org/10.1145/2993148.2993165 ］

Bertasius G ， Wang H and Torresani L . 2021 . Is space-time attention all you need for video understanding? // Proceedings of the 38th International Conference on Machine Learning . Virtual Event ： PMLR： 813 - 824

Brown T B ， Mann B ， Ryder N ， Subbiah M ， Kaplan J ， Dhariwal P ， Neelakantan A ， Shyam P ， Sastry G ， Askell A ， Agarwal S ， Herbert-Voss A ， Krueger G ， Henighan T ， Child R ， Ramesh A ， Ziegler D M ， Wu J ， Winter C ， Hesse C ， Chen M ， Sigler E ， Litwin M ， Gray S ， Chess B ， Clark J ， Berner C ， McCandlish S ， Radford A ， Sutskever I and Amodei D . 2020 . Language models are few-shot learners // Proceedings of the 34th International Conference on Neural Information Processing Systems . Vancouver， Canada ： Curran Associates Inc.： 1877 - 1901

Caesar H ， Uijlings J and Ferrari V . 2018 . COCO-stuff： thing and stuff classes in context // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City， USA ： 1209 - 1218 ［ DOI： 10.1109/CVPR.2018.00132 http://dx.doi.org/10.1109/CVPR.2018.00132 ］

Cai Y M ， Cai G Y and Cai J . 2021 . Action-Transformer for action recognition in short videos // Proceedings of the 11th International Conference on Intelligent Control and Information Processing . Dali， China ： IEEE： 278 - 283 ［ DOI： 10.1109/ICICIP53388.2021.9642184 http://dx.doi.org/10.1109/ICICIP53388.2021.9642184 ］

Cao H ， Wang Y Y ， Chen J ， Jiang D S ， Zhang X P ， Tian Q and Wang M N . 2021 . Swin-unet： Unet-like pure Transformer for medical image segmentation ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2105.05537.pdf https://arxiv.org/pdf/2105.05537.pdf

Cao J L ， Li Y L ， Sun H Q ， Xie J ， Huang K Q and Pang Y W . 2022 . A survey on deep learning based visual object detection . Journal of Image and Graphics ， 27 （ 6 ）： 1697 - 1722

曹家乐，李亚利，孙汉卿，谢今，黄凯奇，庞彦伟 . 2022 . 基于深度学习的视觉目标检测技术综述 . 中国图象图形学报， 27 （ 6 ）： 1697 - 1722

Carion N ， Massa F ， Synnaeve G ， Usunier N ， Kirillov A and Zagoruyko S . 2020 . End-to-end object detection with Transformers // Proceedings of the 16th European Conference on Computer Vision . Glasgow， UK ： Springer： 213 - 229 ［ DOI： 10.1007/978-3-030-58452-8_13 http://dx.doi.org/10.1007/978-3-030-58452-8_13 ］

Chang Y ， Hu M H ， Zhai G T and Zhang X P . 2021 . Transclaw U-Net： claw U-Net with Transformers for medical image segmentation ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2107.05188.pdf https://arxiv.org/pdf/2107.05188.pdf

Chen B Y ， Li P X ， Li B P ， Li C M ， Bai L ， Lin C ， Sun M ， Yan J J and Ouyang W L . 2021a . PSViT： better vision Transformer via token pooling and attention sharing ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2108.03428.pdf https://arxiv.org/pdf/2108.03428.pdf

Chen B Z ， Liu Y S ， Zhang Z ， Lu G and Zhang D . 2022a . TransAttUnet： multi-level attention-guided U-Net with Transformer for medical image segmentation ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2107.05274.pdf https://arxiv.org/pdf/2107.05274.pdf

Chen C F R ， Fan Q F and Panda R . 2021c . CrossViT： cross-attention multi-scale vision Transformer for image classification // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 347 - 356 ［ DOI： 10.1109/ICCV48922.2021.00041 http://dx.doi.org/10.1109/ICCV48922.2021.00041 ］

Chen G ， Zhang S Q and Zhao X M . 2022 . Video sequence-based human facial expression recognition using Transformer networks . Journal of Image and Graphics ， 27 （ 10 ）： 3022 - 3030

陈港，张石清，赵小明 . 2022 . 采用Transformer网络的视频序列表情识别 . 中国图象图形学报， 27 （ 10 ）： 3022 - 3030 ［ DOI： 10.11834/jig.210248 http://dx.doi.org/10.11834/jig.210248 ］

Chen H Y ， Li C ， Li X Y ， Wang G ， Hu W M ， Li Y X ， Liu W L ， Sun C H ， Yao Y D ， Teng Y Y and Grzegorzek M . 2022b . GasHis-Transformer： a multi-scale visual Transformer approach for gastric histopathology image classification ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2104.14528v5.pdf https://arxiv.org/pdf/2104.14528v5.pdf

Chen J N ， Lu Y Y ， Yu Q H ， Luo X D ， Adeli E ， Wang Y ， Lu L ， Yuille A L and Zhou Y Y . 2021e . TransUNet： Transformers make strong encoders for medical image segmentation ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2102.04306.pdf https://arxiv.org/pdf/2102.04306.pdf

Chen J W and Ho C M . 2022 . MM-ViT： multi-modal video Transformer for compressed video action recognition // Proceedings of 2022 IEEE/CVF Winter Conference on Applications of Computer Vision . Waikoloa， USA ： IEEE： 786 - 797 ［ DOI： 10.1109/WACV51458.2022.00086 http://dx.doi.org/10.1109/WACV51458.2022.00086 ］

Chen Z S ， Xie L X ， Niu J W ， Liu X F ， Wei L H and Tian Q . 2021b . Visformer： the vision-friendly Transformer // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： 569 - 578 ［ DOI： 10.1109/ICCV48922.2021.00063 http://dx.doi.org/10.1109/ICCV48922.2021.00063 ］

Chen Z Y ， Zhu Y S ， Zhao C Y ， Hu G S ， Zeng W ， Wang J Q and Tang M . 2021d . DPT： deformable patch-based Transformer for visual recognition // Proceedings of the 29th ACM International Conference on Multimedia . Virtual Event， China ： ACM： 2899 - 2907 ［ DOI： 10.1145/3474085.3475467 http://dx.doi.org/10.1145/3474085.3475467 ］

Cheng B W ， Schwing A G and Kirillov A . 2021 . Per-pixel classification is not all you need for semantic segmentation . Advances in Neural Information Processing Systems ， 34 ： 17864 - 17875

Codella N ， Rotemberg V ， Tschandl P ， Celebi M E ， Dusza S ， Gutman D ， Helba B ， Kalloo A ， Liopyris K ， Marchetti M ， Kittler H and Halpern A . 2019 . Skin lesion analysis toward melanoma detection 2018： a challenge hosted by the international skin imaging collaboration （ISIC）［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/1902.03368.pdf https://arxiv.org/pdf/1902.03368.pdf

Cordts M ， Omran M ， Ramos S ， Rehfeld T ， Enzweiler M ， Benenson R ， Franke U ， Roth S and Schiele B . 2016 . The cityscapes dataset for semantic urban scene understanding // Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas， USA ： IEEE： 3213 - 3223 ［ DOI： 10.1109/CVPR.2016.350 http://dx.doi.org/10.1109/CVPR.2016.350 ］

Dai Y ， Gao Y F and Liu F Y . 2021a . TransMed： Transformers advance multi-modal medical image classification . Diagnostics ， 11 （ 8 ）： # 1384 ［ DOI： 10.3390/diagnostics11081384 http://dx.doi.org/10.3390/diagnostics11081384 ］

Dai Z G ， Cai B L ， Lin Y G and Chen J Y . 2021b . UP-DETR： unsupervised pre-training for object detection with Transformers // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville， USA ： IEEE： 1601 - 1610 ［ DOI： 10.1109/CVPR46437.2021.00165 http://dx.doi.org/10.1109/CVPR46437.2021.00165 ］

Deng J ， Dong W ， Socher R ， Li L J ， Li K and Li F F . 2009 . ImageNet： a large-scale hierarchical image database // Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition . Miami， USA ： IEEE： 248 - 255 ［ DOI： 10.1109/CVPR.2009.5206848 http://dx.doi.org/10.1109/CVPR.2009.5206848 ］

Devlin J ， Chang M W ， Lee K and Toutanova K . 2019 . BERT： pre-training of deep bidirectional Transformers for language understanding // Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers） . Minneapolis， USA ： Association for Computational Linguistics： 4171 - 4186 ［ DOI： 10.18653/v1/n19-1423 http://dx.doi.org/10.18653/v1/n19-1423 ］

Dong B ， Wang W H ， Fan D P ， Li J P ， Fu H Z and Shao L . 2023 . Polyp-PVT： polyp segmentation with pyramid vision Transformers ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2108.06932.pdf https://arxiv.org/pdf/2108.06932.pdf

Dong B ， Zeng F ， Wang T C ， Zhang X Y and Wei Y C . 2021 . SOLQ： segmenting objects by learning queries ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2106.02351.pdf https://arxiv.org/pdf/2106.02351.pdf

Dosovitskiy A ， Beyer L ， Kolesnikov A ， Weissenborn D ， Zhai X H ， Unterthiner T ， Dehghani M ， Minderer M ， Heigold G ， Gelly S ， Uszkoreit J and Houlsby N . 2021 . An image is worth 16 × 16 words： Transformers for image recognition at scale ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2010.11929.pdf https://arxiv.org/pdf/2010.11929.pdf

Frank S ， Bugliarello E and Elliott D . 2021 . Vision-and-language or vision-for-language？ On cross-modal influence in multimodal Transformers // Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing . Punta Cana， Dominican Republic ： Association for Computational Linguistics： 9847 - 9857 ［ DOI： 10.18653/v1/2021.emnlp-main.775 http://dx.doi.org/10.18653/v1/2021.emnlp-main.775 ］

Gao P ， Zheng M H ， Wang X G ， Dai J F and Li H S . 2021a . Fast convergence of detr with spatially modulated co-attention // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 3601 - 3610 ［ DOI： 10.1109/ICCV48922.2021.00360 http://dx.doi.org/10.1109/ICCV48922.2021.00360 ］

Gao Y H ， Zhou M and Metaxas D N . 2021b . UTNet： a hybrid Transformer architecture for medical image segmentation // Proceedings of the 24th International Conference on Medical Image Computing and Computer Assisted Intervention——MICCAI 2021 . Strasbourg， France ： Springer： 61 - 71 ［ DOI： 10.1007/978-3-030-87199-4_6 http://dx.doi.org/10.1007/978-3-030-87199-4_6 ］

Girdhar R ， Carreira J J ， Doersch C and Zisserman A . 2019 . Video action Transformer network // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach， USA ： IEEE： 244 - 253 ［ DOI： 10.1109/CVPR.2019.00033 http://dx.doi.org/10.1109/CVPR.2019.00033 ］

Graham B ， El-Nouby A ， Touvron H ， Stock P ， Joulin A ， Jégou H and Douze M . 2021 . LeViT： a vision Transformer in ConvNet’s clothing for faster inference // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 12239 - 12249 ［ DOI： 10.1109/ICCV48922.2021.01204 http://dx.doi.org/10.1109/ICCV48922.2021.01204 ］

Guan T R ， Wang J ， Lan S Y ， Chandra R ， Wu Z X ， Davis L and Manocha D . 2022 . M3DETR： multi-representation， multi-scale， mutual-relation 3D object detection with Transformers // Proceedings of 2021 IEEE/CVF Winter Conference on Applications of Computer Vision . Waikoloa， USA ： IEEE： 2293 - 2303 ［ DOI： 10.1109/WACV51458.2022.00235 http://dx.doi.org/10.1109/WACV51458.2022.00235 ］

Guo J Y ， Han K ， Wu H ， Tang Y H ， Chen X H ， Wang Y H and Xu C . 2022 . CMT： convolutional neural networks meet vision Transformers ［EB/OL］. ［ 2022-01-21 ］. https://arxiv.org/pdf/2107.06263.pdf https://arxiv.org/pdf/2107.06263.pdf

Guo R H ， Niu D T ， Qu L and Li Z B . 2021 . SOTR： segmenting objects with Transformers // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 7137 - 7146 ［ DOI： 10.1109/ICCV48922.2021.00707 http://dx.doi.org/10.1109/ICCV48922.2021.00707 ］

Hampali S ， Sarkar S D ， Rad M and Lepetit V . 2021 . HandsFormer： keypoint Transformer for monocular 3D pose estimation of hands and object in interaction ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2104.14639v1.pdf https://arxiv.org/pdf/2104.14639v1.pdf

Han K ， Wang Y H ， Chen H T ， Chen X H ， Guo J Y ， Liu Z H ， Tang Y H ， Xiao A ， Xu C J ， Xu Y X ， Yang Z H ， Zhang Y M and Tao D C . 2022 . A survey on vision Transformer . IEEE Transactions on Pattern Analysis and Machine Intelligence ， 45 （ 1 ）： 87 - 110 ［ DOI： 10.1109/TPAMI.2022.3152247 http://dx.doi.org/10.1109/TPAMI.2022.3152247 ］.

Hatamizadeh A ， Tang Y C ， Nath V ， Yang D ， Myronenko A ， Landman B ， Roth H R and Xu D G . 2022 . UNETR： Transformers for 3D medical image segmentation // Proceedings of 2022 IEEE/CVF Winter Conference on Applications of Computer Vision . Waikoloa， USA ： IEEE： 1748 - 1758 ［ DOI： 10.1109/WACV51458.2022.00181 http://dx.doi.org/10.1109/WACV51458.2022.00181 ］

He J ， Chen J N ， Liu S ， Kortylewski A ， Yang C ， Bai Y T and Wang C H . 2022 . TransFG： a Transformer architecture for fine-grained recognition . Proceedings of the AAAI Conference on Artificial Intelligence ， 36 （ 1 ）： 852 - 860 ［ DOI： 10.1609/aaai.v36i1.19967 http://dx.doi.org/10.1609/aaai.v36i1.19967 ］

Heo B ， Yun S ， Han D ， Chun S ， Choe J and Oh S J . 2021 . Rethinking spatial dimensions of vision Transformers // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 11916 - 11925 ［ DOI： 10.1109/ICCV48922.2021.01172 http://dx.doi.org/10.1109/ICCV48922.2021.01172 ］

Hu H Z ， Zhao W C ， Zhou W G ， Wang Y C and Li H Q . 2021a . SignBERT： pre-training of hand-model-aware representation for sign language recognition // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 11067 - 11076 ［ DOI： 10.1109/ICCV48922.2021.01090 http://dx.doi.org/10.1109/ICCV48922.2021.01090 ］

Hu J ， Cao L J ， Lu Y ， Zhang S C ， Wang Y ， Li K ， Huang F Y ， Shao L and Ji R R . 2021b . ISTR： end-to-end instance segmentation with Transformers ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2105.00637.pdf https://arxiv.org/pdf/2105.00637.pdf

Hu R H and Singh A . 2021 . UniT： Multimodal multitask learning with a unified Transformer // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 1419 - 1429 ［ DOI： 10.1109/ICCV48922.2021.00147 http://dx.doi.org/10.1109/ICCV48922.2021.00147 ］

Huang G B ， Ramesh M ， Berg T and Learned-Miller E . 2008 . Labeled faces in the wild： a database for studying face recognition in unconstrained environments ［EB/OL］. ［ 2022-03-26 ］. http://tamaraberg.com/papers/lfw.pdf http://tamaraberg.com/papers/lfw.pdf

Hwang S ， Heo M ， Oh S W and Kim S J . 2021 . Video instance segmentation using inter-frame communication Transformers ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2106.03299.pdf https://arxiv.org/pdf/2106.03299.pdf

Ionescu C ， Papava D ， Olaru V and Sminchisescu C . 2014 . Human3.6M： large scale datasets and predictive methods for 3D human sensing in natural environments . IEEE Transactions on Pattern Analysis and Machine Intelligence ， 36 （ 7 ）： 1325 - 1339 ［ DOI： 10.1109/TPAMI.2013.248 http://dx.doi.org/10.1109/TPAMI.2013.248 ］

Islam M A ， Jia S and Bruce N D B . 2020 . How much position information do convolutional neural networks encode？［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2001.08248.pdf https://arxiv.org/pdf/2001.08248.pdf

Ji G P ， Chou Y C ， Fan D P ， Chen G ， Fu H Z ， Jha D and Shao L . 2021a . Progressively normalized self-attention network for video polyp segmentation // Proceedings of the 24th International Conference on Medical Image Computing and Computer Assisted Intervention —— MICCAI 2021 . Strasbourg， France ： Springer： 142 - 152 ［ DOI： 10.1007/978-3-030-87193-2_14 http://dx.doi.org/10.1007/978-3-030-87193-2_14 ］

Ji Y F ， Zhang R M ， Wang H J ， Li Z ， Wu L Y ， Zhang S T and Luo P . 2021b . Multi-compound Transformer for accurate biomedical image segmentation // Proceedings of the 24th International Conference on Medical Image Computing and Computer Assisted Intervention —— MICCAI 2021 . Strasbourg， France ： Springer： 326 - 336 ［ DOI： 10.1007/978-3-030-87193-2_31 http://dx.doi.org/10.1007/978-3-030-87193-2_31 ］

Jiang B ， Yu J H ， Zhou L ， Wu K L and Yang Y . 2021 . Two-pathway Transformer network for video action recognition // Proceedings of 2021 IEEE International Conference on Image Processing . Anchorage， USA ： IEEE： 1089 - 1093 ［ DOI： 10.1109/ICIP42928.2021.9506453 http://dx.doi.org/10.1109/ICIP42928.2021.9506453 ］

Jin H ， Yang J M and Zhang S . 2021 . Efficient action recognition with introducing R（2+1）D convolution to improved Transformer // Proceedings of the 4th International Conference on Information Communication and Signal Processing . Shanghai， China ： IEEE： 379 - 383 ［ DOI： 10.1109/icicsp54369.2021.9611970 http://dx.doi.org/10.1109/icicsp54369.2021.9611970 ］

Kay W ， Carreira J ， Simonyan K ， Zhang B ， Hillier C ， Vijayanarasimhan S ， Viola F ， Green T ， Back T ， Natsev P ， Suleyman M and Zisserman A . 2017 . The kinetics human action video dataset ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/1705.06950.pdf https://arxiv.org/pdf/1705.06950.pdf

Ke L ， Danelljan M ， Li X ， Tai Y W ， Tang C K and Yu F . 2021 . Mask Transfiner for high-quality instance segmentation ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2111.13673.pdf https://arxiv.org/pdf/2111.13673.pdf

Khan S ， Naseer M ， Hayat M ， Zamir S W ， Khan F S and Shah M . 2022 . Transformers in vision： a survey . ACM Computing Surveys ， 54 （ 10 s）： # 200 ［ DOI： 10.1145/3505244 http://dx.doi.org/10.1145/3505244 ］

Kong J ， Bian Y H and Jiang M . 2022 . MTT： multi-scale temporal Transformer for skeleton-based action recognition . IEEE Signal Processing Letters ， 29 ： 528 - 532 ［ DOI： 10.1109/LSP.2022.3142675 http://dx.doi.org/10.1109/LSP.2022.3142675 ］

Kumar N ， Verma R ， Sharma S ， Bhargava S ， Vahadane A and Sethi A . 2017 . A dataset and a technique for generalized nuclear segmentation for computational pathology . IEEE Transactions on Medical Imaging ， 36 （ 7 ）： 1550 - 1560 ［ DOI： 10.1109/Tmi.2017.2677499 http://dx.doi.org/10.1109/Tmi.2017.2677499 ］

Lanchantin J ， Wang T L ， Ordonez V and Qi Y J . 2021 . General multi-label image classification with Transformers // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville， USA ： IEEE： 16473 - 16483 ［ DOI： 10.1109/CVPR46437.2021.01621 http://dx.doi.org/10.1109/CVPR46437.2021.01621 ］

Li H T ， Sui M Z ， Zhao F ， Zha Z J and Wu F . 2021a . MVT： mask vision Transformer for facial expression recognition in the wild ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2106.04520.pdf https://arxiv.org/pdf/2106.04520.pdf

Li K ， Wang S J ， Zhang X ， Xu Y F ， Xu W J and Tu Z W . 2021b . Pose recognition with cascade Transformers // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville， USA ： IEEE： 1944 - 1953 ［ DOI： 10.1109/CVPR46437.2021.00198 http://dx.doi.org/10.1109/CVPR46437.2021.00198 ］

Li S C ， Cao Q G ， Liu L B ， Yang K L ， Liu S N ， Hou J and Yi S . 2021c . GroupFormer： group activity recognition with clustered spatial-temporal Transformer // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 13648 - 13657 ［ DOI： 10.1109/ICCV48922.2021.01341 http://dx.doi.org/10.1109/ICCV48922.2021.01341 ］

Li S H ， Sui X ， Luo X D ， Xu X X ， Liu Y and Goh R . 2021d . Medical image segmentation using squeeze-and-expansion Transformers // Proceedings of the 30th International Joint Conference on Artificial Intelligence . Montreal， Canada ：［s.n.］： 807 - 815 ［ DOI： 10.24963/ijcai.2021/112 http://dx.doi.org/10.24963/ijcai.2021/112 ］

Li W H ， Liu H ， Ding R W ， Liu M Y and Wang P C . 2022c . Lifting Transformer for 3D human pose estimation in video ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2103.14304v2.pdf https://arxiv.org/pdf/2103.14304v2.pdf

Li X Y ， Hou Y H ， Wang P C ， Gao Z M ， Xu M L and Li W Q . 2022a . Trear： Transformer-based RGB-D egocentric action recognition . IEEE Transactions on Cognitive and Developmental Systems ， 14 （ 1 ）： 246 - 252 ［ DOI： 10.1109/TCDS.2020.3048883 http://dx.doi.org/10.1109/TCDS.2020.3048883 ］

Li Y ， Sun Y F ， Cui Z ， Shan S G and Yang J . 2021e . Learning fair face representation with progressive cross Transformer ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2108.04983.pdf https://arxiv.org/pdf/2108.04983.pdf

Li Y H ， Mao H Z ， Girshick R and He K M . 2022b . Exploring plain vision Transformer backbones for object detection ［EB/OL］. ［ 2022-10-08 ］. https://arxiv.org/pdf/2203.16527.pdf https://arxiv.org/pdf/2203.16527.pdf

Lin M ， Li C M ， Bu X Y ， Sun M ， Lin C ， Yan J J ， Ouyang W L and Deng Z D . 2021a . DETR for crowd pedestrian detection ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2012.06785.pdf https://arxiv.org/pdf/2012.06785.pdf

Lin T Y ， Maire M ， Belongie S ， Hays J ， Perona P ， Ramanan D ， Doll􀆦r P and Zitnick C L . 2014 . Microsoft COCO： common objects in context // Proceedings of the 13th European Conference on Computer Vision . Zurich， Switzerland ： Springer： 740 - 755 ［ DOI： 10.1007/978-3-319-10602-1_48 http://dx.doi.org/10.1007/978-3-319-10602-1_48 ］

Lin W D ， Deng Y Y ， Gao Y ， Wang N ， Zhou J H ， Liu L Q ， Zhang L and Wang P . 2021b . CAT： cross-attention Transformer for one-shot object detection ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2104.14984.pdf https://arxiv.org/pdf/2104.14984.pdf

Ling X F ， Liang J X ， Wang D and Yang J . 2021 . A facial expression recognition system for smart learning based on YOLO and vision Transformer // Proceedings of the 7th International Conference on Computing and Artificial Intelligence . Tianjin， China ： ACM： 178 - 182 ［ DOI： 10.1145/3467707.3467733 http://dx.doi.org/10.1145/3467707.3467733 ］

Liu F F ， Wei H R ， Zhao W Z ， Li G Z ， Peng J Q and Li Z H . 2021a . WB-DETR： Transformer-based detector without backbone // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 2959 - 2967 ［ DOI： 10.1109/ICCV48922.2021.00297 http://dx.doi.org/10.1109/ICCV48922.2021.00297 ］

Liu J ， Shahroudy A ， Perez M ， Wang G ， Duan L Y and Kot A C . 2020 . NTU RGB+D 120： a large-scale benchmark for 3D human activity understanding . IEEE Transactions on Pattern Analysis and Machine Intelligence ， 42 （ 10 ）： 2684 - 2701 ［ DOI： 10.1109/TPAMI.2019.2916873 http://dx.doi.org/10.1109/TPAMI.2019.2916873 ］

Liu N ， Zhang N ， Wan K Y ， Shao L and Han J W . 2021b . Visual saliency Transformer // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 4702 - 4712 ［ DOI： 10.1109/ICCV48922.2021.00468 http://dx.doi.org/10.1109/ICCV48922.2021.00468 ］

Liu S L ， Zhang L ， Yang X ， Su H and Zhu J . 2021c . Query2label： a simple Transformer way to multi-label classification ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2107.10834.pdf https://arxiv.org/pdf/2107.10834.pdf

Liu W T and Lu X M . 2022 . Research progress of Transformer based on computer vision . Computer Engineering and Applications ， 58 （ 6 ）： 1 - 16

刘文婷，卢新明 . 2022 . 基于计算机视觉的Transformer研究进展 . 计算机工程与应用， 58 （ 6 ）： 1 - 16

Liu X L ， Wang Q M ， Hu Y ， Tang X ， Zhang S W ， Bai S and Bai X . 2022a . End-to-end temporal action detection with Transformer ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2106.10271.pdf https://arxiv.org/pdf/2106.10271.pdf

Liu Y ， Zhang Y ， Wang Y X ， Hou F ， Yuan J ， Tian J ， Zhang Y ， Shi Z C ， Fan J P and He Z Q . 2022b . A survey of visual Transformers . IEEE Transactions on Neural Networks and Learning Systems ［ DOI： 10.1109/TNNLS.2022.3227717 http://dx.doi.org/10.1109/TNNLS.2022.3227717 ］

Liu Z ， Lin Y T ， Cao Y ， Hu H ， Wei Y X ， Zhang Z ， Lin S and Guo B N . 2021d . Swin Transformer： hierarchical vision Transformer using shifted windows // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 9992 - 10002 ［ DOI： 10.1109/ICCV48922.2021.00986 http://dx.doi.org/10.1109/ICCV48922.2021.00986 ］

Liu Z ， Zhang Z ， Cao Y ， Hu H and Tong X . 2021e . Group-free 3D object detection via Transformers // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 2929 - 2938 ［ DOI： 10.1109/ICCV48922.2021.00294 http://dx.doi.org/10.1109/ICCV48922.2021.00294 ］

Lu Z H ， He S ， Zhu X T ， Zhang L ， Song Y Z and Xiang T . 2021 . Simpler is better： few-shot semantic segmentation with classifier weight Transformer // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 8721 - 8730 ［ DOI： 10.1109/ICCV48922.2021.00862 http://dx.doi.org/10.1109/ICCV48922.2021.00862 ］

Ma T ， Mao M Y ， Zheng H H ， Gao P ， Wang X D ， Han S M ， Ding E R ， Zhang B C and Doermann D . 2021 . Oriented object detection with Transformer ［EB/OL］. ［ 2022-03-26 ］ https://arxiv.org/pdf/2106.03146.pdf https://arxiv.org/pdf/2106.03146.pdf

Mao W A ， Ge Y T ， Shen C H ， Tian Z ， Wang X L and Wang Z B . 2021 . TFPose： direct human pose estimation with Transformers ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2103.15320.pdf https://arxiv.org/pdf/2103.15320.pdf

Mazzia V ， Angarano S ， Salvetti F ， Angelini F and Chiaberge M . 2022 . Action Transformer： a self-attention model for short-time pose-based human action recognition . Pattern Recognition ， 124 ： # 108487 ［ DOI： 10.1016/j.patcog.2021.108487 http://dx.doi.org/10.1016/j.patcog.2021.108487 ］

Meng D P ， Chen X K ， Fan Z J ， Zeng G ， Li H Q ， Yuan Y H ， Sun L and Wang J D . 2021 . Conditional DETR for fast training convergence // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 3631 - 3640 ［ DOI： 10.1109/ICCV48922.2021.00363 http://dx.doi.org/10.1109/ICCV48922.2021.00363 ］

Meng Y ， Shi M Q and Yang W L . 2022 . Skeleton action recognition based on Transformer adaptive graph convolution . Journal of Physics： Conference Series ， 2170 ： # 012007 ［ DOI： 10.1088/1742-6596/2170/1/012007 http://dx.doi.org/10.1088/1742-6596/2170/1/012007 ］

Misra I ， Girdhar R and Joulin A . 2021 . An end-to-end Transformer model for 3D object detection // Proceedings of 2021 International Conference on Computer Vision . Montreal， Canada ： IEEE： 2886 - 2897 ［ DOI： 10.1109/ICCV48922.2021.00290 http://dx.doi.org/10.1109/ICCV48922.2021.00290 ］

Mottaghi R ， Chen X J ， Liu X B ， Cho N G ， Lee S W ， Fidler S ， Urtasun R and Yuille A . 2014 . The role of context for object detection and semantic segmentation in the wild // Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition . Columbus， USA ： 891 - 898 ［ DOI： 10.1109/CVPR.2014.119 http://dx.doi.org/10.1109/CVPR.2014.119 ］

Munir F ， Azam S and Jeon M . 2021 . SSTN： self-supervised domain adaptation thermal object detection for autonomous driving // Proceedings of 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems （IROS） . Prague， Czech Republic ： IEEE： 206 - 213 ［ DOI： 10.1109/IROS51168.2021.9636353 http://dx.doi.org/10.1109/IROS51168.2021.9636353 ］

Neimark D ， Bar O ， Zohar M and Asselmann D . 2021 . Video Transformer network // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision Workshops . Montreal， Canada ： IEEE： 3156 - 3165 ［ DOI： 10.1109/ICCVW54120.2021.00355 http://dx.doi.org/10.1109/ICCVW54120.2021.00355 ］

Nguyen X B ， Bui D T ， Duong C N ， Bui T D and Luu K . 2021 . Clusformer： a Transformer based clustering approach to unsupervised large-scale face and visual landmark recognition // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville， USA ： IEEE： 10842 - 10851 ［ DOI： 10.1109/CVPR46437.2021.01070 http://dx.doi.org/10.1109/CVPR46437.2021.01070 ］

Pan X R ， Xia Z F ， Song S J ， Li L E and Huang G . 2021 . 3D object detection with pointformer // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville， USA ： IEEE： 7459 - 7468 ［ DOI： 10.1109/CVPR46437.2021.00738 http://dx.doi.org/10.1109/CVPR46437.2021.00738 ］

Petit O ， Thome N ， Rambour C ， Themyr L ， Collins T and Soler L . 2021 . U-Net Transformer： self and cross attention for medical image segmentation // Proceedings of the 12th International Workshop on Machine Learning in Medical Imaging . Strasbourg， France ： Springer： 267 - 276 ［ DOI： 10.1007/978-3-030-87589-3_28 http://dx.doi.org/10.1007/978-3-030-87589-3_28 ］

Plizzari C ， Cannici M and Matteucci M . 2021 . Spatial temporal Transformer network for skeleton-based action recognition // Proceedings of 2021 International Conference on Pattern Recognition. ICPR International Workshops and Challenges . Switzerland ： Springer： 694 - 701 ［ DOI： 10.1007/978-3-030-68796-0_50 http://dx.doi.org/10.1007/978-3-030-68796-0_50 ］

Qiu H L ， Hou B ， Ren B and Zhang X H . 2022a . Spatio-temporal tuples Transformer for skeleton-based action recognition ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2201.02849.pdf https://arxiv.org/pdf/2201.02849.pdf

Qiu Y ， Liu Y ， Zhang L and Xu J . 2022b . Boosting salient object detection with Transformer-based asymmetric bilateral U-Net ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2108.07851.pdf https://arxiv.org/pdf/2108.07851.pdf

Radford A ， Narasimhan K ， Salimans T and Sutskever I . 2018 . Improving language understanding by generative pre-training ［EB/OL］. ［ 2022-03-26 ］. https://www.gwern.net/docs/www/s3-us-west-2.amazonaws.com/d73fdc5ffa8627bce44dcda2fc012da638ffb158.pdf https://www.gwern.net/docs/www/s3-us-west-2.amazonaws.com/d73fdc5ffa8627bce44dcda2fc012da638ffb158.pdf

Radford A ， Wu J ， Child R ， Luan D ， Amodei D and Sutskever I . 2019 . Language models are unsupervised multitask learners ［EB/OL］. ［ 2022-03-26 ］. https://www.gwern.net/docs/ai/nn/transformer/gpt/2019-radford.pdf https://www.gwern.net/docs/ai/nn/transformer/gpt/2019-radford.pdf

Sha Y Y ， Zhang Y H ， Ji X Q and Hu L . 2021 . Transformer-unet： raw image processing with unet ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2109.08417.pdf https://arxiv.org/pdf/2109.08417.pdf

Shahroudy A ， Liu J ， Ng T T and Wang G . 2016 . NTU RGB+D： a large scale dataset for 3D human activity analysis // Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition . Vegas， USA ： IEEE： 1010 - 1019 ［ DOI： 10.1109/CVPR.2016.115 http://dx.doi.org/10.1109/CVPR.2016.115 ］

Shao Z C ， Bian H ， Chen Y ， Wang Y F ， Zhang J ， Ji X Y and Zhang Y B . 2021 . TransMIL： Transformer based correlated multiple instance learning for whole slide image classification ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2106.00908.pdf https://arxiv.org/pdf/2106.00908.pdf

Shen Z Q ， Fu R D ， Lin C N and Zheng S H . 2021 . COTR： convolution in Transformer network for end to end polyp detection // Proceedings of the 7th International Conference on Computer and Communications . Chengdu， China ： IEEE： 1757 - 1761 ［ DOI： 10.1109/ICCC54389.2021.9674267 http://dx.doi.org/10.1109/ICCC54389.2021.9674267 ］

Sheng H L ， Cai S J ， Liu Y ， Deng B ， Huang J ， Hua X S and Zhao M J . 2021 . Improving 3D object detection with channel-wise Transformer // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 2723 - 2732 ［ DOI： 10.1109/ICCV48922.2021.00274 http://dx.doi.org/10.1109/ICCV48922.2021.00274 ］

Shi F ， Lee C ， Qiu L ， Zhao Y Z ， Shen T Y ， Muralidhar S ， Han T ， Zhu S C and Narayanan V . 2021 . STAR： sparse Transformer-based action recognition ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2107.07089.pdf https://arxiv.org/pdf/2107.07089.pdf

Shuai H ， Wu L L and Liu Q S . 2022 . Adaptive multi-view and temporal fusing Transformer for 3D human pose estimation ［EB/OL］. ［ 2022-03-26 ］ https://arxiv.org/pdf/2110.05092.pdf https://arxiv.org/pdf/2110.05092.pdf

Sirinukunwattana K ， Pluim J P W ， Chen H ， Qi X J ， Heng P A ， Guo Y B ， Wang L Y ， Matuszewski B J ， Bruni E ， Sanchez U ， Böhm A ， Ronneberger O ， Cheikh B B ， Racoceanu D ， Kainz P ， Pfeiffer M ， Urschler M ， Snead D R J and Rajpoot N M . 2017 . Gland segmentation in colon histology images： the glas challenge contest . Medical Image Analysis ， 35 ： 489 - 502 ［ DOI： 10.1016/j.media.2016.08.008 http://dx.doi.org/10.1016/j.media.2016.08.008 ］

Song J G . 2021 . UFO-ViT： high performance linear vision Transformer without softmax ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2109.14382.pdf https://arxiv.org/pdf/2109.14382.pdf

Soomro K ， Zamir A R and Shah M . 2012 . UCF101： a dataset of 101 human actions classes from videos in the wild ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/1212.0402.pdf https://arxiv.org/pdf/1212.0402.pdf

Stoffl L ， Vidal M and Mathis A . 2021 . End-to-end trainable multi-instance pose estimation with Transformers ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2103.12115.pdf https://arxiv.org/pdf/2103.12115.pdf

Strudel R ， Garcia R ， Laptev I and Schmid C . 2021 . Segmenter： Transformer for semantic segmentation // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 7242 - 7252 ［ DOI： 10.1109/ICCV48922.2021.00717 http://dx.doi.org/10.1109/ICCV48922.2021.00717 ］

Sun G ， Liu Y ， Liang J and Gool L V . 2021a . Boosting few-shot semantic segmentation with Transformers ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2108.02266.pdf https://arxiv.org/pdf/2108.02266.pdf

Sun Z Q ， Cao S C ， Yang Y M and Kitani K . 2021b . Rethinking Transformer-based set prediction for object detection // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 3591 - 3600 ［ DOI： 10.1109/ICCV48922.2021.00359 http://dx.doi.org/10.1109/ICCV48922.2021.00359 ］

Tang L ， Li B . 2022 . CoSformer： detecting co-salient object with Transformers ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2104.14729.pdf https://arxiv.org/pdf/2104.14729.pdf

Touvron H ， Cord M ， Douze M ， Massa F ， Sablayrolles A and Jégou H . 2021 . Training data-efficient image Transformers and distillation through attention // Proceedings of the 38th International Conference on Machine Learning . Virtual Event ： PMLR： 10347 - 10357

Valanarasu J M J ， Oza P ， Hacihaliloglu I and Patel V M . 2021 . Medical Transformer： gated axial-attention for medical image segmentation // Proceedings of the 24th International Conference on Medical Image Computing and Computer Assisted Intervention——MICCAI 2021 . Strasbourg， France ： Springer： 36 - 46 ［ DOI： 10.1007/978-3-030-87193-2_4 http://dx.doi.org/10.1007/978-3-030-87193-2_4 ］

Vaswani A ， Shazeer N ， Parmar N ， Uszkoreit J ， Jones L ， Gomez A N ， Kaiser Ł and Polosukhin I . 2017 . Attention is all you need // Proceedings of the 31st International Conference on Neural Information Processing Systems . Long Beach， USA ： Curran Associates Inc.： 6000 - 6010

Wang J ， Yu X H and Gao Y S . 2022a . Feature fusion vision Transformer for fine-grained visual categorization ［EB/OL］. ［ 2022-02-28 ］. https：//arxiv.org/pdf/2107.02341.pdf https://arxiv.org/pdf/2107.02341.pdf ［ DOI： 10.48550/arXiv.2107.02341 http://dx.doi.org/10.48550/arXiv.2107.02341 ］

Wang L B ， Li R ， Duan C X and Fang S H . 2022b . Transformer meets DCFAM： a novel semantic segmentation scheme for fine-resolution remote sensing images ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2104.12137v1.pdf https://arxiv.org/pdf/2104.12137v1.pdf

Wang Q T ， Peng J L ， Shi S Z ， Liu T X ， He J B and Weng R L . 2021c . IIP-Transformer： intra-inter-part Transformer for skeleton-based action recognition ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2110.13385.pdf https://arxiv.org/pdf/2110.13385.pdf

Wang T ， Yuan L ， Chen Y P ， Feng J S and Yan S C . 2021d . PnP-DETR： towards efficient visual analysis with Transformers // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 4641 - 4650 ［ DOI： 10.1109/ICCV48922.2021.00462 http://dx.doi.org/10.1109/ICCV48922.2021.00462 ］

Wang W H ， Xie E Z ， Li X ， Fan D P ， Song K T ， Liang D ， Lu T ， Luo P and Shao L . 2021a . Pyramid vision Transformer： a versatile backbone for dense prediction without convolutions // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 548 - 558 ［ DOI： 10.1109/ICCV48922.2021.00061 http://dx.doi.org/10.1109/ICCV48922.2021.00061 ］

Wang W X ， Chen C ， Ding M ， Yu H ， Zha S and Li J Y . 2021e . TransBTs： multimodal brain tumor segmentation using Transformer // Proceedings of the 24th International Conference on Medical Image Computing and Computer Assisted Intervention——MICCAI 2021 . Strasbourg， France ： Springer： 109 - 119 ［ DOI： 10.1007/978-3-030-87193-2_11 http://dx.doi.org/10.1007/978-3-030-87193-2_11 ］

Wang Y Q ， Xu Z L ， Wang X L ， Shen C H ， Cheng B S ， Shen H and Xia H X . 2021b . End-to-end video instance segmentation with Transformers // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville， USA ： IEEE： 8737 - 8746 ［ DOI： 10.1109/CVPR46437.2021.00863 http://dx.doi.org/10.1109/CVPR46437.2021.00863 ］

Wu B C ， Xu C F ， Dai X L ， Wan A ， Zhang P Z ， Yan Z C ， Tomizuka M ， Gonzalez J ， Keutzer K and Vajda P . 2021a . Visual Transformers： where do Transformers really belong in vision models? // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 579 - 589 ［ DOI： 10.1109/ICCV48922.2021.00064 http://dx.doi.org/10.1109/ICCV48922.2021.00064 ］

Wu K ， Peng H W ， Chen M H ， Fu J L and Chao H Y . 2021b . Rethinking and improving relative position encoding for vision Transformer // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 10013 - 10021 ［ DOI： 10.1109/ICCV48922.2021.00988 http://dx.doi.org/10.1109/ICCV48922.2021.00988 ］

Wu S T ， Wu T Y ， Lin F J ， Tian S W and Guo G D . 2021c . Fully Transformer networks for semantic image segmentation ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2106.04108.pdf https://arxiv.org/pdf/2106.04108.pdf

Wu W L ， Kan M N ， Liu X ， Yang Y ， Shan S G and Chen X L . 2017 . Recursive spatial Transformer （ReST） for alignment-free face recognition // Proceedings of 2017 IEEE International Conference on Computer Vision . Venice， Italy ： IEEE： 3792 - 3800 ［ DOI： 10.1109/ICCV.2017.407 http://dx.doi.org/10.1109/ICCV.2017.407 ］

Xia X ， Li J S ， Wu J ， Wang X ， Xiao X F ， Zheng M and Wang R . 2022 . TRT-ViT： TensorRT-oriented vision Transformer ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2205.09579.pdf https://arxiv.org/pdf/2205.09579.pdf

Xie E Z ， Wang W J ， Wang W H ， Sun P Z ， Xu H ， Liang D and Luo P . 2021a . Segmenting transparent objects in the wild with Transformer // Proceedings of the 30th International Joint Conference on Artificial Intelligence . Montreal， Canada ：［s.n.］： 1194 - 1200 ［ DOI： 10.24963/ijcai.2021/165 http://dx.doi.org/10.24963/ijcai.2021/165 ］

Xie E Z ， Wang W H ， Yu Z D ， Anandkumar A ， Álvarez J M and Luo P . 2021b . Segformer： simple and efficient design for semantic segmentation with Transformers ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2105.15203.pdf https://arxiv.org/pdf/2105.15203.pdf

Xie J T ， Zeng R R ， Wang Q L ， Zhou Z Q and Li P H . 2021c . So-ViT： mind visual tokens for vision Transformer ［EB/OL］. ［ 2022-01-21 ］. https://arxiv.org/pdf/2104.10935v1.pdf https://arxiv.org/pdf/2104.10935v1.pdf

Xu Y F ， Zhang Z J ， Zhang M D ， Sheng K K ， Li K ， Dong W M ， Zhang L Q ， Xu C S and Sun X . 2021 . Evo-ViT： slow-fast token evolution for dynamic vision Transformer ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2108.01390.pdf https://arxiv.org/pdf/2108.01390.pdf

Yang J W ， Li C Y ， Zhang P C ， Dai X Y ， Xiao B ， Yuan L and Gao J F . 2021a . Focal self-attention for local-global interactions in vision Transformers ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2107.00641.pdf https://arxiv.org/pdf/2107.00641.pdf

Yang S ， Quan Z B ， Nie M and Yang W K . 2021b . TransPose： keypoint localization via Transformer // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 11782 - 11792 ［ DOI： 10.1109/ICCV48922.2021.01159 http://dx.doi.org/10.1109/ICCV48922.2021.01159 ］

Yu X D ， Shi D H ， Wei X ， Ren Y ， Ye T Q and Tan W M . 2022 . SOIT： segmenting objects with instance-aware Transformers . Proceedings of the AAAI Conference on Artificial Intelligence ， 36 （ 3 ）： 3188 - 3196 ［ DOI： 10.1609/aaai.v36i3.20227 http://dx.doi.org/10.1609/aaai.v36i3.20227 ］

Yuan L ， Chen Y P ， Wang T ， Yu W H ， Shi Y J ， Jiang Z H ， Tay F E H ， Feng J S and Yan S C . 2021 . Tokens-to-token ViT： training vision Transformers from scratch on ImageNet // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 538 - 547 ［ DOI： 10.1109/ICCV48922.2021.00060 http://dx.doi.org/10.1109/ICCV48922.2021.00060 ］

Yue X Y ， Sun S Y ， Kuang Z H ， Wei M ， Torr P ， Zhang W and Lin D H . 2021 . Vision Transformer with progressive sampling // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 377 - 386 ［ DOI： 10.1109/ICCV48922.2021.00044 http://dx.doi.org/10.1109/ICCV48922.2021.00044 ］

Zhang B W ， Yu J H ， Fifty C ， Han W ， Dai A M ， Pang R M and Sha F . 2021a . Co-training Transformer with videos and images improves action recognition ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2112.07175.pdf https://arxiv.org/pdf/2112.07175.pdf

Zhang G J ， Luo Z P ， Cui K W and Lu S J . 2021b . Meta-DETR： few-shot object detection via unified image-level meta-learning ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2103.11731v2.pdf https://arxiv.org/pdf/2103.11731v2.pdf

Zhang H ， Hao Y B and Ngo C W . 2021c . Token shift Transformer for video classification // Proceedings of the 29th ACM International Conference on Multimedia . Virtual Event， China ： ACM： 917 - 925 ［ DOI： 10.1145/3474085.3475272 http://dx.doi.org/10.1145/3474085.3475272 ］

Zhang J M ， Yang K L ， Constantinescu A ， Peng K Y ， Müller K and Stiefelhagen R . 2021e . Trans4Trans： efficient Transformer for transparent object segmentation to help visually impaired people navigate in the real world // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision Workshops . Montreal， Canada ： IEEE： 1760 - 1770 ［ DOI： 10.1109/ICCVW54120.2021.00202 http://dx.doi.org/10.1109/ICCVW54120.2021.00202 ］

Zhang J Y ， Huang J X ， Luo Z P ， Zhang G J and Lu S J . 2023 . DA-DETR： domain adaptive detection Transformer by hybrid attention ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2103.17084v1.pdf https://arxiv.org/pdf/2103.17084v1.pdf

Zhang P C ， Dai X Y ， Yang J W ， Xiao B ， Yuan L ， Zhang L and Gao J F . 2021f . Multi-scale vision longformer： a new vision Transformer for high-resolution image encoding // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 2978 - 2988 ［ DOI： 10.1109/ICCV48922.2021.00299 http://dx.doi.org/10.1109/ICCV48922.2021.00299 ］

Zhang Q L and Yang Y B . 2021 . ResT： an efficient Transformer for visual recognition . Advances in Neural Information Processing Systems ， 34 ： 15475 - 15485

Zhang Y ， Cao J ， Zhang L ， Liu X C ， Wang Z Y ， Ling F and Chen W Q . 2022 . A free lunch from ViT： adaptive attention multi-scale fusion Transformer for fine-grained visual recognition // ICASSP 2022 — 2022 IEEE International Conference on Acoustics， Speech and Signal Processing . Singapore， Singapore ： IEEE： 3234 - 3238 ［ DOI： 10.1109/ICASSP43922.2022.9747591 http://dx.doi.org/10.1109/ICASSP43922.2022.9747591 ］

Zhang Y Y ， Li X Y ， Liu C H ， Shuai B ， Zhu Y ， Brattoli B ， Chen H ， Marsic I and Tighe J . 2021d . VidTr： video Transformer without convolutions // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 13557 - 13567 ［ DOI： 10.1109/ICCV48922.2021.01332 http://dx.doi.org/10.1109/ICCV48922.2021.01332 ］

Zhang Z Z and Zhang W X . 2022 . Pyramid medical Transformer for medical image segmentation ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2104.14702.pdf https://arxiv.org/pdf/2104.14702.pdf

Zhao H ， Wang Q M ， Jia Z Z ， Chen Y M and Zhang J X . 2021a . Bayesian based facial expression recognition Transformer model in uncertainty // Proceedings of 2021 International Conference on Digital Society and Intelligent Systems . Chengdu， China ： IEEE： 157 - 161 ［ DOI： 10.1109/dsins54396.2021.9670628 http://dx.doi.org/10.1109/dsins54396.2021.9670628 ］

Zhao J J ， Li X Y ， Liu C H ， Bing S ， Chen H ， Snoek C G M and Tighe J . 2022 . TubeR： tube-Transformer for action detection ［EB/OL］. ［ 2022-02-21 ］. https://arxiv.org/pdf/2104.00969v2.pdf https://arxiv.org/pdf/2104.00969v2.pdf

Zhao J W ， Yan K ， Zhao Y F ， Guo X W ， Huang F Y and Li J . 2021c . Transformer-based dual relation graph for multi-label image recognition // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 163 - 172 ［ DOI： 10.1109/ICCV48922.2021.00023 http://dx.doi.org/10.1109/ICCV48922.2021.00023 ］

Zhao W X ， Tian Y J ， Ye Q X ， Jiao J B and Wang W Q . 2021b . GraFormer： graph convolution Transformer for 3D pose estimation ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2109.08364.pdf https://arxiv.org/pdf/2109.08364.pdf

Zheng C ， Zhu S J ， Mendieta M ， Yang T J N ， Chen C and Ding Z M . 2021a . 3D human pose estimation with spatial and temporal Transformers // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 11636 - 11645 ［ DOI： 10.1109/ICCV48922.2021.01145 http://dx.doi.org/10.1109/ICCV48922.2021.01145 ］

Zheng M H ， Gao P ， Zhang R R ， Li K C ， Wang X G ， Li H S and Dong H . 2021b . End-to-end object detection with adaptive clustering Transformer ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2011.09315.pdf https://arxiv.org/pdf/2011.09315.pdf

Zheng S X ， Lu J C ， Zhao H S ， Zhu X T ， Luo Z K ， Wang Y B ， Fu Y W ， Feng J F ， Xiang T ， Torr P H S and Zhang L . 2021c . Rethinking semantic segmentation from a sequence-to-sequence perspective with Transformers // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville， USA ： IEEE： 6877 - 6886 ［ DOI： 10.1109/CVPR46437.2021.00681 http://dx.doi.org/10.1109/CVPR46437.2021.00681 ］

Zhong Y Y and Deng W H . 2021 . Face Transformer for recognition ［EB/OL］. ［ 2022-02-15 ］. https://arxiv.org/pdf/2103.14803.pdf https://arxiv.org/pdf/2103.14803.pdf

Zhou B L ， Zhao H ， Puig X ， Fidler S ， Barriuso A and Torralba A . 2017 . Scene parsing through ADE20K dataset // Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition . Honolulu， USA ： IEEE： 5122 - 5130 ［ DOI： 10.1109/cvpr.2017.544 http://dx.doi.org/10.1109/cvpr.2017.544 ］

Zhu X Z ， Su W J ， Lu L W ， Li B ， Wang X G and Dai J F . 2021 . Deformable DETR： deformable Transformers for end-to-end object detection ［EB/OL］. ［ 2022-03-26 ］. https://arxiv.org/pdf/2010.04159.pdf https://arxiv.org/pdf/2010.04159.pdf

Alert me when the article has been cited

提交

Review of various vessels and airway segmentation in medical imaging

Low-light image enhancement guided by semantic segmentation and HSV color space

The growth of UAV aerial images-related power lines detection： a literature review of 2023

Complex gesture pose estimation network fusing multiscale features

Multi-scale local feature enhanced transformer network for pavement crack detection