视觉Transformer识别任务研究综述

周丽娟; 毛嘉宁

doi:10.11834/jig.220895

综述 | 浏览量 : 0 下载量: 3 CSCD: 1

PDF
导出
分享
收藏
专辑

视觉Transformer识别任务研究综述
Vision Transformer-based recognition tasks： a critical review
2023年28卷第10期页码：2969-3003
纸质出版日期： 2023-10-16 ，
DOI： 10.11834/jig.220895
稿件说明：

移动端阅览

周丽娟，毛嘉宁. 2023. 视觉Transformer识别任务研究综述. 中国图象图形学报， 28(10):2969-3003

Zhou Lijuan， Mao Jianing. 2023. Vision Transformer-based recognition tasks： a critical review. Journal of Image and Graphics， 28(10):2969-3003
周丽娟，毛嘉宁. 2023. 视觉Transformer识别任务研究综述. 中国图象图形学报， 28(10):2969-3003 DOI： 10.11834/jig.220895.

Zhou Lijuan， Mao Jianing. 2023. Vision Transformer-based recognition tasks： a critical review. Journal of Image and Graphics， 28(10):2969-3003 DOI： 10.11834/jig.220895.

摘要

Transformer模型在自然语言处理领域取得了很好的效果，同时因其能够更好地连接视觉和语言，也激发了计算机视觉界的极大兴趣。本文总结了视觉Transformer处理多种识别任务的百余种代表性方法，并对比分析了不同任务内的模型表现，在此基础上总结了每类任务模型的优点、不足以及面临的挑战。根据识别粒度的不同，分别着眼于诸如图像分类、视频分类的基于全局识别的方法，以及目标检测、视觉分割的基于局部识别的方法。考虑到现有方法在3种具体识别任务的广泛流行，总结了在人脸识别、动作识别和姿态估计中的方法。同时，也总结了可用于多种视觉任务或领域无关的通用方法的研究现状。基于Transformer的模型实现了许多端到端的方法，并不断追求准确率与计算成本的平衡。全局识别任务下的Transformer模型对补丁序列切分和标记特征表示进行了探索，局部识别任务下的Transformer模型因能够更好地捕获全局信息而取得了较好的表现。在人脸识别和动作识别方面，注意力机制减少了特征表示的误差，可以处理丰富多样的特征。Transformer可以解决姿态估计中特征错位的问题，有利于改善基于回归的方法性能，还减少了三维估计时深度映射所产生的歧义。大量探索表明视觉Transformer在识别任务中的有效性，并且在特征表示或网络结构等方面的改进有利于提升性能。

Abstract

Due to its ability to model long-distance dependencies， self-attention mechanism for adaptive computing， scalability for large models and big data， and better connection between vision and language， Transformer model is beneficial for natural language processing and computer vision apparently. To melt Transformer into vision tasks， such vision Transformer methods have been developing intensively. Current literatures can be summarized and analyzed for multiple applications-related methods. However， these different applications are often heterogeneous for various methods. In addition， comparative analysis is often focused on between Transformer and traditional convolution neural networks （CNNs）， and multi-Transformer models are less involved in and linked mutually. We summarize and compare more than 100 popular methods of vision Transformer for various recognition tasks. Global recognition-based methods are reviewed for such classification of image and video contexts， and local recognition-based methods of object detection and vision segmentation. We summarize the methods in the context of face recognition， action recognition and pose estimation based on three specific recognition tasks mentioned above. Furthermore， solo task and independent domain methods are summarized， which can be used for image classification， object detection and other related vision tasks. The performance of these Transformer-based models are compared and analyzed on the public datasets as well. Image classification is mostly used to represent features in terms of visual and class tokens. The vision Transformer （ViT） and data-efficient image Transformers （DeiT）-illustrated models have its potentials for ImageNet datasets. Object detection tasks are required to detect targeted objects derived from input visual data， and the coordinates and labels of a series of bounding boxes are predictable as well. Object detection is illustrated by detection Transformer （DETR）， which can alter the indirectness of previous classification and regression through proposals， anchors or windows. Subsequently， other related literatures are focused on improving the feature maps， computational complexity and convergence speed of DETR to a certain extent， such as conditional DETR， deformable DETR， unsupervised pre-training DETR （UP-DETR）. Additionally， Transformer-based models have preferred relevant to such applications of salient object detection， point cloud 3D detection and few-shot object detection. Semantic segmentation tasks are required for an assignment from class label to each pixel in the image and the bounding box of the object like object detection can be predicted and optimized further. However， semantic segmentation can be used to determine pixel classes only， and it is still challenged to identify multiple instances-between similar pixels. Transformer is also paid attention to improve U-Net for medical image segmentation. It is possible to link the Transformer with pyramid network， or design different decoder structures for pixel-by-pixel segmentation， such as segmentation Transformer progressive upsampling （SETR-PUP） and segmentation Transformer multi-level feature aggregation （SETR-MLA）. Mask classification methods are commonly used in instance segmentation and it can also be used for semantic segmentation via Transformer structure like a segmenter. Instance segmentation is similar to the combination of object detection and semantic segmentation. Compared to the bounding box of object detection， the output of instance segmentation is a mask， which can segment the edges of objects and distinguish different instances of similar objects. It can optimize the ability of semantic segmentation to some extent. Transformer can be used to melt more end-to-end methods into instance segmentation， and the quality of the masks can be used and improved during the segmentation process. Transformer can provide an alignment-free method for face recognition， and it can handle noises in related to facial expressions and racial bias. Action recognition tasks are required to classify videos-input human actions， which are similar to image classification tasks and additional processing of the temporal dimension is not avoidable. Transformer is developed for modeling long-term temporal and spatial dependencies for action recognition beyond two-stream network and three-dimensional convolution. Pose estimation is usually recognized as a human body keypoints-sorted problem and parts-between spatial relationship is identified. It consists of 2D pose estimation and 3D pose estimation. The former one is generally used to determine two-dimensional coordinates of body parts， while the latter one adds depth information on the basis of two-dimensional coordinates. Transformer is used to refine keypoint features for pose estimation， and the modeling of intra-frame node relationships and inter-frame temporal relationships are optimized as well. Multi-task models based Transformer research is focused on the integration of image classification， object detection and semantic segmentation tasks. Some other related popular models are also proposed that can be used in vision and language domains. Extensive research has shown the effectiveness of the vision Transformer in recognition tasks， and feature representation or network structure-relevant optimization is beneficial for its performance improvement. Future research direction are predicted in relevance to such effective and efficient methods for accuracy preservation in the context of positional encoding， self-supervised learning， multimodal integrating， and computational cost cutting.

关键词

视觉Transformer（ViT）自注意力视觉识别深度学习图像处理视频理解

Keywords

vision Transformer （ViT）self-attentionvision recognitiondeep learningimage processingvideo understanding

references

Arnab A， Dehghani M， Heigold G， Sun C， Lučić M and Schmid C. 2021. ViViT： a video vision Transformer//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 6816-6826 ［DOI： 10.1109/ICCV48922.2021.00676http://dx.doi.org/10.1109/ICCV48922.2021.00676］

Atito S， Awais M and Kittler J. 2021. SiT： self-supervised vision Transformer ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2104.03602.pdfhttps://arxiv.org/pdf/2104.03602.pdf

Bai R W， Li M， Meng B， Li F F， Jiang M， Ren J X and Sun D G. 2022. Hierarchical graph convolutional skeleton Transformer for action recognition//Proceedings of 2022 IEEE International Conference on Multimedia and Expo. Taipei， China： IEEE： 01-06 ［DOI： 10.1109/ICME52920.2022.9859781http://dx.doi.org/10.1109/ICME52920.2022.9859781］

Bar A， Wang X， Kantorov V， Reed C J， Herzig R， Chechik G， Rohrbach A， Darrell T and Globerson A. 2021. DETReg： unsupervised pretraining with region priors for object detection ［EB/OL］. ［2022-01-24］. https://arxiv.org/pdf/2106.04550.pdfhttps://arxiv.org/pdf/2106.04550.pdf

Barsoum E， Zhang C， Ferrer C C and Zhang Z Y. 2016. Training deep networks for facial expression recognition with crowd-sourced label distribution//Proceedings of the 18th ACM International Conference on Multimodal Interaction. Tokyo， Japan： ACM： 279-283 ［DOI： 10.1145/2993148.2993165http://dx.doi.org/10.1145/2993148.2993165］

Bertasius G， Wang H and Torresani L. 2021. Is space-time attention all you need for video understanding?//Proceedings of the 38th International Conference on Machine Learning. Virtual Event： PMLR： 813-824

Brown T B， Mann B， Ryder N， Subbiah M， Kaplan J， Dhariwal P， Neelakantan A， Shyam P， Sastry G， Askell A， Agarwal S， Herbert-Voss A， Krueger G， Henighan T， Child R， Ramesh A， Ziegler D M， Wu J， Winter C， Hesse C， Chen M， Sigler E， Litwin M， Gray S， Chess B， Clark J， Berner C， McCandlish S， Radford A， Sutskever I and Amodei D. 2020. Language models are few-shot learners//Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver， Canada： Curran Associates Inc.： 1877-1901

Caesar H， Uijlings J and Ferrari V. 2018. COCO-stuff： thing and stuff classes in context//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： 1209-1218 ［DOI： 10.1109/CVPR.2018.00132http://dx.doi.org/10.1109/CVPR.2018.00132］

Cai Y M， Cai G Y and Cai J. 2021. Action-Transformer for action recognition in short videos//Proceedings of the 11th International Conference on Intelligent Control and Information Processing. Dali， China： IEEE： 278-283 ［DOI： 10.1109/ICICIP53388.2021.9642184http://dx.doi.org/10.1109/ICICIP53388.2021.9642184］

Cao H， Wang Y Y， Chen J， Jiang D S， Zhang X P， Tian Q and Wang M N. 2021. Swin-unet： Unet-like pure Transformer for medical image segmentation ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2105.05537.pdfhttps://arxiv.org/pdf/2105.05537.pdf

Cao J L， Li Y L， Sun H Q， Xie J， Huang K Q and Pang Y W. 2022. A survey on deep learning based visual object detection. Journal of Image and Graphics， 27（6）： 1697-1722

曹家乐，李亚利，孙汉卿，谢今，黄凯奇，庞彦伟. 2022. 基于深度学习的视觉目标检测技术综述. 中国图象图形学报， 27（6）： 1697-1722

Carion N， Massa F， Synnaeve G， Usunier N， Kirillov A and Zagoruyko S. 2020. End-to-end object detection with Transformers//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 213-229 ［DOI： 10.1007/978-3-030-58452-8_13http://dx.doi.org/10.1007/978-3-030-58452-8_13］

Chang Y， Hu M H， Zhai G T and Zhang X P. 2021. Transclaw U-Net： claw U-Net with Transformers for medical image segmentation ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2107.05188.pdfhttps://arxiv.org/pdf/2107.05188.pdf

Chen B Y， Li P X， Li B P， Li C M， Bai L， Lin C， Sun M， Yan J J and Ouyang W L. 2021a. PSViT： better vision Transformer via token pooling and attention sharing ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2108.03428.pdfhttps://arxiv.org/pdf/2108.03428.pdf

Chen B Z， Liu Y S， Zhang Z， Lu G and Zhang D. 2022a. TransAttUnet： multi-level attention-guided U-Net with Transformer for medical image segmentation ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2107.05274.pdfhttps://arxiv.org/pdf/2107.05274.pdf

Chen C F R， Fan Q F and Panda R. 2021c. CrossViT： cross-attention multi-scale vision Transformer for image classification//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 347-356 ［DOI： 10.1109/ICCV48922.2021.00041http://dx.doi.org/10.1109/ICCV48922.2021.00041］

Chen G， Zhang S Q and Zhao X M. 2022. Video sequence-based human facial expression recognition using Transformer networks. Journal of Image and Graphics， 27（10）： 3022-3030

陈港，张石清，赵小明. 2022. 采用Transformer网络的视频序列表情识别. 中国图象图形学报， 27（10）： 3022-3030 ［DOI： 10.11834/jig.210248http://dx.doi.org/10.11834/jig.210248］

Chen H Y， Li C， Li X Y， Wang G， Hu W M， Li Y X， Liu W L， Sun C H， Yao Y D， Teng Y Y and Grzegorzek M. 2022b. GasHis-Transformer： a multi-scale visual Transformer approach for gastric histopathology image classification ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2104.14528v5.pdfhttps://arxiv.org/pdf/2104.14528v5.pdf

Chen J N， Lu Y Y， Yu Q H， Luo X D， Adeli E， Wang Y， Lu L， Yuille A L and Zhou Y Y. 2021e. TransUNet： Transformers make strong encoders for medical image segmentation ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2102.04306.pdfhttps://arxiv.org/pdf/2102.04306.pdf

Chen J W and Ho C M. 2022. MM-ViT： multi-modal video Transformer for compressed video action recognition//Proceedings of 2022 IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa， USA： IEEE： 786-797 ［DOI： 10.1109/WACV51458.2022.00086http://dx.doi.org/10.1109/WACV51458.2022.00086］

Chen Z S， Xie L X， Niu J W， Liu X F， Wei L H and Tian Q. 2021b. Visformer： the vision-friendly Transformer//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： 569-578 ［DOI： 10.1109/ICCV48922.2021.00063http://dx.doi.org/10.1109/ICCV48922.2021.00063］

Chen Z Y， Zhu Y S， Zhao C Y， Hu G S， Zeng W， Wang J Q and Tang M. 2021d. DPT： deformable patch-based Transformer for visual recognition//Proceedings of the 29th ACM International Conference on Multimedia. Virtual Event， China： ACM： 2899-2907 ［DOI： 10.1145/3474085.3475467http://dx.doi.org/10.1145/3474085.3475467］

Cheng B W， Schwing A G and Kirillov A. 2021. Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems， 34： 17864-17875

Codella N， Rotemberg V， Tschandl P， Celebi M E， Dusza S， Gutman D， Helba B， Kalloo A， Liopyris K， Marchetti M， Kittler H and Halpern A. 2019. Skin lesion analysis toward melanoma detection 2018： a challenge hosted by the international skin imaging collaboration （ISIC）［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/1902.03368.pdfhttps://arxiv.org/pdf/1902.03368.pdf

Cordts M， Omran M， Ramos S， Rehfeld T， Enzweiler M， Benenson R， Franke U， Roth S and Schiele B. 2016. The cityscapes dataset for semantic urban scene understanding//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 3213-3223 ［DOI： 10.1109/CVPR.2016.350http://dx.doi.org/10.1109/CVPR.2016.350］

Dai Y， Gao Y F and Liu F Y. 2021a. TransMed： Transformers advance multi-modal medical image classification. Diagnostics， 11（8）： #1384 ［DOI： 10.3390/diagnostics11081384http://dx.doi.org/10.3390/diagnostics11081384］

Dai Z G， Cai B L， Lin Y G and Chen J Y. 2021b. UP-DETR： unsupervised pre-training for object detection with Transformers//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 1601-1610 ［DOI： 10.1109/CVPR46437.2021.00165http://dx.doi.org/10.1109/CVPR46437.2021.00165］

Deng J， Dong W， Socher R， Li L J， Li K and Li F F. 2009. ImageNet： a large-scale hierarchical image database//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami， USA： IEEE： 248-255 ［DOI： 10.1109/CVPR.2009.5206848http://dx.doi.org/10.1109/CVPR.2009.5206848］

Devlin J， Chang M W， Lee K and Toutanova K. 2019. BERT： pre-training of deep bidirectional Transformers for language understanding//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Minneapolis， USA： Association for Computational Linguistics： 4171-4186 ［DOI： 10.18653/v1/n19-1423http://dx.doi.org/10.18653/v1/n19-1423］

Dong B， Wang W H， Fan D P， Li J P， Fu H Z and Shao L. 2023. Polyp-PVT： polyp segmentation with pyramid vision Transformers ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2108.06932.pdfhttps://arxiv.org/pdf/2108.06932.pdf

Dong B， Zeng F， Wang T C， Zhang X Y and Wei Y C. 2021. SOLQ： segmenting objects by learning queries ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2106.02351.pdfhttps://arxiv.org/pdf/2106.02351.pdf

Dosovitskiy A， Beyer L， Kolesnikov A， Weissenborn D， Zhai X H， Unterthiner T， Dehghani M， Minderer M， Heigold G， Gelly S， Uszkoreit J and Houlsby N. 2021. An image is worth 16 × 16 words： Transformers for image recognition at scale ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2010.11929.pdfhttps://arxiv.org/pdf/2010.11929.pdf

Frank S， Bugliarello E and Elliott D. 2021. Vision-and-language or vision-for-language？ On cross-modal influence in multimodal Transformers//Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. Punta Cana， Dominican Republic： Association for Computational Linguistics： 9847-9857 ［DOI： 10.18653/v1/2021.emnlp-main.775http://dx.doi.org/10.18653/v1/2021.emnlp-main.775］

Gao P， Zheng M H， Wang X G， Dai J F and Li H S. 2021a. Fast convergence of detr with spatially modulated co-attention//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 3601-3610 ［DOI： 10.1109/ICCV48922.2021.00360http://dx.doi.org/10.1109/ICCV48922.2021.00360］

Gao Y H， Zhou M and Metaxas D N. 2021b. UTNet： a hybrid Transformer architecture for medical image segmentation//Proceedings of the 24th International Conference on Medical Image Computing and Computer Assisted Intervention——MICCAI 2021. Strasbourg， France： Springer： 61-71 ［DOI： 10.1007/978-3-030-87199-4_6http://dx.doi.org/10.1007/978-3-030-87199-4_6］

Girdhar R， Carreira J J， Doersch C and Zisserman A. 2019. Video action Transformer network//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 244-253 ［DOI： 10.1109/CVPR.2019.00033http://dx.doi.org/10.1109/CVPR.2019.00033］

Graham B， El-Nouby A， Touvron H， Stock P， Joulin A， Jégou H and Douze M. 2021. LeViT： a vision Transformer in ConvNet’s clothing for faster inference//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 12239-12249 ［DOI： 10.1109/ICCV48922.2021.01204http://dx.doi.org/10.1109/ICCV48922.2021.01204］

Guan T R， Wang J， Lan S Y， Chandra R， Wu Z X， Davis L and Manocha D. 2022. M3DETR： multi-representation， multi-scale， mutual-relation 3D object detection with Transformers//Proceedings of 2021 IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa， USA： IEEE： 2293-2303 ［DOI： 10.1109/WACV51458.2022.00235http://dx.doi.org/10.1109/WACV51458.2022.00235］

Guo J Y， Han K， Wu H， Tang Y H， Chen X H， Wang Y H and Xu C. 2022. CMT： convolutional neural networks meet vision Transformers ［EB/OL］. ［2022-01-21］. https://arxiv.org/pdf/2107.06263.pdfhttps://arxiv.org/pdf/2107.06263.pdf

Guo R H， Niu D T， Qu L and Li Z B. 2021. SOTR： segmenting objects with Transformers//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 7137-7146 ［DOI： 10.1109/ICCV48922.2021.00707http://dx.doi.org/10.1109/ICCV48922.2021.00707］

Hampali S， Sarkar S D， Rad M and Lepetit V. 2021. HandsFormer： keypoint Transformer for monocular 3D pose estimation of hands and object in interaction ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2104.14639v1.pdfhttps://arxiv.org/pdf/2104.14639v1.pdf

Han K， Wang Y H， Chen H T， Chen X H， Guo J Y， Liu Z H， Tang Y H， Xiao A， Xu C J， Xu Y X， Yang Z H， Zhang Y M and Tao D C. 2022. A survey on vision Transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence， 45（1）： 87-110 ［DOI： 10.1109/TPAMI.2022.3152247http://dx.doi.org/10.1109/TPAMI.2022.3152247］.

Hatamizadeh A， Tang Y C， Nath V， Yang D， Myronenko A， Landman B， Roth H R and Xu D G. 2022. UNETR： Transformers for 3D medical image segmentation//Proceedings of 2022 IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa， USA： IEEE： 1748-1758 ［DOI： 10.1109/WACV51458.2022.00181http://dx.doi.org/10.1109/WACV51458.2022.00181］

He J， Chen J N， Liu S， Kortylewski A， Yang C， Bai Y T and Wang C H. 2022. TransFG： a Transformer architecture for fine-grained recognition. Proceedings of the AAAI Conference on Artificial Intelligence， 36（1）： 852-860 ［DOI： 10.1609/aaai.v36i1.19967http://dx.doi.org/10.1609/aaai.v36i1.19967］

Heo B， Yun S， Han D， Chun S， Choe J and Oh S J. 2021. Rethinking spatial dimensions of vision Transformers//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 11916-11925 ［DOI： 10.1109/ICCV48922.2021.01172http://dx.doi.org/10.1109/ICCV48922.2021.01172］

Hu H Z， Zhao W C， Zhou W G， Wang Y C and Li H Q. 2021a. SignBERT： pre-training of hand-model-aware representation for sign language recognition//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 11067-11076 ［DOI： 10.1109/ICCV48922.2021.01090http://dx.doi.org/10.1109/ICCV48922.2021.01090］

Hu J， Cao L J， Lu Y， Zhang S C， Wang Y， Li K， Huang F Y， Shao L and Ji R R. 2021b. ISTR： end-to-end instance segmentation with Transformers ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2105.00637.pdfhttps://arxiv.org/pdf/2105.00637.pdf

Hu R H and Singh A. 2021. UniT： Multimodal multitask learning with a unified Transformer//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 1419-1429 ［DOI： 10.1109/ICCV48922.2021.00147http://dx.doi.org/10.1109/ICCV48922.2021.00147］

Huang G B， Ramesh M， Berg T and Learned-Miller E. 2008. Labeled faces in the wild： a database for studying face recognition in unconstrained environments ［EB/OL］. ［2022-03-26］. http://tamaraberg.com/papers/lfw.pdfhttp://tamaraberg.com/papers/lfw.pdf

Hwang S， Heo M， Oh S W and Kim S J. 2021. Video instance segmentation using inter-frame communication Transformers ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2106.03299.pdfhttps://arxiv.org/pdf/2106.03299.pdf

Ionescu C， Papava D， Olaru V and Sminchisescu C. 2014. Human3.6M： large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence， 36（7）： 1325-1339 ［DOI： 10.1109/TPAMI.2013.248http://dx.doi.org/10.1109/TPAMI.2013.248］

Islam M A， Jia S and Bruce N D B. 2020. How much position information do convolutional neural networks encode？［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2001.08248.pdfhttps://arxiv.org/pdf/2001.08248.pdf

Ji G P， Chou Y C， Fan D P， Chen G， Fu H Z， Jha D and Shao L. 2021a. Progressively normalized self-attention network for video polyp segmentation//Proceedings of the 24th International Conference on Medical Image Computing and Computer Assisted Intervention —— MICCAI 2021. Strasbourg， France： Springer： 142-152 ［DOI： 10.1007/978-3-030-87193-2_14http://dx.doi.org/10.1007/978-3-030-87193-2_14］

Ji Y F， Zhang R M， Wang H J， Li Z， Wu L Y， Zhang S T and Luo P. 2021b. Multi-compound Transformer for accurate biomedical image segmentation//Proceedings of the 24th International Conference on Medical Image Computing and Computer Assisted Intervention —— MICCAI 2021. Strasbourg， France： Springer： 326-336 ［DOI： 10.1007/978-3-030-87193-2_31http://dx.doi.org/10.1007/978-3-030-87193-2_31］

Jiang B， Yu J H， Zhou L， Wu K L and Yang Y. 2021. Two-pathway Transformer network for video action recognition//Proceedings of 2021 IEEE International Conference on Image Processing. Anchorage， USA： IEEE： 1089-1093 ［DOI： 10.1109/ICIP42928.2021.9506453http://dx.doi.org/10.1109/ICIP42928.2021.9506453］

Jin H， Yang J M and Zhang S. 2021. Efficient action recognition with introducing R（2+1）D convolution to improved Transformer//Proceedings of the 4th International Conference on Information Communication and Signal Processing. Shanghai， China： IEEE： 379-383 ［DOI： 10.1109/icicsp54369.2021.9611970http://dx.doi.org/10.1109/icicsp54369.2021.9611970］

Kay W， Carreira J， Simonyan K， Zhang B， Hillier C， Vijayanarasimhan S， Viola F， Green T， Back T， Natsev P， Suleyman M and Zisserman A. 2017. The kinetics human action video dataset ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/1705.06950.pdfhttps://arxiv.org/pdf/1705.06950.pdf

Ke L， Danelljan M， Li X， Tai Y W， Tang C K and Yu F. 2021. Mask Transfiner for high-quality instance segmentation ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2111.13673.pdfhttps://arxiv.org/pdf/2111.13673.pdf

Khan S， Naseer M， Hayat M， Zamir S W， Khan F S and Shah M. 2022. Transformers in vision： a survey. ACM Computing Surveys， 54（10s）： #200 ［DOI： 10.1145/3505244http://dx.doi.org/10.1145/3505244］

Kong J， Bian Y H and Jiang M. 2022. MTT： multi-scale temporal Transformer for skeleton-based action recognition. IEEE Signal Processing Letters， 29： 528-532 ［DOI： 10.1109/LSP.2022.3142675http://dx.doi.org/10.1109/LSP.2022.3142675］

Kumar N， Verma R， Sharma S， Bhargava S， Vahadane A and Sethi A. 2017. A dataset and a technique for generalized nuclear segmentation for computational pathology. IEEE Transactions on Medical Imaging， 36（7）： 1550-1560 ［DOI： 10.1109/Tmi.2017.2677499http://dx.doi.org/10.1109/Tmi.2017.2677499］

Lanchantin J， Wang T L， Ordonez V and Qi Y J. 2021. General multi-label image classification with Transformers//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 16473-16483 ［DOI： 10.1109/CVPR46437.2021.01621http://dx.doi.org/10.1109/CVPR46437.2021.01621］

Li H T， Sui M Z， Zhao F， Zha Z J and Wu F. 2021a. MVT： mask vision Transformer for facial expression recognition in the wild ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2106.04520.pdfhttps://arxiv.org/pdf/2106.04520.pdf

Li K， Wang S J， Zhang X， Xu Y F， Xu W J and Tu Z W. 2021b. Pose recognition with cascade Transformers//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 1944-1953 ［DOI： 10.1109/CVPR46437.2021.00198http://dx.doi.org/10.1109/CVPR46437.2021.00198］

Li S C， Cao Q G， Liu L B， Yang K L， Liu S N， Hou J and Yi S. 2021c. GroupFormer： group activity recognition with clustered spatial-temporal Transformer//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 13648-13657 ［DOI： 10.1109/ICCV48922.2021.01341http://dx.doi.org/10.1109/ICCV48922.2021.01341］

Li S H， Sui X， Luo X D， Xu X X， Liu Y and Goh R. 2021d. Medical image segmentation using squeeze-and-expansion Transformers//Proceedings of the 30th International Joint Conference on Artificial Intelligence. Montreal， Canada：［s.n.］： 807-815 ［DOI： 10.24963/ijcai.2021/112http://dx.doi.org/10.24963/ijcai.2021/112］

Li W H， Liu H， Ding R W， Liu M Y and Wang P C. 2022c. Lifting Transformer for 3D human pose estimation in video ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2103.14304v2.pdfhttps://arxiv.org/pdf/2103.14304v2.pdf

Li X Y， Hou Y H， Wang P C， Gao Z M， Xu M L and Li W Q. 2022a. Trear： Transformer-based RGB-D egocentric action recognition. IEEE Transactions on Cognitive and Developmental Systems， 14（1）： 246-252 ［DOI： 10.1109/TCDS.2020.3048883http://dx.doi.org/10.1109/TCDS.2020.3048883］

Li Y， Sun Y F， Cui Z， Shan S G and Yang J. 2021e. Learning fair face representation with progressive cross Transformer ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2108.04983.pdfhttps://arxiv.org/pdf/2108.04983.pdf

Li Y H， Mao H Z， Girshick R and He K M. 2022b. Exploring plain vision Transformer backbones for object detection ［EB/OL］. ［2022-10-08］. https://arxiv.org/pdf/2203.16527.pdfhttps://arxiv.org/pdf/2203.16527.pdf

Lin M， Li C M， Bu X Y， Sun M， Lin C， Yan J J， Ouyang W L and Deng Z D. 2021a. DETR for crowd pedestrian detection ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2012.06785.pdfhttps://arxiv.org/pdf/2012.06785.pdf

Lin T Y， Maire M， Belongie S， Hays J， Perona P， Ramanan D， Doll􀆦r P and Zitnick C L. 2014. Microsoft COCO： common objects in context//Proceedings of the 13th European Conference on Computer Vision. Zurich， Switzerland： Springer： 740-755［DOI： 10.1007/978-3-319-10602-1_48http://dx.doi.org/10.1007/978-3-319-10602-1_48］

Lin W D， Deng Y Y， Gao Y， Wang N， Zhou J H， Liu L Q， Zhang L and Wang P. 2021b. CAT： cross-attention Transformer for one-shot object detection ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2104.14984.pdfhttps://arxiv.org/pdf/2104.14984.pdf

Ling X F， Liang J X， Wang D and Yang J. 2021. A facial expression recognition system for smart learning based on YOLO and vision Transformer//Proceedings of the 7th International Conference on Computing and Artificial Intelligence. Tianjin， China： ACM： 178-182 ［DOI： 10.1145/3467707.3467733http://dx.doi.org/10.1145/3467707.3467733］

Liu F F， Wei H R， Zhao W Z， Li G Z， Peng J Q and Li Z H. 2021a. WB-DETR： Transformer-based detector without backbone//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 2959-2967 ［DOI： 10.1109/ICCV48922.2021.00297http://dx.doi.org/10.1109/ICCV48922.2021.00297］

Liu J， Shahroudy A， Perez M， Wang G， Duan L Y and Kot A C. 2020. NTU RGB+D 120： a large-scale benchmark for 3D human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence， 42（10）： 2684-2701 ［DOI： 10.1109/TPAMI.2019.2916873http://dx.doi.org/10.1109/TPAMI.2019.2916873］

Liu N， Zhang N， Wan K Y， Shao L and Han J W. 2021b. Visual saliency Transformer//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 4702-4712 ［DOI： 10.1109/ICCV48922.2021.00468http://dx.doi.org/10.1109/ICCV48922.2021.00468］

Liu S L， Zhang L， Yang X， Su H and Zhu J. 2021c. Query2label： a simple Transformer way to multi-label classification ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2107.10834.pdfhttps://arxiv.org/pdf/2107.10834.pdf

Liu W T and Lu X M. 2022. Research progress of Transformer based on computer vision. Computer Engineering and Applications， 58（6）： 1-16

刘文婷，卢新明. 2022. 基于计算机视觉的Transformer研究进展. 计算机工程与应用， 58（6）： 1-16

Liu X L， Wang Q M， Hu Y， Tang X， Zhang S W， Bai S and Bai X. 2022a. End-to-end temporal action detection with Transformer ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2106.10271.pdfhttps://arxiv.org/pdf/2106.10271.pdf

Liu Y， Zhang Y， Wang Y X， Hou F， Yuan J， Tian J， Zhang Y， Shi Z C， Fan J P and He Z Q. 2022b. A survey of visual Transformers. IEEE Transactions on Neural Networks and Learning Systems ［DOI： 10.1109/TNNLS.2022.3227717http://dx.doi.org/10.1109/TNNLS.2022.3227717］

Liu Z， Lin Y T， Cao Y， Hu H， Wei Y X， Zhang Z， Lin S and Guo B N. 2021d. Swin Transformer： hierarchical vision Transformer using shifted windows//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 9992-10002 ［DOI： 10.1109/ICCV48922.2021.00986http://dx.doi.org/10.1109/ICCV48922.2021.00986］

Liu Z， Zhang Z， Cao Y， Hu H and Tong X. 2021e. Group-free 3D object detection via Transformers//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 2929-2938 ［DOI： 10.1109/ICCV48922.2021.00294http://dx.doi.org/10.1109/ICCV48922.2021.00294］

Lu Z H， He S， Zhu X T， Zhang L， Song Y Z and Xiang T. 2021. Simpler is better： few-shot semantic segmentation with classifier weight Transformer//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 8721-8730 ［DOI： 10.1109/ICCV48922.2021.00862http://dx.doi.org/10.1109/ICCV48922.2021.00862］

Ma T， Mao M Y， Zheng H H， Gao P， Wang X D， Han S M， Ding E R， Zhang B C and Doermann D. 2021. Oriented object detection with Transformer ［EB/OL］. ［2022-03-26］ https://arxiv.org/pdf/2106.03146.pdfhttps://arxiv.org/pdf/2106.03146.pdf

Mao W A， Ge Y T， Shen C H， Tian Z， Wang X L and Wang Z B. 2021. TFPose： direct human pose estimation with Transformers ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2103.15320.pdfhttps://arxiv.org/pdf/2103.15320.pdf

Mazzia V， Angarano S， Salvetti F， Angelini F and Chiaberge M. 2022. Action Transformer： a self-attention model for short-time pose-based human action recognition. Pattern Recognition， 124： #108487 ［DOI： 10.1016/j.patcog.2021.108487http://dx.doi.org/10.1016/j.patcog.2021.108487］

Meng D P， Chen X K， Fan Z J， Zeng G， Li H Q， Yuan Y H， Sun L and Wang J D. 2021. Conditional DETR for fast training convergence//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 3631-3640 ［DOI： 10.1109/ICCV48922.2021.00363http://dx.doi.org/10.1109/ICCV48922.2021.00363］

Meng Y， Shi M Q and Yang W L. 2022. Skeleton action recognition based on Transformer adaptive graph convolution. Journal of Physics： Conference Series， 2170： #012007 ［DOI： 10.1088/1742-6596/2170/1/012007http://dx.doi.org/10.1088/1742-6596/2170/1/012007］

Misra I， Girdhar R and Joulin A. 2021. An end-to-end Transformer model for 3D object detection//Proceedings of 2021 International Conference on Computer Vision. Montreal， Canada： IEEE： 2886-2897 ［DOI： 10.1109/ICCV48922.2021.00290http://dx.doi.org/10.1109/ICCV48922.2021.00290］

Mottaghi R， Chen X J， Liu X B， Cho N G， Lee S W， Fidler S， Urtasun R and Yuille A. 2014. The role of context for object detection and semantic segmentation in the wild//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus， USA： 891-898 ［DOI： 10.1109/CVPR.2014.119http://dx.doi.org/10.1109/CVPR.2014.119］

Munir F， Azam S and Jeon M. 2021. SSTN： self-supervised domain adaptation thermal object detection for autonomous driving//Proceedings of 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems （IROS）. Prague， Czech Republic： IEEE： 206-213 ［DOI： 10.1109/IROS51168.2021.9636353http://dx.doi.org/10.1109/IROS51168.2021.9636353］

Neimark D， Bar O， Zohar M and Asselmann D. 2021. Video Transformer network//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision Workshops. Montreal， Canada： IEEE： 3156-3165 ［DOI： 10.1109/ICCVW54120.2021.00355http://dx.doi.org/10.1109/ICCVW54120.2021.00355］

Nguyen X B， Bui D T， Duong C N， Bui T D and Luu K. 2021. Clusformer： a Transformer based clustering approach to unsupervised large-scale face and visual landmark recognition//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 10842-10851 ［DOI： 10.1109/CVPR46437.2021.01070http://dx.doi.org/10.1109/CVPR46437.2021.01070］

Pan X R， Xia Z F， Song S J， Li L E and Huang G. 2021. 3D object detection with pointformer//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 7459-7468 ［DOI： 10.1109/CVPR46437.2021.00738http://dx.doi.org/10.1109/CVPR46437.2021.00738］

Petit O， Thome N， Rambour C， Themyr L， Collins T and Soler L. 2021. U-Net Transformer： self and cross attention for medical image segmentation//Proceedings of the 12th International Workshop on Machine Learning in Medical Imaging. Strasbourg， France： Springer： 267-276 ［DOI： 10.1007/978-3-030-87589-3_28http://dx.doi.org/10.1007/978-3-030-87589-3_28］

Plizzari C， Cannici M and Matteucci M. 2021. Spatial temporal Transformer network for skeleton-based action recognition//Proceedings of 2021 International Conference on Pattern Recognition. ICPR International Workshops and Challenges. Switzerland： Springer： 694-701 ［DOI： 10.1007/978-3-030-68796-0_50http://dx.doi.org/10.1007/978-3-030-68796-0_50］

Qiu H L， Hou B， Ren B and Zhang X H. 2022a. Spatio-temporal tuples Transformer for skeleton-based action recognition ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2201.02849.pdfhttps://arxiv.org/pdf/2201.02849.pdf

Qiu Y， Liu Y， Zhang L and Xu J. 2022b. Boosting salient object detection with Transformer-based asymmetric bilateral U-Net ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2108.07851.pdfhttps://arxiv.org/pdf/2108.07851.pdf

Radford A， Narasimhan K， Salimans T and Sutskever I. 2018. Improving language understanding by generative pre-training ［EB/OL］. ［2022-03-26］. https://www.gwern.net/docs/www/s3-us-west-2.amazonaws.com/d73fdc5ffa8627bce44dcda2fc012da638ffb158.pdfhttps://www.gwern.net/docs/www/s3-us-west-2.amazonaws.com/d73fdc5ffa8627bce44dcda2fc012da638ffb158.pdf

Radford A， Wu J， Child R， Luan D， Amodei D and Sutskever I. 2019. Language models are unsupervised multitask learners ［EB/OL］. ［2022-03-26］. https://www.gwern.net/docs/ai/nn/transformer/gpt/2019-radford.pdfhttps://www.gwern.net/docs/ai/nn/transformer/gpt/2019-radford.pdf

Sha Y Y， Zhang Y H， Ji X Q and Hu L. 2021. Transformer-unet： raw image processing with unet ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2109.08417.pdfhttps://arxiv.org/pdf/2109.08417.pdf

Shahroudy A， Liu J， Ng T T and Wang G. 2016. NTU RGB+D： a large scale dataset for 3D human activity analysis//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Vegas， USA： IEEE： 1010-1019 ［DOI： 10.1109/CVPR.2016.115http://dx.doi.org/10.1109/CVPR.2016.115］

Shao Z C， Bian H， Chen Y， Wang Y F， Zhang J， Ji X Y and Zhang Y B. 2021. TransMIL： Transformer based correlated multiple instance learning for whole slide image classification ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2106.00908.pdfhttps://arxiv.org/pdf/2106.00908.pdf

Shen Z Q， Fu R D， Lin C N and Zheng S H. 2021. COTR： convolution in Transformer network for end to end polyp detection//Proceedings of the 7th International Conference on Computer and Communications. Chengdu， China： IEEE： 1757-1761 ［DOI： 10.1109/ICCC54389.2021.9674267http://dx.doi.org/10.1109/ICCC54389.2021.9674267］

Sheng H L， Cai S J， Liu Y， Deng B， Huang J， Hua X S and Zhao M J. 2021. Improving 3D object detection with channel-wise Transformer//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 2723-2732 ［DOI： 10.1109/ICCV48922.2021.00274http://dx.doi.org/10.1109/ICCV48922.2021.00274］

Shi F， Lee C， Qiu L， Zhao Y Z， Shen T Y， Muralidhar S， Han T， Zhu S C and Narayanan V. 2021. STAR： sparse Transformer-based action recognition ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2107.07089.pdfhttps://arxiv.org/pdf/2107.07089.pdf

Shuai H， Wu L L and Liu Q S. 2022. Adaptive multi-view and temporal fusing Transformer for 3D human pose estimation ［EB/OL］. ［2022-03-26］ https://arxiv.org/pdf/2110.05092.pdfhttps://arxiv.org/pdf/2110.05092.pdf

Sirinukunwattana K， Pluim J P W， Chen H， Qi X J， Heng P A， Guo Y B， Wang L Y， Matuszewski B J， Bruni E， Sanchez U， Böhm A， Ronneberger O， Cheikh B B， Racoceanu D， Kainz P， Pfeiffer M， Urschler M， Snead D R J and Rajpoot N M. 2017. Gland segmentation in colon histology images： the glas challenge contest. Medical Image Analysis， 35： 489-502 ［DOI： 10.1016/j.media.2016.08.008http://dx.doi.org/10.1016/j.media.2016.08.008］

Song J G. 2021. UFO-ViT： high performance linear vision Transformer without softmax ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2109.14382.pdfhttps://arxiv.org/pdf/2109.14382.pdf

Soomro K， Zamir A R and Shah M. 2012. UCF101： a dataset of 101 human actions classes from videos in the wild ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/1212.0402.pdfhttps://arxiv.org/pdf/1212.0402.pdf

Stoffl L， Vidal M and Mathis A. 2021. End-to-end trainable multi-instance pose estimation with Transformers ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2103.12115.pdfhttps://arxiv.org/pdf/2103.12115.pdf

Strudel R， Garcia R， Laptev I and Schmid C. 2021. Segmenter： Transformer for semantic segmentation//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 7242-7252 ［DOI： 10.1109/ICCV48922.2021.00717http://dx.doi.org/10.1109/ICCV48922.2021.00717］

Sun G， Liu Y， Liang J and Gool L V. 2021a. Boosting few-shot semantic segmentation with Transformers ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2108.02266.pdfhttps://arxiv.org/pdf/2108.02266.pdf

Sun Z Q， Cao S C， Yang Y M and Kitani K. 2021b. Rethinking Transformer-based set prediction for object detection//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 3591-3600 ［DOI： 10.1109/ICCV48922.2021.00359http://dx.doi.org/10.1109/ICCV48922.2021.00359］

Tang L， Li B. 2022. CoSformer： detecting co-salient object with Transformers ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2104.14729.pdfhttps://arxiv.org/pdf/2104.14729.pdf

Touvron H， Cord M， Douze M， Massa F， Sablayrolles A and Jégou H. 2021. Training data-efficient image Transformers and distillation through attention//Proceedings of the 38th International Conference on Machine Learning. Virtual Event： PMLR： 10347-10357

Valanarasu J M J， Oza P， Hacihaliloglu I and Patel V M. 2021. Medical Transformer： gated axial-attention for medical image segmentation//Proceedings of the 24th International Conference on Medical Image Computing and Computer Assisted Intervention——MICCAI 2021. Strasbourg， France： Springer： 36-46 ［DOI： 10.1007/978-3-030-87193-2_4http://dx.doi.org/10.1007/978-3-030-87193-2_4］

Vaswani A， Shazeer N， Parmar N， Uszkoreit J， Jones L， Gomez A N， Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 6000-6010

Wang J， Yu X H and Gao Y S. 2022a. Feature fusion vision Transformer for fine-grained visual categorization ［EB/OL］. ［2022-02-28］. https：//arxiv.org/pdf/2107.02341.pdfhttps://arxiv.org/pdf/2107.02341.pdf ［DOI： 10.48550/arXiv.2107.02341http://dx.doi.org/10.48550/arXiv.2107.02341］

Wang L B， Li R， Duan C X and Fang S H. 2022b. Transformer meets DCFAM： a novel semantic segmentation scheme for fine-resolution remote sensing images ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2104.12137v1.pdfhttps://arxiv.org/pdf/2104.12137v1.pdf

Wang Q T， Peng J L， Shi S Z， Liu T X， He J B and Weng R L. 2021c. IIP-Transformer： intra-inter-part Transformer for skeleton-based action recognition ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2110.13385.pdfhttps://arxiv.org/pdf/2110.13385.pdf

Wang T， Yuan L， Chen Y P， Feng J S and Yan S C. 2021d. PnP-DETR： towards efficient visual analysis with Transformers//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 4641-4650 ［DOI： 10.1109/ICCV48922.2021.00462http://dx.doi.org/10.1109/ICCV48922.2021.00462］

Wang W H， Xie E Z， Li X， Fan D P， Song K T， Liang D， Lu T， Luo P and Shao L. 2021a. Pyramid vision Transformer： a versatile backbone for dense prediction without convolutions//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 548-558 ［DOI： 10.1109/ICCV48922.2021.00061http://dx.doi.org/10.1109/ICCV48922.2021.00061］

Wang W X， Chen C， Ding M， Yu H， Zha S and Li J Y. 2021e. TransBTs： multimodal brain tumor segmentation using Transformer//Proceedings of the 24th International Conference on Medical Image Computing and Computer Assisted Intervention——MICCAI 2021. Strasbourg， France： Springer： 109-119 ［DOI： 10.1007/978-3-030-87193-2_11http://dx.doi.org/10.1007/978-3-030-87193-2_11］

Wang Y Q， Xu Z L， Wang X L， Shen C H， Cheng B S， Shen H and Xia H X. 2021b. End-to-end video instance segmentation with Transformers//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 8737-8746 ［DOI： 10.1109/CVPR46437.2021.00863http://dx.doi.org/10.1109/CVPR46437.2021.00863］

Wu B C， Xu C F， Dai X L， Wan A， Zhang P Z， Yan Z C， Tomizuka M， Gonzalez J， Keutzer K and Vajda P. 2021a. Visual Transformers： where do Transformers really belong in vision models?//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 579-589 ［DOI： 10.1109/ICCV48922.2021.00064http://dx.doi.org/10.1109/ICCV48922.2021.00064］

Wu K， Peng H W， Chen M H， Fu J L and Chao H Y. 2021b. Rethinking and improving relative position encoding for vision Transformer//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 10013-10021 ［DOI： 10.1109/ICCV48922.2021.00988http://dx.doi.org/10.1109/ICCV48922.2021.00988］

Wu S T， Wu T Y， Lin F J， Tian S W and Guo G D. 2021c. Fully Transformer networks for semantic image segmentation ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2106.04108.pdfhttps://arxiv.org/pdf/2106.04108.pdf

Wu W L， Kan M N， Liu X， Yang Y， Shan S G and Chen X L. 2017. Recursive spatial Transformer （ReST） for alignment-free face recognition//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice， Italy： IEEE： 3792-3800 ［DOI： 10.1109/ICCV.2017.407http://dx.doi.org/10.1109/ICCV.2017.407］

Xia X， Li J S， Wu J， Wang X， Xiao X F， Zheng M and Wang R. 2022. TRT-ViT： TensorRT-oriented vision Transformer ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2205.09579.pdfhttps://arxiv.org/pdf/2205.09579.pdf

Xie E Z， Wang W J， Wang W H， Sun P Z， Xu H， Liang D and Luo P. 2021a. Segmenting transparent objects in the wild with Transformer//Proceedings of the 30th International Joint Conference on Artificial Intelligence. Montreal， Canada：［s.n.］： 1194-1200 ［DOI： 10.24963/ijcai.2021/165http://dx.doi.org/10.24963/ijcai.2021/165］

Xie E Z， Wang W H， Yu Z D， Anandkumar A， Álvarez J M and Luo P. 2021b. Segformer： simple and efficient design for semantic segmentation with Transformers ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2105.15203.pdfhttps://arxiv.org/pdf/2105.15203.pdf

Xie J T， Zeng R R， Wang Q L， Zhou Z Q and Li P H. 2021c. So-ViT： mind visual tokens for vision Transformer ［EB/OL］. ［2022-01-21］. https://arxiv.org/pdf/2104.10935v1.pdfhttps://arxiv.org/pdf/2104.10935v1.pdf

Xu Y F， Zhang Z J， Zhang M D， Sheng K K， Li K， Dong W M， Zhang L Q， Xu C S and Sun X. 2021. Evo-ViT： slow-fast token evolution for dynamic vision Transformer ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2108.01390.pdfhttps://arxiv.org/pdf/2108.01390.pdf

Yang J W， Li C Y， Zhang P C， Dai X Y， Xiao B， Yuan L and Gao J F. 2021a. Focal self-attention for local-global interactions in vision Transformers ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2107.00641.pdfhttps://arxiv.org/pdf/2107.00641.pdf

Yang S， Quan Z B， Nie M and Yang W K. 2021b. TransPose： keypoint localization via Transformer//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 11782-11792 ［DOI： 10.1109/ICCV48922.2021.01159http://dx.doi.org/10.1109/ICCV48922.2021.01159］

Yu X D， Shi D H， Wei X， Ren Y， Ye T Q and Tan W M. 2022. SOIT： segmenting objects with instance-aware Transformers. Proceedings of the AAAI Conference on Artificial Intelligence， 36（3）： 3188-3196 ［DOI： 10.1609/aaai.v36i3.20227http://dx.doi.org/10.1609/aaai.v36i3.20227］

Yuan L， Chen Y P， Wang T， Yu W H， Shi Y J， Jiang Z H， Tay F E H， Feng J S and Yan S C. 2021. Tokens-to-token ViT： training vision Transformers from scratch on ImageNet//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 538-547 ［DOI： 10.1109/ICCV48922.2021.00060http://dx.doi.org/10.1109/ICCV48922.2021.00060］

Yue X Y， Sun S Y， Kuang Z H， Wei M， Torr P， Zhang W and Lin D H. 2021. Vision Transformer with progressive sampling//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 377-386 ［DOI： 10.1109/ICCV48922.2021.00044http://dx.doi.org/10.1109/ICCV48922.2021.00044］

Zhang B W， Yu J H， Fifty C， Han W， Dai A M， Pang R M and Sha F. 2021a. Co-training Transformer with videos and images improves action recognition ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2112.07175.pdfhttps://arxiv.org/pdf/2112.07175.pdf

Zhang G J， Luo Z P， Cui K W and Lu S J. 2021b. Meta-DETR： few-shot object detection via unified image-level meta-learning ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2103.11731v2.pdfhttps://arxiv.org/pdf/2103.11731v2.pdf

Zhang H， Hao Y B and Ngo C W. 2021c. Token shift Transformer for video classification//Proceedings of the 29th ACM International Conference on Multimedia. Virtual Event， China： ACM： 917-925 ［DOI： 10.1145/3474085.3475272http://dx.doi.org/10.1145/3474085.3475272］

Zhang J M， Yang K L， Constantinescu A， Peng K Y， Müller K and Stiefelhagen R. 2021e. Trans4Trans： efficient Transformer for transparent object segmentation to help visually impaired people navigate in the real world//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision Workshops. Montreal， Canada： IEEE： 1760-1770 ［DOI： 10.1109/ICCVW54120.2021.00202http://dx.doi.org/10.1109/ICCVW54120.2021.00202］

Zhang J Y， Huang J X， Luo Z P， Zhang G J and Lu S J. 2023. DA-DETR： domain adaptive detection Transformer by hybrid attention ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2103.17084v1.pdfhttps://arxiv.org/pdf/2103.17084v1.pdf

Zhang P C， Dai X Y， Yang J W， Xiao B， Yuan L， Zhang L and Gao J F. 2021f. Multi-scale vision longformer： a new vision Transformer for high-resolution image encoding//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 2978-2988 ［DOI： 10.1109/ICCV48922.2021.00299http://dx.doi.org/10.1109/ICCV48922.2021.00299］

Zhang Q L and Yang Y B. 2021. ResT： an efficient Transformer for visual recognition. Advances in Neural Information Processing Systems， 34： 15475-15485

Zhang Y， Cao J， Zhang L， Liu X C， Wang Z Y， Ling F and Chen W Q. 2022. A free lunch from ViT： adaptive attention multi-scale fusion Transformer for fine-grained visual recognition//ICASSP 2022 — 2022 IEEE International Conference on Acoustics， Speech and Signal Processing. Singapore， Singapore： IEEE： 3234-3238 ［DOI： 10.1109/ICASSP43922.2022.9747591http://dx.doi.org/10.1109/ICASSP43922.2022.9747591］

Zhang Y Y， Li X Y， Liu C H， Shuai B， Zhu Y， Brattoli B， Chen H， Marsic I and Tighe J. 2021d. VidTr： video Transformer without convolutions//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 13557-13567 ［DOI： 10.1109/ICCV48922.2021.01332http://dx.doi.org/10.1109/ICCV48922.2021.01332］

Zhang Z Z and Zhang W X. 2022. Pyramid medical Transformer for medical image segmentation ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2104.14702.pdfhttps://arxiv.org/pdf/2104.14702.pdf

Zhao H， Wang Q M， Jia Z Z， Chen Y M and Zhang J X. 2021a. Bayesian based facial expression recognition Transformer model in uncertainty//Proceedings of 2021 International Conference on Digital Society and Intelligent Systems. Chengdu， China： IEEE： 157-161 ［DOI： 10.1109/dsins54396.2021.9670628http://dx.doi.org/10.1109/dsins54396.2021.9670628］

Zhao J J， Li X Y， Liu C H， Bing S， Chen H， Snoek C G M and Tighe J. 2022. TubeR： tube-Transformer for action detection ［EB/OL］. ［2022-02-21］. https://arxiv.org/pdf/2104.00969v2.pdfhttps://arxiv.org/pdf/2104.00969v2.pdf

Zhao J W， Yan K， Zhao Y F， Guo X W， Huang F Y and Li J. 2021c. Transformer-based dual relation graph for multi-label image recognition//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 163-172 ［DOI： 10.1109/ICCV48922.2021.00023http://dx.doi.org/10.1109/ICCV48922.2021.00023］

Zhao W X， Tian Y J， Ye Q X， Jiao J B and Wang W Q. 2021b. GraFormer： graph convolution Transformer for 3D pose estimation ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2109.08364.pdfhttps://arxiv.org/pdf/2109.08364.pdf

Zheng C， Zhu S J， Mendieta M， Yang T J N， Chen C and Ding Z M. 2021a. 3D human pose estimation with spatial and temporal Transformers//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 11636-11645 ［DOI： 10.1109/ICCV48922.2021.01145http://dx.doi.org/10.1109/ICCV48922.2021.01145］

Zheng M H， Gao P， Zhang R R， Li K C， Wang X G， Li H S and Dong H. 2021b. End-to-end object detection with adaptive clustering Transformer ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2011.09315.pdfhttps://arxiv.org/pdf/2011.09315.pdf

Zheng S X， Lu J C， Zhao H S， Zhu X T， Luo Z K， Wang Y B， Fu Y W， Feng J F， Xiang T， Torr P H S and Zhang L. 2021c. Rethinking semantic segmentation from a sequence-to-sequence perspective with Transformers//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 6877-6886 ［DOI： 10.1109/CVPR46437.2021.00681http://dx.doi.org/10.1109/CVPR46437.2021.00681］

Zhong Y Y and Deng W H. 2021. Face Transformer for recognition ［EB/OL］. ［2022-02-15］. https://arxiv.org/pdf/2103.14803.pdfhttps://arxiv.org/pdf/2103.14803.pdf

Zhou B L， Zhao H， Puig X， Fidler S， Barriuso A and Torralba A. 2017. Scene parsing through ADE20K dataset//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 5122-5130 ［DOI： 10.1109/cvpr.2017.544http://dx.doi.org/10.1109/cvpr.2017.544］

Zhu X Z， Su W J， Lu L W， Li B， Wang X G and Dai J F. 2021. Deformable DETR： deformable Transformers for end-to-end object detection ［EB/OL］. ［2022-03-26］. https://arxiv.org/pdf/2010.04159.pdfhttps://arxiv.org/pdf/2010.04159.pdf

文章被引用时，请邮件提醒。

提交

医学影像多血管和气道分割方法综述

语义分割和HSV色彩空间引导的低光照图像增强

无人机航拍图像中电力线检测方法研究进展

融合多尺度特征的复杂手势姿态估计网络

多尺度局部特征增强Transformer道路裂缝检测模型