视觉语言多模态预训练综述

张浩宇; 王天保; 李孟择; 赵洲; 浦世亮; 吴飞

doi:10.11834/jig.220173

综述 | 浏览量 : 0 下载量: 1789 CSCD: 8

PDF
导出
分享
收藏
专辑

视觉语言多模态预训练综述
Comprehensive review of visual-language-oriented multimodal pre-training methods
2022年27卷第9期页码：2652-2682
收稿：2022-03-10，

修回：2022-6-15，

录用：2022-6-22，

纸质出版：2022-09-16
DOI： 10.11834/jig.220173
稿件说明：

移动端阅览

张浩宇, 王天保, 李孟择, 赵洲, 浦世亮, 吴飞. 视觉语言多模态预训练综述[J]. 中国图象图形学报, 2022,27(9):2652-2682. DOI： 10.11834/jig.220173.

Haoyu Zhang, Tianbao Wang, Mengze Li, Zhou Zhao, Shiliang Pu, Fei Wu. Comprehensive review of visual-language-oriented multimodal pre-training methods[J]. Journal of Image and Graphics, 2022, 27(9): 2652-2682. DOI： 10.11834/jig.220173.

摘要

在多模态机器学习领域，为特定任务而制作的人工标注数据昂贵，且不同任务难以进行迁移，从而需要大量重新训练，导致训练多个任务时效率低下、资源浪费。预训练模型通过以自监督为代表的方式进行大规模数据训练，对数据集中不同模态的信息进行提取和融合，以学习其中蕴涵的通用知识表征，从而服务于广泛的相关下游视觉语言多模态任务，这一方法逐渐成为人工智能各领域的主流方法。依靠互联网所获取的大规模图文对与视频数据，以及以自监督学习为代表的预训练方法的进步，视觉语言多模态预训练模型在很大程度上打破了不同视觉语言任务之间的壁垒，提升了多个任务训练的效率并促进了具体任务的性能表现。本文总结视觉语言多模态预训练领域的进展，首先对常见的预训练数据集和预训练方法进行汇总，然后对目前最新方法以及经典方法进行系统概述，按输入来源分为图像—文本预训练模型和视频—文本多模态模型两大类，阐述了各方法之间的共性和差异，并将各模型在具体下游任务上的实验情况进行汇总。最后，总结了视觉语言预训练面临的挑战和未来发展趋势。

Abstract

Multimodal machine learning has been challenging for labor-intensive and labeled cost and data migration constraints

which requires amount of retraining process

resulting in low efficiency and imbalanced resources allocation for multiple training tasks. To learn the internal knowledge representation and meet the requirement of the related downstream visual language multimodal tasks

pre-training model is carried out for large-scale data training task through self-supervision

the multiple modes information extraction and integration of the data set context

etc. The exploration of pre-trained models is focused on cheaper labeled data due to the expensive human labels. First

the model is pre-trained based on cheap labeled data

and the model is fine-tuned using less expensive human annotations. Large-scale data and long time span training are often required to pre-train the model because of the less information and noise derived from cheap labeled data. The large-scale unlabeled-data-based pre-trained model not only transfer the more general knowledge to the target task through the learned unlabeled data

but also get a better parameter initial point through the pre-training learning. The future multimodal contexts have their potentials like learning demonstration

sentiment analysis and task-oriented large-scale human-computer interactions. Multimodal pre-training models can be as a pathway derived of weak artificial intelligence from local to global. It is possible to transfer multi-tasks learning results to non-supervision multi-domains data automatically and quickly. The plain text pre-training model can cover less online data only

and richer data have not been fully utilized and learned. Multimodal-contexts are benefited from information gathering

context perception

knowledge learning

and demonstration. To generate commonly-used artificial intelligence model

the pre-training model has been developing from single-modal to multi-modal. The intensive growth of pre-training models has extended to the field of visual and textual interaction since 2019. Thanks to the large-scale image-text pairs and video data online and the growth of pre-training technique like self-supervised learning

the visual-language multimodal pre-training model has been promoted and bridged the gap between different visual-language tasks

which optimizes multi-task training and improves the performance of specific tasks. Current multimodal researches are challenged to an intelligent system organizing

multimodal information perceiving and the semantic gap bridging. We review existing pre-training datasets and pre-training methods

and propose a systematic overview of the latest and traditional methods. The universals and differences between the methods are critical analyzed

and the experimental conditions of each model are summarized on specific downstream tasks. Finally

the challenges and future research direction of visual language pre-training are predicted.

关键词

Keywords

references

Agrawal H, Chandrasekaran A, Batra D, Parikh D and Bansal M. 2016. Sort story: sorting jumbled images and captions into stories//Proceedings of 2016 Conference on Empirical Methods in Natural Language Processing. Austin,USA: ACL: 925-931 [ DOI: 10.18653/v1/D16-1091 http://dx.doi.org/10.18653/v1/D16-1091 ]

Agrawal H, Desai K, Wang Y F, Chen X L, Jain R, Johnson M, Batra D, Parikh D, Lee S and Anderson P. 2019. Nocaps: novel object captioning at scale//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE: 8947-8956 [ DOI: 10.1109/ICCV.2019.00904 http://dx.doi.org/10.1109/ICCV.2019.00904 ]

Akbari H, Yuan L Z, Qian R, Chuang W H, Chang S F, Cui Y and Gong B Q. 2021. VATT: transformers for multimodal self-supervised learning from raw video, audio and text [EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/2104.11178.pdf https://arxiv.org/pdf/2104.11178.pdf

Alberti C, Ling J, Collins M and Reitter D. 2019. Fusion of detected objects in text for visual question answering//Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: ACL: 2131-2140 [ DOI: 10.18653/v1/D19-1219 http://dx.doi.org/10.18653/v1/D19-1219 ]

Amrani E, Ben-Ari R, Rotman D and Bronstein A. 2021. Noise estimation using density estimation for self-supervised multimodal learning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(8): 6644-6652

Anderson P, He X D, Buehler C, Teney D, Johnson M, Gould S and Zhang L. 2018. Bottom-up and top-down attention for image captioning and visual question answering//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 6077-6086 [ DOI: 10.1109/CVPR.2018.00636 http://dx.doi.org/10.1109/CVPR.2018.00636 ]

Antol S, Agrawal A, Lu J S, Mitchell M, Batra D, Zitnick C L and Parikh D. 2015. VQA: visual question answering//Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE: 2425-2433 [ DOI: 10.1109/ICCV.2015.279 http://dx.doi.org/10.1109/ICCV.2015.279 ]

Bain M, Nagrani A, Varol G and Zisserman A. 2021. Frozen in time: a joint video and image encoder for end-to-end retrieval//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE: 1708-1718 [ DOI: 10.1109/ICCV48922.2021.00175 http://dx.doi.org/10.1109/ICCV48922.2021.00175 ]

Bao H B, Dong Land Wei F R. 2021. BEIT: BERT pre-training of image transformers [EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/2106.08254.pdf https://arxiv.org/pdf/2106.08254.pdf

Brown T B, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler D M, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I and Amodei D. 2020. Language models are few-shot learners//Proceedings of the 34th Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc. : 1877-1901

Carreira J and Zisserman A. 2017. Quo vadis, action recognition? a new model and the kinetics dataset//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 4724-4733 [ DOI: 10.1109/CVPR.2017.502 http://dx.doi.org/10.1109/CVPR.2017.502 ]

Carreira J, Noland E, Banki-Horvath A, Hillier C and Zisserman A. 2018. A short note about kinetics-600 [EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/1808.01340.pdf https://arxiv.org/pdf/1808.01340.pdf

Carreira J, Noland E, Hillier C and Zisserman A. 2019. A short note on the kinetics-700 human action dataset [EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/1907.06987.pdf https://arxiv.org/pdf/1907.06987.pdf

Changpinyo S, Sharma P, Ding N and Soricut R. 2021. Conceptual 12M: pushing web-scale image-text pre-training to recognize long-tail visual concepts//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: IEEE: 3557-3567 [ DOI: 10.1109/CVPR46437.2021.00356 http://dx.doi.org/10.1109/CVPR46437.2021.00356 ]

Chen D G, Ma J L, Ma Z P and Zhou J. 2021. Review of pre-training techniques for natural language processing. Journal of Frontiers of Computer Science and Technology, 15(8): 1359-1389

陈德光, 马金林, 马自萍, 周洁. 2021. 自然语言处理预训练技术综述. 计算机科学与探索, 15(8): 1359-1389) [DOI: 10.3778/j.issn.1673-9418.2012109]

Chen F L, Zhang D Z, Han M L, Chen X Y, Shi J, Xu S and Xu B. 2022. VLP: a survey on vision-language pre-training [EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/2202.09061.pdf https://arxiv.org/pdf/2202.09061.pdf

Chen K Z, Huang Q Y, Bisk Y, McDuff D and Gao J F. 2021. KB-VLP: knowledge based vision and language pretraining//Proceedings of the 38th International Conference on Machine Learning. [s. l.]: PMLR

Chen L Q, Gan Z, Cheng Y, Li L J, Carin L and Liu J J. 2020a. Graph optimal transport for cross-domain alignment//Proceedings of the 37th International Conference on Machine Learning (ICML). [s. l.]: PMLR: 1542-1553

Chen X L, Fang H, Lin T Y, Vedantam R, Gupta S, Dollár P and Zitnick C L. 2015. Microsoft coco captions: data collection and evaluation server[EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/1504.00325 https://arxiv.org/pdf/1504.00325

Chen Y C, Li L J, Yu L C, El Kholy A, Ahmed F, Gan Z, Cheng Y and Liu J J. 2020b. UNITER: universal image-text representation learning//Proceedings of the 16th European Conference on Computer Vision (ECCV). Glasgow, Scotland: Springer: 104-120 [ DOI: 10.1007/978-3-030-58577-8_7 http://dx.doi.org/10.1007/978-3-030-58577-8_7 ]

Cho J, Lei J, Tan H and Bansal M. 2021. Unifying vision-and-language tasks via text generation//Proceedings of the 38th International Conference on Machine Learning (ICML). [s. l.]: PMLR: 1931-1942

Clark K, Luong M T, Le Q V and Manning C D. 2020. Electra: pre-training text encoders as discriminators rather than generators [EB/OL ] . [2022-04-28 ] . https://openreview.net/pdf?id=r1xMH1-BtvB https://openreview.net/pdf?id=r1xMH1-BtvB

Devlin J, Chang M W, Lee K and Toutanova K. 2019. Bert: pre-training of deep bidirectional transformers for language understanding//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, USA: ACL: 4171-4186 [ DOI: 10.18653/v1/N19-1423 http://dx.doi.org/10.18653/v1/N19-1423 ]

Dixon T L. 2008. Crime news and racialized beliefs: understanding the relationship between local news viewing and perceptions of African Americans and crime. Journal of Communication, 58(1): 106-125 [ DOI: 10.1111/j.1460-2466.2007.00376.x http://dx.doi.org/10.1111/j.1460-2466.2007.00376.x ]

Dong L, Yang N, Wang W H, Wei F R, Liu X D, Wang Y, Gao J F, Zhou M and Hon H W. 2019. Unified language model pre-training for natural language understanding and generation//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver, Canada: ACM: 13063-13075

Dosovitskiy A, Beyer L, Kolesnikov A, WeissenbornD, Zhai X H, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J and Houlsby N. 2021. An image is worth 16×16 words: transformers for image recognition at scale [EB/OL ] . [2022-04-28 ] . https://openreview.net/pdf?id=YicbFdNTTy https://openreview.net/pdf?id=YicbFdNTTy

Dou Z Y, Xu Y C, Gan Z, Wang J F, Wang S H, Wang L J, Zhu C G, Zhang P C, Yuan L, Peng N Y, Liu Z C and Zeng M. 2022. An empirical study of training end-to-end vision-and-language transformers [EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/2111.02387.pdf https://arxiv.org/pdf/2111.02387.pdf

Elliott D, Frank S, Sima'an K and Specia L. 2016. Multi30k: multilingual English-German image descriptions//Proceedings of the 5th Workshop on Vision and Language. Berlin, Germany: ACL: 70-74 [ DOI: 10.18653/v1/W16-3210 http://dx.doi.org/10.18653/v1/W16-3210 ]

Fedus W, Zoph B and Shazeer N. 2022. Switch transformers: scaling to trillion parameter models with simple and efficient sparsity[EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/2101.03961.pdf https://arxiv.org/pdf/2101.03961.pdf

Feichtenhofer C, Fan H Q, Malik J and He K M. 2019. SlowFast networks for video recognition//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE: 6201-6210 [ DOI: 10.1109/ICCV.2019.00630 http://dx.doi.org/10.1109/ICCV.2019.00630 ]

Fouhey D F, Kuo W C, Efros A A and Malik J. 2018. From lifestyle vlogs to everyday interactions//Proceedings of 2018 IEEE/CVFConference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, USA: IEEE: 4991-5000 [ DOI: 10.1109/CVPR.2018.00524 http://dx.doi.org/10.1109/CVPR.2018.00524 ]

Fu T J, Li L J, Gan Z, Lin K, Wang W Y, Wang L J and Liu Z C. 2022. VIOLET: end-to-end video-language transformers with masked visual-token modeli ng [EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/2111.12681.pdf https://arxiv.org/pdf/2111.12681.pdf

Gabeur V, Sun C, Alahari K and Schmid C. 2020. Multi-modal transformer for video retrieval//Proceedings of the 16th European Conference on Computer Vision (ECCV). Glasgow, UK: Springer: 214-229 [ DOI: 10.1007/978-3-030-58548-8_13 http://dx.doi.org/10.1007/978-3-030-58548-8_13 ]

Gan Z, Chen Y C, Li L J, Zhu C, Cheng Y and Liu J J. 2020. Large-scale adversarial training for vision-and-language representation learning//Proceedings of the 34th Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc. : 6616-6628

Gan Z, Gan C, He X D, Pu Y C, Tran K, Gao J F, Carin L and Deng L. 2017. Semantic compositional networks for visual captioning//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 1141-1150 [ DOI: 10.1109/CVPR.2017.127 http://dx.doi.org/10.1109/CVPR.2017.127 ]

Gao D F, Wang R P, Shan S G and Chen X L. 2019. CRIC: a VQA dataset for compositional reasoning on vision and commonsense [EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/1908.02962.pdf https://arxiv.org/pdf/1908.02962.pdf

Gao J Y, Sun C, Yang Z H and Nevatia R. 2017. TALL: temporal activity localization via language query//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 5277-5285 [ DOI: 10.1109/ICCV.2017.563 http://dx.doi.org/10.1109/ICCV.2017.563 ]

Goyal Y, Khot T, Agrawal A, Summers-Stay D, Batra D and Parikh D. 2019. Making the V in VQA matter: elevating the role of image understanding in visual question answering. International Journal of Computer Vision, 127(4): 398-414 [DOI: 10.1007/s11263-018-1116-0]

Han Y, Qiao L B, Li D S and Liao X K. 2022. Review of knowledge-enhanced pre-trained language models. Journal of Frontiers of Computer Science and Technology

韩毅, 乔林波, 李东升, 廖湘科. 2022. 知识增强型预训练语言模型综述. 计算机科学与探索. http://fcst.ceaj.org/CN/10.3778/j.issn.1673-9418.2108105 http://fcst.ceaj.org/CN/10.3778/j.issn.1673-9418.2108105 )[ DOI: 10.3778/j.issn.1673-9418.2108105 http://dx.doi.org/10.3778/j.issn.1673-9418.2108105 .

He K M, Chen X L, Xie S N, Li Y H, Dollár P and Girshick R. 2021a. Masked autoencoders are scalable vision learners [EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/2111.06377.pdf https://arxiv.org/pdf/2111.06377.pdf

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 770-778 [ DOI: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]

He P C, LiuX D, Gao J F and Chen W Z. 2021b. Deberta: decoding-enhanced bert with disentangled attention[EB/OL ] . [2022-04-28 ] . https://openreview.net/pdf?id=XPZIaotutsD https://openreview.net/pdf?id=XPZIaotutsD

Hendricks L A, Wang O, Shechtman E, Sivic J, Darrell T and Russell B. 2017. Localizing moments in video with natural language//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 5804-5813 [ DOI: 10.1109/ICCV.2017.618 http://dx.doi.org/10.1109/ICCV.2017.618 ]

Hu X W, Yin X, Lin K, Zhang L, Gao J F, Wang L J and Liu Z C. 2021. VIVO: visual vocabulary pre-training for novel object captioning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(2): 1575-1583

Huang G, Pang B, Zhu Z H, Rivera C and Soricut R. 2020a. Multimodal pretraining for dense video captioning//Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. Suzhou, China: ACL: 470-490

Huang T H K, Ferraro F, Mostafazadeh N, Misra I, Agrawal A, Devlin J, Girshick R, He X D, Kohli P, Batra D, Zitnick C L, Parikh D, Anderwende L, Galley M and Mitchell M. 2016. Visual storytelling//Proceedings of 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego, USA: ACL: 1233-1239 [ DOI: 10.18653/v1/N16-1147 http://dx.doi.org/10.18653/v1/N16-1147 ]

Huang Z C, Zeng Z Y, Liu B, Fu D M and Fu J L. 2020b. Pixel-BERT: aligning image pixels with text by deep multi-modal transformers [EB/OL ] . [2021-12-20 ] . https://arxiv.org/pdf/2004.00849.pdf https://arxiv.org/pdf/2004.00849.pdf

Hudson D A and Manning C D. 2019. GQA: a new dataset for real-world visual reasoning and compositional question answering//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 6693-6702 [ DOI: 10.1109/CVPR.2019.00686 http://dx.doi.org/10.1109/CVPR.2019.00686 ]

Huo Y Q, Zhang M L, Liu G Z, Lu H Y , Gao Y Z, Yang G X, Wen J Y, Zhang H, Xu B G, Zheng W H, Xi Z Z, Yang Y Q, Hu A W, Zhao J M, Li R C, Zhao Y D, Zhang L, Song Y Q, Hong X, Cui W Q, Hou D Y, Li Y Y, Li J Y, Liu P Y, Gong Z, Jin C H, Sun Y C, Chen S Z, Lu Z W, Dou Z C, Jin Q, Lan Y Y, Zhao W X, Song R H and Wen J R. 2021. WenLan: bridging vision and language by large-scale multi-modal pre-training [EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/2103.06561.pdf https://arxiv.org/pdf/2103.06561.pdf

Jang Y, Song Y L, Yu Y, Kim Y and Kim G. 2017. TGIF-QA: toward spatio-temporal reasoning in visual question answering//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 1359-1367 [ DOI: 10.1109/CVPR.2017.149 http://dx.doi.org/10.1109/CVPR.2017.149 ]

Jia C, Yang Y F, Xia Y, Chen Y T, Parekh Z, Pham H, Le Q V, Sung Y, Li Z and Duerig T. 2021. Scaling up visual and vision-language representation learning with noisy text supervision//Proceedings of the 38th International Conference on Machine Learning. [s. l.]: PMLR: 4904-4916

Johnson J, Hariharan B, van der Maaten L, Li F F, Zitnick C L and Girshick R. 2017. CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 1988-1997 [ DOI: 10.1109/CVPR.2017.215 http://dx.doi.org/10.1109/CVPR.2017.215 ]

Ju S G, Huang F Y and Sun J P. 2022. Research on the idiom cloze algorithm integrating with a pre-trained language model. Journal of Software

琚生根, 黄方怡, 孙界平. 2022. 融合预训练语言模型的成语完形填空算法研究. 软件学报: #006307 [ DOI: 10.13328/j.cnki.jos.006307 http://dx.doi.org/10.13328/j.cnki.jos.006307 ]

Kalyan K S, Rajasekharan A and Sangeetha S. 2021. AMMUS: a survey of transformer-based pretrained models in natural language processing[EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/2108.05542.pdf https://arxiv.org/pdf/2108.05542.pdf

Karpathy A and Li F F. 2015. Deep visual-semantic alignments for generating image descriptions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA: IEEE: 3128-3137 [ DOI: 10.1109/CVPR.2015.7298932 http://dx.doi.org/10.1109/CVPR.2015.7298932 ]

Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, Suleyman M and Zisserman A. 2017. The kinetics human action video dataset [EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/1705.06950.pdf https://arxiv.org/pdf/1705.06950.pdf

Kazemzadeh S, Ordonez V, Matten M and Berg T. 2014. ReferItGame: referring to objects in photographs of natural scenes//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: ACL: 787-798 [ DOI: 10.3115/v1/D14-1086 http://dx.doi.org/10.3115/v1/D14-1086 ]

Kim W, Son B and Kim I. 2021. ViLT: vision-and-language transformer without convolution or region supervision//Proceedings of the 38th International Conference on Machine Learning. [s. l.]: PMLR: 5583-5594

Krishna R, Hata K, Ren F, Li F F and Niebles J C. 2017a. Dense-captioning events in videos//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 706-715 [ DOI: 10.1109/ICCV.2017.83 http://dx.doi.org/10.1109/ICCV.2017.83 ]

Krishna R, Zhu Y K, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L J, Shamma D A, Bernstein M S and Li F F. 2017b. Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1): 32-73 [DOI: 10.1007/s11263-016-0981-7]

Kuehne H, Jhuang H, Garrote E, Poggio T and Serre T. 2011. HMDB: a large video database for human mot ion recognition//Proceedings of 2011 International Conference on Computer Vision (ICCV). Barcelona, Spain: IEEE: 2556-2563 [ DOI: 10.1109/ICCV.2011.6126543 http://dx.doi.org/10.1109/ICCV.2011.6126543 ]

Kuznetsova A, Rom H, Alldrin N, Uijlings J, Krasin I, Pont-Tuset J, Kamali S, Popov S, Malloci M, Kolesnikov A, Duerig T and Ferrari V. 2020. The open iImages dataset v4: unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision. 128: 1956-1981 [ DOI: 10.1007/s11263-020-01316-z http://dx.doi.org/10.1007/s11263-020-01316-z ]

Lan Z Z, Chen M D, Goodman S, Gimpel K, Sharma P and Soricut R. 2020. ALBERT: a lite BERT for self-supervised learning of language representations [EB/OL ] . [2022-04-28 ] . https://openreview.net/pdf?id=H1eA7AEtvS https://openreview.net/pdf?id=H1eA7AEtvS

Lee K H, Chen X, Hua G, Hu H D and He X D. 2018. Stacked cross attention for image-text matching//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 212-228 [ DOI: 10.1007/978-3-030-01225-0_13 http://dx.doi.org/10.1007/978-3-030-01225-0_13 ]

Lei C Y, Luo S X, Liu Y, He W G, Wang J M, Wang G X, Tang H H, Miao C Y and Li H Q. 2021a. Understanding Chinese video and language via contrastive multimodal pre-training//Proceedings of the 29th ACM International Conference on Multimedia. [s. l. ] : ACM: 2567-2576 [ DOI: 10.1145/3474085.3475431 http://dx.doi.org/10.1145/3474085.3475431 ]

Lei J, Li L J, Zhou L W, Gan Z, Berg T L, Bansal M and Liu J J. 2021b. Less is more: CLIPBERT for video-and-language learning via sparse sampling//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: IEEE: 7327-7337 [ DOI: 10.1109/CVPR46437.2021.00725 http://dx.doi.org/10.1109/CVPR46437.2021.00725 ]

Lei J, Yu L C, Bansal M and Berg T. 2018. TVQA: localized, compositional video question answering//Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: ACL: 1369-1379 [ DOI: 10.18653/v1/D18-1167 http://dx.doi.org/10.18653/v1/D18-1167 ]

Lei J, Yu L C, Berg T and Bansal M. 2020a. TVQA+: spatio-temporal grounding for video question answering//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: ACL: 8211-8225 [ DOI: 10.18653/v1/2020.acl-main.730 http://dx.doi.org/10.18653/v1/2020.acl-main.730 ]

Lei J, Yu L C, Berg T L and Bansal M. 2020b. TVR: a large-scale dataset for video-subtitle moment retrieval//Proceedings of the 16th European Conference on Computer Vision. Glasgow, Scotland: Springer: 447-463 [ DOI: 10.1007/978-3-030-58589-1_27 http://dx.doi.org/10.1007/978-3-030-58589-1_27 ]

Lewis M, Liu Y H, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V and Zettlemoyer L. 2020. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: ACL: 7871-7880 [ DOI: 10.18653/v1/2020.acl-main.703 http://dx.doi.org/10.18653/v1/2020.acl-main.703 ]

Li G, Duan N, Fang Y J, Gong M and Jiang D X. 2020a. Unicoder-vl: a universal encoder for vision and language by cross-modal pre-training. Proceedings of the AAAI Conference on Artificial Intelligence, 34(7): 11336-11344 [DOI: 10.1609/aaai.v34i07.6795]

Li L H, Yatskar M, Yin D, Hsieh C J and Chang K W. 2019. VisualBERT: a simple and performant baseline for vision and language[EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/1908.03557.pdf https://arxiv.org/pdf/1908.03557.pdf

Li L J, Chen Y C, Cheng Y, Gan Z, Yu L C and Liu J J. 2020b. HERO: hierarchical encoder for video+language omni-representation pre-training//Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: ACL: 2046-2065 [ DOI: 10.18653/v1/2020.emnlp-main.161 http://dx.doi.org/10.18653/v1/2020.emnlp-main.161 ]

Li L J, Lei J, Gan Z, Yu L C, Chen Y C, Pillai R, Cheng Y, Zhou L W, Wang X E, Wang W Y, Berg T L, Bansal M, Liu J J, Wang L J and Liu Z C. 2021a. VALUE: a multi-task benchmark for video-and-language understanding evaluation [EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/2106.04632.pdf https://arxiv.org/pdf/2106.04632.pdf

Li W, Gao G, Niu G C, Xiao X Y, Liu H, Liu J C, Wu H and Wang H F. 2021b. UNIMO: towards unified-modal understanding and generation via cross-modal contrastive learning//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Online: ACL: 2592-2607 [ DOI: 10.18653/v1/2021.acl-long.202 http://dx.doi.org/10.18653/v1/2021.acl-long.202 ]

Li X J, Yin X, Li C Y, Zhang P C, Hu X W, Zhang L, Wang L J, Hu H D, Dong L, Wei F R, Choi Y and Gao J F. 2020c. OSCAR: object-semantics aligned pre-training for vision-language tasks//Proceedings of the 16th European Conference on Computer Vision (ECCV). Glasgow, Scotland: Springer: 121-137 [ DOI: 10.1007/978-3-030-58577-8_8 http://dx.doi.org/10.1007/978-3-030-58577-8_8 ]

Li Y H, Pan Y W, Yao T, Chen J W and Mei T. 2021c. Scheduled sampling in vision-language pretraining with decoupled encoder-decoder network. Proceedings of the AAAI Conference on Artificial Intelligence, 35(10): 8518-8526

Lin J Y, Men R, Yang A, Zhou C, Ding M, Zhang Y C, Wang P, Wang A, Jiang L, Jia X Y, Zhang J, Zhang J W, Zou X, Li Z K, Deng X D, Liu J, Xue J B, Zhou H L, Ma J X, Yu J, Li Y, Lin W, Zhou J R, Tang J and Yang H X. 2021. M6: a Chinese multimodal pretrainer [EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/2103.00823.pdf https://arxiv.org/pdf/2103.00823.pdf

Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P and Zitnick C L. 2014. Microsoft COCO: common objects in context//Proceedings of the 13th European Conference on ComputerVision (ECCV). Zurich, Switzerland: Springer: 740-755 [ DOI: 10.1007/978-3-319-10602-1_48 http://dx.doi.org/10.1007/978-3-319-10602-1_48 ]

Liu J, Zhu X X, Liu F, Guo L T, Zhao Z J, Sun M Z, Wang W N, Lu H Q, Zhou S Y, Zhang J J and Wang J Q. 2021a. OPT: omni-perception pre-trainer for cross-modal understanding and generation[EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/2107.00249.pdf https://arxiv.org/pdf/2107.00249.pdf

Liu X D, He P C, Chen W Z and Gao J F. 2019a. Multi-task deep neural networks for natural language understanding//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). Florence, Italy: ACL: 4487-4496 [ DOI: 10.18653/v1/P19-1441 http://dx.doi.org/10.18653/v1/P19-1441 ]

Liu Y H, Ott M, Goyal N, Du J F, Joshi M, Chen D Q, Levy O, Lewis M, Zettlemoyer L and Stoyanov V. 2019b. RoBERTa: a robustly optimized BERT pretraining approach [EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/1907.11692.pdf https://arxiv.org/pdf/1907.11692.pdf

Liu Z, Lin Y T, Cao Y, Hu H, Wei Y X, Zhang Z, Lin S and Guo B N. 2021b. Swin transformer: hierarchical vision transformer using shifted windows//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE: 9992-10002 [ DOI: 10.1109/ICCV48922.2021.00986 http://dx.doi.org/10.1109/ICCV48922.2021.00986 ]

Lu J S, Batra D, Parikh D and Lee S. 2019. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks//Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019). Vancouver, Canada: Curran Associates Inc.

Lu J S, Goswami V, Rohrbach M, Parikh D and Lee S. 2020.12-in-1: multi-task vision and language representation learning//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE: #01045 [ DOI: 10.1109/CVPR42600.2020.01045 http://dx.doi.org/10.1109/CVPR42600.2020.01045 ]

Luo H S, Ji L, Shi B T, Huang H Y, Duan N, Li T R, Chen X L and Zhou M. 2020. UniViLM: a unified video and language pre-training model for multimodal understanding and generation[EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/2002.06353.pdf https://arxiv.org/pdf/2002.06353.pdf

Maharaj T, Ballas N, Rohrbach A, Courville A and Pal C. 2017. A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 7359-7368 [ DOI: 10.1109/CVPR.2017.778 http://dx.doi.org/10.1109/CVPR.2017.778 ]

Mao J H, Huang J, Toshev A, Camburu O, Yuille A and Murphy K. 2016. Generation and comprehension of unambiguous object descriptions//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 11-20 [ DOI: 10.1109/CVPR.2016.9 http://dx.doi.org/10.1109/CVPR.2016.9 ]

Miech A, Alayrac J B, Smaira L, Laptev I, Sivic J and Zisserman A. 2020. End-to-end learning of visual representations from uncurated instructional videos//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE: 9876-9886 [ DOI: 10.1109/CVPR42600.2020.00990 http://dx.doi.org/10.1109/CVPR42600.2020.00990 ]

Miech A, Zhukov D, Alayrac J B, Tapaswi M, Laptev I and Sivic J. 2019. HowTo100m: learning a text-video embedding by watching hundred million narrated video clips//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE: 2630-2640 [ DOI: 10.1109/ICCV.2019.00272 http://dx.doi.org/10.1109/ICCV.2019.00272 ]

Min B N, Ross H, Sulem E, Veyseh A P B, Nguyen T H, Sainz O, Agirre E, Heinz I and Roth D. 2021. Recent advances in natural language processing via large pre-trained language models: a survey[EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/2111.01243.pdf https://arxiv.org/pdf/2111.01243.pdf

Mithun N C, Li J C, Metze F and Roy-Chowdhury A K. 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval//Proceedings of 2018 ACM on International Conference on Multimedia Retrieva. Yokohama, Japan: ACM: 19-27 [ DOI: 10.1145/3206025.3206064 http://dx.doi.org/10.1145/3206025.3206064 ]

Ordonez V, Kulkarni G and Berg T. 2011. Im2Text: describing images using 1 million captioned photographs//Proceedings of the 24th International Conference on Neural Information Processing Systems. Granada, Spain: Curran Associates Inc. : 1143-1151

Pan P B, Xu Z W, Yang Y, Wu F and Zhuang Y T. 2016a. Hierarchical recurrent neural encoder for video representation with application to captioning//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 1029-1038 [ DOI: 10.1109/CVPR.2016.117 http://dx.doi.org/10.1109/CVPR.2016.117 ]

Pan Y W, Mei T, Yao T, Li H Q and Rui Y. 2016b. Jointly modeling embedding and translation to bridge video and language//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 4594-4602 [ DOI: 10.1109/CVPR.2016.497 http://dx.doi.org/10.1109/CVPR.2016.497 ]

Plummer B A, Brown M and Lazebnik S. 2017. Enhancing video summarization via vision-language embedding//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 1052-1060 [ DOI: 10.1109/CVPR.2017.118 http://dx.doi.org/10.1109/CVPR.2017.118 ]

Plummer B A, Wang L W, Cervantes C M, Caicedo J C, Hockenmaier J and Lazebnik S. 2015. Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models//Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE: 2641-2649 [ DOI: 10.1109/ICCV.2015.303 http://dx.doi.org/10.1109/ICCV.2015.303 ]

Qi D, Su L, Song J, Cui E, Bharti T and Sacheti A. 2020. ImageBERT: cross-modal pre-training with large-scale weak-supervised image-text data [EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/2001.07966.pdf https://arxiv.org/pdf/2001.07966.pdf

Qiang J P, Qian Z Y, Li Y, Yuan Y H and Zhu Y. 2022. English lexical simplification based on pretrained language representation modeling. Acta Automatica Sinica

强继朋, 钱镇宇, 李云, 袁运浩, 朱毅. 2022. 基于预训练表示模型的英语词语简化方法. 自动化学报: #c200723) [ DOI: 10.16383/j.aas.c200723 http://dx.doi.org/10.16383/j.aas.c200723 ]

Qiu X P, Sun T X, Xu Y G, Shao Y F, Dai N and Huang X J. 2021. Pre-trained models for natural language processing: a survey [EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/2003.08271.pdf https://arxiv.org/pdf/2003.08271.pdf

Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G and Sutskever I. 2021. Learning transferable visual models from natural language supervision//Proceedings of the 38th International Conference on Machine Learning (ICML). [s. l.]: PMLR: 8748-8763

Radford A, Narasimhan K, Salimans T and Sutskever I. 2018. Improving language understanding by generative pre-training [EB/OL ] . [2022-04-28 ] . https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf

Radford A, Wu J, Child R, Luan D, Amodei D and Sutskever I. 2019. Language models are unsupervised multitask learners [EB/OL ] . [2022-04-28 ] . http://www.persagen.com/files/misc/radford2019language.pdf http://www.persagen.com/files/misc/radford2019language.pdf

Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y Q, Li W and Liu P J. 2020. Exploring the limits of transfer learning with a unified text-to-text tansformer. Journal of Machine Learning Research (JMLR), 21(140): 1-67

Ren S Q, He K M, Girshick R and Sun J. 2015. Faster R-CNN: towards real-time object detection with region proposal networks//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press: 91-99

Ruan L D and Jin Q. 2021. Survey: transformer based video-language pre-training [EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/2109.09920.pdf https://arxiv.org/pdf/2109.09920.pdf

Sharma P, Ding N, Goodman S and Soricut R. 2018. Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne, Australia: ACL: 2556-2565 [ DOI: 10.18653/v1/P18-1238 http://dx.doi.org/10.18653/v1/P18-1238 ]

Shoeybi M, Patwary M, Puri R, LeGresley P, Casper J and Catanzaro B. 2020. Megatron-LM: training multi-billion parameter language models using model parallelism [EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/1909.08053.pdf https://arxiv.org/pdf/1909.08053.pdf

Song K, Tan X, Qin T, Lu J F and Liu T Y. 2019. MASS: masked sequence to sequence pre-training for language generation//Proceedings of the 36th International Conference on Machine Learning. Long Beach, USA: PMLR: 5926-5936

Song Y L and Soleymani M. 2019. Polysemous visual-semantic embedding for cross-modal retrieval//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 1979-1988 [ DOI: 10.1109/CVPR.2019.00208 http://dx.doi.org/10.1109/CVPR.2019.00208 ]

Soomro K, Zamir A R and Shah M. 2012. UCF101: a dataset of 101 human actions classes from videos in the wild [EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/1212.0402.pdf https://arxiv.org/pdf/1212.0402.pdf

Stroud J C, Lu Z C, Sun C, Deng J, Sukthankar R, SchmidC and Ross D A. 2021. Learning video representations from textual web supervision [EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/2007.14937.pdf https://arxiv.org/pdf/2007.14937.pdf

Su W J, Zhu X Z, Cao Y, Li B, Lu L L, Wei R and Dai J F. 2020. VL-bert: pre-training of generic visual-linguistic representations[EB/OL ] . [2022-04-28 ] . https://openreview.net/attachment?id=SygXPaEYvH&name=original_pdf https://openreview.net/attachment?id=SygXPaEYvH&name=original_pdf

Suhr A, Lewis M, Yeh J and Artzi Y. 2017. A corpus of natural language for visual reasoning//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Vancouver, Canada: ACL: 217-223 [ DOI: 10.18653/v1/P17-2034 http://dx.doi.org/10.18653/v1/P17-2034 ]

Suhr A, Zhou S, Zhang A, Zhang I, Bai H J and Artzi Y. 2019. A corpus for reasoning about natural language grounded in photographs//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: ACL: 6418-6428 [ DOI: 10.18653/v1/P19-1644 http://dx.doi.org/10.18653/v1/P19-1644 ]

Sun C, Baradel F, Murphy K and Schmid C. 2019a. Contrastive bidirectional transformer for temporal representation learning [EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/1906.05743v1.pdf https://arxiv.org/pdf/1906.05743v1.pdf

Sun C, Myers A, Vondrick C, Murphy K and Schmid C. 2019b. VideoBERT: a joint model for video and language representation learning//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea(South): IEEE: 7463-7472 [ DOI: 10.1109/ICCV.2019.00756 http://dx.doi.org/10.1109/ICCV.2019.00756 ]

Tan H and Bansal M. 2019. LXMERT: learning cross-modality encoder representations from transformers//Proceedings of 2019 Conference on Empirical Methods in Natural Language P rocessing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: ACL: 5100-5111 [ DOI: 10.18653/v1/D19-1514 http://dx.doi.org/10.18653/v1/D19-1514 ]

Tang Y S, Ding D J, Rao Y M, Zheng Y, Zhang D Y, Zhao L L, Lu J W and Zhou J. 2019. COIN: a large-scale dataset for comprehensive instructional video analysis//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 1207-1216 [ DOI: 10.1109/CVPR.2019.00130 http://dx.doi.org/10.1109/CVPR.2019.00130 ]

Tang Z N, Lei J and Bansal M. 2021. DeCEMBERT: learning from noisy instructional videos via dense captions and entropy minimization//Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: ACL: 2415-2426 [ DOI: 10.18653/v1/2021.naacl-main.193 http://dx.doi.org/10.18653/v1/2021.naacl-main.193 ]

Tapaswi M, Zhu Y K, Stiefelhagen R, Torralba A, Urtasun R and Fidler S. 2016. MovieQA: understanding stories in movies through question-answering//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 4631-4640 [ DOI: 10.1109/CVPR.2016.501 http://dx.doi.org/10.1109/CVPR.2016.501 ]

Touvron H, Cord M, Douze M, Massa F, Sablayrolles A and Jégou H. 2021a. Training data-efficient image transformers and distillation through attention//Proceedings of the 38th International Conference on Machine Learning. [s. l.]: PMLR: 10347-10357

Touvron H, Cord M, Sablayrolles A, Synnaeve G and Jégou H. 2021b. Going deeper with image transformers//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE: 32-42 [ DOI: 10.1109/ICCV48922.2021.00010 http://dx.doi.org/10.1109/ICCV48922.2021.00010 ]

Van den Oord A, Vinyals O and Kavukcuoglu K. 2017. Neural discrete representation learning//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc. : 6309-6318

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc. : 6000-6010

Vinyals O, Toshev A, Bengio S and Erhan D. 2015. Show and tell: a neural image caption generator//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA: IEEE: 315 6-3164 [ DOI: 10.1109/CVPR.2015.7298935 http://dx.doi.org/10.1109/CVPR.2015.7298935 ]

Wang L W, Li Y and Lazebnik S. 2016. Learning deep structure-preserving image-text embeddings//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 5005-5013 [ DOI: 10.1109/CVPR.2016.541 http://dx.doi.org/10.1109/CVPR.2016.541 ]

Wang L W, Li Y, Huang J and Lazebnik S. 2019. Learning two-branch neural networks for image-text matching tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2): 394-407 [DOI: 10.1109/TPAMI.2018.2797921]

Wang Z R, Yu J H, Yu A W, Dai Z H, Tsvetkov Y and Cao Y. 2022. SimVLM: simple visual language model pretraining with weak supervision [EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/2108.10904.pdf https://arxiv.org/pdf/2108.10904.pdf

Wu Y H, Schuster M, Chen Z F, Le Q V, Norouzi M, Macherey M, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J, Shah A, Johnson M, Liu X B, KaiserŁ, Gouws S, Kato Y, Kudo T, Kazawa H, Stevens K, Kurian G, Patil N, Wang W, Young C, Smith J, Riesa J, Rudnick A, Vinyals O, Corrado G, Hughes M and Dean J. 2016. Google's neural machine translation system: bridging the gap between human and machine translation [EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/1609.08144.pdf https://arxiv.org/pdf/1609.08144.pdf

Xia Q L, Huang H Y, Duan N, Zhang D D, Ji L, Sui Z F, Cui E, Bharti T and Zhou M. 2021. XGPT: cross-modal generative pre-training for image captioning//Proceedings of the 10th CCF International Conference on Natural Language Processing and Chinese Computing. Qingdao, China: Springer: 786-797 [ DOI: 10.1007/978-3-030-88480-2_63 http://dx.doi.org/10.1007/978-3-030-88480-2_63 ]

Xie N, Lai F, Doran D and Kadav A. 2019a. Visual entailment task for visually-grounded language learning [EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/1811.10582.pdf https://arxiv.org/pdf/1811.10582.pdf

Xie N, Lai F, Doran D and Kadav A. 2019b. Visual entailment: a novel task for fine-grained image understanding [EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/1901.06706.pdf https://arxiv.org/pdf/1901.06706.pdf

Xie S N, Sun C, Huang J, Tu Z W and Murphy K. 2018. Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 318-335 [ DOI: 10.1007/978-3-030-01267-0_19 http://dx.doi.org/10.1007/978-3-030-01267-0_19 ]

Xu D J, Zhao Z, Xiao J, Wu F, Zhang H W, He X N and Zhuang Y T. 2017. Video question answering via gradually refined attention over appearance and motion//Proceedings of the 25th ACM International Conference on Multimedia. Mountain View, USA: ACM: 1645-1653 [ DOI: 10.1145/3123266.3123427 http://dx.doi.org/10.1145/3123266.3123427 ]

Xu H, Ghosh G, Huang P Y, Arora P, Aminzadeh M, Feichtenhofer C, Metze F and Zettlemoyer L. 2021a. VLM: task-agnostic video-language model pre-training for video understanding//Findings of Association for Computational Linguistics. Online: ACL: 4227-4239 [ DOI: 10.18653/v1/2021.findings-acl.370 http://dx.doi.org/10.18653/v1/2021.findings-acl.370 ]

Xu J, Mei T, Yao T and Rui Y. 2016. MSR-VTT: a large video description dataset for bridging video and language//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 5288-5296 [ DOI: 10.1109/CVPR.2016.571 http://dx.doi.org/10.1109/CVPR.2016.571 ]

Xu K, Ba J L, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R S and Bengio Y. 2015. Show, attend and tell: neural image caption generation with visual attention//Proceedings of the 32nd International Conference on Machine Learning. Lille, France: PMLR: 2048-2057

Xu Y, Xu Y H, Lv T C, Cui L, We F R, Wang G X, Lu Y J, Florencio D, Zhang C, Che W X, Zhang M and Zhou L D. 2021b. LayoutLMv2: multi-modal pre-training for visually-rich document understanding//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Online: ACL: 2579-2591 [ DOI: 10.18653/v1/2021.acl-long.201 http://dx.doi.org/10.18653/v1/2021.acl-long.201 ]

Xu Y H, Li M H, Cui L, Huang S H, Wei F R and Zhou M. 2020. LayoutLM: pre-training of text and layout for document image understanding//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. California, USA: ACM: 1192-1200 [ DOI: 10.1145/3394486.3403172 http://dx.doi.org/10.1145/3394486.3403172 ]

Yang Z L, Dai Z H, Yang Y M, Carbonell J, Salakhutdinov R and Le Q V. 2019. XLNet: generalized autoregressive pretraining for language understanding//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver, Canada: ACM: 5753-5763

Yao Y, Zhang A, Zhang Z Y, Liu Z Y, Chua T S and Sun M S. 2022. CPT: colorful prompt tuning for pre-trained vision-language models[EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/2109.11797.pdf https://arxiv.org/pdf/2109.11797.pdf

Yosinski J, Clune J, Bengio Y and Lipson H. 2014. How transferable are features in deep neural networks//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press: 3320-3328

Young P, Lai A, Hodosh M and Hockenmaier J. 2014. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions//Proceedings of the Transactions of the Association for Computational Linguistics. Cambridge, USA: MIT Press: 67-78 [ DOI: 10.1162/tacl_a_00166 http://dx.doi.org/10.1162/tacl_a_00166 ]

Yu F, Tang J J, Yin W C, Sun Y, Tian H, Wu H and Wang H F. 2021. ERNIE-VIL: knowledge enhanced vision-language representations through scene graphs. Proceedings of the AAAI Conference on Artificial Intelligence, 35(4): 3208-3216

Yu H N, Wang J, Huang Z H, Yang Y and Xu W. 2016a. Video paragraph captioning using hierarchical recurrent neural networks//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 4584-4593 [ DOI: 10.1109/CVPR.2016.496 http://dx.doi.org/10.1109/CVPR.2016.496 ]

Yu L C, Poirson P, Yang S, Berg A C and Berg T L. 2016b. Modeling context in referring expressions//Proceedings of the 14th European Conference on Computer Vision (ECCV). Amsterdam, the Netherlands: Springer: 69-85 [ DOI: 10.1007/978-3-319-46475-6_5 http://dx.doi.org/10.1007/978-3-319-46475-6_5 ]

Yu Y, Kim J and Kim G. 2018. A joint sequence fusion model for video question answering and retrieval//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 487-503 [ DOI: 10.1007/978-3-030-01234-2_29 http://dx.doi.org/10.1007/978-3-030-01234-2_29 ]

Yuan L, Hou Q B, Jiang Z H, Feng J S and Yan S C. 2021. VOLO: vision outlooker for visual recognition [EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/2106.13112.pdf https://arxiv.org/pdf/2106.13112.pdf

Zellers R, Bisk Y, Farhadi A and Choi Y. 2019a. From recognition to cognition: visual commonsense reasoning//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 6713-6724 [ DOI: 10.1109/CVPR.2019.00688 http://dx.doi.org/10.1109/CVPR.2019.00688 ]

Zellers R, Holtzman A, Rashkin H, Bisk Y, Farhadi A, Roesner F and Choi Y J. 2019b. Defending against neural fake news [EB/OL ] . [2022-04-28 ] . https://proceedings.neurips.cc/paper/2019/file/3e9f0fc9b2f89e043bc6233994dfcf76-Paper.pdf https://proceedings.neurips.cc/paper/2019/file/3e9f0fc9b2f89e043bc6233994dfcf76-Paper.pdf

Zellers R, Lu X M, Hesse J, Yu Y, Park J S, Cao J, Farhadi A and Choi Y J. 2021. Merlot: multimodal neural script knowledge models//Proceedings of the Advances in Neural Information Processing Systems 34 (NeurIPS 2021). Online: 23634-23651

Zhang P C, Li X J, Hu X W, Yang J W, Zhang L, Wang L J, Choi Y J and Gao J F. 2021a. VinVL: revisiting visual representations in vision-language models//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: IEEE: 5575-5584 [ DOI: 10.1109/CVPR46437.2021.00553 http://dx.doi.org/10.1109/CVPR46437.2021.00553 ]

Zhang S Y, Jiang T, Wang T, Kuang K, Zhao Z, Zhu J K, Yu J, Yang H X and Wu F. 2020a. DeVLBert: learning deconfounded visio-linguistic representations//Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM: 4373-4382 [ DOI: 10.1145/3394171.3413518 http://dx.doi.org/10.1145/3394171.3413518 ]

Zhang Z Y, Gu Y X, Han X, Chen S Q, Xiao C J, Sun Z B, Yao Y, Qi F C, Guan J, Ke P, Cai Y Z, Zeng G Y, Tan Z X, Liu Z Y, Huang M L, Han W T, Liu Y, Zhu X Y and Sun M S. 2021b. CPM-2: large-scale cost-effective pre-trained language models [EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/2106.10715.pdf https://arxiv.org/pdf/2106.10715.pdf

Zhang Z Y, Han X, Zhou H, Ke P, Gu Y X, Ye D M, Qin Y J, Su Y S, Ji H Z, Guan J, Qi F C, Wang X Z, Zheng Y N, Zeng G Y, Cao H Q, Chen S Q, Li D X, Sun Z B, Liu Z Y, Huang M L, Han W T, Tang J, Li J Z, Zhu X Y and Sun M S. 2020b. CPM: a large-scale generative Chinese pre-trained language model[EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/2012.00413.pdf https://arxiv.org/pdf/2012.00413.pdf

Zhou L W, Kalantidis Y, Chen X L, Corso J J and Rohrbach M. 2019. Grounded video description//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 6571-6580 [ DOI: 10.1109/CVPR.2019.00674 http://dx.doi.org/10.1109/CVPR.2019.00674 ]

Zhou L W, Liu J J, Cheng Y, Gan Z and Zhang L. 2021. CUPID: adaptive curation of pre-training data for video-and-language representation learning [EB/OL ] . [2022-04-28 ] . https://arxiv.org/pdf/2104.00285.pdf https://arxiv.org/pdf/2104.00285.pdf

Zhou L W, Palangi H, Zhang L, Hu H D, Corso J and Gao J F. 2020. Unified vision-language pre-training for image captioning and VQA. Proceedings of the AAAI Conference on Artificial Intelligence, 34(7): 13041-13049 [DOI: 10.1609/aaai.v34i07.7005]

Zhou L W, Xu C L and Corso J. 2018a. Towards automatic learning of procedures from web instructional videos. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1): 7590-7598 [DOI: 10.1609/aaai.v32i1.12342]

Zhou L W, Zhou Y B, Corso J J, Socher R and Xiong C M. 2018b. End-to-end dense video captioning with masked transformer//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, USA: IEEE: 8739-8748 [ DOI: 10.1109/CVPR.2018.00911 http://dx.doi.org/10.1109/CVPR.2018.00911 ]

Zhu L C and Yang Y. 2020. ActBERT: learning global-local video-text representations//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE: 8743-8752 [ DOI: 10.1109/CVPR42600.2020.00877 http://dx.doi.org/10.1109/CVPR42600.2020.00877 ]

Zhu Y K, Groth O, Bernstein M and Li F F. 2016. Visual7 W: grounded question answering in images//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 4995-5004 [ DOI: 10.1109/CVPR.2016.540 http://dx.doi.org/10.1109/CVPR.2016.540 ]

Zhu Y K, Kiros R, Zemel R, Salakhutdinov R, Urtasun R, Torralba A and Fidler S. 2015. Aligning books and movies: towards story-like visual explanations by watching movies and reading books//Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE: 19-27 [ DOI: 10.1109/ICCV.2015.11 http://dx.doi.org/10.1109/ICCV.2015.11 ]

Zhukov D, Alayrac J B, Cinbis R G, Fouhey D, Laptev I and Sivic J. 2019. Cross-task weakly supervised learning from instructional videos//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 3532-3540 [ DOI: 10.1109/CVPR.2019.00365 http://dx.doi.org/10.1109/CVPR.2019.00365 ]

文章被引用时，请邮件提醒。

提交

一致性约束引导的零样本三维模型分类网络

基于自适应掩码的自监督矿井图像去噪

视觉基础模型研究现状与发展趋势

自监督提取光谱序列和语义信息的胆管癌显微高光谱图像分类