视觉语言多模态预训练综述
Comprehensive review of visual-language-oriented multimodal pre-training methods
- 2022年27卷第9期 页码:2652-2682
纸质出版日期: 2022-09-16 ,
录用日期: 2022-06-22
DOI: 10.11834/jig.220173
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2022-09-16 ,
录用日期: 2022-06-22
移动端阅览
张浩宇, 王天保, 李孟择, 赵洲, 浦世亮, 吴飞. 视觉语言多模态预训练综述[J]. 中国图象图形学报, 2022,27(9):2652-2682.
Haoyu Zhang, Tianbao Wang, Mengze Li, Zhou Zhao, Shiliang Pu, Fei Wu. Comprehensive review of visual-language-oriented multimodal pre-training methods[J]. Journal of Image and Graphics, 2022,27(9):2652-2682.
在多模态机器学习领域,为特定任务而制作的人工标注数据昂贵,且不同任务难以进行迁移,从而需要大量重新训练,导致训练多个任务时效率低下、资源浪费。预训练模型通过以自监督为代表的方式进行大规模数据训练,对数据集中不同模态的信息进行提取和融合,以学习其中蕴涵的通用知识表征,从而服务于广泛的相关下游视觉语言多模态任务,这一方法逐渐成为人工智能各领域的主流方法。依靠互联网所获取的大规模图文对与视频数据,以及以自监督学习为代表的预训练方法的进步,视觉语言多模态预训练模型在很大程度上打破了不同视觉语言任务之间的壁垒,提升了多个任务训练的效率并促进了具体任务的性能表现。本文总结视觉语言多模态预训练领域的进展,首先对常见的预训练数据集和预训练方法进行汇总,然后对目前最新方法以及经典方法进行系统概述,按输入来源分为图像—文本预训练模型和视频—文本多模态模型两大类,阐述了各方法之间的共性和差异,并将各模型在具体下游任务上的实验情况进行汇总。最后,总结了视觉语言预训练面临的挑战和未来发展趋势。
Multimodal machine learning has been challenging for labor-intensive and labeled cost and data migration constraints
which requires amount of retraining process
resulting in low efficiency and imbalanced resources allocation for multiple training tasks. To learn the internal knowledge representation and meet the requirement of the related downstream visual language multimodal tasks
pre-training model is carried out for large-scale data training task through self-supervision
the multiple modes information extraction and integration of the data set context
etc. The exploration of pre-trained models is focused on cheaper labeled data due to the expensive human labels. First
the model is pre-trained based on cheap labeled data
and the model is fine-tuned using less expensive human annotations. Large-scale data and long time span training are often required to pre-train the model because of the less information and noise derived from cheap labeled data. The large-scale unlabeled-data-based pre-trained model not only transfer the more general knowledge to the target task through the learned unlabeled data
but also get a better parameter initial point through the pre-training learning. The future multimodal contexts have their potentials like learning demonstration
sentiment analysis and task-oriented large-scale human-computer interactions. Multimodal pre-training models can be as a pathway derived of weak artificial intelligence from local to global. It is possible to transfer multi-tasks learning results to non-supervision multi-domains data automatically and quickly. The plain text pre-training model can cover less online data only
and richer data have not been fully utilized and learned. Multimodal-contexts are benefited from information gathering
context perception
knowledge learning
and demonstration. To generate commonly-used artificial intelligence model
the pre-training model has been developing from single-modal to multi-modal. The intensive growth of pre-training models has extended to the field of visual and textual interaction since 2019. Thanks to the large-scale image-text pairs and video data online and the growth of pre-training technique like self-supervised learning
the visual-language multimodal pre-training model has been promoted and bridged the gap between different visual-language tasks
which optimizes multi-task training and improves the performance of specific tasks. Current multimodal researches are challenged to an intelligent system organizing
multimodal information perceiving and the semantic gap bridging. We review existing pre-training datasets and pre-training methods
and propose a systematic overview of the latest and traditional methods. The universals and differences between the methods are critical analyzed
and the experimental conditions of each model are summarized on specific downstream tasks. Finally
the challenges and future research direction of visual language pre-training are predicted.
多模态机器学习视觉语言多模态预训练自监督学习图像文本预训练视频文本预训练
multimodal machine learningvisual language multimodalitypre-trainingself-supervised learningimage-text pre-trainingvideo-text pre-training
Agrawal H, Chandrasekaran A, Batra D, Parikh D and Bansal M. 2016. Sort story: sorting jumbled images and captions into stories//Proceedings of 2016 Conference on Empirical Methods in Natural Language Processing. Austin,USA: ACL: 925-931 [DOI: 10.18653/v1/D16-1091http://dx.doi.org/10.18653/v1/D16-1091]
Agrawal H, Desai K, Wang Y F, Chen X L, Jain R, Johnson M, Batra D, Parikh D, Lee S and Anderson P. 2019. Nocaps: novel object captioning at scale//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE: 8947-8956 [DOI: 10.1109/ICCV.2019.00904http://dx.doi.org/10.1109/ICCV.2019.00904]
Akbari H, Yuan L Z, Qian R, Chuang W H, Chang S F, Cui Y and Gong B Q. 2021. VATT: transformers for multimodal self-supervised learning from raw video, audio and text [EB/OL]. [2022-04-28].https://arxiv.org/pdf/2104.11178.pdfhttps://arxiv.org/pdf/2104.11178.pdf
Alberti C, Ling J, Collins M and Reitter D. 2019. Fusion of detected objects in text for visual question answering//Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: ACL: 2131-2140 [DOI: 10.18653/v1/D19-1219http://dx.doi.org/10.18653/v1/D19-1219]
Amrani E, Ben-Ari R, Rotman D and Bronstein A. 2021. Noise estimation using density estimation for self-supervised multimodal learning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(8): 6644-6652
Anderson P, He X D, Buehler C, Teney D, Johnson M, Gould S and Zhang L. 2018. Bottom-up and top-down attention for image captioning and visual question answering//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 6077-6086 [DOI: 10.1109/CVPR.2018.00636http://dx.doi.org/10.1109/CVPR.2018.00636]
Antol S, Agrawal A, Lu J S, Mitchell M, Batra D, Zitnick C L and Parikh D. 2015. VQA: visual question answering//Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE: 2425-2433 [DOI: 10.1109/ICCV.2015.279http://dx.doi.org/10.1109/ICCV.2015.279]
Bain M, Nagrani A, Varol G and Zisserman A. 2021. Frozen in time: a joint video and image encoder for end-to-end retrieval//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE: 1708-1718 [DOI: 10.1109/ICCV48922.2021.00175http://dx.doi.org/10.1109/ICCV48922.2021.00175]
Bao H B, Dong Land Wei F R. 2021. BEIT: BERT pre-training of image transformers [EB/OL]. [2022-04-28].https://arxiv.org/pdf/2106.08254.pdfhttps://arxiv.org/pdf/2106.08254.pdf
Brown T B, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler D M, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I and Amodei D. 2020. Language models are few-shot learners//Proceedings of the 34th Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc. : 1877-1901
Carreira J and Zisserman A. 2017. Quo vadis, action recognition? a new model and the kinetics dataset//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 4724-4733 [DOI: 10.1109/CVPR.2017.502http://dx.doi.org/10.1109/CVPR.2017.502]
Carreira J, Noland E, Banki-Horvath A, Hillier C and Zisserman A. 2018. A short note about kinetics-600 [EB/OL]. [2022-04-28].https://arxiv.org/pdf/1808.01340.pdfhttps://arxiv.org/pdf/1808.01340.pdf
Carreira J, Noland E, Hillier C and Zisserman A. 2019. A short note on the kinetics-700 human action dataset [EB/OL]. [2022-04-28].https://arxiv.org/pdf/1907.06987.pdfhttps://arxiv.org/pdf/1907.06987.pdf
Changpinyo S, Sharma P, Ding N and Soricut R. 2021. Conceptual 12M: pushing web-scale image-text pre-training to recognize long-tail visual concepts//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: IEEE: 3557-3567 [DOI: 10.1109/CVPR46437.2021.00356http://dx.doi.org/10.1109/CVPR46437.2021.00356]
Chen D G, Ma J L, Ma Z P and Zhou J. 2021. Review of pre-training techniques for natural language processing. Journal of Frontiers of Computer Science and Technology, 15(8): 1359-1389
陈德光, 马金林, 马自萍, 周洁. 2021. 自然语言处理预训练技术综述. 计算机科学与探索, 15(8): 1359-1389) [DOI: 10.3778/j.issn.1673-9418.2012109]
Chen F L, Zhang D Z, Han M L, Chen X Y, Shi J, Xu S and Xu B. 2022. VLP: a survey on vision-language pre-training [EB/OL]. [2022-04-28].https://arxiv.org/pdf/2202.09061.pdfhttps://arxiv.org/pdf/2202.09061.pdf
Chen K Z, Huang Q Y, Bisk Y, McDuff D and Gao J F. 2021. KB-VLP: knowledge based vision and language pretraining//Proceedings of the 38th International Conference on Machine Learning. [s. l.]: PMLR
Chen L Q, Gan Z, Cheng Y, Li L J, Carin L and Liu J J. 2020a. Graph optimal transport for cross-domain alignment//Proceedings of the 37th International Conference on Machine Learning (ICML). [s. l.]: PMLR: 1542-1553
Chen X L, Fang H, Lin T Y, Vedantam R, Gupta S, Dollár P and Zitnick C L. 2015. Microsoft coco captions: data collection and evaluation server[EB/OL]. [2022-04-28].https://arxiv.org/pdf/1504.00325https://arxiv.org/pdf/1504.00325
Chen Y C, Li L J, Yu L C, El Kholy A, Ahmed F, Gan Z, Cheng Y and Liu J J. 2020b. UNITER: universal image-text representation learning//Proceedings of the 16th European Conference on Computer Vision (ECCV). Glasgow, Scotland: Springer: 104-120 [DOI: 10.1007/978-3-030-58577-8_7http://dx.doi.org/10.1007/978-3-030-58577-8_7]
Cho J, Lei J, Tan H and Bansal M. 2021. Unifying vision-and-language tasks via text generation//Proceedings of the 38th International Conference on Machine Learning (ICML). [s. l.]: PMLR: 1931-1942
Clark K, Luong M T, Le Q V and Manning C D. 2020. Electra: pre-training text encoders as discriminators rather than generators [EB/OL]. [2022-04-28].https://openreview.net/pdf?id=r1xMH1-BtvBhttps://openreview.net/pdf?id=r1xMH1-BtvB
Devlin J, Chang M W, Lee K and Toutanova K. 2019. Bert: pre-training of deep bidirectional transformers for language understanding//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, USA: ACL: 4171-4186 [DOI: 10.18653/v1/N19-1423http://dx.doi.org/10.18653/v1/N19-1423]
Dixon T L. 2008. Crime news and racialized beliefs: understanding the relationship between local news viewing and perceptions of African Americans and crime. Journal of Communication, 58(1): 106-125 [DOI: 10.1111/j.1460-2466.2007.00376.xhttp://dx.doi.org/10.1111/j.1460-2466.2007.00376.x]
Dong L, Yang N, Wang W H, Wei F R, Liu X D, Wang Y, Gao J F, Zhou M and Hon H W. 2019. Unified language model pre-training for natural language understanding and generation//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver, Canada: ACM: 13063-13075
Dosovitskiy A, Beyer L, Kolesnikov A, WeissenbornD, Zhai X H, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J and Houlsby N. 2021. An image is worth 16×16 words: transformers for image recognition at scale [EB/OL]. [2022-04-28].https://openreview.net/pdf?id=YicbFdNTTyhttps://openreview.net/pdf?id=YicbFdNTTy
Dou Z Y, Xu Y C, Gan Z, Wang J F, Wang S H, Wang L J, Zhu C G, Zhang P C, Yuan L, Peng N Y, Liu Z C and Zeng M. 2022. An empirical study of training end-to-end vision-and-language transformers [EB/OL]. [2022-04-28].https://arxiv.org/pdf/2111.02387.pdfhttps://arxiv.org/pdf/2111.02387.pdf
Elliott D, Frank S, Sima'an K and Specia L. 2016. Multi30k: multilingual English-German image descriptions//Proceedings of the 5th Workshop on Vision and Language. Berlin, Germany: ACL: 70-74 [DOI: 10.18653/v1/W16-3210http://dx.doi.org/10.18653/v1/W16-3210]
Fedus W, Zoph B and Shazeer N. 2022. Switch transformers: scaling to trillion parameter models with simple and efficient sparsity[EB/OL]. [2022-04-28].https://arxiv.org/pdf/2101.03961.pdfhttps://arxiv.org/pdf/2101.03961.pdf
Feichtenhofer C, Fan H Q, Malik J and He K M. 2019. SlowFast networks for video recognition//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE: 6201-6210 [DOI: 10.1109/ICCV.2019.00630http://dx.doi.org/10.1109/ICCV.2019.00630]
Fouhey D F, Kuo W C, Efros A A and Malik J. 2018. From lifestyle vlogs to everyday interactions//Proceedings of 2018 IEEE/CVFConference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, USA: IEEE: 4991-5000 [DOI: 10.1109/CVPR.2018.00524http://dx.doi.org/10.1109/CVPR.2018.00524]
Fu T J, Li L J, Gan Z, Lin K, Wang W Y, Wang L J and Liu Z C. 2022. VIOLET: end-to-end video-language transformers with masked visual-token modeling [EB/OL]. [2022-04-28].https://arxiv.org/pdf/2111.12681.pdfhttps://arxiv.org/pdf/2111.12681.pdf
Gabeur V, Sun C, Alahari K and Schmid C. 2020. Multi-modal transformer for video retrieval//Proceedings of the 16th European Conference on Computer Vision (ECCV). Glasgow, UK: Springer: 214-229 [DOI: 10.1007/978-3-030-58548-8_13http://dx.doi.org/10.1007/978-3-030-58548-8_13]
Gan Z, Chen Y C, Li L J, Zhu C, Cheng Y and Liu J J. 2020. Large-scale adversarial training for vision-and-language representation learning//Proceedings of the 34th Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc. : 6616-6628
Gan Z, Gan C, He X D, Pu Y C, Tran K, Gao J F, Carin L and Deng L. 2017. Semantic compositional networks for visual captioning//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 1141-1150 [DOI: 10.1109/CVPR.2017.127http://dx.doi.org/10.1109/CVPR.2017.127]
Gao D F, Wang R P, Shan S G and Chen X L. 2019. CRIC: a VQA dataset for compositional reasoning on vision and commonsense [EB/OL]. [2022-04-28].https://arxiv.org/pdf/1908.02962.pdfhttps://arxiv.org/pdf/1908.02962.pdf
Gao J Y, Sun C, Yang Z H and Nevatia R. 2017. TALL: temporal activity localization via language query//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 5277-5285 [DOI: 10.1109/ICCV.2017.563http://dx.doi.org/10.1109/ICCV.2017.563]
Goyal Y, Khot T, Agrawal A, Summers-Stay D, Batra D and Parikh D. 2019. Making the V in VQA matter: elevating the role of image understanding in visual question answering. International Journal of Computer Vision, 127(4): 398-414 [DOI: 10.1007/s11263-018-1116-0]
Han Y, Qiao L B, Li D S and Liao X K. 2022. Review of knowledge-enhanced pre-trained language models. Journal of Frontiers of Computer Science and Technology
韩毅, 乔林波, 李东升, 廖湘科. 2022. 知识增强型预训练语言模型综述. 计算机科学与探索.http://fcst.ceaj.org/CN/10.3778/j.issn.1673-9418.2108105http://fcst.ceaj.org/CN/10.3778/j.issn.1673-9418.2108105)[DOI: 10.3778/j.issn.1673-9418.2108105http://dx.doi.org/10.3778/j.issn.1673-9418.2108105.
He K M, Chen X L, Xie S N, Li Y H, Dollár P and Girshick R. 2021a. Masked autoencoders are scalable vision learners [EB/OL]. [2022-04-28].https://arxiv.org/pdf/2111.06377.pdfhttps://arxiv.org/pdf/2111.06377.pdf
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 770-778 [DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]
He P C, LiuX D, Gao J F and Chen W Z. 2021b. Deberta: decoding-enhanced bert with disentangled attention[EB/OL]. [2022-04-28].https://openreview.net/pdf?id=XPZIaotutsDhttps://openreview.net/pdf?id=XPZIaotutsD
Hendricks L A, Wang O, Shechtman E, Sivic J, Darrell T and Russell B. 2017. Localizing moments in video with natural language//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 5804-5813 [DOI: 10.1109/ICCV.2017.618http://dx.doi.org/10.1109/ICCV.2017.618]
Hu X W, Yin X, Lin K, Zhang L, Gao J F, Wang L J and Liu Z C. 2021. VIVO: visual vocabulary pre-training for novel object captioning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(2): 1575-1583
Huang G, Pang B, Zhu Z H, Rivera C and Soricut R. 2020a. Multimodal pretraining for dense video captioning//Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. Suzhou, China: ACL: 470-490
Huang T H K, Ferraro F, Mostafazadeh N, Misra I, Agrawal A, Devlin J, Girshick R, He X D, Kohli P, Batra D, Zitnick C L, Parikh D, Anderwende L, Galley M and Mitchell M. 2016. Visual storytelling//Proceedings of 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego, USA: ACL: 1233-1239 [DOI: 10.18653/v1/N16-1147http://dx.doi.org/10.18653/v1/N16-1147]
Huang Z C, Zeng Z Y, Liu B, Fu D M and Fu J L. 2020b. Pixel-BERT: aligning image pixels with text by deep multi-modal transformers [EB/OL]. [2021-12-20].https://arxiv.org/pdf/2004.00849.pdfhttps://arxiv.org/pdf/2004.00849.pdf
Hudson D A and Manning C D. 2019. GQA: a new dataset for real-world visual reasoning and compositional question answering//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 6693-6702 [DOI: 10.1109/CVPR.2019.00686http://dx.doi.org/10.1109/CVPR.2019.00686]
Huo Y Q, Zhang M L, Liu G Z, Lu H Y, Gao Y Z, Yang G X, Wen J Y, Zhang H, Xu B G, Zheng W H, Xi Z Z, Yang Y Q, Hu A W, Zhao J M, Li R C, Zhao Y D, Zhang L, Song Y Q, Hong X, Cui W Q, Hou D Y, Li Y Y, Li J Y, Liu P Y, Gong Z, Jin C H, Sun Y C, Chen S Z, Lu Z W, Dou Z C, Jin Q, Lan Y Y, Zhao W X, Song R H and Wen J R. 2021. WenLan: bridging vision and language by large-scale multi-modal pre-training [EB/OL]. [2022-04-28].https://arxiv.org/pdf/2103.06561.pdfhttps://arxiv.org/pdf/2103.06561.pdf
Jang Y, Song Y L, Yu Y, Kim Y and Kim G. 2017. TGIF-QA: toward spatio-temporal reasoning in visual question answering//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 1359-1367 [DOI: 10.1109/CVPR.2017.149http://dx.doi.org/10.1109/CVPR.2017.149]
Jia C, Yang Y F, Xia Y, Chen Y T, Parekh Z, Pham H, Le Q V, Sung Y, Li Z and Duerig T. 2021. Scaling up visual and vision-language representation learning with noisy text supervision//Proceedings of the 38th International Conference on Machine Learning. [s. l.]: PMLR: 4904-4916
Johnson J, Hariharan B, van der Maaten L, Li F F, Zitnick C L and Girshick R. 2017. CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 1988-1997 [DOI: 10.1109/CVPR.2017.215http://dx.doi.org/10.1109/CVPR.2017.215]
Ju S G, Huang F Y and Sun J P. 2022. Research on the idiom cloze algorithm integrating with a pre-trained language model. Journal of Software
琚生根, 黄方怡, 孙界平. 2022. 融合预训练语言模型的成语完形填空算法研究. 软件学报: #006307 [DOI: 10.13328/j.cnki.jos.006307http://dx.doi.org/10.13328/j.cnki.jos.006307]
Kalyan K S, Rajasekharan A and Sangeetha S. 2021. AMMUS: a survey of transformer-based pretrained models in natural language processing[EB/OL]. [2022-04-28].https://arxiv.org/pdf/2108.05542.pdfhttps://arxiv.org/pdf/2108.05542.pdf
Karpathy A and Li F F. 2015. Deep visual-semantic alignments for generating image descriptions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA: IEEE: 3128-3137 [DOI: 10.1109/CVPR.2015.7298932http://dx.doi.org/10.1109/CVPR.2015.7298932]
Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, Suleyman M and Zisserman A. 2017. The kinetics human action video dataset [EB/OL]. [2022-04-28].https://arxiv.org/pdf/1705.06950.pdfhttps://arxiv.org/pdf/1705.06950.pdf
Kazemzadeh S, Ordonez V, Matten M and Berg T. 2014. ReferItGame: referring to objects in photographs of natural scenes//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: ACL: 787-798 [DOI: 10.3115/v1/D14-1086http://dx.doi.org/10.3115/v1/D14-1086]
Kim W, Son B and Kim I. 2021. ViLT: vision-and-language transformer without convolution or region supervision//Proceedings of the 38th International Conference on Machine Learning. [s. l.]: PMLR: 5583-5594
Krishna R, Hata K, Ren F, Li F F and Niebles J C. 2017a. Dense-captioning events in videos//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 706-715 [DOI: 10.1109/ICCV.2017.83http://dx.doi.org/10.1109/ICCV.2017.83]
Krishna R, Zhu Y K, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L J, Shamma D A, Bernstein M S and Li F F. 2017b. Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1): 32-73 [DOI: 10.1007/s11263-016-0981-7]
Kuehne H, Jhuang H, Garrote E, Poggio T and Serre T. 2011. HMDB: a large video database for human motion recognition//Proceedings of 2011 International Conference on Computer Vision (ICCV). Barcelona, Spain: IEEE: 2556-2563 [DOI: 10.1109/ICCV.2011.6126543http://dx.doi.org/10.1109/ICCV.2011.6126543]
Kuznetsova A, Rom H, Alldrin N, Uijlings J, Krasin I, Pont-Tuset J, Kamali S, Popov S, Malloci M, Kolesnikov A, Duerig T and Ferrari V. 2020. The open iImages dataset v4: unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision. 128: 1956-1981 [DOI: 10.1007/s11263-020-01316-zhttp://dx.doi.org/10.1007/s11263-020-01316-z]
Lan Z Z, Chen M D, Goodman S, Gimpel K, Sharma P and Soricut R. 2020. ALBERT: a lite BERT for self-supervised learning of language representations [EB/OL]. [2022-04-28].https://openreview.net/pdf?id=H1eA7AEtvShttps://openreview.net/pdf?id=H1eA7AEtvS
Lee K H, Chen X, Hua G, Hu H D and He X D. 2018. Stacked cross attention for image-text matching//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 212-228 [DOI: 10.1007/978-3-030-01225-0_13http://dx.doi.org/10.1007/978-3-030-01225-0_13]
Lei C Y, Luo S X, Liu Y, He W G, Wang J M, Wang G X, Tang H H, Miao C Y and Li H Q. 2021a. Understanding Chinese video and language via contrastive multimodal pre-training//Proceedings of the 29th ACM International Conference on Multimedia. [s. l.]: ACM: 2567-2576 [DOI: 10.1145/3474085.3475431http://dx.doi.org/10.1145/3474085.3475431]
Lei J, Li L J, Zhou L W, Gan Z, Berg T L, Bansal M and Liu J J. 2021b. Less is more: CLIPBERT for video-and-language learning via sparse sampling//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: IEEE: 7327-7337 [DOI: 10.1109/CVPR46437.2021.00725http://dx.doi.org/10.1109/CVPR46437.2021.00725]
Lei J, Yu L C, Bansal M and Berg T. 2018. TVQA: localized, compositional video question answering//Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: ACL: 1369-1379 [DOI: 10.18653/v1/D18-1167http://dx.doi.org/10.18653/v1/D18-1167]
Lei J, Yu L C, Berg T and Bansal M. 2020a. TVQA+: spatio-temporal grounding for video question answering//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: ACL: 8211-8225 [DOI: 10.18653/v1/2020.acl-main.730http://dx.doi.org/10.18653/v1/2020.acl-main.730]
Lei J, Yu L C, Berg T L and Bansal M. 2020b. TVR: a large-scale dataset for video-subtitle moment retrieval//Proceedings of the 16th European Conference on Computer Vision. Glasgow, Scotland: Springer: 447-463 [DOI: 10.1007/978-3-030-58589-1_27http://dx.doi.org/10.1007/978-3-030-58589-1_27]
Lewis M, Liu Y H, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V and Zettlemoyer L. 2020. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: ACL: 7871-7880 [DOI: 10.18653/v1/2020.acl-main.703http://dx.doi.org/10.18653/v1/2020.acl-main.703]
Li G, Duan N, Fang Y J, Gong M and Jiang D X. 2020a. Unicoder-vl: a universal encoder for vision and language by cross-modal pre-training. Proceedings of the AAAI Conference on Artificial Intelligence, 34(7): 11336-11344 [DOI: 10.1609/aaai.v34i07.6795]
Li L H, Yatskar M, Yin D, Hsieh C J and Chang K W. 2019. VisualBERT: a simple and performant baseline for vision and language[EB/OL]. [2022-04-28].https://arxiv.org/pdf/1908.03557.pdfhttps://arxiv.org/pdf/1908.03557.pdf
Li L J, Chen Y C, Cheng Y, Gan Z, Yu L C and Liu J J. 2020b. HERO: hierarchical encoder for video+language omni-representation pre-training//Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: ACL: 2046-2065 [DOI: 10.18653/v1/2020.emnlp-main.161http://dx.doi.org/10.18653/v1/2020.emnlp-main.161]
Li L J, Lei J, Gan Z, Yu L C, Chen Y C, Pillai R, Cheng Y, Zhou L W, Wang X E, Wang W Y, Berg T L, Bansal M, Liu J J, Wang L J and Liu Z C. 2021a. VALUE: a multi-task benchmark for video-and-language understanding evaluation [EB/OL]. [2022-04-28].https://arxiv.org/pdf/2106.04632.pdfhttps://arxiv.org/pdf/2106.04632.pdf
Li W, Gao G, Niu G C, Xiao X Y, Liu H, Liu J C, Wu H and Wang H F. 2021b. UNIMO: towards unified-modal understanding and generation via cross-modal contrastive learning//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Online: ACL: 2592-2607 [DOI: 10.18653/v1/2021.acl-long.202http://dx.doi.org/10.18653/v1/2021.acl-long.202]
Li X J, Yin X, Li C Y, Zhang P C, Hu X W, Zhang L, Wang L J, Hu H D, Dong L, Wei F R, Choi Y and Gao J F. 2020c. OSCAR: object-semantics aligned pre-training for vision-language tasks//Proceedings of the 16th European Conference on Computer Vision (ECCV). Glasgow, Scotland: Springer: 121-137 [DOI: 10.1007/978-3-030-58577-8_8http://dx.doi.org/10.1007/978-3-030-58577-8_8]
Li Y H, Pan Y W, Yao T, Chen J W and Mei T. 2021c. Scheduled sampling in vision-language pretraining with decoupled encoder-decoder network. Proceedings of the AAAI Conference on Artificial Intelligence, 35(10): 8518-8526
Lin J Y, Men R, Yang A, Zhou C, Ding M, Zhang Y C, Wang P, Wang A, Jiang L, Jia X Y, Zhang J, Zhang J W, Zou X, Li Z K, Deng X D, Liu J, Xue J B, Zhou H L, Ma J X, Yu J, Li Y, Lin W, Zhou J R, Tang J and Yang H X. 2021. M6: a Chinese multimodal pretrainer [EB/OL]. [2022-04-28].https://arxiv.org/pdf/2103.00823.pdfhttps://arxiv.org/pdf/2103.00823.pdf
Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P and Zitnick C L. 2014. Microsoft COCO: common objects in context//Proceedings of the 13th European Conference on ComputerVision (ECCV). Zurich, Switzerland: Springer: 740-755 [DOI: 10.1007/978-3-319-10602-1_48http://dx.doi.org/10.1007/978-3-319-10602-1_48]
Liu J, Zhu X X, Liu F, Guo L T, Zhao Z J, Sun M Z, Wang W N, Lu H Q, Zhou S Y, Zhang J J and Wang J Q. 2021a. OPT: omni-perception pre-trainer for cross-modal understanding and generation[EB/OL]. [2022-04-28].https://arxiv.org/pdf/2107.00249.pdfhttps://arxiv.org/pdf/2107.00249.pdf
Liu X D, He P C, Chen W Z and Gao J F. 2019a. Multi-task deep neural networks for natural language understanding//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). Florence, Italy: ACL: 4487-4496 [DOI: 10.18653/v1/P19-1441http://dx.doi.org/10.18653/v1/P19-1441]
Liu Y H, Ott M, Goyal N, Du J F, Joshi M, Chen D Q, Levy O, Lewis M, Zettlemoyer L and Stoyanov V. 2019b. RoBERTa: a robustly optimized BERT pretraining approach [EB/OL]. [2022-04-28].https://arxiv.org/pdf/1907.11692.pdfhttps://arxiv.org/pdf/1907.11692.pdf
Liu Z, Lin Y T, Cao Y, Hu H, Wei Y X, Zhang Z, Lin S and Guo B N. 2021b. Swin transformer: hierarchical vision transformer using shifted windows//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE: 9992-10002 [DOI: 10.1109/ICCV48922.2021.00986http://dx.doi.org/10.1109/ICCV48922.2021.00986]
Lu J S, Batra D, Parikh D and Lee S. 2019. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks//Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019). Vancouver, Canada: Curran Associates Inc.
Lu J S, Goswami V, Rohrbach M, Parikh D and Lee S. 2020.12-in-1: multi-task vision and language representation learning//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE: #01045 [DOI: 10.1109/CVPR42600.2020.01045http://dx.doi.org/10.1109/CVPR42600.2020.01045]
Luo H S, Ji L, Shi B T, Huang H Y, Duan N, Li T R, Chen X L and Zhou M. 2020. UniViLM: a unified video and language pre-training model for multimodal understanding and generation[EB/OL]. [2022-04-28].https://arxiv.org/pdf/2002.06353.pdfhttps://arxiv.org/pdf/2002.06353.pdf
Maharaj T, Ballas N, Rohrbach A, Courville A and Pal C. 2017. A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 7359-7368 [DOI: 10.1109/CVPR.2017.778http://dx.doi.org/10.1109/CVPR.2017.778]
Mao J H, Huang J, Toshev A, Camburu O, Yuille A and Murphy K. 2016. Generation and comprehension of unambiguous object descriptions//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 11-20 [DOI: 10.1109/CVPR.2016.9http://dx.doi.org/10.1109/CVPR.2016.9]
Miech A, Alayrac J B, Smaira L, Laptev I, Sivic J and Zisserman A. 2020. End-to-end learning of visual representations from uncurated instructional videos//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE: 9876-9886 [DOI: 10.1109/CVPR42600.2020.00990http://dx.doi.org/10.1109/CVPR42600.2020.00990]
Miech A, Zhukov D, Alayrac J B, Tapaswi M, Laptev I and Sivic J. 2019. HowTo100m: learning a text-video embedding by watching hundred million narrated video clips//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE: 2630-2640 [DOI: 10.1109/ICCV.2019.00272http://dx.doi.org/10.1109/ICCV.2019.00272]
Min B N, Ross H, Sulem E, Veyseh A P B, Nguyen T H, Sainz O, Agirre E, Heinz I and Roth D. 2021. Recent advances in natural language processing via large pre-trained language models: a survey[EB/OL]. [2022-04-28].https://arxiv.org/pdf/2111.01243.pdfhttps://arxiv.org/pdf/2111.01243.pdf
Mithun N C, Li J C, Metze F and Roy-Chowdhury A K. 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval//Proceedings of 2018 ACM on International Conference on Multimedia Retrieva. Yokohama, Japan: ACM: 19-27 [DOI: 10.1145/3206025.3206064http://dx.doi.org/10.1145/3206025.3206064]
Ordonez V, Kulkarni G and Berg T. 2011. Im2Text: describing images using 1 million captioned photographs//Proceedings of the 24th International Conference on Neural Information Processing Systems. Granada, Spain: Curran Associates Inc. : 1143-1151
Pan P B, Xu Z W, Yang Y, Wu F and Zhuang Y T. 2016a. Hierarchical recurrent neural encoder for video representation with application to captioning//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 1029-1038 [DOI: 10.1109/CVPR.2016.117http://dx.doi.org/10.1109/CVPR.2016.117]
Pan Y W, Mei T, Yao T, Li H Q and Rui Y. 2016b. Jointly modeling embedding and translation to bridge video and language//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 4594-4602 [DOI: 10.1109/CVPR.2016.497http://dx.doi.org/10.1109/CVPR.2016.497]
Plummer B A, Brown M and Lazebnik S. 2017. Enhancing video summarization via vision-language embedding//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 1052-1060 [DOI: 10.1109/CVPR.2017.118http://dx.doi.org/10.1109/CVPR.2017.118]
Plummer B A, Wang L W, Cervantes C M, Caicedo J C, Hockenmaier J and Lazebnik S. 2015. Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models//Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE: 2641-2649 [DOI: 10.1109/ICCV.2015.303http://dx.doi.org/10.1109/ICCV.2015.303]
Qi D, Su L, Song J, Cui E, Bharti T and Sacheti A. 2020. ImageBERT: cross-modal pre-training with large-scale weak-supervised image-text data [EB/OL]. [2022-04-28].https://arxiv.org/pdf/2001.07966.pdfhttps://arxiv.org/pdf/2001.07966.pdf
Qiang J P, Qian Z Y, Li Y, Yuan Y H and Zhu Y. 2022. English lexical simplification based on pretrained language representation modeling. Acta Automatica Sinica
强继朋, 钱镇宇, 李云, 袁运浩, 朱毅. 2022. 基于预训练表示模型的英语词语简化方法. 自动化学报: #c200723) [DOI: 10.16383/j.aas.c200723http://dx.doi.org/10.16383/j.aas.c200723]
Qiu X P, Sun T X, Xu Y G, Shao Y F, Dai N and Huang X J. 2021. Pre-trained models for natural language processing: a survey [EB/OL]. [2022-04-28].https://arxiv.org/pdf/2003.08271.pdfhttps://arxiv.org/pdf/2003.08271.pdf
Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G and Sutskever I. 2021. Learning transferable visual models from natural language supervision//Proceedings of the 38th International Conference on Machine Learning (ICML). [s. l.]: PMLR: 8748-8763
Radford A, Narasimhan K, Salimans T and Sutskever I. 2018. Improving language understanding by generative pre-training [EB/OL]. [2022-04-28].https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdfhttps://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf
Radford A, Wu J, Child R, Luan D, Amodei D and Sutskever I. 2019. Language models are unsupervised multitask learners [EB/OL]. [2022-04-28].http://www.persagen.com/files/misc/radford2019language.pdfhttp://www.persagen.com/files/misc/radford2019language.pdf
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y Q, Li W and Liu P J. 2020. Exploring the limits of transfer learning with a unified text-to-text tansformer. Journal of Machine Learning Research (JMLR), 21(140): 1-67
Ren S Q, He K M, Girshick R and Sun J. 2015. Faster R-CNN: towards real-time object detection with region proposal networks//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press: 91-99
Ruan L D and Jin Q. 2021. Survey: transformer based video-language pre-training [EB/OL]. [2022-04-28].https://arxiv.org/pdf/2109.09920.pdfhttps://arxiv.org/pdf/2109.09920.pdf
Sharma P, Ding N, Goodman S and Soricut R. 2018. Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne, Australia: ACL: 2556-2565 [DOI: 10.18653/v1/P18-1238http://dx.doi.org/10.18653/v1/P18-1238]
Shoeybi M, Patwary M, Puri R, LeGresley P, Casper J and Catanzaro B. 2020. Megatron-LM: training multi-billion parameter language models using model parallelism [EB/OL]. [2022-04-28].https://arxiv.org/pdf/1909.08053.pdfhttps://arxiv.org/pdf/1909.08053.pdf
Song K, Tan X, Qin T, Lu J F and Liu T Y. 2019. MASS: masked sequence to sequence pre-training for language generation//Proceedings of the 36th International Conference on Machine Learning. Long Beach, USA: PMLR: 5926-5936
Song Y L and Soleymani M. 2019. Polysemous visual-semantic embedding for cross-modal retrieval//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 1979-1988 [DOI: 10.1109/CVPR.2019.00208http://dx.doi.org/10.1109/CVPR.2019.00208]
Soomro K, Zamir A R and Shah M. 2012. UCF101: a dataset of 101 human actions classes from videos in the wild [EB/OL]. [2022-04-28].https://arxiv.org/pdf/1212.0402.pdfhttps://arxiv.org/pdf/1212.0402.pdf
Stroud J C, Lu Z C, Sun C, Deng J, Sukthankar R, SchmidC and Ross D A. 2021. Learning video representations from textual web supervision [EB/OL]. [2022-04-28].https://arxiv.org/pdf/2007.14937.pdfhttps://arxiv.org/pdf/2007.14937.pdf
Su W J, Zhu X Z, Cao Y, Li B, Lu L L, Wei R and Dai J F. 2020. VL-bert: pre-training of generic visual-linguistic representations[EB/OL]. [2022-04-28].https://openreview.net/attachment?id=SygXPaEYvH&name=original_pdfhttps://openreview.net/attachment?id=SygXPaEYvH&name=original_pdf
Suhr A, Lewis M, Yeh J and Artzi Y. 2017. A corpus of natural language for visual reasoning//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Vancouver, Canada: ACL: 217-223 [DOI: 10.18653/v1/P17-2034http://dx.doi.org/10.18653/v1/P17-2034]
Suhr A, Zhou S, Zhang A, Zhang I, Bai H J and Artzi Y. 2019. A corpus for reasoning about natural language grounded in photographs//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: ACL: 6418-6428 [DOI: 10.18653/v1/P19-1644http://dx.doi.org/10.18653/v1/P19-1644]
Sun C, Baradel F, Murphy K and Schmid C. 2019a. Contrastive bidirectional transformer for temporal representation learning [EB/OL]. [2022-04-28].https://arxiv.org/pdf/1906.05743v1.pdfhttps://arxiv.org/pdf/1906.05743v1.pdf
Sun C, Myers A, Vondrick C, Murphy K and Schmid C. 2019b. VideoBERT: a joint model for video and language representation learning//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea(South): IEEE: 7463-7472 [DOI: 10.1109/ICCV.2019.00756http://dx.doi.org/10.1109/ICCV.2019.00756]
Tan H and Bansal M. 2019. LXMERT: learning cross-modality encoder representations from transformers//Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: ACL: 5100-5111 [DOI: 10.18653/v1/D19-1514http://dx.doi.org/10.18653/v1/D19-1514]
Tang Y S, Ding D J, Rao Y M, Zheng Y, Zhang D Y, Zhao L L, Lu J W and Zhou J. 2019. COIN: a large-scale dataset for comprehensive instructional video analysis//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 1207-1216 [DOI: 10.1109/CVPR.2019.00130http://dx.doi.org/10.1109/CVPR.2019.00130]
Tang Z N, Lei J and Bansal M. 2021. DeCEMBERT: learning from noisy instructional videos via dense captions and entropy minimization//Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: ACL: 2415-2426 [DOI: 10.18653/v1/2021.naacl-main.193http://dx.doi.org/10.18653/v1/2021.naacl-main.193]
Tapaswi M, Zhu Y K, Stiefelhagen R, Torralba A, Urtasun R and Fidler S. 2016. MovieQA: understanding stories in movies through question-answering//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 4631-4640 [DOI: 10.1109/CVPR.2016.501http://dx.doi.org/10.1109/CVPR.2016.501]
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A and Jégou H. 2021a. Training data-efficient image transformers and distillation through attention//Proceedings of the 38th International Conference on Machine Learning. [s. l.]: PMLR: 10347-10357
Touvron H, Cord M, Sablayrolles A, Synnaeve G and Jégou H. 2021b. Going deeper with image transformers//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE: 32-42 [DOI: 10.1109/ICCV48922.2021.00010http://dx.doi.org/10.1109/ICCV48922.2021.00010]
Van den Oord A, Vinyals O and Kavukcuoglu K. 2017. Neural discrete representation learning//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc. : 6309-6318
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc. : 6000-6010
Vinyals O, Toshev A, Bengio S and Erhan D. 2015. Show and tell: a neural image caption generator//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA: IEEE: 3156-3164 [DOI: 10.1109/CVPR.2015.7298935http://dx.doi.org/10.1109/CVPR.2015.7298935]
Wang L W, Li Y and Lazebnik S. 2016. Learning deep structure-preserving image-text embeddings//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 5005-5013 [DOI: 10.1109/CVPR.2016.541http://dx.doi.org/10.1109/CVPR.2016.541]
Wang L W, Li Y, Huang J and Lazebnik S. 2019. Learning two-branch neural networks for image-text matching tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2): 394-407 [DOI: 10.1109/TPAMI.2018.2797921]
Wang Z R, Yu J H, Yu A W, Dai Z H, Tsvetkov Y and Cao Y. 2022. SimVLM: simple visual language model pretraining with weak supervision [EB/OL]. [2022-04-28].https://arxiv.org/pdf/2108.10904.pdfhttps://arxiv.org/pdf/2108.10904.pdf
Wu Y H, Schuster M, Chen Z F, Le Q V, Norouzi M, Macherey M, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J, Shah A, Johnson M, Liu X B, KaiserŁ, Gouws S, Kato Y, Kudo T, Kazawa H, Stevens K, Kurian G, Patil N, Wang W, Young C, Smith J, Riesa J, Rudnick A, Vinyals O, Corrado G, Hughes M and Dean J. 2016. Google's neural machine translation system: bridging the gap between human and machine translation [EB/OL]. [2022-04-28].https://arxiv.org/pdf/1609.08144.pdfhttps://arxiv.org/pdf/1609.08144.pdf
Xia Q L, Huang H Y, Duan N, Zhang D D, Ji L, Sui Z F, Cui E, Bharti T and Zhou M. 2021. XGPT: cross-modal generative pre-training for image captioning//Proceedings of the 10th CCF International Conference on Natural Language Processing and Chinese Computing. Qingdao, China: Springer: 786-797 [DOI: 10.1007/978-3-030-88480-2_63http://dx.doi.org/10.1007/978-3-030-88480-2_63]
Xie N, Lai F, Doran D and Kadav A. 2019a. Visual entailment task for visually-grounded language learning [EB/OL]. [2022-04-28].https://arxiv.org/pdf/1811.10582.pdfhttps://arxiv.org/pdf/1811.10582.pdf
Xie N, Lai F, Doran D and Kadav A. 2019b. Visual entailment: a novel task for fine-grained image understanding [EB/OL]. [2022-04-28].https://arxiv.org/pdf/1901.06706.pdfhttps://arxiv.org/pdf/1901.06706.pdf
Xie S N, Sun C, Huang J, Tu Z W and Murphy K. 2018. Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 318-335 [DOI: 10.1007/978-3-030-01267-0_19http://dx.doi.org/10.1007/978-3-030-01267-0_19]
Xu D J, Zhao Z, Xiao J, Wu F, Zhang H W, He X N and Zhuang Y T. 2017. Video question answering via gradually refined attention over appearance and motion//Proceedings of the 25th ACM International Conference on Multimedia. Mountain View, USA: ACM: 1645-1653 [DOI: 10.1145/3123266.3123427http://dx.doi.org/10.1145/3123266.3123427]
Xu H, Ghosh G, Huang P Y, Arora P, Aminzadeh M, Feichtenhofer C, Metze F and Zettlemoyer L. 2021a. VLM: task-agnostic video-language model pre-training for video understanding//Findings of Association for Computational Linguistics. Online: ACL: 4227-4239 [DOI: 10.18653/v1/2021.findings-acl.370http://dx.doi.org/10.18653/v1/2021.findings-acl.370]
Xu J, Mei T, Yao T and Rui Y. 2016. MSR-VTT: a large video description dataset for bridging video and language//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 5288-5296 [DOI: 10.1109/CVPR.2016.571http://dx.doi.org/10.1109/CVPR.2016.571]
Xu K, Ba J L, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R S and Bengio Y. 2015. Show, attend and tell: neural image caption generation with visual attention//Proceedings of the 32nd International Conference on Machine Learning. Lille, France: PMLR: 2048-2057
Xu Y, Xu Y H, Lv T C, Cui L, We F R, Wang G X, Lu Y J, Florencio D, Zhang C, Che W X, Zhang M and Zhou L D. 2021b. LayoutLMv2: multi-modal pre-training for visually-rich document understanding//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Online: ACL: 2579-2591 [DOI: 10.18653/v1/2021.acl-long.201http://dx.doi.org/10.18653/v1/2021.acl-long.201]
Xu Y H, Li M H, Cui L, Huang S H, Wei F R and Zhou M. 2020. LayoutLM: pre-training of text and layout for document image understanding//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. California, USA: ACM: 1192-1200 [DOI: 10.1145/3394486.3403172http://dx.doi.org/10.1145/3394486.3403172]
Yang Z L, Dai Z H, Yang Y M, Carbonell J, Salakhutdinov R and Le Q V. 2019. XLNet: generalized autoregressive pretraining for language understanding//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver, Canada: ACM: 5753-5763
Yao Y, Zhang A, Zhang Z Y, Liu Z Y, Chua T S and Sun M S. 2022. CPT: colorful prompt tuning for pre-trained vision-language models[EB/OL]. [2022-04-28].https://arxiv.org/pdf/2109.11797.pdfhttps://arxiv.org/pdf/2109.11797.pdf
Yosinski J, Clune J, Bengio Y and Lipson H. 2014. How transferable are features in deep neural networks//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press: 3320-3328
Young P, Lai A, Hodosh M and Hockenmaier J. 2014. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions//Proceedings of the Transactions of the Association for Computational Linguistics. Cambridge, USA: MIT Press: 67-78 [DOI: 10.1162/tacl_a_00166http://dx.doi.org/10.1162/tacl_a_00166]
Yu F, Tang J J, Yin W C, Sun Y, Tian H, Wu H and Wang H F. 2021. ERNIE-VIL: knowledge enhanced vision-language representations through scene graphs. Proceedings of the AAAI Conference on Artificial Intelligence, 35(4): 3208-3216
Yu H N, Wang J, Huang Z H, Yang Y and Xu W. 2016a. Video paragraph captioning using hierarchical recurrent neural networks//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 4584-4593 [DOI: 10.1109/CVPR.2016.496http://dx.doi.org/10.1109/CVPR.2016.496]
Yu L C, Poirson P, Yang S, Berg A C and Berg T L. 2016b. Modeling context in referring expressions//Proceedings of the 14th European Conference on Computer Vision (ECCV). Amsterdam, the Netherlands: Springer: 69-85 [DOI: 10.1007/978-3-319-46475-6_5http://dx.doi.org/10.1007/978-3-319-46475-6_5]
Yu Y, Kim J and Kim G. 2018. A joint sequence fusion model for video question answering and retrieval//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 487-503 [DOI: 10.1007/978-3-030-01234-2_29http://dx.doi.org/10.1007/978-3-030-01234-2_29]
Yuan L, Hou Q B, Jiang Z H, Feng J S and Yan S C. 2021. VOLO: vision outlooker for visual recognition [EB/OL]. [2022-04-28].https://arxiv.org/pdf/2106.13112.pdfhttps://arxiv.org/pdf/2106.13112.pdf
Zellers R, Bisk Y, Farhadi A and Choi Y. 2019a. From recognition to cognition: visual commonsense reasoning//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 6713-6724 [DOI: 10.1109/CVPR.2019.00688http://dx.doi.org/10.1109/CVPR.2019.00688]
Zellers R, Holtzman A, Rashkin H, Bisk Y, Farhadi A, Roesner F and Choi Y J. 2019b. Defending against neural fake news [EB/OL]. [2022-04-28].https://proceedings.neurips.cc/paper/2019/file/3e9f0fc9b2f89e043bc6233994dfcf76-Paper.pdfhttps://proceedings.neurips.cc/paper/2019/file/3e9f0fc9b2f89e043bc6233994dfcf76-Paper.pdf
Zellers R, Lu X M, Hesse J, Yu Y, Park J S, Cao J, Farhadi A and Choi Y J. 2021. Merlot: multimodal neural script knowledge models//Proceedings of the Advances in Neural Information Processing Systems 34 (NeurIPS 2021). Online: 23634-23651
Zhang P C, Li X J, Hu X W, Yang J W, Zhang L, Wang L J, Choi Y J and Gao J F. 2021a. VinVL: revisiting visual representations in vision-language models//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: IEEE: 5575-5584 [DOI: 10.1109/CVPR46437.2021.00553http://dx.doi.org/10.1109/CVPR46437.2021.00553]
Zhang S Y, Jiang T, Wang T, Kuang K, Zhao Z, Zhu J K, Yu J, Yang H X and Wu F. 2020a. DeVLBert: learning deconfounded visio-linguistic representations//Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM: 4373-4382 [DOI: 10.1145/3394171.3413518http://dx.doi.org/10.1145/3394171.3413518]
Zhang Z Y, Gu Y X, Han X, Chen S Q, Xiao C J, Sun Z B, Yao Y, Qi F C, Guan J, Ke P, Cai Y Z, Zeng G Y, Tan Z X, Liu Z Y, Huang M L, Han W T, Liu Y, Zhu X Y and Sun M S. 2021b. CPM-2: large-scale cost-effective pre-trained language models [EB/OL]. [2022-04-28]. https://arxiv.org/pdf/2106.10715.pdfhttps://arxiv.org/pdf/2106.10715.pdf
Zhang Z Y, Han X, Zhou H, Ke P, Gu Y X, Ye D M, Qin Y J, Su Y S, Ji H Z, Guan J, Qi F C, Wang X Z, Zheng Y N, Zeng G Y, Cao H Q, Chen S Q, Li D X, Sun Z B, Liu Z Y, Huang M L, Han W T, Tang J, Li J Z, Zhu X Y and Sun M S. 2020b. CPM: a large-scale generative Chinese pre-trained language model[EB/OL]. [2022-04-28].https://arxiv.org/pdf/2012.00413.pdfhttps://arxiv.org/pdf/2012.00413.pdf
Zhou L W, Kalantidis Y, Chen X L, Corso J J and Rohrbach M. 2019. Grounded video description//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 6571-6580 [DOI: 10.1109/CVPR.2019.00674http://dx.doi.org/10.1109/CVPR.2019.00674]
Zhou L W, Liu J J, Cheng Y, Gan Z and Zhang L. 2021. CUPID: adaptive curation of pre-training data for video-and-language representation learning [EB/OL]. [2022-04-28].https://arxiv.org/pdf/2104.00285.pdfhttps://arxiv.org/pdf/2104.00285.pdf
Zhou L W, Palangi H, Zhang L, Hu H D, Corso J and Gao J F. 2020. Unified vision-language pre-training for image captioning and VQA. Proceedings of the AAAI Conference on Artificial Intelligence, 34(7): 13041-13049 [DOI: 10.1609/aaai.v34i07.7005]
Zhou L W, Xu C L and Corso J. 2018a. Towards automatic learning of procedures from web instructional videos. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1): 7590-7598 [DOI: 10.1609/aaai.v32i1.12342]
Zhou L W, Zhou Y B, Corso J J, Socher R and Xiong C M. 2018b. End-to-end dense video captioning with masked transformer//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, USA: IEEE: 8739-8748 [DOI: 10.1109/CVPR.2018.00911http://dx.doi.org/10.1109/CVPR.2018.00911]
Zhu L C and Yang Y. 2020. ActBERT: learning global-local video-text representations//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE: 8743-8752 [DOI: 10.1109/CVPR42600.2020.00877http://dx.doi.org/10.1109/CVPR42600.2020.00877]
Zhu Y K, Groth O, Bernstein M and Li F F. 2016. Visual7 W: grounded question answering in images//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 4995-5004 [DOI: 10.1109/CVPR.2016.540http://dx.doi.org/10.1109/CVPR.2016.540]
Zhu Y K, Kiros R, Zemel R, Salakhutdinov R, Urtasun R, Torralba A and Fidler S. 2015. Aligning books and movies: towards story-like visual explanations by watching movies and reading books//Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE: 19-27 [DOI: 10.1109/ICCV.2015.11http://dx.doi.org/10.1109/ICCV.2015.11]
Zhukov D, Alayrac J B, Cinbis R G, Fouhey D, Laptev I and Sivic J. 2019. Cross-task weakly supervised learning from instructional videos//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 3532-3540 [DOI: 10.1109/CVPR.2019.00365http://dx.doi.org/10.1109/CVPR.2019.00365]
相关作者
相关机构