图像—文本多模态指代表达理解研究综述
Multimodal referring expression comprehension based on image and text: a review
- 2023年28卷第5期 页码:1308-1325
纸质出版日期: 2023-05-16
DOI: 10.11834/jig.221024
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2023-05-16 ,
移动端阅览
王丽安, 缪佩翰, 苏伟, 李玺, 吉娜烨, 姜燕冰. 2023. 图像—文本多模态指代表达理解研究综述. 中国图象图形学报, 28(05):1308-1325
Wang Li’an, Miao Peihan, Su Wei, Li Xi, Ji Naye, Jiang Yanbing. 2023. Multimodal referring expression comprehension based on image and text: a review. Journal of Image and Graphics, 28(05):1308-1325
指代表达理解(referring expression comprehension,REC)作为视觉—语言相结合的多模态任务,旨在理解输入指代表达式的内容并在图像中定位其所描述的目标对象,受到计算机视觉和自然语言处理两个领域的关注。REC任务建立了人类语言与物理世界的视觉内容之间的桥梁,可以广泛应用于视觉理解系统和对话系统等人工智能设备中。解决该任务的关键在于对复杂的指代表达式进行充分的语义理解;然后利用语义信息对包含多个对象的图像进行关系推理以及对象筛选,最终在图像中唯一地定位目标对象。本文从计算机视觉的视角出发对REC任务进行了综述,首先介绍该任务的通用处理流程。然后,重点对REC领域现有方法进行分类总结,根据视觉数据表征粒度的不同,划分为基于区域卷积粒度视觉表征、基于网格卷积粒度视觉表征以及基于图像块粒度视觉表征的方法;并进一步按照视觉—文本特征融合模块的建模方式进行了更细粒度的归类。此外,本文还介绍了该任务的主流数据集和评估指标。最后,从模型的推理速度、模型的可解释性以及模型对表达式的推理能力3个方面揭示了现有方法面临的挑战,并对REC的发展进行了全面展望。本文希望通过对REC任务现有研究以及未来趋势的总结为相关领域研究人员提供一个全面的参考以及探索的方向。
As a subtask of visual grounding (VG), referring expression comprehension (REC) is focused on the input referring expression-defined object location in the given image. To optimize multimodal data-based artificial intelligence (AI) tasks, the REC has used to facilitate interaction ability between humans, machines, and the physical world. The REC can be used for such domains like navigation, autonomous driving, robotics, and early education in terms of visual understanding systems and dialogue systems. Additionally, it is beneficial for other related studies, including 1) image retrieval, 2) image captioning, and 3) visual question answering. In the past two decades, computer vision-oriented object detection has been developing dramatically, which can locate all predefined and fixed categories objects. To get the referring expression input-defined object, a challenging problem of the REC is required for multiple objects-related reasoning. The general process of REC can be divided into three modules: linguistic feature extraction, visual feature extraction, and visual-linguistic fusion. The most important one of three modules is visual-linguistic fusion, which can realize the interaction and screening between linguistic and visual features. Furthermore, current researches are oriented to the design of the visual feature extraction module, which is recognized as the basic module of one REC model to a certain extent. Visual input has richer information than text input and more redundant information interference are required to be alleviated. So, the potentials of object localization are linked to extracting effective visual features further. We segment existing REC methods into three categories.1) Regional convolution granularity visual representation method, it can be divided into five sub-categories in accordance with visual-linguistic fusion module based modeling: (1)early,(2) attention mechanism fusion,(3) expression decomposition fusion,(4)graph network fusion, and (5)Transformer-based fusion. It is still challenged for computational cost and lower speed because it is required to generate object proposals for the input image in advance. Moreover, the performance of the REC model is challenged for the quality of the object proposals as well. 2) Grid convolution granularity visual representation method: the multi-modal fusion module of it can be divided into two categories:(1)filtering-based fusion and (2)Transformer-based fusion. Its model inference speed can be accelerated to 10 times at least since the generation of object proposals is not required for that. 3) Image patch granularity visual representation method: as visual feature extractors, two methods mentioned above are based on pre-trained object detection networks or convolutional networks. The visual features are still challenged to match REC-required visual elements. Therefore, more researches are focused on the integration of visual feature extraction module and the visual-linguistic fusion module, in which image patches-derived pixel can be as the input. To be compatible with the requirements of the REC task, direct text input-guided visual features are generated beyond pre-trained convolutional neural network(CNN) visual feature extractor. The REC mission are introduced and clarified on the basis of four popular datasets and the evaluation methods. Furthermore, three sort of REC-contextual challenging problems are required to be resolved: 1) model’s reasoning speed, 2) interpretability of the model, and 3) reasoning ability of the model to expressions. The video and 3D domains-related future research direction of REC is predicted and analyzed further on the two aspects of its model design and domain development.
视觉定位(VG)指代表达理解(REC)视觉与语言视觉表征粒度多模态特征融合
visual grounding(VG)referring expression comprehension(REC)vision and languagevisual representation granularitymulti-modal feature fusion
Bao X G, Zhou C L, Xiao K J and Qin B. 2021. Survey on visual question answering. Journal of Software, 32(8): 2522-2544
包希港, 周春来, 肖克晶, 覃飙. 2021. 视觉问答研究综述. 软件学报, 32(8): 2522-2544 [DOI: 10.13328/j.cnki.jos.006215http://dx.doi.org/10.13328/j.cnki.jos.006215]
Chen D Q and Manning C. 2014. A fast and accurate dependency parser using neural networks//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: ACL: 740-750 [DOI: 10.3115/v1/d14-1082http://dx.doi.org/10.3115/v1/d14-1082]
Chen D Z, Chang A X and Nießner M. 2020a. ScanRefer: 3D object localization in RGB-D scans using natural language//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 202-221 [DOI: 10.1007/978-3-030-58565-5_13http://dx.doi.org/10.1007/978-3-030-58565-5_13]
Chen H T, Wang Y H, Guo T Y, Xu C, Deng Y P, Liu Z H, Ma S W, Xu C J, Xu C and Gao W. 2020b. Pre-trained image processing transformer//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 12294-12305 [DOI: 10.1109/CVPR46437.2021.01212http://dx.doi.org/10.1109/CVPR46437.2021.01212]
Chen M, Radford A, Child R, Wu J, Jun H, Luan D and Sutskever I. 2020c. Generative pretraining from pixels//Proceedings of the 37th International Conference on Machine Learning. [s.l.]: JMLR.org: #158
Chen Z F, Wang P, Ma L, Wong K Y K and Wu Q. 2020d. Cops-ref: a new dataset and task on compositional referring expression comprehension//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 10083-10092 [DOI: 10.1109/CVPR42600.2020.01010http://dx.doi.org/10.1109/CVPR42600.2020.01010]
Cirik V, Berg-Kirkpatrick T and Morency L P. 2018. Using syntax to ground referring expressions in natural images//Proceedings of the 32nd AAAI Conference on Artificial Intelligence and 30th Innovative Applications of Artificial Intelligence Conference and 8th AAAI Symposium on Educational Advances in Artificial Intelligence. New Orleans, USA: AAAI: 827
Deng C R, Wu Q, Wu Q Y, Hu F Y, Lyu F and Tan M K. 2018. Visual grounding via accumulated attention//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7746-7755 [DOI: 10.1109/CVPR.2018.00808http://dx.doi.org/10.1109/CVPR.2018.00808]
Deng J J, Yang Z Y, Chen T L, Zhou W G and Li H Q. 2021. TransVG: end-to-end visual grounding with transformers//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 1749-1759 [DOI: 10.1109/ICCV48922.2021.00179http://dx.doi.org/10.1109/ICCV48922.2021.00179]
Deng J J, Yang Z Y, Liu D Q, Chen T L, Zhou W G, Zhang Y Y, Li H Q and Ouyang W L. 2022. TransVG++: end-to-end visual grounding with language conditioned vision transformer[EB/OL]. [2023-03-06]. https://arxiv.org/pdf/2206.06619.pdfhttps://arxiv.org/pdf/2206.06619.pdf
Devlin J, Chang M W, Lee K and Toutanova K. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, USA: ACL: 4171-4186 [DOI: 10.18653/v1/N19-1423http://dx.doi.org/10.18653/v1/N19-1423]
Donahue J, Hendricks L A, Guadarrama S, Rohrbach M, Venugopalan S, Darrell T and Saenko K. 2015. Long-term recurrent convolutional networks for visual recognition and description//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 2625-2634 [DOI: 10.1109/CVPR.2015.7298878http://dx.doi.org/10.1109/CVPR.2015.7298878]
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X H, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J and Houlsby N. 2021. An image is worth 16 × 16 words: transformers for image recognition at scale//Proceedings of the 9th International Conference on Learning Representations. [s.l.]: ICLR
Du P F, Li X Y and Gao Y L. 2021. Survey on multimodal visual language representation learning. Journal of Software, 32(2): 327-348
杜鹏飞, 李小勇, 高雅丽. 2021. 多模态视觉语言表征学习研究综述. 软件学报, 32(2): 327-348 [DOI: 10.13328/j.cnki.jos.006125http://dx.doi.org/10.13328/j.cnki.jos.006125]
Du Y, Fu Z H, Liu Q J and Wang Y H. 2022. Visual grounding with transformers//Proceedings of 2022 IEEE International Conference on Multimedia and Expo. Taipei, China: IEEE: 1-6 [DOI: 10.1109/ICME52920.2022.9859880http://dx.doi.org/10.1109/ICME52920.2022.9859880]
Escalante H J, Hernández C A, Gonzalez J A, López-López A, Montes M, Morales E F, Sucar L E, Villaseñor L and Grubinger M. 2010. The segmented and annotated IAPR TC-12 benchmark. Computer Vision and Image Understanding, 114(4): 419-428 [DOI: 10.1016/j.cviu.2009.03.008http://dx.doi.org/10.1016/j.cviu.2009.03.008]
Grubinger M, Clough PD, Müller H and Deselaers T. 2006. The IAPR-TC12 benchmark: a new evaluation resource for visual information systems//Proceedings of the International Workshop OntoImage’2006 Language Resources for Content-Based Image Retrieval, Held in Conjunction with LREC 2006. Genoa, Italy: [s.n.]: 13-23
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778 [DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]
Hu R H, Rohrbach M and Darrell T. 2016a. Segmentation from natural language expressions//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 108-124 [DOI: 10.1007/978-3-319-46448-0_7http://dx.doi.org/10.1007/978-3-319-46448-0_7]
Hu R H, Xu H Z, Rohrbach M, Feng J S, Saenko K and Darrell T. 2016b. Natural language object retrieval//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 4555-4564 [DOI: 10.1109/CVPR.2016.493http://dx.doi.org/10.1109/CVPR.2016.493]
Hu R H, Rohrbach M, Andreas J, Darrell T and Saenko K.2017. Modeling relationships in referential expressions with compositional modular networks.Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Honolulu, USA: IEEE: 1115-1124 [DOI:10.1109/CVPR.2017.470http://dx.doi.org/10.1109/CVPR.2017.470]
Jiang H Z, Misra I, Rohrbach M, Learned-Miller E and Chen X L. 2020. In defense of grid features for visual question answering//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 10264-10273 [DOI: 10.1109/CVPR42600.2020.01028http://dx.doi.org/10.1109/CVPR42600.2020.01028]
Kamath A, Singh M, LeCun Y, Synnaeve G, Misra I and Carion N. 2021. MDETR-modulated detection for end-to-end multi-modal understanding//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 1760-1770 [DOI: 10.1109/ICCV48922.2021.00180http://dx.doi.org/10.1109/ICCV48922.2021.00180]
Karpathy A and Li F F. 2015. Deep visual-semantic alignments for generating image descriptions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA: IEEE: 3128-3137 [DOI: 10.1109/CVPR.2015.7298932http://dx.doi.org/10.1109/CVPR.2015.7298932]
Kazemzadeh S, Ordonez V, Matten M and Berg T. 2014. ReferitGame: referring to objects in photographs of natural scenes//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: ACL: 787-798 [DOI: 10.3115/v1/d14-1086http://dx.doi.org/10.3115/v1/d14-1086]
Kim W, Son B and Kim I. 2021. ViLT: vision-and-language transformer without convolution or region supervision//Proceedings of the 38th International Conference on Machine Learning (ICML). PMLR: 5583-5594 [EB/OL]. [2022-4-28]. http://proceedings.mlr.press/v139/kim21k/kim21k.pdfhttp://proceedings.mlr.press/v139/kim21k/kim21k.pdf
Li R Y, Li K C, Kuo Y C, Shu M, Qi X J, Shen X Y and Jia J Y. 2018. Referring image segmentation via recurrent refinement networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 5745-5753 [DOI: 10.1109/CVPR.2018.00602http://dx.doi.org/10.1109/CVPR.2018.00602]
Liao Y, Liu S, Li G B, Wang F, Chen Y J, Qian C and Li B. 2020. A real-time cross-modality correlation filtering method for referring expression comprehension//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 10877-10886 [DOI: 10.1109/CVPR42600.2020.01089http://dx.doi.org/10.1109/CVPR42600.2020.01089]
Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P and Zitnick C L. 2014. Microsoft COCO: common objects in context//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 740-755 [DOI: 10.1007/978-3-319-10602-1_48http://dx.doi.org/10.1007/978-3-319-10602-1_48]
Liu D Q, Zhang H W, Zha Z J and Wu F. 2019a. Learning to assemble neural module tree networks for visual grounding//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea(South): IEEE: 4672-4681 [DOI: 10.1109/ICCV.2019.00477http://dx.doi.org/10.1109/ICCV.2019.00477]
Liu H L, Lin A R, Han X G, Yang L, Yu Y Z and Cui S G. 2021a. Refer-it-in-RGBD: a bottom-up approach for 3D visual grounding in RGBD images//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 6028-6037 [DOI: 10.1109/CVPR46437.2021.00597http://dx.doi.org/10.1109/CVPR46437.2021.00597]
Liu R T, Liu C X, Bai Y T and Yuille A L. 2019b. CLEVR-ref+: diagnosing visual reasoning with referring expressions//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 4180-4189 [DOI: 10.1109/CVPR.2019.00431http://dx.doi.org/10.1109/CVPR.2019.00431]
Liu X H, Wang Z H, Shao J, Wang X G and Li H S. 2019c. Improving referring expression grounding with cross-modal attention-guided erasing//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 1950-1959 [DOI: 10.1109/CVPR.2019.00205http://dx.doi.org/10.1109/CVPR.2019.00205]
Liu Y, Zhang Y, Wang Y X, Hou F, Yuan J, Tian J, Zhang Y, Shi Z C, Fan J P and He Z Q. 2021b. A survey of visual transformers [EB/OL]. [2023-03-06]. https://arxiv.org/pdf/2111.06091.pdfhttps://arxiv.org/pdf/2111.06091.pdf
Lu J S, Batra D, Parikh D and Lee S. 2019. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc.: #2
Luo G, Zhou Y Y, Sun X S, Cao L J, Wu C L, Deng C and Ji R R. 2020. Multi-task collaborative network for joint referring expression comprehension and segmentation//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 10031-10040 [DOI: 10.1109/CVPR42600.2020.01005http://dx.doi.org/10.1109/CVPR42600.2020.01005]
Mao J H, Huang J, Toshev A, Camburu O, Yuille A and Murphy K. 2016. Generation and comprehension of unambiguous object descriptions//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 11-20 [DOI: 10.1109/CVPR.2016.9http://dx.doi.org/10.1109/CVPR.2016.9]
Nagaraja V K, Morariu V I and Davis L S. 2016. Modeling context between objects for referring expression understanding//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 792-807 [DOI: 10.1007/978-3-319-46493-0_48http://dx.doi.org/10.1007/978-3-319-46493-0_48]
Pan H and Huang J. 2022. Semantic-aware multi-branch interaction network for deep multimodal learning. Neural Computing and Applications, 35: 7529-7545 [DOI: 10.1007/s00521-022-08048-whttp://dx.doi.org/10.1007/s00521-022-08048-w]
Plummer B A, Wang L W, Cervantes C M, Caicedo J C, Hockenmaier J and Lazebnik S. 2017. Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Computer Vision, 123(1): 74-93 [DOI: 10.1007/s11263-016-0965-7http://dx.doi.org/10.1007/s11263-016-0965-7]
Qiao Y Y, Deng C R and Wu Q. 2021. Referring expression comprehension: a survey of methods and datasets. IEEE Transactions on Multimedia, 23: 4426-4440 [DOI: 10.1109/TMM.2020.3042066http://dx.doi.org/10.1109/TMM.2020.3042066]
Redmon J and Farhadi A. 2018. YOLOv3: an incremental improvement[EB/OL]. [2023-03-06]. https://arxiv.org/pdf/1804.02767.pdfhttps://arxiv.org/pdf/1804.02767.pdf
Ren S Q, He K M, Girshick R and Sun J. 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149 [DOI: 10.1109/TPAMI.2016.2577031http://dx.doi.org/10.1109/TPAMI.2016.2577031]
Sheng Y P, Xu Z L, Wang Y F and De Melo G. 2020. Multi-document semantic relation extraction for news analytics. World Wide Web, 23(3): 2043-2077 [DOI: 10.1007/s11280-020-00790-2http://dx.doi.org/10.1007/s11280-020-00790-2]
Sheng Y P, Xu Z L, Wang Y F, Zhang X Y, Jia J, You Z H and De Melo G. 2019. Visualizing multi-document semantics via open domain information extraction//Proceedings of Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Dublin, Ireland: Springer: 695-699 [DOI: 10.1007/978-3-030-10997-4_54http://dx.doi.org/10.1007/978-3-030-10997-4_54]
Simonyan K and Zisserman A. 2015. Very deep convolutional networks for large-scale image recognition//Proceedings of the 3rd International Conference on Learning Representations. San Diego, USA: ICLR
Su W J, Zhu X Z, Cao Y, Li B, Lu L W, Wei F R and Dai J F. 2020. VL-BERT: pre-training of generic visual-linguistic representations//Proceedings of the 8th International Conference on Learning Representations. Addis Ababa, Ethiopia: ICLR
Summaira J, Li X, Shoib A M, Li S Y and Abdul J. 2021. Recent advances and trends in multimodal deep learning: a review[EB/OL]. [2023-03-06]. https://arxiv.org/pdf/2105.11087.pdfhttps://arxiv.org/pdf/2105.11087.pdf
Sun M Y, Suo W, Wang P, Zhang Y N and Wu Q. 2022. A proposal-free one-stage framework for referring expression comprehension and generation via dense cross-attention. IEEE Transactions on Multimedia: #3147385 [DOI: 10.1109/TMM.2022.3147385http://dx.doi.org/10.1109/TMM.2022.3147385]
Thomason J, Sinapov J and Mooney R. 2017. Guiding interaction behaviors for multi-modal grounded language learning//Proceedings of the 1st Workshop on Language Grounding for Robotics. Vancouver, Canada: Association for Computational Linguistics: 20-24 [DOI: 10.18653/v1/w17-2803http://dx.doi.org/10.18653/v1/w17-2803]
Vasudevan A B, Dai D X and Van Gool L. 2018. Object referring in videos with language and human gaze//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4129-4138 [DOI: 10.1109/CVPR.2018.00434http://dx.doi.org/10.1109/CVPR.2018.00434]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc.: 6000-6010
Wang L W, Li Y, Huang J and Lazebnik S. 2019a. Learning two-branch neural networks for image-text matching tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2): 394-407 [DOI: 10.1109/TPAMI.2018.2797921http://dx.doi.org/10.1109/TPAMI.2018.2797921]
Wang P, Wu Q, Cao J W, Shen C H, Gao L L and Van Den Hengel A. 2019b. Neighbourhood watch: referring expression comprehension via language-guided graph attention networks//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 1960-1968 [DOI: 10.1109/CVPR.2019.00206http://dx.doi.org/10.1109/CVPR.2019.00206]
Wang P, Yang A, Men R, Lin J Y, Bai S, Li Z K, Ma J X, Zhou C, Zhou J R and Yang H X. 2022. OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework//Proceedings of the 39th International Conference on Machine Learning. Baltimore, USA: PMLR: 23318-23340
Wang Z M, Wang X, Li G and Zhang F T. 2019. Overview on image caption. Journal of Xi’an University of Posts and Telecommunications, 24(1): 1-15
王忠民, 王星, 李刚, 张福涛. 2019. 视觉场景理解综述. 西安邮电大学学报, 24(1): 1-15 [DOI: 10.13682/j.issn.2095-6533.2019.01.001http://dx.doi.org/10.13682/j.issn.2095-6533.2019.01.001]
Xu P, Zhu X T and Clifton D A. 2022. Multimodal learning with transformers: a survey[EB/OL]. [2023-03-06]. https://arxiv.org/pdf/2206.06488.pdfhttps://arxiv.org/pdf/2206.06488.pdf
Yamaguchi M, Saito K, Ushiku Y and Harada T. 2017. Spatio-temporal person retrieval via natural language queries//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 1462-1471 [DOI: 10.1109/ICCV.2017.162http://dx.doi.org/10.1109/ICCV.2017.162]
Yang F Z, Yang H, Fu J L, Lu H T and Guo B N. 2020a. Learning texture transformer network for image super-resolution//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 5790-5799 [DOI: 10.1109/CVPR42600.2020.00583http://dx.doi.org/10.1109/CVPR42600.2020.00583]
Yang S B, Li G B and Yu Y Z. 2019a. Dynamic graph attention for referring expression comprehension//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea(South): IEEE: 4643-4652 [DOI: 10.1109/ICCV.2019.00474http://dx.doi.org/10.1109/ICCV.2019.00474]
Yang S B, Li G B and Yu Y Z. 2020b. Graph-structured referring expression reasoning in the wild//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 9949-9958 [DOI: 10.1109/CVPR42600.2020.00997http://dx.doi.org/10.1109/CVPR42600.2020.00997]
Yang Z Y, Gong B Q, Wang L W, Huang W B, Yu D and Luo J B. 2019b. A fast and accurate one-stage approach to visual grounding//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea(South): IEEE: 4682-4692 [DOI: 10.1109/ICCV.2019.00478http://dx.doi.org/10.1109/ICCV.2019.00478]
Ye J B, Tian J F, Yan M, Yang X S, Wang X W, Zhang J, He L and Lin X. 2022. Shifting more attention to visual backbone: query-modulated refinement networks for end-to-end visual grounding//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 15481-15491 [DOI: 10.1109/CVPR52688.2022.01506http://dx.doi.org/10.1109/CVPR52688.2022.01506]
Yin Q Y, Huang Y, Zhang J G, Wu S and Wang L. 2021. Survey on deep learning based cross-modal retrieval. Journal of Image and Graphics, 26(6): 1368-1388
尹奇跃, 黄岩, 张俊格, 吴书, 王亮. 2021. 基于深度学习的跨模态检索综述. 中国图象图形学报, 26(6): 1368-1388 [DOI: 10.11834/jig.200862http://dx.doi.org/10.11834/jig.200862]
Yu L C, Lin Z, Shen X H, Yang J M, Lu X, Bansal M and Berg T L. 2018. MattNet: modular attention network for referring expression comprehension//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 1307-1315 [DOI: 10.1109/CVPR.2018.00142http://dx.doi.org/10.1109/CVPR.2018.00142]
Yu L C, Poirson P, Yang S, Berg A C and Berg T L. 2016. Modeling context in referring expressions//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 69-85 [DOI: 10.1007/978-3-319-46475-6_5http://dx.doi.org/10.1007/978-3-319-46475-6_5]
Yu L C, Tan H, Bansal M and Berg T L. 2017. A joint speaker-listener-reinforcer model for referring expressions//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 3521-3529 [DOI: 10.1109/CVPR.2017.375http://dx.doi.org/10.1109/CVPR.2017.375]
Zhang C, Yang Z C, He X D and Deng L. 2020a. Multimodal intelligence: representation learning, information fusion, and applications. IEEE Journal of Selected Topics in Signal Processing, 14(3): 478-493 [DOI: 10.1109/JSTSP.2020.2987728http://dx.doi.org/10.1109/JSTSP.2020.2987728]
Zhang H, Li F, Liu S L, Zhang L, Su H, Zhu J, Ni L and Shum H Y. 2022. DINO: DETR with improved DeNoising anchor boxes for end-to-end object detection//Proceedings of the International Conference on Learning Representations. [s.l.]: [s.n.]
Zhang H W, Niu Y L and Chang S F. 2018. Grounding referring expressions in images by variational context//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4158-4166 [DOI: 10.1109/CVPR.2018.00437http://dx.doi.org/10.1109/CVPR.2018.00437]
Zhang H Y, Wang T B, Li M Z, Zhao Z, Pu S L and Wu F. 2022. Comprehensive review of visual-language-oriented multimodal pre-training methods. Journal of Image and Graphics, 27(9): 2652-2682
张浩宇, 王天保, 李孟择, 赵洲, 浦世亮, 吴飞. 2022. 视觉语言多模态预训练综述. 中国图象图形学报, 27(9): 2652-2682 [DOI: 10.11834/jig.220173http://dx.doi.org/10.11834/jig.220173]
Zhang J S, Sheng Y P, Wang Z and Shao J. 2020b. TKGFrame: a two-phase framework for temporal-aware knowledge graph completion//Proceedings of the 4th Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data. Tianjin, China: Springer: 196-211 [DOI: 10.1007/978-3-030-60259-8_16http://dx.doi.org/10.1007/978-3-030-60259-8_16]
Zhuang B H, Wu Q, Shen C H, Reid I and Van Den Hengel A. 2018. Parallel attention: a unified framework for visual object discovery through dialogs and queries//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4252-4261 [DOI: 10.1109/CVPR.2018.00447http://dx.doi.org/10.1109/CVPR.2018.00447]
相关文章
相关作者
相关机构