深度学习图像描述方法分析与展望

赵永强; 金芝; 张峰; 赵海燕; 陶政为; 豆乘风; 徐新海; 刘东红

doi:10.11834/jig.220660

综述 | 浏览量 : 0 下载量: 2 CSCD: 1

PDF
导出
分享
收藏
专辑

深度学习图像描述方法分析与展望
Deep-learning-based image captioning： analysis and prospects
2023年28卷第9期页码：2788-2816
纸质出版日期： 2023-09-16 ，
DOI： 10.11834/jig.220660
稿件说明：

移动端阅览

赵永强，金芝，张峰，赵海燕，陶政为，豆乘风，徐新海，刘东红. 2023. 深度学习图像描述方法分析与展望. 中国图象图形学报， 28(09):2788-2816

Zhao Yongqiang， Jin Zhi， Zhang Feng， Zhao Haiyan， Tao Zhengwei， Dou Chengfeng， Xu Xinhai， Liu Donghong. 2023. Deep-learning-based image captioning： analysis and prospects. Journal of Image and Graphics， 28(09):2788-2816
赵永强，金芝，张峰，赵海燕，陶政为，豆乘风，徐新海，刘东红. 2023. 深度学习图像描述方法分析与展望. 中国图象图形学报， 28(09):2788-2816 DOI： 10.11834/jig.220660.

Zhao Yongqiang， Jin Zhi， Zhang Feng， Zhao Haiyan， Tao Zhengwei， Dou Chengfeng， Xu Xinhai， Liu Donghong. 2023. Deep-learning-based image captioning： analysis and prospects. Journal of Image and Graphics， 28(09):2788-2816 DOI： 10.11834/jig.220660.

摘要

图像描述任务是利用计算机自动为已知图像生成一个完整、通顺、适用于对应场景的描述语句，实现从图像到文本的跨模态转换。随着深度学习技术的广泛应用，图像描述算法的精确度和推理速度都得到了极大提升。本文在广泛文献调研的基础上，将基于深度学习的图像描述算法研究分为两个层面，一是图像描述的基本能力构建，二是图像描述的应用有效性研究。这两个层面又可以细分为传递更加丰富的特征信息、解决暴露偏差问题、生成多样性的图像描述、实现图像描述的可控性和提升图像描述推理速度等核心技术挑战。针对上述层面所对应的挑战，本文从注意力机制、预训练模型和多模态模型的角度分析了传递更加丰富的特征信息的方法，从强化学习、非自回归模型和课程学习与计划采样的角度分析了解决暴露偏差问题的方法，从图卷积神经网络、生成对抗网络和数据增强的角度分析了生成多样性的图像描述的方法，从内容控制和风格控制的角度分析了图像描述可控性的方法，从非自回归模型、基于网格的视觉特征和基于卷积神经网络解码器的角度分析了提升图像描述推理速度的方法。此外，本文还对图像描述领域的通用数据集、评价指标和已有算法性能进行了详细介绍，并对图像描述中待解决的问题与未来研究趋势进行预测和展望。

Abstract

The task of image captioning is to use a computer in automatically generating a complete， smooth， and suitable corresponding scene’s caption for a known image and realizing the multimodal conversion from image to text. Describing the visual content of an image accurately and quickly is a fundamental goal for the area of artificial intelligence， which has a wide range of applications in research and production. Image captioning can be applied to many aspects of social development， such as text captions of images and videos， visual question answering， storytelling by looking at the image， network image analysis， and keyword search of an image. Image captions can also assist individuals born with visual impairments， making the computer another pair of eyes for them. The accuracy and inference speed of image captioning algorithms have been greatly improved with the wide application of deep learning technology. On the basis of extensive literature research， we find that image captioning algorithms based on deep learning still have key technical challenges， i.e.， delivering rich feature information， solving the problem of exposure bias， generating the diversity of image captions， realizing the controllability of image captions， and improving the inference speed of image captions. The main framework of the image captioning model is the encoder-decoder architecture. First， the encoder-decoder architecture uses an encoder to convert an input image into a fixed-length feature vector. Then， a decoder converts the fixed-length feature vector into an image caption. Therefore， the richer the feature information contained in the model is， the higher the accuracy of the model is， and the better the generation effect of the image caption is. According to the different research ideas of the existing algorithms， the present study reviews image captioning algorithms that deliver rich feature information from three aspects： attention mechanism， pretraining model， and multimodal model. Many image captioning algorithms cannot synchronize the training and prediction processes of a model. Thus， the model obtains exposure bias. When the model has an exposure bias， errors accumulate during word generation. Thus， the following words become biased， seriously affecting the accuracy of the image captioning model. According to different problem-solving methods， the present study reviews the related research on solving the exposure bias problem in the field of image captioning from three perspectives： reinforcement learning， nonautoregressive model， and curriculum learning and scheduled sampling. Image captioning is an ambiguity problem because it may generate multiple suitable captions for an image. The existing image captioning methods use common high-frequency expressions to generate relatively “safety” sentences. The caption results are relatively simple， empty， and lack critical detailed information， easily causing a lack of diversity in image captions. According to different research ideas， the present study reviews the existing image captioning methods of generative diversity from three aspects： graph convolutional neural network， generative adversarial network， and data augmentation. The majority of current image captioning models lack controllability， differentiating them from human intelligence. Researchers have proposed an algorithm to solve the problem by actively controlling image caption generation， which is mainly divided into two categories： content-controlled image captions and style-controlled image captions. Content-controlled image captions aim to control the described image content， such as different areas or objects of the image. Thus， the model can describe the image content in which the users are interested. Style-controlled image captions aim to generate captions of different styles， such as humorous， romantic， and antique. In this study， the related algorithms of content-controlled and style-controlled image captions are reviewed. The existing image captioning models are mostly encoder-decoder architectures. The encoder stage uses a convolutional neural network-based visual feature extraction method， whereas the decoder stage uses a recurrent neural network-based method. According to the different existing research ideas， the methods for improving the inference speed of image captioning models are divided into three categories. The first category uses nonautoregressive models to improve the inference speed. The second category uses the grid-based visual feature method to improve the inference speed. The third category uses a convolutional-neural-network-based decoder to improve inference speed. In addition， this study provides a detailed introduction to general datasets and evaluation metrics in image captioning. General datasets mainly include the following： bilingual evaluation understudy （BLEU）； recall-oriented understanding for gisting evaluation （ROUGE）； metric for evaluation of translation with explicit ordering （METEOR）； consensus-based image description evaluation （CIDEr）； semantic propositional image caption evaluation （SPICE）； Compact bilinear pooling； Text-to-image grounding for image caption evaluation； Relevance， extraness， omission； Fidelity and adequacy ensured. The evaluation metrics mainly include Flickr8K， Flickr30K， MS COCO （Microsoft common objects in context）， TextCaps， Localized Narratives， and Nocaps. Finally， this study deeply discusses the problems to be solved and the future research direction in the field of image captioning， i.e.， how to improve the performance of visual feature extraction in image captions， how to improve the diversity of image captions， how to improve the interpretability of deep learning models， how to realize the transfer between multiple languages in image captions， how to automatically generate or design the optimal network architecture， and how to study the datasets and evaluation metrics that are suitable for image captions. Image captioning research is a popular hot spot in computer vision and natural language processing. At present， many algorithms for solving different problems are proposed annually. Other research directions will be developed in the future.

关键词

图像描述深度学习基本能力应用有效性核心技术挑战

Keywords

image captiondeep learningbasic capabilitiesapplication effectivenesskey technical challenges

references

Anderson P， Fernando B， Johnson M and Gould S. 2016. SPICE： semantic propositional image caption evaluation//Proceedings of the 14th European Conference on Computer Vision. Amsterdam， the Netherlands： Springer： 382-398 ［DOI： 10.1007/978-3-319-46454-1_24http://dx.doi.org/10.1007/978-3-319-46454-1_24］

Anderson P， He X D， Buehler C， Teney D， Johnson M， Gould S and Zhang L. 2018. Bottom-up and top-down attention for image captioning and visual question answering//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 6077-6086 ［DOI： 10.1109/CVPR.2018.00636http://dx.doi.org/10.1109/CVPR.2018.00636］

Aslam A. 2022. Detecting objects in less response time for processing multimedia events in smart cities//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. New Orleans， USA： IEEE： 2043-2053 ［DOI： 10.1109/CVPRW56347.2022.00222http://dx.doi.org/10.1109/CVPRW56347.2022.00222］

Banerjee S and Lavie A. 2005. METEOR： an automatic metric for MT evaluation with improved correlation with human judgments//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. ANN Arbor， Michigan： ACL： 65-73

Bengio S， Vinyals O， Jaitly N and Shazeer N. 2015. Scheduled sampling for sequence prediction with recurrent neural networks//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal， Canada： MIT Press： 1171-1179

Berthelier A， Chateau T， Duffner S， Garcia C and Blanc C. 2021. Deep model compression and architecture optimization for embedded systems： a survey. Journal of Signal Processing Systems， 93（8）： 863-878 ［DOI： 10.1007/s11265-020-01596-1http://dx.doi.org/10.1007/s11265-020-01596-1］

Bhatnagar B L， Xie X H， Petrov I A， Sminchisescu C， Theobalt C and Pons-Moll G. 2022. BEHAVE： dataset and method for tracking human object interactions//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 15914-15925 ［DOI： 10.1109/CVPR52688.2022.01547http://dx.doi.org/10.1109/CVPR52688.2022.01547］

Bujimalla S， Subedar M and Tickoo O. 2020. B-SCST： Bayesian self-critical sequence training for image captioning ［EB/OL］. ［2022-06-09］. https://arxiv.org/pdf/2004.02435.pdfhttps://arxiv.org/pdf/2004.02435.pdf

Cao P P， Zhu Z Q， Wang Z Y， Zhu Y P and Niu Q. 2022. Applications of graph convolutional networks in computer vision. Neural Computing and Applications， 34（16）： 13387-13405 ［DOI： 10.1007/s00521-022-07368-1http://dx.doi.org/10.1007/s00521-022-07368-1］

Chan D M， Myers A， Vijayanarasimhan S， Ross D A， Seybold B and Canny J F. 2022. What’s in a caption？ Dataset-specific linguistic diversity and its effect on visual description models and metrics//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. New Orleans， USA： IEEE： 4739-4748 ［DOI： 10.1109/CVPRW56347.2022.00520http://dx.doi.org/10.1109/CVPRW56347.2022.00520］

Chen F H， Ji R R， Sun X S， Wu Y J and Su J S. 2018. GroupCap： group-based image captioning with structured relevance and diversity constraints//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 1345-1353 ［DOI： 10.1109/CVPR.2018.00146http://dx.doi.org/10.1109/CVPR.2018.00146］

Chen L， Jiang Z H， Xiao J and Liu W. 2021. Human-like controllable image captioning with verb-specific semantic roles//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 16841-16851 ［DOI： 10.1109/CVPR46437.2021.01657http://dx.doi.org/10.1109/CVPR46437.2021.01657］

Chen L， Zhang H W， Xiao J， Nie L Q， Shao J， Liu W and Chua T S. 2017. SCA-CNN： spatial and channel-wise attention in convolutional networks for image captioning//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 6298-6306 ［DOI： 10.1109/CVPR.2017.667http://dx.doi.org/10.1109/CVPR.2017.667］

Chen S Z， Jin Q， Wang P and Wu Q. 2020. Say as you wish： fine-grained control of image caption generation with abstract scene graphs//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 9959-9968 ［DOI： 10.1109/CVPR42600.2020.00998http://dx.doi.org/10.1109/CVPR42600.2020.00998］

Chen T L， Zhang Z Y， Cheng Y， Awadallah A and Wang Z Y. 2022a. The principle of diversity： training stronger vision transformers calls for reducing all levels of redundancy//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 12010-12020 ［DOI： 10.1109/CVPR52688.2022.01171http://dx.doi.org/10.1109/CVPR52688.2022.01171］

Chen X L， Fang H， Lin T Y， Vedantam R， Gupta S， Doll􀆦r P and Zitnick C L. 2015. Microsoft COCO captions： data collection and evaluation server ［EB/OL］. ［2022-06-09］. https://arxiv.org/pdf/1504.00325.pdfhttps://arxiv.org/pdf/1504.00325.pdf

Chen Y Z， Yang X H， Wei Z H， Heidari A A， Zheng N G， Li Z C， Chen H L， Hu H G， Zhou Q W and Guan Q. 2022b. Generative adversarial networks in medical image augmentation： a review. Computers in Biology and Medicine， 144： #105382 ［DOI： 10.1016/j.compbiomed.2022.105382http://dx.doi.org/10.1016/j.compbiomed.2022.105382］

Cheng J， Wang L， Wu J J， Hu X P， Jeon G， Tao D C and Zhou M C. 2022. Visual relationship detection： a survey. IEEE Transactions on Cybernetics， 52（8）： 8453-8466 ［DOI： 10.1109/TCYB.2022.3142013http://dx.doi.org/10.1109/TCYB.2022.3142013］

Cornia M， Baraldi L and Cucchiara R. 2019. Show， control and tell： a framework for generating controllable and grounded captions//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 8299-8308 ［DOI： 10.1109/CVPR.2019.00850http://dx.doi.org/10.1109/CVPR.2019.00850］

Dai B， Fidler S， Urtasun R and Lin D H. 2017. Towards diverse and natural image descriptions via a conditional GAN//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice， Italy： IEEE： 2989-2998 ［DOI： 10.1109/ICCV.2017.323http://dx.doi.org/10.1109/ICCV.2017.323］

Dai B and Lin D H. 2017. Contrastive learning for image captioning//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 898-907

Deng C R， Ding N， Tan M K and Wu Q. 2020. Length-controllable image captioning//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 712-729 ［DOI： 10.1007/978-3-030-58601-0_42http://dx.doi.org/10.1007/978-3-030-58601-0_42］

Deshpande A， Aneja J， Wang L W， Schwing A G and Forsyth D. 2019. Fast， diverse and accurate image captioning guided by part-of-speech//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 10687-10696 ［DOI： 10.1109/CVPR.2019.01095http://dx.doi.org/10.1109/CVPR.2019.01095］

Devlin J， Chang M W， Lee K and Toutanova K. 2019. BERT： pre-training of deep bidirectional transformers for language understanding//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Minneapolis， Minnesota， USA： ACL： 4171-4186 ［DOI： 10.18653/v1/N19-1423http://dx.doi.org/10.18653/v1/N19-1423］

Dong X Z， Long C J， Xu W J and Xiao C X. 2021. Dual graph convolutional networks with transformer and curriculum learning for image captioning//Proceedings of the 29th ACM International Conference on Multimedia. New York， USA： ACM： 2615-2624 ［DOI： 10.1145/3474085.3475439http://dx.doi.org/10.1145/3474085.3475439］

Fei Z. 2021. Partially non-autoregressive image captioning//Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto， USA： AAAI： 1309-1316 ［DOI： 10.1609/aaai.v35i2.16219http://dx.doi.org/10.1609/aaai.v35i2.16219］

Fei Z. 2022. Attention-aligned transformer for image captioning//Proceedings of the 36th AAAI Conference on Artificial Intelligence. Palo Alto， USA： AAAI： 607-615 ［DOI： 10.1609/aaai.v36i1.19940http://dx.doi.org/10.1609/aaai.v36i1.19940］

Fei Z C. 2019. Fast image caption generation with position alignment ［EB/OL］. ［2022-06-09］. https://arxiv.org/pdf/1912.06365.pdfhttps://arxiv.org/pdf/1912.06365.pdf

Gan C， Gan Z， He X D， Gao J F and Deng L. 2017. Stylenet： generating attractive visual captions with styles//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 955-964 ［DOI： 10.1109/CVPR.2017.108http://dx.doi.org/10.1109/CVPR.2017.108］

Gao J L， Meng X， Wang S Q， Li X， Wang S S， Ma S W and Gao W. 2019. Masked non-autoregressive image captioning ［EB/OL］. ［2022-06-09］. https://arxiv.org/pdf/1906.00717.pdfhttps://arxiv.org/pdf/1906.00717.pdf

Gu J T， Bradbury J， Xiong C M， Li V O K and Socher R. 2018. Non-autoregressive neural machine translation//Proceedings of the 6th International Conference on Learning Representations. Vancouver， Canada： ICLR ［DOI： 10.48550/arxiv.1711.02281http://dx.doi.org/10.48550/arxiv.1711.02281］

Guo L T， Liu J， Yao P， Li J W and Lu H Q. 2019. MSCap： multi-style image captioning with unpaired stylized text//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 4199-4208 ［DOI： 10.1109/CVPR.2019.00433http://dx.doi.org/10.1109/CVPR.2019.00433］

Hafiz A M. 2022. Image classification by reinforcement learning with two-state Q-learning ［EB/OL］. ［2022-06-09］. https://arxiv.org/pdf/2007.01298.pdfhttps://arxiv.org/pdf/2007.01298.pdf

Han K， Wang Y H， Chen H T， Chen X H， Guo J Y， Liu Z H， Tang Y H， Xiao A， Xu C J， Xu Y X， Yang Z H， Zhang Y M and Tao D C. 2023. A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence， 45（1）： 87-110 ［DOI： 10.1109/TPAMI.2022.3152247http://dx.doi.org/10.1109/TPAMI.2022.3152247］

He Z W， Wang X， Wang R， Shi S M and Tu Z P. 2022. Bridging the data gap between training and inference for unsupervised neural machine translation//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Dublin， Ireland： ACL： 6611-6623 ［DOI： 10.18653/v1/2022.acl-long.456http://dx.doi.org/10.18653/v1/2022.acl-long.456］

Hodosh Y， Young P and Hockenmaier J. 2013. Framing image description as a ranking task： data， models and evaluation metrics. Journal of Artificial Intelligence Research， 47（1）： 853-899 ［DOI： 10.1613/jair.3994http://dx.doi.org/10.1613/jair.3994］

Huang M B， Huang Z J， Li C L， Chen X， Xu H， Li Z G and Liang X D. 2022. Arch-graph： acyclic architecture relation predictor for task-transferable neural architecture search//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 11871-11881 ［DOI： 10.1109/CVPR52688.2022.01158http://dx.doi.org/10.1109/CVPR52688.2022.01158］

Huynh L， Nguyen P， Matas J， Rahtu E and Heikkilä J. 2022. Lightweight monocular depth with a novel neural architecture search method//Proceedings of 2022 IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa， USA： IEEE： 326-336 ［DOI： 10.1109/WACV51458.2022.00040http://dx.doi.org/10.1109/WACV51458.2022.00040］

Jiang H Z， Misra I， Rohrbach M， Learned-Miller E and Chen X L. 2020. In defense of grid features for visual question answering//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 10264-10273 ［DOI： 10.1109/CVPR42600.2020.01028http://dx.doi.org/10.1109/CVPR42600.2020.01028］

Jiang M， Hu J J， Huang Q Y， Zhang L， Diesner J and Gao J F. 2019a. REO-relevance， extraness， omission： a fine-grained evaluation for image captioning//Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong， China： ACL： 1475-1480 ［DOI： 10.18653/v1/D19-1156http://dx.doi.org/10.18653/v1/D19-1156］

Jiang M， Huang Q Y， Zhang L， Wang X， Zhang P C， Gan Z， Diesner J and Gao J F. 2019b. TIGEr： text-to-image grounding for image caption evaluation//Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing （EMNLP-IJCNLP）. Hong Kong， China： ACL： 2141-2152 ［DOI： 10.18653/v1/D19-1220http://dx.doi.org/10.18653/v1/D19-1220］

Jiang X Z， Liang Y B， Chen W Z and Duan N. 2022. XLM-K： improving cross-lingual language model pre-training with multilingual knowledge//Proceedings of 2022 AAAI Conference on Artificial Intelligence. Palo Alto， USA： AAAI： 10840-10848 ［DOI： 10.1609/aaai.v36i10.21330http://dx.doi.org/10.1609/aaai.v36i10.21330］

Jiao L C， Zhang R H， Liu F， Yang S Y， Hou B， Li L L and Tang X. 2022. New generation deep learning for video object detection： a survey. IEEE Transactions on Neural Networks and Learning Systems， 33（8）： 3195-3215 ［DOI： 10.1109/TNNLS.2021.3053249http://dx.doi.org/10.1109/TNNLS.2021.3053249］

Krishna R， Zhu Y K， Groth O， Johnson J， Hata K， Kravitz J， Chen S， Kalantidis Y， Li L J， Shamma D A， Bernstein M S and Li F F. 2017. Visual genome： connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision， 123（1）： 32-73 ［DOI： 10.1007/s11263-016-0981-7http://dx.doi.org/10.1007/s11263-016-0981-7］

Li B， Xia F， Weng Y X， Sun B， Li S T and Huang X S. 2022a. PSG： prompt-based sequence generation for acronym extraction//Proceedings of the Workshop on Scientific Document Understanding Co-Located with 36th AAAI Conference on Artificial Inteligence. Palo Alto， USA： AAAI

Li G D， Zhai Y C， Lin Z H and Zhang Y. 2021a. Similar scenes arouse similar emotions： parallel data augmentation for stylized image captioning//Proceedings of the 29th ACM International Conference on Multimedia. New York， USA： ACM： 5363-5372 ［DOI： 10.1145/3474085.3475662http://dx.doi.org/10.1145/3474085.3475662］

Li H Y， Wang N N， Zhu M R， Yang X and Gao X B. 2022. Recent advances in neural architecture search： a survey. Journal of Software， 33（1）： 129-149

李航宇，王楠楠，朱明瑞，杨曦，高新波. 2022. 神经结构搜索的研究进展综述. 软件学报， 33（1）： 129-149 ［DOI： 10.13328/j.cnki.jos.006306http://dx.doi.org/10.13328/j.cnki.jos.006306］

Li N N and Chen Z Z. 2020. Learning compact reward for image captioning［EB/OL］. ［2022-06-09］. https://arxiv.org/abs/2003.10925.pdfhttps://arxiv.org/abs/2003.10925.pdf

Li Y H， Pan Y W， Yao T， Chen J W and Mei T. 2021b. Scheduled sampling in vision-language pretraining with decoupled encoder-decoder network//Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto， USA： AAAI： 8518-8526 ［DOI： 10.1609/aaai.v35i10.17034http://dx.doi.org/10.1609/aaai.v35i10.17034］

Li Y H， Yao T， Pan Y W， Chao H Y and Mei T. 2019. Pointing novel objects in image captioning//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 12489-12498 ［DOI： 10.1109/CVPR.2019.01278http://dx.doi.org/10.1109/CVPR.2019.01278］

Li Y W， Adamczewski K， Li W， Gu S H， Timofte R and Van Gool L. 2022b. Revisiting random channel pruning for neural network compression//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 191-201 ［DOI： 10.1109/CVPR52688.2022.00029http://dx.doi.org/10.1109/CVPR52688.2022.00029］

Lin C Y. 2004. ROUGE： A package for automatic evaluation of summaries ［EB/OL］. ［2022-06-09］. https://aclanthology.org/W04-1013.pdfhttps://aclanthology.org/W04-1013.pdf

Lin X D， Bertasius G， Wang J， Chang S F， Parikh D and Torresani L. 2021. VX2TEXT： end-to-end learning of video-based text generation from multimodal inputs//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 7001-7011 ［DOI： 10.1109/CVPR46437.2021.00693http://dx.doi.org/10.1109/CVPR46437.2021.00693］

Liu F L， Ren X C， Liu Y X， Wang H F and Sun X. 2018. SimNet： stepwise image-topic merging network for generating detailed and comprehensive image captions//Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. Brussels， Belgium： ACL： 137-149 ［DOI： 10.18653/v1/D18-1013http://dx.doi.org/10.18653/v1/D18-1013］

Lu J S， Batra D， Parikh D and Lee S. 2019. VILBERT： pretraining task-agnostic visiolinguistic representations for vision-and-language tasks//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver， Canada： Curran Associates Inc.： #2

Luo Y P， Ji J Y， Sun X S， Cao L J， Wu Y J， Huang F Y， Lin C W and Ji R R. 2021. Dual-level collaborative transformer for image captioning//Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto， USA： AAAI： 2286-2293

Mason R and Charniak E. 2014. Nonparametric method for data-driven image captioning//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Baltimore， USA： ACL： 592-598

Mathews A， Xie L X and He X M. 2018. SemStyle： learning to generate stylised image captions using unaligned text//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 8591-8600 ［DOI： 10.1109/CVPR.2018.00896http://dx.doi.org/10.1109/CVPR.2018.00896］

Mou C， Wang Q and Zhang J. 2022. Deep generalized unfolding networks for image restoration//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 17378-17389 ［DOI： 10.1109/CVPR52688.2022.01688http://dx.doi.org/10.1109/CVPR52688.2022.01688］

Paolicelli V， Tavera A， Masone C， Berton G and Caputo B. 2022. Learning semantics for visual place recognition through multi-scale attention//Proceedings of the 21st International Conference on Image Analysis and Processing. Lecce， Italy： Springer： 454-466 ［DOI： 10.1007/978-3-031-06430-2_38http://dx.doi.org/10.1007/978-3-031-06430-2_38］

Papineni K， Roukos S， Ward T and Zhu W J. 2002. BLEU： a method for automatic evaluation of machine translation//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia， USA： ACL： 311-318 ［DOI： 10.3115/1073083.1073135http://dx.doi.org/10.3115/1073083.1073135］

Plummer B A， Wang L W， Cervantes C M， Caicedo J C， Hockenmaier J and Lazebnik S. 2015. Flickr30k entities： collecting region-to-phrase correspondences for richer image-to-sentence models//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago， Chile： IEEE： 2641-2649 ［DOI： 10.1109/ICCV.2015.303http://dx.doi.org/10.1109/ICCV.2015.303］

Qi D， Su L， Song J， Cui E， Bharti T and Sacheti A. 2020. ImageBERT： cross-modal pre-training with large-scale weak-supervised image-text data ［EB/OL］. ［2022-06-09］. https://arxiv.org/pdf/2001.07966.pdfhttps://arxiv.org/pdf/2001.07966.pdf

Qin Y， Du J J， Zhang Y H and Lu H T. 2019. Look back and predict forward in image captioning//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 8359-8367 ［DOI： 10.1109/CVPR.2019.00856http://dx.doi.org/10.1109/CVPR.2019.00856］

Ren P Z， Xiao Y， Chang X J， Huang P Y， Li Z H， Chen X J and Wang X. 2022. A comprehensive survey of neural architecture search： challenges and solutions. ACM Computing Surveys， 54（4）： #76 ［DOI： 10.1145/3447582http://dx.doi.org/10.1145/3447582］

Ren S Q， He K M， Girshick R and Sun J. 2017. Faster R-CNN： towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence， 39（6）： 1137-1149 ［DOI： 10.1109/TPAMI.2016.2577031http://dx.doi.org/10.1109/TPAMI.2016.2577031］

Rennie S J， Marcheret E， Mroueh Y， Ross J and Goel V. 2017. Self-critical sequence training for image captioning//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 1179-1195 ［DOI： 10.1109/CVPR.2017.131http://dx.doi.org/10.1109/CVPR.2017.131］

Seo P H， Sharma P， Levinboim T， Han B and Soricut R. 2020. Reinforcing an image caption generator using off-line human feedback//Proceedings of 2020 AAAI Conference on Artificial Intelligence. New York， USA： AAAI： 2693-2700 ［DOI： 10.1609/aaai.v34i03.5655http://dx.doi.org/10.1609/aaai.v34i03.5655］

Sharma P， Ding N， Goodman S and Soricut R. 2018. Conceptual captions： a cleaned， hypernymed， image alt-text dataset for automatic image captioning//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne， Australia： ACL： 2556-2565 ［DOI： 10.18653/v1/P18-1238http://dx.doi.org/10.18653/v1/P18-1238］

Shetty R， Rohrbach M， Hendricks L A， Fritz M and Schiele B. 2017. Speaking the same language： matching machine to human captions by adversarial training//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice， Italy： IEEE： 4155-4164 ［DOI： 10.1109/ICCV.2017.445http://dx.doi.org/10.1109/ICCV.2017.445］

Sidorov O， Hu R H， Rohrbach M and Singh A. 2020. TextCaps： a dataset for image captioning with reading comprehension//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 742-758 ［DOI： 10.1007/978-3-030-58536-5_44http://dx.doi.org/10.1007/978-3-030-58536-5_44］

Song Z L， Zhou X F， Dong L H， Tan J L and Guo L. 2021. Direction relation transformer for image captioning//Proceedings of the 29th ACM International Conference on Multimedia. New York， USA： ACM： 5056-5064 ［DOI： 10.1145/3474085.3475607http://dx.doi.org/10.1145/3474085.3475607］

Stefanini M， Cornia M， Baraldi L， Cascianelli S， Fiameni G and Cucchiara R. 2023. From show to tell： a survey on deep learning-based image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence， 45（1）： 539-559 ［DOI： 10.1109/TPAMI.2022.3148210http://dx.doi.org/10.1109/TPAMI.2022.3148210］

Sun J X， Deng Q Y， Li Q， Sun M Y， Ren M and Sun Z A. 2022. AnyFace： free-style text-to-face synthesis and manipulation//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 18666-18675 ［DOI： 10.1109/CVPR52688.2022.01813http://dx.doi.org/10.1109/CVPR52688.2022.01813］

Tang L， Li H X， Yan C Q， Zheng X W and Ji R R. 2021. Survey on neural architecture search. Journal of Image and Graphics， 26（2）： 245-264

唐浪，李慧霞，颜晨倩，郑侠武，纪荣嵘. 2021. 深度神经网络结构搜索综述. 中国图象图形学报， 26（2）： 245-264 ［DOI： 10.11834/jig.200202http://dx.doi.org/10.11834/jig.200202］

Ushiku Y， Yamaguchi M， Mukuta Y and Harada T. 2015. Common subspace for model and similarity： phrase learning for caption generation from images//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago， Chile： IEEE： 2668-2676 ［DOI： 10.1109/ICCV.2015.306http://dx.doi.org/10.1109/ICCV.2015.306］

Vedantam R， Zitnick C L and Parikh D. 2015. CIDEr： consensus-based image description evaluation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston， USA： IEEE： 4566-4575 ［DOI： 10.1109/CVPR.2015.7299087http://dx.doi.org/10.1109/CVPR.2015.7299087］

Vinyals O， Toshev A， Bengio S and Erhan D. 2015. Show and tell： a neural image caption generator//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston， USA： IEEE： 3156-3164 ［DOI： 10.1109/CVPR.2015.7298935http://dx.doi.org/10.1109/CVPR.2015.7298935］

Vo D M， Chen H， Sugimoto A and Nakayama H. 2022. NOC-REK： novel object captioning with retrieved vocabulary from external knowledge//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 17979-17987 ［DOI： 10.1109/CVPR52688.2022.01747http://dx.doi.org/10.1109/CVPR52688.2022.01747］

Waghmare P M and Shinde S V. 2022. Image Caption Generation Using Neural Network Models and LSTM Hierarchical Structure//Das A K， Nayak J， Naik B， Dutta S and Pelusi D， eds. Computational Intelligence in Pattern Recognition. Singapore： Springer： 109-117 ［DOI： 10.1007/978-981-16-2543-5_10http://dx.doi.org/10.1007/978-981-16-2543-5_10］

Wang B， Huang M， Liu L J， Huang Q S and Shan W Q. 2022. Multi-layer focused inception-V3 models for fine-grained visual recognition. Acta Electronica Sinica， 50（1）： 72-78

王波，黄冕，刘利军，黄青松，单文琦. 2022. 基于多层聚焦Inception-V3卷积网络的细粒度图像分类. 电子学报， 50（1）： 72-78 ［DOI： 10.12263/DZXB.20200443http://dx.doi.org/10.12263/DZXB.20200443］

Wang J N， Xu W J， Wang Q Z and Chan A B. 2021a. Group-based distinctive image captioning with memory attention//Proceedings of the 29th ACM International Conference on Multimedia. New York， USA： ACM： 5020-5028 ［DOI： 10.1145/3474085.3475215http://dx.doi.org/10.1145/3474085.3475215］

Wang Q Z and Chan A B. 2018. CNN+CNN： convolutional decoders for image captioning ［EB/OL］. ［2022-06-09］. https://arxiv.org/pdf/1805.09019.pdfhttps://arxiv.org/pdf/1805.09019.pdf

Wang S J， Yao Z W， Wang R P， Wu Z Q and Chen X L. 2021b. FAIEr： fidelity and adequacy ensured image caption evaluation//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 14045-14054 ［DOI： 10.1109/CVPR46437.2021.01383http://dx.doi.org/10.1109/CVPR46437.2021.01383］

Wang X， Chen Y D and Zhu W W. 2022. A survey on curriculum learning. IEEE Transactions on Pattern Analysis and Machine Intelligence， 44（9）： 4555-4576 ［DOI： 10.1109/TPAMI.2021.3069908http://dx.doi.org/10.1109/TPAMI.2021.3069908］

Wang Z Q， Zhang Y S， Yu Y， Min J and Tian H. 2022. Review of deep learning based salient object detection. Journal of Image and Graphics， 27（7）： 2112-2128

王自全，张永生，于英，闵杰，田浩. 2022. 深度学习背景下视觉显著性物体检测综述. 中国图象图形学报， 27（7）： 2112-2128 ［DOI： 10.11834/jig.200649http://dx.doi.org/10.11834/jig.200649］

Wang Z W， Huang Z and Luo Y. 2020. Human consensus-oriented image captioning//Proceedings of the 29th International Joint Conference on Artificial Intelligence. Yokohama， Japan： IJCAI： 659-665

Xu G H， Niu S C， Tan M K， Luo Y C， Du Q and Wu Q. 2021a. Towards accurate text-based image captioning with content diversity exploration//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 12632-12641 ［DOI： 10.1109/CVPR46437.2021.01245http://dx.doi.org/10.1109/CVPR46437.2021.01245］

Xu K， Ba J L， Kiros R， Cho K， Courville A， Salakhutdinov R， Zemel R S and Bengio Y. 2015. Show， attend and tell： neural image caption generation with visual attention//Proceedings of the 32nd International Conference on International Conference on Machine Learning. Lille， France： JMLR.org： 2048-2057

Xu L Y， Zhang X C， Zhao X J， Chen H F， Chen F and Choi J D. 2021b. Boosting cross-lingual transfer via self-learning with uncertainty estimation//Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， USA： ACL： 6716-6723 ［DOI： 10.18653/v1/2021.emnlp-main.538http://dx.doi.org/10.18653/v1/2021.emnlp-main.538］

Xu R X， Luo F L， Wang C Y， Chang B B， Huang J， Huang S F and Huang F. 2022. From dense to sparse： contrastive pruning for better pre-trained language model compression//Proceedings of the 36th AAAI Conference on Artificial Intelligence. Palo Alto， USA： AAAI： 11547-11555 ［DOI： 10.1609/aaai.v36i10.21408http://dx.doi.org/10.1609/aaai.v36i10.21408］

Yan K， Ji L， Luo H S， Zhou M， Duan N and Ma S. 2021a. Control image captioning spatially and temporally//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg， USA： ACL： 2014-2025 ［DOI： 10.18653/v1/2021.acl-long.157http://dx.doi.org/10.18653/v1/2021.acl-long.157］

Yan X， Fei Z C， Li Z K， Wang S H， Huang Q M and Tian Q. 2021b. Semi-autoregressive image captioning//Proceedings of the 29th ACM International Conference on Multimedia. Lisbon， Portugal： ACM： 2708-2716 ［DOI： 10.1145/3474085.3475179http://dx.doi.org/10.1145/3474085.3475179］

Yang X， Tang K H， Zhang H W and Cai J F. 2019. Auto-encoding scene graphs for image captioning//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 10677-10686 ［DOI： 10.1109/CVPR.2019.01094http://dx.doi.org/10.1109/CVPR.2019.01094］

Yang X， Wang S S， Dong J， Dong J F， Wang M and Chua T S. 2022. Video moment retrieval with cross-modal neural architecture search. IEEE Transactions on Image Processing， 31： 1204-1216 ［DOI： 10.1109/TIP.2022.3140611http://dx.doi.org/10.1109/TIP.2022.3140611］

Yang X W， Zhang H M， Jin D， Liu Y R， Wu C H， Tan J C， Xie D L， Wang J and Wang X. 2020. Fashion captioning： towards generating accurate descriptions with semantic rewards//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 1-17 ［DOI： 10.1007/978-3-030-58601-0_1http://dx.doi.org/10.1007/978-3-030-58601-0_1］

Yang Y Z， Teo C L， Daumé H and Aloimonos Y. 2011. Corpus-guided sentence generation of natural images//Proceedings of 2011 Conference on Empirical Methods in Natural Language Processing. Edinburgh， UK： ACL： 444-454

Yao L L， Wang W Y and Jin Q. 2022. Image difference captioning with pre-training and contrastive learning//Proceedings of the 36th AAAI Conference on Artificial Intelligence. ［s.l.］： AAAI： 3108-3116

Yin G J， Sheng L， Liu B， Yu N H， Wang X G and Shao J. 2019. Context and attribute grounded dense captioning//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 6234-6243 ［DOI： 10.1109/CVPR.2019.00640http://dx.doi.org/10.1109/CVPR.2019.00640］

Yin Y H， Huang S Y and Zhang X. 2022. BM-NAS： bilevel multimodal neural architecture search//Proceedings of the 36th AAAI Conference on Artificial Intelligence. ［s.l.］： AAAI： 8901-8909

Yu H B， Luo Y Z， Shu M， Huo Y Y， Yang Z B， Shi Y F， Guo Z L， Li H Y， Hu X， Yuan J R and Nie Z Q. 2022. DAIR-V2X： a large-scale dataset for vehicle-infrastructure cooperative 3D object detection//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 21329-21338 ［DOI： 10.1109/CVPR52688.2022.02067http://dx.doi.org/10.1109/CVPR52688.2022.02067］

Zhang T J， Yin F and Luo Z Q. 2022a. Fast generic interaction detection for model interpretability and compression//Proceedings of the 10th International Conference on Learning Representations. ［s.l.］： ICLR

Zhang X Y， Sun X S， Luo Y P， Ji J Y， Zhou Y Y， Wu Y J， Huang F Y and Ji R R. 2021. RSTNet： captioning with adaptive attention on visual and non-visual words//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 15460-15469 ［DOI： 10.1109/CVPR46437.2021.01521http://dx.doi.org/10.1109/CVPR46437.2021.01521］

Zhang Y F， Jiang M and Zhao Q. 2022b. Query and attention augmentation for knowledge-based explainable reasoning//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 15555-15564 ［DOI： 10.1109/CVPR52688.2022.01513http://dx.doi.org/10.1109/CVPR52688.2022.01513］

Zhang Z Z， Zhang H， Zhao L， Chen T， Arik S Ö and Pfister T. 2022c. Nested hierarchical transformer： towards accurate， data-efficient and interpretable visual understanding//Proceedings of the 36th AAAI Conference on Artificial Intelligence. ［s.l.］： AAAI： 3417-3425 ［DOI： 10.1609/aaai.v36i3.20252http://dx.doi.org/10.1609/aaai.v36i3.20252］

Zhao B R， Cui Q， Song R J， Qiu Y Y and Liang J J. 2022. Decoupled knowledge distillation//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 11943-11952 ［DOI： 10.1109/CVPR52688.2022.01165http://dx.doi.org/10.1109/CVPR52688.2022.01165］

Zheng Y， Li Y L and Wang S J. 2019. Intention oriented image captions with guiding objects//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 8387-8396 ［DOI： 10.1109/CVPR.2019.00859http://dx.doi.org/10.1109/CVPR.2019.00859］

Zhou Y N， Wang M， Liu D Q， Hu Z Z and Zhang H W. 2020. More grounded image captioning by distilling image-text matching model//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 4776-4785 ［DOI： 10.1109/CVPR42600.2020.00483http://dx.doi.org/10.1109/CVPR42600.2020.00483］

Zhou Y N， Zhang Y， Hu Z Z and Wang M. 2021. Semi-autoregressive transformer for image captioning//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision Workshops. Montreal， Canada： IEEE： 3132-3136 ［DOI： 10.1109/ICCVW54120.2021.00350http://dx.doi.org/10.1109/ICCVW54120.2021.00350］

文章被引用时，请邮件提醒。

提交