From image to language: image captioning and description
- Vol. 26, Issue 4, Pages: 727-750(2021)
Published: 16 April 2021 ,
Accepted: 03 August 2020
DOI: 10.11834/jig.200177
移动端阅览
浏览全部资源
扫码关注微信
Published: 16 April 2021 ,
Accepted: 03 August 2020
移动端阅览
Yunlan Tan, Pengjie Tang, Li Zhang, Yupan Luo. From image to language: image captioning and description. [J]. Journal of Image and Graphics 26(4):727-750(2021)
图像标题生成与描述的任务是通过计算机将图像自动翻译成自然语言的形式重新表达出来,该研究在人类视觉辅助、智能人机环境开发等领域具有广阔的应用前景,同时也为图像检索、高层视觉语义推理和个性化描述等任务的研究提供支撑。图像数据具有高度非线性和繁杂性,而人类自然语言较为抽象且逻辑严谨,因此让计算机自动地对图像内容进行抽象和总结,具有很大的挑战性。本文对图像简单标题生成与描述任务进行了阐述,分析了基于手工特征的图像简单描述生成方法,并对包括基于全局视觉特征、视觉特征选择与优化以及面向优化策略等基于深度特征的图像简单描述生成方法进行了梳理与总结。针对图像的精细化描述任务,分析了当前主要的图像“密集描述”与结构化描述模型与方法。此外,本文还分析了融合情感信息与个性化表达的图像描述方法。在分析与总结的过程中,指出了当前各类图像标题生成与描述方法存在的不足,提出了下一步可能的研究趋势与解决思路。对该领域常用的MS COCO 2014(Microsoft common objects in context)、Flickr30K等数据集进行了详细介绍,对图像简单描述、图像密集描述与段落描述和图像情感描述等代表性模型在数据集上的性能进行了对比分析。由于视觉数据的复杂性与自然语言的抽象性,尤其是融合情感与个性化表达的图像描述任务,在相关特征提取与表征、语义词汇的选择与嵌入、数据集构建及描述评价等方面尚存在大量问题亟待解决。
Image captioning and description belong to high-level visual understanding. They translate an image into natural language with decent words
appropriate sentence patterns
and correct grammars. The task is interesting and has wide application prospects on early education
visually impaired aid
automatic explanation
auto-reminding
development of intelligent interactive environment
and even designing of intelligent robots. They also provide support for studying image retrieval
object detection
visual semantic reasoning
and personalized description. At present
the task has attracted the attention of several researchers
and a large number of effective models have been proposed and developed. However
the task is difficult and challenging because the model has to bridge the visual information and natural language and close the semantic gap between the data with different modalities. In this work
the development timeline
popular frameworks and models
frequently used datasets
and corresponding performance of image captioning and description are surveyed comprehensively. Additionally
the remaining questions and limitations of current works are investigated and analyzed in depth. Overall
there are four parts for image captioning and description illustration in this study: 1) the image simple captioning and description (one sentence is generated for an image generally)
including handcraft feature-based methods and deep feature-based approaches; 2) image dense captioning (multiple but relatively independent sentences are generated in general) and refined paragraph description (paragraph with a certain structure and logic is generated generally); 3) image personalized and sentimental captioning and description (sentence with personalized style and sentimental words is generated in general); and 4) corresponding evaluation datasets
metrics
and performances of the popular models. For the first part
the research history of image captioning and description is first introduced
including template-based framework and visual semantic retrieval-based framework based on handcraft visual feature. The classical and significant works such as semantic space sharing model and visual semantic component reorganization model are described in detail. Then
the current popular works based on deep learning techniques are sorted out carefully and elaborated in great detail. According to the usage of visual information
the models for image captioning and description based on deep feature can be mainly classified into three categories: 1) global visual feature-based model
2) visual feature selection and optimization-based model
and 3) optimization strategy-oriented model. For each kind of model
the current popular works including the proposed models
superiority
and possible problems are analyzed and discussed. The models based on selected or optimized visual features such as visual attention region
attributes
and concepts as prior knowledge are usually more intuitive and show better performance
especially when advanced optimization strategies such as reinforcement learning are employed and the quality of generated sentences frequently possesses more accurate words and richer semantics
although a few methods based on global visual feature perform as good as them. Besides the models for image simple captioning and description
popular works on dense captioning and refinement description for images are presented and sorted out in the second part. The models for dense captioning generate more sentences for images and offer more detailed description. However
the semantic relevance among different visual objects
scenes
and actions is usually ignored and not embedded into the sentences although a few approaches take advantage of the possible visual relation to predict more accurate words. With regard to refined paragraph description for images
the hierarchical architecture with multiple recurrent neural network layers is the most employed basic framework
where hierarchical attention mechanism
visual attributes
and reinforcement learning strategy are also introduced into the related models to further improve the performance. However
the semantic relevance and logic among different visual objects remain to be further explored and represented
and the coherence and logicality of the generated description paragraph for images need to be further polished and refined. Additionally
in consideration of human habit of describing an image
personal experience is usually embedded into the description
and then the generated sentences often contain personalized and sentimental information. Therefore
a few significant works for personalized image captioning and sentimental description are also introduced and discussed in this paper. In particular
the discovery
representation
and embedding of personalized information and sentiment in the models are surveyed and analyzed in depth. Moreover
the limitations and problems about the task including the granularity and intensity of sentiments
the evaluation metrics of personalized and sentimental description
are worthy of further research and exploration. In addition to classical frameworks and popular models
the related public evaluation datasets and metrics are also summarized and presented. First of all
the image simple captioning and description datasets
including Microsoft common objects in context(MS COCO 2014)
Flickr30K
and Flickr8K
and the performances of a few popular models on these datasets are briefly introduced. Afterward
the datasets
including Visual Genome and VG-P(Paragraph) for image dense captioning and paragraph description
and the performances of certain current works on the datasets are described and provided. Next
the datasets for the task of image description with personalized and sentimental expression
including SentiCap and FlickrStyle10K
are briefly introduced. Moreover
the performances of the main models are reported and discussed. Additionally
the frequently used evaluation methods
including traditional metrics and special targeted metrics
are described and compared. In conclusion
breakthrough has been made on image captioning and description in recent years
and the quality of generated sentences has been greatly improved. However
more efforts are still needed to generate more coherent and accurate sentences with richer semantics for images. The possible trends and solutions about image captioning and description are reconsidered and put forward in this study. To further promote the task to practical applications
the semantic gap between visual dataset and natural language should be narrowed by generating a structured paragraph with sentiment and logical semantics for images. However
several problems
including visual feature refinement and usage
sentiment and logic mining and embedding
corresponding training dataset collecting and metric designing for personalization
and sentiment and paragraph description evaluation
remain to be addressed.
图像标题生成深度特征视觉描述语段生成图像情感逻辑语义
image captioningthe depth featurevisual descriptionparagraph generationimage sentimentlogical semantic
Anderson P, Fernando B, Johnson M and Gould S. 2016. SPICE: semantic propositional image caption evaluation//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer: 382-398[DOI: 10.1007/978-3-319-46454-1_24http://dx.doi.org/10.1007/978-3-319-46454-1_24]
Anderson P, He X D, Buehler C, Teney D, Johnson M, Gould S and Zhang L. 2018. Bottom-up and top-down attention for image captioning and visual question answering//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake, USA: IEEE: 6077-6086[DOI: 10.1109/CVPR.2018.00636http://dx.doi.org/10.1109/CVPR.2018.00636]
Aneja J, Deshpande A and Schwing A G. 2018. Convolutional image captioning//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 5561-5570[DOI: 10.1109/CVPR.2018.00583http://dx.doi.org/10.1109/CVPR.2018.00583]
Banerjee S and Lavie A. 2005. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments//Proceedings of ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Ann Arbor, USA: Association for Computational Linguistics: 65-75
Borth D, Ji R R, Chen T, Breuel T and Chang S F. 2013. Large-scale visual sentiment ontology and detectors using adjective noun pairs//Proceedings of the 21st ACM International Conference on Multimedia. Barcelona, Spain: ACM: 223-232[DOI: 10.1145/2502081.2502282http://dx.doi.org/10.1145/2502081.2502282]
Chatterjee M and Schwing A G. 2018. Diverse and coherent paragraph generation from images//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 747-763[DOI: 10.1007/978-3-030-01216-8_45http://dx.doi.org/10.1007/978-3-030-01216-8_45]
Che W B, Fan X P, Xiong R Q and Zhao D B. 2018. Paragraph generation network with visual relationship detection//Proceedings of the 26th ACM International Conference on Multimedia. Seoul, Korea(South): ACM: 1435-1443[DOI: 10.1145/3240508.3240695http://dx.doi.org/10.1145/3240508.3240695]
Chen L, Zhang H W, Xiao J, Nie L Q, Shao J, Liu W and Chua T S. 2017. SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 6298-6306[DOI: 10.1109/CVPR.2017.667http://dx.doi.org/10.1109/CVPR.2017.667]
Chen S Z and Jin Q. 2016. Multi-modal conditional attention fusion for dimensional emotion prediction//Proceedings of the 24th ACM International Conference on Multimedia. Amsterdam, the Netherlands: ACM: 571-575[DOI: 10.1145/2964284.2967286http://dx.doi.org/10.1145/2964284.2967286]
Chen T L, Zhang Z P, You Q Z, Fang C, Wang Z W, Jin H L and Luo J B. 2018. "Factual" or "Emotional": stylized image captioning with adaptive learning and attention//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 527-543[DOI: 10.1007/978-3-030-01249-6_32http://dx.doi.org/10.1007/978-3-030-01249-6_32]
Dai B, Fidler S, Urtasun R and Lin D H. 2017a. Towards diverse and natural image descriptions via a conditional GAN//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2989-2998[DOI: 10.1109/ICCV.2017.323http://dx.doi.org/10.1109/ICCV.2017.323]
Dai B and Lin D H. 2017. Contrastive learning for image captioning//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: NIPS: 898-907[DOI: 10.5555/3294771.3294857http://dx.doi.org/10.5555/3294771.3294857]
Dai B, Zhang Y Q and Lin D H. 2017b. Detecting visual relationships with deep relational networks//Proceedings of 2017 IEEE Conference on Computer Vision and PatternRecognition. Honolulu, USA: IEEE: 3298-3308[DOI: 10.1109/CVPR.2017.352http://dx.doi.org/10.1109/CVPR.2017.352]
Dalal N and Triggs B. 2005. Histograms of oriented gradients for human detection//Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego, USA: IEEE: 886-893[DOI: 10.1109/CVPR.2005.177http://dx.doi.org/10.1109/CVPR.2005.177]
Devlin J, Chang M W, Lee K and Toutanova K. 2019. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. [2019-05-24].https://arxiv.org/pdf/1810.04805.pdfhttps://arxiv.org/pdf/1810.04805.pdf
Donahue J, Hendricks L A, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K and Darrell T. 2017. Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4): 677-691[DOI:10.1109/TPAMI.2016.2599174]
Dong X Y, Zhu L C, Zhang D, Yang Y and Wu F. 2018. Fast parameter adaptation for few-shot image captioning and visual question answering//Proceedings of the 26th ACM International Conference on Multimedia. Seoul, Korea (South): ACM: 54-62[DOI: 10.1145/3240508.3240527http://dx.doi.org/10.1145/3240508.3240527]
Fang F, Li Q Y, Wang H L and Tang P J. 2018. Refining attention: a sequential attention model for image captioning//Proceedings of 2018 IEEE International Conference on Multimedia and Expo. San Diego, USA: IEEE: 1-6[DOI: 10.1109/ICME.2018.8486437http://dx.doi.org/10.1109/ICME.2018.8486437]
Fang H, Gupta S, Iandola F, Srivastava R K, Deng L, Dollár P, Gao J F, He X D, Mitchell M, Platt J C, Zitnick C L and Zweig G. 2015. From captions to visual concepts and back//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 1473-1482[DOI: 10.1109/CVPR.2015.7298754http://dx.doi.org/10.1109/CVPR.2015.7298754]
Farhadi A, Hejrati M, Sadeghi M A, Young P, Rashtchian C, Hockenmaier J and Forsyth D. 2010. Every picture tells a story: generating sentences from images//Proceedings of the 11th European Conference on Computer Vision. Crete, Greece: Springer: 15-29[DOI: 10.1007/978-3-642-15561-1_2http://dx.doi.org/10.1007/978-3-642-15561-1_2]
Feng Y, Ma L, Liu W and Luo J B. 2019. Unsupervised image captioning//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 4120-4129[DOI: 10.1109/CVPR.2019.00425http://dx.doi.org/10.1109/CVPR.2019.00425]
Fu K, Jin J Q, Cui R P, Sha F and Zhang C S. 2017. Aligning where to see and what to tell: image captioning with region-based attention and scene-specific contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12): 2321-2334[DOI:10.1109/TPAMI.2016.2642953]
Gan C, Gan Z, He X D, Gao J F and Deng L. 2017a. StyleNet: generating attractive visual captions with styles//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 955-964[DOI: 10.1109/CVPR.2017.108http://dx.doi.org/10.1109/CVPR.2017.108]
Gan Z, Gan C, He X D, Pu Y C, Tran K, Gao J F, Carin L and Deng L. 2017b. Semantic compositional networks for visual captioning//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 1141-1150[DOI: 10.1109/CVPR.2017.127http://dx.doi.org/10.1109/CVPR.2017.127]
Girshick R. 2015. Fast R-CNN//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 1440-1448[DOI: 10.1109/ICCV.2015.169http://dx.doi.org/10.1109/ICCV.2015.169]
Girshick R, Donahue J, Darrell T and Malik J. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 580-587[DOI: 10.1109/CVPR.2014.81http://dx.doi.org/10.1109/CVPR.2014.81]
Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A and Bengio Y. 2014. Generative adversarial nets//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT: 2672-2680[DOI: 10.5555/2969033.2969125http://dx.doi.org/10.5555/2969033.2969125]
Gu J X, Cai J F, Wang G and Chen T. 2018. Stack-captioning: coarse-to-fine learning for image captioning//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. New Orleans, USA: AAAI: 6837-6844
Guo L T, Liu J, Yao P, Li J W and Lu H Q. 2019. MSCap: multi-style image captioning with unpaired stylized text//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 4199-4208[DOI: 10.1109/CVPR.2019.00433http://dx.doi.org/10.1109/CVPR.2019.00433]
Han H, Jain A K, Wang F, Shan S G and Chen X L. 2018. Heterogeneous face attribute estimation: a deep multi-task learning approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(11): 2597-2609[DOI:10.1109/TPAMI.2017.2738004]
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778[DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]
Hinton G E and Salakhutdinov R R. 2006. Reducing the dimensionality of data with neural networks. Science, 313(5786): 504-507[DOI:10.1126/science.1127647]
Hodosh M, Young P and Hockenmaier J. 2013. Framing image description as a ranking task: data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47(1): 853-899[DOI:10.5555/2566972.2566993]
Huang K T H, Ferraro F, Mostafazadeh N, Misra I, Agrawal A, Devlin J, Girshick R, He X D, Kohli P, Batra D, Zitnick C L, Parikh D, Vanderwende L, Galley M and Mitchell M. 2016. Visual storytelling//Proceedings of 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego, USA: Association for Computational Linguistics: 1233-1239[DOI: 10.18653/v1/N16-1147http://dx.doi.org/10.18653/v1/N16-1147]
Jia X, Gavves E, Fernando B and Tuytelaars T. 2015. Guiding the long-short term memory model for image caption generation//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 2407-2415[DOI: 10.1109/ICCV.2015.277http://dx.doi.org/10.1109/ICCV.2015.277]
Jiang W H, Ma L, Jiang Y G, Liu W and Zhang T. 2018. Recurrent fusion network for image captioning//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 510-526[DOI: 10.1007/978-3-030-01216-8_31http://dx.doi.org/10.1007/978-3-030-01216-8_31]
Johnson J, Karpathy A and Li F F. 2016. DenseCap: fully convolutional localization networks for dense captioning//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 4565-4574[DOI: 10.1109/CVPR.2016.494http://dx.doi.org/10.1109/CVPR.2016.494]
Karayil T, Irfan A, Raue F, Hees J and Dengel A. 2019. Conditional GANs for image captioning with sentiments//Proceedings of the 28th International Conference on Artificial Neural Networks. Munich, Germany: Springer: 300-312[DOI: 10.1007/978-3-030-30490-4_25http://dx.doi.org/10.1007/978-3-030-30490-4_25]
Karpathy A and Li F F. 2015. Deep visual-semantic alignments for generating image descriptions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3128-3137[DOI: 10.1109/CVPR.2015.7298932http://dx.doi.org/10.1109/CVPR.2015.7298932]
Kim D J, Choi J, Oh T H and Kweon S O. 2019. Dense relational captioning: Triple-stream networks for relationship-based captioning//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 6264-6273[DOI: 10.1109/CVPR.2019.00643http://dx.doi.org/10.1109/CVPR.2019.00643]
Kiros R, Salakhutdinov R and Zemel R. 2014. Multimodal neural language models//Proceedings of the 31st International Conference on Machine Learning. Beijing, China: ICML: 595-603[DOI: 10.5555/3044805.3044959http://dx.doi.org/10.5555/3044805.3044959]
Klein D and Manning C D. 2003. Accurate unlexicalized parsing//Proceedings of the 41st Annual Meeting on Association for Computational Linguistics. Sapporo, Japan: ACL: 423-430[DOI: 10.3115/1075096.1075150http://dx.doi.org/10.3115/1075096.1075150]
Krause J, Johnson J, Krishna R and Li F F. 2017. A hierarchical approach for generating descriptive image paragraphs//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 3337-3345[DOI: 10.1109/CVPR.2017.356http://dx.doi.org/10.1109/CVPR.2017.356]
KrishnaR, Zhu Y K, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L J, Shamma D A, Bernstein M S and Li F F. 2017. Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1): 32-73[DOI:10.1007/s11263-016-0981-7]
Krizhevsky A, Sutskever I and Hinton G E. 2012. ImageNet classification with deep convolutional neural networks//Proceedings of the 25th International Conference on Neural Information Processing Systems. Red Hook, USA: 1097-1105
Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S M, Choi Y, Berg A C and Berg T L. 2013. BabyTalk: understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12): 2891-2903[DOI:10.1109/TPAMI.2012.162]
Kuznetsova P, Ordonez V, Berg A, Berg T and Choi Y. 2013. Generalizing image captions for image-text parallel corpus//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. Sofia, Bulgaria: Association for Computational Linguistics: 790-796
Kuznetsova P, Ordonez V, Berg A C, Berg T L and Choi Y. 2012. Collective generation of natural image descriptions//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Jeju Island, Korea(South): ACL: 359-368[DOI: 10.5555/2390524.2390575http://dx.doi.org/10.5555/2390524.2390575]
Kuznetsova P, Ordonez V, Berg T L and Choi Y. 2014. TREETALK: composition and compression of trees for image descriptions. Transactions of the Association for Computational Linguistics, 2: 351-362[DOI:10.1162/tacl_a_00188]
LeCun Y, Bottou L, Bengio Y and Haffner P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11): 2278-2324[DOI:10.1109/5.726791]
Li X Y, Jiang S Q and Han J G. 2019. Learning object context for dense captioning//Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Hawaii, USA: AAAI: 8650-8657
Li Y, Cheng H H, Liang X Y, Guo Q and Qian Y H. 2019. CNN image caption generation. Journal of Xidian University, 46(2): 152-157
李勇, 成红红, 梁新彦, 郭倩, 钱宇华. 2019. CNN图像标题生成. 西安电子科技大学学报, 46(2): 152-157 [DOI:10.19665/j.issn1001-2400.2019.02.025
Liang X D, Hu Z T, Zhang H, Gan C and Xing E P. 2017. Recurrent topic-transition GAN for visual paragraph generation//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 3382-3391[DOI: 10.1109/ICCV.2017.364http://dx.doi.org/10.1109/ICCV.2017.364]
Lin C Y and Och F J. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics//Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. Barcelona, Spain: ACL: 605-612[DOI: 10.3115/1218955.1219032http://dx.doi.org/10.3115/1218955.1219032]
Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P and Zitnick C L. 2014. Microsoft COCO: common objects in context//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 740-755[DOI: 10.1007/978-3-319-10602-1_48http://dx.doi.org/10.1007/978-3-319-10602-1_48]
Liu S Q, Zhu Z H, Ye N, Guadarrama S and Murphy K. 2017a. Improved image captioning via policy gradient optimization of SPIDEr//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 873-881[DOI: 10.1109/ICCV.2017.100http://dx.doi.org/10.1109/ICCV.2017.100]
Liu Y, Fu J L, Mei T and Chen C W. 2017b. Let your photos talk: generating narrative paragraph for photo stream via bidirectional attention recurrent neural networks//Proceedings of the 31st AAAI Conference on Artificial Intelligence. San Francisco, USA: AAAI: 1445-1452[DOI: 10.5555/3298239.3298450http://dx.doi.org/10.5555/3298239.3298450]
Lowe D G. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2): 91-110[DOI:10.1023/b:visi.0000029664.99615.94]
Lu J S, Xiong C M, Parikh D and Socher R. 2017. Knowing when to look: adaptive attention via a visual sentinel for image captioning//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 3242-3250[DOI: 10.1109/CVPR.2017.345http://dx.doi.org/10.1109/CVPR.2017.345]
Luong M T, Le Q V, Sutskever I, Vinyals O and Kaiser L. 2016. Multi-task sequence to sequence learning//Proceedings of the 4th International Conference on Learning Representations. San Juan, Argentina: ICLR: 1-10
Machajdik J and Hanbury A. 2010. Affective image classification using features inspired by psychology and art theory//Proceedings of the 18th ACM International Conference on Multimedia. Firenze, Italy: ACM: 83-92[DOI: 10.1145/1873951.1873965http://dx.doi.org/10.1145/1873951.1873965]
Mao J H, Xu W, Yang Y, Wang J, Huang Z H and Yuille A L. 2015. Deep captioning with multimodal recurrent neural networks (m-RNN)//Proceedings of the 3rd International Conference on Learning Representations. San Diego, USA: ICLR: 1-17[https://arxiv.org/abs/1412.6632https://arxiv.org/abs/1412.6632]
Mason R and Charniak E. 2014. Nonparametric method for data-driven image captioning//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Baltimore, Maryland, USA: ACL: 592-598[DOI: 10.3115/v1/P14-2097http://dx.doi.org/10.3115/v1/P14-2097]
Mathews A, Xie L X and He X M. 2016. SentiCap: generating image descriptions with sentiments//Proceedings of the 30th AAAI Conference on Artificial Intelligence. Phoenix, USA: AAAI: 3574-3580[DOI: 10.5555/3016387.3016406http://dx.doi.org/10.5555/3016387.3016406]
Mitchell M, Han X F, Dodge J, Mensch A, Goyal A, Berg A, Yamaguchi K, Berg T, Stratos K and DauméH. 2012. Midge: generating image descriptions from computer vision detections//Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Avignon, France: ACL: 747-756[DOI: 10.5555/2380816.2380907http://dx.doi.org/10.5555/2380816.2380907]
Mun J, Cho M and Han B. 2017. Text-guided attention model for image captioning//Proceedings of the 31st AAAI Conference on Artificial Intelligence. San Francisco, USA: AAAI: 4233-4239
Ojala T, Pietikäinen M and Harwood D. 1996. A comparative study of texture measures with classification based on featured distributions. Pattern Recognition, 29(1): 51-59[DOI:10.1016/0031-3203(95)00067-4]
Papineni K, Roukos S, Ward T and Zhu W J. 2002. BLEU: a method for automatic evaluation of machine translation//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia, USA: ACL: 311-318[DOI: 10.3115/1073083.1073135http://dx.doi.org/10.3115/1073083.1073135]
Park C C, Kim B and Kim G. 2017. Attend to you: personalized image captioning with context sequence memory networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 6432-6440[DOI: 10.1109/CVPR.2017.681http://dx.doi.org/10.1109/CVPR.2017.681]
Pedersoli M, Lucas T, Schmid C and Verbeek J. 2017. Areas of attention for image captioning//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 1251-1259[DOI: 10.1109/ICCV.2017.140http://dx.doi.org/10.1109/ICCV.2017.140]
Ren S Q, He K M, Girshick R and Sun J. 2017a. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149[DOI:10.1109/TPAMI.2016.2577031]
Ren Z, Wang X Y, Zhang N, Lyu X T and Li L J. 2017b. Deep reinforcement learning-based image captioning with embedding reward//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 1151-1159[DOI: 10.1109/CVPR.2017.128http://dx.doi.org/10.1109/CVPR.2017.128]
Rennie S J, Marcheret E, Mroueh Y, Ross J and Goel V. 2017. Self-critical sequence training for image captioning//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 1179-1195[DOI: 10.1109/CVPR.2017.131http://dx.doi.org/10.1109/CVPR.2017.131]
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S A, Huang Z H, Karpathy A, Khosla A, Bernstein M, Berg A C and Li F F. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3): 211-252[DOI:10.1007/s11263-015-0816-y]
Shi J H, Li Y L and Wang S J. 2019. Cascade attention: multiple feature based learning for image captioning//Proceedings of 2019 IEEE International Conference on Image Processing. Taipei, China: IEEE: 1970-1974[DOI: 10.1109/ICIP.2019.8803149http://dx.doi.org/10.1109/ICIP.2019.8803149]
Shin A, Ushiku Y and Harada T. 2016. Image captioning with sentiment terms via weakly-supervised sentiment dataset//Proceedings of British Machine Vision Conference. York, UK: BMVA Press: 53.1-53.12[DOI: 10.5244/C.30.53http://dx.doi.org/10.5244/C.30.53]
Shuster K, Humeau S, Hu H X, Bordes A and Weston J. 2019. Engaging image captioning via personality//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 12508-12518[DOI: 10.1109/CVPR.2019.01280http://dx.doi.org/10.1109/CVPR.2019.01280]
Simonyan K and Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition//Proceedings of the 3rd International Conference on Learning Representations. San Diego, USA: ICLR: 1-14
Sun Y, Wang X G and Tang X O. 2013. Deep convolutional network cascade for facial point detection//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, USA: IEEE: 3476-3483[DOI: 10.1109/CVPR.2013.446http://dx.doi.org/10.1109/CVPR.2013.446]
Szegedy C, Liu W, Jia Y Q, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V and Rabinovich A. 2015. Going deeper with convolutions//Proceedings of 2015 IEEE Conference on ComputerVision and Pattern Recognition. Boston, USA: IEEE: 1-9[DOI: 10.1109/CVPR.2015.7298594http://dx.doi.org/10.1109/CVPR.2015.7298594]
Tang P J, Tan Y L and Li J Z. 2017. Image description based on the fusion of scene and object category prior knowledge. Journal of Image and Graphics, 22(9): 1251-1260
汤鹏杰, 谭云兰, 李金忠. 2017. 融合图像场景及物体先验知识的图像描述生成模型. 中国图象图形学报, 22(9): 1251-1260 [DOI:10.11834/jig.170052]
Tang P J, Wang H L and Kwong S. 2017. G-MS2F: GoogLeNet based multi-stage feature fusion of deep CNN for scene recognition. Neurocomputing, 225: 188-197[DOI:10.1016/j.neucom.2016.11.023]
Tang P J, Wang H L and Kwong S. 2018. Deep sequential fusion LSTM network for image description. Neurocomputing, 312: 154-164[DOI:10.1016/j.neucom.2018.05.086]
Tang P J, Wang H L and Xu K S. 2018. Multi-objective layer-wise optimization and multi-level probability fusion for image description generation using LSTM. Acta Automatica Sinica, 44(7): 1237-1249
汤鹏杰, 王瀚漓, 许恺晟. 2018. LSTM逐层多目标优化及多层概率融合的图像描述. 自动化学报, 44(7): 1237-1249 [DOI:10.16383/j.aas.2017.c160733]
Tran K, He X D, Zhang L and Sun J. 2016. Rich image captioning in the wild//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Las Vegas, USA: IEEE: 434-441[DOI: 10.1109/CVPRW.2016.61http://dx.doi.org/10.1109/CVPRW.2016.61]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: NIPS: 5998-6008[DOI: 10.5555/3295222.3295349http://dx.doi.org/10.5555/3295222.3295349]
Vedantam R, Zitnick C L and Parikh D. 2015. CIDEr: consensus-based image description evaluation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 4566-4575[DOI: 10.1109/CVPR.2015.7299087http://dx.doi.org/10.1109/CVPR.2015.7299087]
Vinyals O, Toshev A, Bengio S and Erhan D. 2015. Show and tell: a neural image caption generator//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3156-3164[DOI: 10.1109/CVPR.2015.7298935http://dx.doi.org/10.1109/CVPR.2015.7298935]
Wang J, Fu J L, Tang J H, Li Z C and Mei T. 2018a. Show, rewardand tell: automatic generation of narrative paragraph from photo stream by adversarial training//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. New Orleans, USA: AAAI: 7396-7403
Wang J, Pan Y W, Yao T, Tang J H and Mei T. 2019. Convolutional auto-encoding of sentence topics for image paragraph generation//Proceedings of the 28th International Joint Conference on Artificial Intelligence. Macao, China: IJCAI: 940-946[DOI: 10.24963/ijcai.2019/132http://dx.doi.org/10.24963/ijcai.2019/132]
Wang L M, Qiao Y and Tang X O. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 4305-4314[DOI: 10.1109/CVPR.2015.7299059http://dx.doi.org/10.1109/CVPR.2015.7299059]
Wang L M, Xiong Y J, Wang Z, Qiao Y, Lin D H, Tang X O and van Gool L. 2018b. Temporal segment networks for action recognition in videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(11): 2740-2755[DOI:10.1109/TPAMI.2018.2868668]
Wu J C, Wang L M, Wang L, Guo J and Wu G S. 2019a. Learning actor relation graphs for group activity recognition//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 9956-9966[DOI: 10.1109/CVPR.2019.01020http://dx.doi.org/10.1109/CVPR.2019.01020]
Wu Q, Shen C H, Liu L Q, Dick A and van den Hengel A. 2016. What value do explicit high level concepts have in vision to language problems?//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 203-212[DOI: 10.1109/CVPR.2016.29http://dx.doi.org/10.1109/CVPR.2016.29]
Wu Q, Shen C H, Wang P, Dick A and van den Hengel A. 2018. Image captioning and visual question answering based on attributes and external knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6): 1367-1381[DOI:10.1109/TPAMI.2017.2708709]
Wu S Z, Kan M N, Shan S G and Chen X L. 2019b. Hierarchical attention for part-aware face detection. International Journal of Computer Vision, 127(6): 560-578[DOI:10.1007/s11263-019-01157-5]
Xu K, Le Ba J, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel R and Bengio Y. 2015. Show, attend and tell: neural image caption generation with visual attention//Proceedings of the 32nd International Conference on International Conference on Machine Learning. Lille, France: JMLR. org: 2048-2057[DOI: 10.5555/3045118.3045336http://dx.doi.org/10.5555/3045118.3045336]
Yan S Y, Xie Y, Wu F Y, Smith J S, Lu W J and Zhang B L. 2020. Image captioning via hierarchical attention mechanism and policy gradient optimization. Signal Processing, 167: #107329[DOI:10.1016/j.sigpro.2019.107329]
Yang J W, Lu J S, Lee S, Batra D and Parikh D. 2018. Graph R-CNN for scene graph generation//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 690-706[DOI: 10.1007/978-3-030-01246-5_41http://dx.doi.org/10.1007/978-3-030-01246-5_41]
Yang L J, Tang K, Yang J C and Li L J. 2017. Dense captioning with joint inference and visual context//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 1978-1987[DOI: 10.1109/CVPR.2017.214http://dx.doi.org/10.1109/CVPR.2017.214]
Yang X, Tang K H, Zhang H W and Cai J F. 2019. Auto-encoding scene graphs for image captioning//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 10677-10686[DOI: 10.1109/CVPR.2019.01094http://dx.doi.org/10.1109/CVPR.2019.01094]
Yang Y Z, Teo C L, DauméH and Aloimonos Y. 2011. Corpus-guided sentence generation of natural images//Proceedings of Conference on Empirical Methods in Natural Language Processing. Edinburgh, UK: ACL: 444-454[DOI: 10.5555/2145432.2145484http://dx.doi.org/10.5555/2145432.2145484]
Yao B Z, Yang X, Lin L, Lee M W and Zhu S C. 2010. I2T: image parsing to text description. Proceedings of the IEEE, 98(8): 1485-1508[DOI:10.1109/jproc.2010.2050411]
Yao T, Pan Y W, Li Y H and Mei T. 2018. Exploring visual relationship for image captioning//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 711-727[DOI: 10.1007/978-3-030-01264-9_42http://dx.doi.org/10.1007/978-3-030-01264-9_42]
Yao T, Pan Y W, Li Y H, Qiu Z F and Mei T. 2017. Boosting image captioning with attributes. Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 4904-4912[DOI: 10.1109/ICCV.2017.524http://dx.doi.org/10.1109/ICCV.2017.524]
Yin G J, Sheng L, Liu B, Yu N H, Wang X G and Shao J. 2019. Context and attribute grounded dense captioning//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 6234-6243[DOI: 10.1109/CVPR.2019.00640http://dx.doi.org/10.1109/CVPR.2019.00640]
You Q Z, Cao L L, Jin H L and Luo J B. 2016a. Robust visual-textual sentiment analysis: when attention meets tree-structured recursive neural networks//Proceedings of the 24th ACM International Conference on Multimedia. Amsterdam, the Netherlands: ACM: 1008-1017[DOI: 10.1145/2964284.2964288http://dx.doi.org/10.1145/2964284.2964288]
You Q Z, Jin H L, Wang Z W, Fang C and Luo J B. 2016b. Image captioning with semantic attention//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 4651-4659[DOI: 10.1109/CVPR.2016.503http://dx.doi.org/10.1109/CVPR.2016.503]
Young P, Lai A, Hodosh M and Hockenmaier J. 2014. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2: 67-78[DOI:10.1162/tacl_a_00166]
Zhang M X, Yang Y, Zhang H W, Ji Y L, Shen H T and Chua T S. 2019a. More is better: precise and detailed image captioning using online positive recall and missing concepts mining. IEEE Transactions on Image Processing, 28(1): 32-44[DOI:10.1109/TIP.2018.2855415]
Zhang Z J, Wu Q, Wang Y and Chen F. 2019b. High-quality image captioning with fine-grained and semantic-guided visual attention. IEEE Transactions on Multimedia, 21(7): 1681-1693[DOI:10.1109/TMM.2018.2888822]
Zhao J M, Chen S Z and Jin Q. 2018. Multimodal dimensional and continuous emotion recognition in dyadic video interactions//Proceedings of the 19th Pacific-Rim Conference on Multimedia. Hefei, China: Springer: 301-312[DOI: 10.1007/978-3-030-00776-8_28http://dx.doi.org/10.1007/978-3-030-00776-8_28]
Zhao W T, Wu X X and Zhang X X. 2020. MemCap: memorizing style knowledge for image captioning//Proceedings of the 34th AAAI Conference on Artificial Intelligence. New York, USA: AAAI: 12984-12992
Zhou B L, Lapedriza A, Xiao J X, Torralba A and Oliva A. 2014. Learning deep features for scene recognition using places database//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: ACM: 487-495[DOI: 10.5555/2968826.2968881http://dx.doi.org/10.5555/2968826.2968881]
相关文章
相关作者
相关机构