Current Issue Cover

赵永强1,2, 金芝1,2, 张峰3, 赵海燕1,2, 陶政为1,2, 豆乘风1,2, 徐新海3, 刘东红3(1.北京大学计算机学院, 北京 100871;2.北京大学高可信软件技术教育部重点实验室, 北京 100871;3.军事科学院, 北京 100097)

摘 要
Deep-learning-based image captioning:analysis and prospects

Zhao Yongqiang1,2, Jin Zhi1,2, Zhang Feng3, Zhao Haiyan1,2, Tao Zhengwei1,2, Dou Chengfeng1,2, Xu Xinhai3, Liu Donghong3(1.School of Computer Science, Peking University, Beijing 100871, China;2.Key Laboratory of High Confidence Software Technologies(Peking University), Ministry of Education, Beijing 100871, China;3.Academy of Military Sciences, Beijing 100097, China)

The task of image captioning is to use a computer in automatically generating a complete, smooth, and suitable corresponding scene's caption for a known image and realizing the multimodal conversion from image to text.Describing the visual content of an image accurately and quickly is a fundamental goal for the area of artificial intelligence, which has a wide range of applications in research and production.Image captioning can be applied to many aspects of social development, such as text captions of images and videos, visual question answering, storytelling by looking at the image, network image analysis, and keyword search of an image.Image captions can also assist individuals born with visual impairments, making the computer another pair of eyes for them.The accuracy and inference speed of image captioning algorithms have been greatly improved with the wide application of deep learning technology.On the basis of extensive literature research we find that image captioning algorithms based on deep learning still have key technical challenges, i.e., delivering rich feature information, solving the problem of exposure bias, generating the diversity of image captions, realizing the controllability of image captions, and improving the inference speed of image captions.The main framework of the image captioning model is the encoder-decoder architecture.First, the encoder-decoder architecture uses an encoder to convert an input image into a fixed-length feature vector.Then, a decoder converts the fixed-length feature vector into an image caption.Therefore, the richer the feature information contained in the model is, the higher the accuracy of the model is, and the better the generation effect of the image caption is.According to the different research ideas of the existing algorithms, the present study reviews image captioning algorithms that deliver rich feature information from three aspects:attention mechanism, pretraining model, and multimodal model.Many image captioning algorithms cannot synchronize the training and prediction processes of a model.Thus, the model obtains exposure bias.When the model has an exposure bias, errors accumulate during word generation.Thus, the following words become biased, seriously affecting the accuracy of the image captioning model.According to different problem-solving methods, the present study reviews the related research on solving the exposure bias problem in the field of image captioning from three perspectives:reinforcement learning, nonautoregressive model, and curriculum learning and scheduled sampling.Image captioning is an ambiguity problem because it may generate multiple suitable captions for an image.The existing image captioning methods use common high-frequency expressions to generate relatively "safety" sentences.The caption results are relatively simple, empty, and lack critical detailed information, easily causing a lack of diversity in image captions.According to different research ideas, the present study reviews the existing image captioning methods of generative diversity from three aspects:graph convolutional neural network, generative adversarial network, and data augmentation.The majority of current image captioning models lack controllability, differentiating them from human intelligence.Researchers have proposed an algorithm to solve the problem by actively controlling image caption generation, which is mainly divided into two categories:content-controlled image captions and style-controlled image captions.Content-controlled image captions aim to control the described image content, such as different areas or objects of the image.Thus, the model can describe the image content in which the users are interested.Style-controlled image captions aim to generate captions of different styles, such as humorous, romantic, and antique.In this study, the related algorithms of content-controlled and style-controlled image captions are reviewed.The existing image captioning models are mostly encoder-decoder architectures.The encoder stage uses a convolutional neural network-based visual feature extraction method, whereas the decoder stage uses a recurrent neural network-based method.According to the different existing research ideas, the methods for improving the inference speed of image captioning models are divided into three categories.The first category uses nonautoregressive models to improve the inference speed.The second category uses the grid-based visual feature method to improve the inference speed.The third category uses a convolutional-neural-network-based decoder to improve inference speed.In addition, this study provides a detailed introduction to general datasets and evaluation metrics in image captioning.General datasets mainly include the following:bilingual evaluation understudy(BLEU);recall-oriented understanding for gisting evaluation(ROUGE);metric for evaluation of translation with explicit ordering(METEOR);consensus-based image description evaluation(CIDEr);semantic propositional image caption evaluation(SPICE);Compact bilinear pooling;Text-to-image grounding for image caption evaluation;Relevance, extraness, omission;Fidelity and adequacy ensured.The evaluation metrics mainly include Flickr8K, Flickr30K, MS COCO(Microsoft common objects in context), TextCaps, Localized Narratives, and Nocaps.Finally, this study deeply discusses the problems to be solved and the future research direction in the field of image captioning, i.e., how to improve the performance of visual feature extraction in image captions, how to improve the diversity of image captions, how to improve the interpretability of deep learning models, how to realize the transfer between multiple languages in image captions, how to automatically generate or design the optimal network architecture, and how to study the datasets and evaluation metrics that are suitable for image captions.Image captioning research is a popular hot spot in computer vision and natural language processing.At present, many algorithms for solving different problems are proposed annually.Other research directions will be developed in the future.