Current Issue Cover
深度学习图像描述方法分析与展望

赵永强1,2, 金芝1,2, 张峰3, 赵海燕1,2, 陶政为1,2, 豆乘风1,2, 徐新海3, 刘东红3(1.北京大学计算机学院, 北京 100871;2.北京大学高可信软件技术教育部重点实验室, 北京 100871;3.军事科学院, 北京 100097)

摘 要
图像描述任务是利用计算机自动为已知图像生成一个完整、通顺、适用于对应场景的描述语句,实现从图像到文本的跨模态转换。随着深度学习技术的广泛应用,图像描述算法的精确度和推理速度都得到了极大提升。本文在广泛文献调研的基础上,将基于深度学习的图像描述算法研究分为两个层面,一是图像描述的基本能力构建,二是图像描述的应用有效性研究。这两个层面又可以细分为传递更加丰富的特征信息、解决暴露偏差问题、生成多样性的图像描述、实现图像描述的可控性和提升图像描述推理速度等核心技术挑战。针对上述层面所对应的挑战,本文从注意力机制、预训练模型和多模态模型的角度分析了传递更加丰富的特征信息的方法,从强化学习、非自回归模型和课程学习与计划采样的角度分析了解决暴露偏差问题的方法,从图卷积神经网络、生成对抗网络和数据增强的角度分析了生成多样性的图像描述的方法,从内容控制和风格控制的角度分析了图像描述可控性的方法,从非自回归模型、基于网格的视觉特征和基于卷积神经网络解码器的角度分析了提升图像描述推理速度的方法。此外,本文还对图像描述领域的通用数据集、评价指标和已有算法性能进行了详细介绍,并对图像描述中待解决的问题与未来研究趋势进行预测和展望。
关键词
Deep-learning-based image captioning:analysis and prospects

Zhao Yongqiang1,2, Jin Zhi1,2, Zhang Feng3, Zhao Haiyan1,2, Tao Zhengwei1,2, Dou Chengfeng1,2, Xu Xinhai3, Liu Donghong3(1.School of Computer Science, Peking University, Beijing 100871, China;2.Key Laboratory of High Confidence Software Technologies(Peking University), Ministry of Education, Beijing 100871, China;3.Academy of Military Sciences, Beijing 100097, China)

Abstract
The task of image captioning is to use a computer in automatically generating a complete, smooth, and suitable corresponding scene's caption for a known image and realizing the multimodal conversion from image to text.Describing the visual content of an image accurately and quickly is a fundamental goal for the area of artificial intelligence, which has a wide range of applications in research and production.Image captioning can be applied to many aspects of social development, such as text captions of images and videos, visual question answering, storytelling by looking at the image, network image analysis, and keyword search of an image.Image captions can also assist individuals born with visual impairments, making the computer another pair of eyes for them.The accuracy and inference speed of image captioning algorithms have been greatly improved with the wide application of deep learning technology.On the basis of extensive literature research we find that image captioning algorithms based on deep learning still have key technical challenges, i.e., delivering rich feature information, solving the problem of exposure bias, generating the diversity of image captions, realizing the controllability of image captions, and improving the inference speed of image captions.The main framework of the image captioning model is the encoder-decoder architecture.First, the encoder-decoder architecture uses an encoder to convert an input image into a fixed-length feature vector.Then, a decoder converts the fixed-length feature vector into an image caption.Therefore, the richer the feature information contained in the model is, the higher the accuracy of the model is, and the better the generation effect of the image caption is.According to the different research ideas of the existing algorithms, the present study reviews image captioning algorithms that deliver rich feature information from three aspects:attention mechanism, pretraining model, and multimodal model.Many image captioning algorithms cannot synchronize the training and prediction processes of a model.Thus, the model obtains exposure bias.When the model has an exposure bias, errors accumulate during word generation.Thus, the following words become biased, seriously affecting the accuracy of the image captioning model.According to different problem-solving methods, the present study reviews the related research on solving the exposure bias problem in the field of image captioning from three perspectives:reinforcement learning, nonautoregressive model, and curriculum learning and scheduled sampling.Image captioning is an ambiguity problem because it may generate multiple suitable captions for an image.The existing image captioning methods use common high-frequency expressions to generate relatively "safety" sentences.The caption results are relatively simple, empty, and lack critical detailed information, easily causing a lack of diversity in image captions.According to different research ideas, the present study reviews the existing image captioning methods of generative diversity from three aspects:graph convolutional neural network, generative adversarial network, and data augmentation.The majority of current image captioning models lack controllability, differentiating them from human intelligence.Researchers have proposed an algorithm to solve the problem by actively controlling image caption generation, which is mainly divided into two categories:content-controlled image captions and style-controlled image captions.Content-controlled image captions aim to control the described image content, such as different areas or objects of the image.Thus, the model can describe the image content in which the users are interested.Style-controlled image captions aim to generate captions of different styles, such as humorous, romantic, and antique.In this study, the related algorithms of content-controlled and style-controlled image captions are reviewed.The existing image captioning models are mostly encoder-decoder architectures.The encoder stage uses a convolutional neural network-based visual feature extraction method, whereas the decoder stage uses a recurrent neural network-based method.According to the different existing research ideas, the methods for improving the inference speed of image captioning models are divided into three categories.The first category uses nonautoregressive models to improve the inference speed.The second category uses the grid-based visual feature method to improve the inference speed.The third category uses a convolutional-neural-network-based decoder to improve inference speed.In addition, this study provides a detailed introduction to general datasets and evaluation metrics in image captioning.General datasets mainly include the following:bilingual evaluation understudy(BLEU);recall-oriented understanding for gisting evaluation(ROUGE);metric for evaluation of translation with explicit ordering(METEOR);consensus-based image description evaluation(CIDEr);semantic propositional image caption evaluation(SPICE);Compact bilinear pooling;Text-to-image grounding for image caption evaluation;Relevance, extraness, omission;Fidelity and adequacy ensured.The evaluation metrics mainly include Flickr8K, Flickr30K, MS COCO(Microsoft common objects in context), TextCaps, Localized Narratives, and Nocaps.Finally, this study deeply discusses the problems to be solved and the future research direction in the field of image captioning, i.e., how to improve the performance of visual feature extraction in image captions, how to improve the diversity of image captions, how to improve the interpretability of deep learning models, how to realize the transfer between multiple languages in image captions, how to automatically generate or design the optimal network architecture, and how to study the datasets and evaluation metrics that are suitable for image captions.Image captioning research is a popular hot spot in computer vision and natural language processing.At present, many algorithms for solving different problems are proposed annually.Other research directions will be developed in the future.
Keywords

订阅号|日报