Current Issue Cover
融合约束学习的图像字幕生成方法

杜海骏, 刘学亮(合肥工业大学计算机与信息学院, 合肥 230601)

摘 要
目的 图像字幕生成是一个涉及计算机视觉和自然语言处理的热门研究领域,其目的是生成可以准确表达图片内容的句子。在已经提出的方法中,生成的句子存在描述不准确、缺乏连贯性的问题。为此,提出一种基于编码器-解码器框架和生成式对抗网络的融合训练新方法。通过对生成字幕整体和局部分别进行优化,提高生成句子的准确性和连贯性。方法 使用卷积神经网络作为编码器提取图像特征,并将得到的特征和图像对应的真实描述共同作为解码器的输入。使用长短时记忆网络作为解码器进行图像字幕生成。在字幕生成的每个时刻,分别使用真实描述和前一时刻生成的字幕作为下一时刻的输入,同时生成两组字幕。计算使用真实描述生成的字幕和真实描述本身之间的相似性,以及使用前一时刻的输出生成的字幕通过判别器得到的分数。将二者组合成一个新的融合优化函数指导生成器的训练。结果 在CUB-200数据集上,与未使用约束器的方法相比,本文方法在BLEU-4、BLEU-3、BLEI-2、BLEU-1、ROUGE-L和METEOR等6个评价指标上的得分分别提升了0.8%、1.2%、1.6%、0.9%、1.8%和1.0%。在Oxford-102数据集上,与未使用约束器的方法相比,本文方法在CIDEr、BLEU-4、BLEU-3、BLEU-2、BLEU-1、ROUGE-L和METEOR等7个评价指标上的得分分别提升了3.8%、1.5%、1.7%、1.4%、1.5%、0.5%和0.1%。在MSCOCO数据集上,本文方法在BLEU-2和BLEU-3两项评价指标上取得了最优值,分别为50.4%和36.8%。结论 本文方法将图像字幕中单词前后的使用关系纳入考虑范围,并使用约束器对字幕局部信息进行优化,有效解决了之前方法生成的字幕准确度和连贯度不高的问题,可以很好地用于图像理解和图像字幕生成。
关键词
Image description generation method based on inhibitor learning

Du Haijun, Liu Xueliang(School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, China)

Abstract
Objective Image description, a popular research field in computer vision and natural language processing, focuses on generating sentences that accurately describe the image content. Image description has a wide application in infant education, film production, and road navigation. In previous methods, generating image description based on deep learning achieved great success. On the basis of encoder-decoder framework, convolutional neural network is used as the feature extractor, and recurrent neural network is used as the caption generator. Cross entropy is applied to calculate the generation loss. However, the descriptions produced by these methods are often overly disorderly and inaccurate. Some researchers have exposed regularization method based on the encoder-decoder framework to strengthen the relationship between the image and generated description. However, incoherent problems remain with the generated descriptions caused by missing local information and high-level semantic concepts. Therefore, we propose a novel fusion training method based on encoder-decoder framework and generative adversarial networks, which enables global and local information to be calculated by a generator and inhibitor. This method encourages high linguistic coherence to human level while closing semantic concepts between image and description. Method The model is composed of an image feature extractor, inhibitor, generator, and discriminator. First, ResNet-152 is used as the image feature extractor. In ResNet-152, a key module named bottleneck is made up of a 1×1 convolution layer in 64 dimensions, a 3×3 convolution layer in 64 dimensions, a 1×1 convolution layer in 256 dimensions, and a shortcut connection. To suppress the time complexity per layer, an important principle is that the number of filters will be doubled if the dimension of convolutions is halved. The shortcut connection is introduced to address vanishing gradient. The last layer in ResNet-152 is replaced by a fully connected layer to align the dimension between the image feature and word after embedding. For the input of an image, the output of the extractor is an image feature vector with 512 dimensions. Second, the inhibitor is composed of a long short-term memory (LSTM). The input of the first moment is an image feature from the extractor. Every moment after that, the input is a word vector after embedding from ground truth. The results of inhibitor and "ground truth" are used to calculate the local score. The local score represents the coherence of the generated sentences. Third, the structure of the generator is the same as that of the inhibitor. Despite parameter sharing between the inhibitor and generator, the input of the generator is different from that of the inhibitor. In the generator, the output of the previous moment is used as the input of the current moment. The generator result is the image description and is used as part of the discriminator input. Fourth, the discriminator similarly consists of LSTM. Each word in the description generated by the generator corresponds to the input of the discriminator at each moment. The discriminator output at the last moment is combined with the image features obtained by the feature extractor to calculate the global score. The global score measures the semantic similarity between the generated description and image. Finally, the fusion loss consists of local and global scores. By controlling the weight of the local and global scores in fusion loss, coherence and accuracy are given different degrees of attention, and different descriptions may be generated for the same image. On the basis of the fusion loss, the model optimizes the parameter by backpropagation. In the experiment, the feature extractor is pretrained on the basis of the ImageNet dataset, and the parameters of the last layer are fine-tuned in formal training. As the training number increases, the generated sentences will perform increasingly well in terms of coherence and accuracy. Result Model performance is evaluated using three datasets, namely, MSCOCO-2014, Oxford-102, and CUB-200-2011. In the CUB-200-2011 dataset, our method shows improvement compared with that using the maximum likelihood estimate as the optimization function (CIDEr +1.6%, BLEU-3 +0.2%, BLEU-2 +0.8%, BLEU-1 +0.7%, and ROUGE-L +0.5%). The model performance declines when the inhibitor is removed from the model (BLEU-4 -0.8%, BLEU-3 -1.2%, BLEU-2 -1.6%, BLEU-1 -0.9%, ROUGE-L -1.8%, and METEOR -1.0%). In the Oxford-102 dataset, our method gains additional improvements compared with that using MLE as the optimization function (CIDEr +3.6%, BLEU-4 +0.7%, BLEU-3 +0.6%, BLEU-2 +0.4%, BLEU-1 +0.2%, ROUGE-L +0.6%, and METEOR +0.7%). The model performance declines substantially after removing the inhibitor (CIDEr -3.8%, BLEU-4 -1.5%, BLEU-3 -1.7%, BLEU-2 -1.4%, BLEU-1 -1.5%, ROUGE-L -0.5%, and METEOR -0.1%). In the MSCOCO-2014 dataset, our method achieves a leading position in several evaluation metrics compared with several proposed methods (BLEU-3 +0.4%, BLEU-2 +0.4%, BLEU-1 +0.4%, and ROUGE-L +0.3%). Conclusion The new optimization function and fusion training method are considered in the dependency relationships between words and the semantic relativity between the generated description and image. The method in this study obtains better scores in multiple evaluation metrics than that using MLE as the optimization function. Experiment results indicate that the inhibitor has a positive effect on the model performance in terms of evaluation metrics while optimizing the coherence of generated descriptions. The method increases the coherency of generated captions on the premise of the captions generated by the generator and image matching in the content.
Keywords

订阅号|日报