发布时间: 2020-02-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.190222
2020 | Volume 25 | Number 2

图像理解和计算机视觉

融合约束学习的图像字幕生成方法

杜海骏, 刘学亮

合肥工业大学计算机与信息学院, 合肥 230601

收稿日期: 2019-05-27; 修回日期: 2019-07-24; 预印本日期: 2019-07-31

基金项目: 国家自然科学基金项目（61632007，61502139）

第一作者简介: 杜海骏, 1993年生, 男, 硕士研究生, 主要研究方向为机器学习、图像字幕生成。E-mail:duhaijun@mail.hfut.edu.cn.

中图法分类号: TP301.6

文献标识码: A

文章编号: 1006-8961(2020)02-0335-10

摘要

目的图像字幕生成是一个涉及计算机视觉和自然语言处理的热门研究领域，其目的是生成可以准确表达图片内容的句子。在已经提出的方法中，生成的句子存在描述不准确、缺乏连贯性的问题。为此，提出一种基于编码器-解码器框架和生成式对抗网络的融合训练新方法。通过对生成字幕整体和局部分别进行优化，提高生成句子的准确性和连贯性。方法使用卷积神经网络作为编码器提取图像特征，并将得到的特征和图像对应的真实描述共同作为解码器的输入。使用长短时记忆网络作为解码器进行图像字幕生成。在字幕生成的每个时刻，分别使用真实描述和前一时刻生成的字幕作为下一时刻的输入，同时生成两组字幕。计算使用真实描述生成的字幕和真实描述本身之间的相似性，以及使用前一时刻的输出生成的字幕通过判别器得到的分数。将二者组合成一个新的融合优化函数指导生成器的训练。结果在CUB-200数据集上，与未使用约束器的方法相比，本文方法在BLEU-4、BLEU-3、BLEI-2、BLEU-1、ROUGE-L和METEOR等6个评价指标上的得分分别提升了0.8%、1.2%、1.6%、0.9%、1.8%和1.0%。在Oxford-102数据集上，与未使用约束器的方法相比，本文方法在CIDEr、BLEU-4、BLEU-3、BLEU-2、BLEU-1、ROUGE-L和METEOR等7个评价指标上的得分分别提升了3.8%、1.5%、1.7%、1.4%、1.5%、0.5%和0.1%。在MSCOCO数据集上，本文方法在BLEU-2和BLEU-3两项评价指标上取得了最优值，分别为50.4%和36.8%。结论本文方法将图像字幕中单词前后的使用关系纳入考虑范围，并使用约束器对字幕局部信息进行优化，有效解决了之前方法生成的字幕准确度和连贯度不高的问题，可以很好地用于图像理解和图像字幕生成。

关键词

图像字幕生成; 约束学习; 强化学习; 生成式对抗网络; 融合训练

Image description generation method based on inhibitor learning

Du Haijun, Liu Xueliang

School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, China

Supported by: National Natural Science Foundation of China(61632007, 61502139)

Abstract

Objective Image description, a popular research field in computer vision and natural language processing, focuses on generating sentences that accurately describe the image content. Image description has a wide application in infant education, film production, and road navigation. In previous methods, generating image description based on deep learning achieved great success. On the basis of encoder-decoder framework, convolutional neural network is used as the feature extractor, and recurrent neural network is used as the caption generator. Cross entropy is applied to calculate the generation loss. However, the descriptions produced by these methods are often overly disorderly and inaccurate. Some researchers have exposed regularization method based on the encoder-decoder framework to strengthen the relationship between the image and generated description. However, incoherent problems remain with the generated descriptions caused by missing local information and high-level semantic concepts. Therefore, we propose a novel fusion training method based on encoder-decoder framework and generative adversarial networks, which enables global and local information to be calculated by a generator and inhibitor. This method encourages high linguistic coherence to human level while closing semantic concepts between image and description. Method The model is composed of an image feature extractor, inhibitor, generator, and discriminator. First, ResNet-152 is used as the image feature extractor. In ResNet-152, a key module named bottleneck is made up of a 1×1 convolution layer in 64 dimensions, a 3×3 convolution layer in 64 dimensions, a 1×1 convolution layer in 256 dimensions, and a shortcut connection. To suppress the time complexity per layer, an important principle is that the number of filters will be doubled if the dimension of convolutions is halved. The shortcut connection is introduced to address vanishing gradient. The last layer in ResNet-152 is replaced by a fully connected layer to align the dimension between the image feature and word after embedding. For the input of an image, the output of the extractor is an image feature vector with 512 dimensions. Second, the inhibitor is composed of a long short-term memory (LSTM). The input of the first moment is an image feature from the extractor. Every moment after that, the input is a word vector after embedding from ground truth. The results of inhibitor and "ground truth" are used to calculate the local score. The local score represents the coherence of the generated sentences. Third, the structure of the generator is the same as that of the inhibitor. Despite parameter sharing between the inhibitor and generator, the input of the generator is different from that of the inhibitor. In the generator, the output of the previous moment is used as the input of the current moment. The generator result is the image description and is used as part of the discriminator input. Fourth, the discriminator similarly consists of LSTM. Each word in the description generated by the generator corresponds to the input of the discriminator at each moment. The discriminator output at the last moment is combined with the image features obtained by the feature extractor to calculate the global score. The global score measures the semantic similarity between the generated description and image. Finally, the fusion loss consists of local and global scores. By controlling the weight of the local and global scores in fusion loss, coherence and accuracy are given different degrees of attention, and different descriptions may be generated for the same image. On the basis of the fusion loss, the model optimizes the parameter by backpropagation. In the experiment, the feature extractor is pretrained on the basis of the ImageNet dataset, and the parameters of the last layer are fine-tuned in formal training. As the training number increases, the generated sentences will perform increasingly well in terms of coherence and accuracy. Result Model performance is evaluated using three datasets, namely, MSCOCO-2014, Oxford-102, and CUB-200-2011. In the CUB-200-2011 dataset, our method shows improvement compared with that using the maximum likelihood estimate as the optimization function (CIDEr +1.6%, BLEU-3 +0.2%, BLEU-2 +0.8%, BLEU-1 +0.7%, and ROUGE-L +0.5%). The model performance declines when the inhibitor is removed from the model (BLEU-4 -0.8%, BLEU-3 -1.2%, BLEU-2 -1.6%, BLEU-1 -0.9%, ROUGE-L -1.8%, and METEOR -1.0%). In the Oxford-102 dataset, our method gains additional improvements compared with that using MLE as the optimization function (CIDEr +3.6%, BLEU-4 +0.7%, BLEU-3 +0.6%, BLEU-2 +0.4%, BLEU-1 +0.2%, ROUGE-L +0.6%, and METEOR +0.7%). The model performance declines substantially after removing the inhibitor (CIDEr -3.8%, BLEU-4 -1.5%, BLEU-3 -1.7%, BLEU-2 -1.4%, BLEU-1 -1.5%, ROUGE-L -0.5%, and METEOR -0.1%). In the MSCOCO-2014 dataset, our method achieves a leading position in several evaluation metrics compared with several proposed methods (BLEU-3 +0.4%, BLEU-2 +0.4%, BLEU-1 +0.4%, and ROUGE-L +0.3%). Conclusion The new optimization function and fusion training method are considered in the dependency relationships between words and the semantic relativity between the generated description and image. The method in this study obtains better scores in multiple evaluation metrics than that using MLE as the optimization function. Experiment results indicate that the inhibitor has a positive effect on the model performance in terms of evaluation metrics while optimizing the coherence of generated descriptions. The method increases the coherency of generated captions on the premise of the captions generated by the generator and image matching in the content.

Key words

image description generation; inhibiting learning; reinforcement learning; generative adversarial networks(GAN); fusion training

0 引言

图像字幕生成任务是生成一段给定的图像文本描述语句，准确表达图像中的物体及其相互关系，在语法和用词上尽可能接近人类标准。涉及计算机视觉和自然语言处理两个领域的技术。图像字幕生成首先要让计算机识别出图像包含的元素，包括物体的大小、颜色、类型、纹理等特征。接着基于识别出的图像特征，生成符合图像内容的描述语句。早期方法中，Farhadi等人(2010)使用传统的图像处理方法提取图像特征，然后通过既定的规则生成图像描述。Karpathy和Li(2015)将目标检测方法融入图像字幕生成过程以提高生成字幕的质量，Xu等人(2015)利用注意力机制进行图像字幕生成。这些方法都希望可以从图像中提取更加优良的特征以提高字幕生成的质量，但却忽视了对生成单词前后关系的优化。

基于编码器—解码器的模型常用最大似然估计对模型进行优化，尽量使模型输出结果的概率分布向训练数据集靠近。Vinyals等人(2015)使用卷积神经网络(CNN)提取图像特征，然后将提取的特征和真实图片描述传递给以循环神经网络为结构的解码器进行翻译，并生成图像描述语句。该类方法使用预先设计好的优化函数进行梯度更新，但是如果优化函数的设计不够完善可能误导模型的训练，使模型无法达到预期效果。为此，通过引入生成式对抗网络，为生成模型提供一个动态变化的判别模型指导生成模型的训练，使参数更新更加合理。然而，由于网络中生成器的梯度更新信息来自判别器，对不同的离散数据输入，判别器可能给出相同的结果，使得生成器无法获得有效的梯度进行更新。

为解决上述问题，本文提出一种基于编码器—解码器框架和生成式对抗网络的融合训练方法。通过融合多个优化方法，更加全面且有针对性地对生成字幕进行优化，使得生成的字幕更加自然和合理。

1 相关工作

现在图像字幕生成模型的性能评定一般使用由Papineni等人(2002)、Vedantam等人(2015)、Lin(2004)、Denkowski和Lavie(2014)提出的METEOR(an automatic metric for machine translation evaluation)、BLEU(bilingual evaluation understudy)、ROUGE(recall-oriented understudy for gisting evaluation)、CIDEr(consensus-based image description evaluation)评价指标进行量化评价。因此，部分研究者直接将评价指标作为优化目标以图获得更高的分数。但是部分评价指标的计算过程不可导，无法直接求取梯度。为了解决这个问题，将强化学习方法引入字幕生成模型中，使得不可导的评价指标转换为可导的价值函数并以此优化网络。

1.1 基于强化学习的字幕生成技术

强化学习包括智能体、环境、状态、动作和奖励5个基本概念。基于策略的强化学习方法的基本思路是智能体根据当前的状态采取某个动作，并更新当前的状态，接着环境根据当前的状态和动作给出奖励，可能是积极的奖励，也可能是消极的奖励。最后，智能体根据当前的状态和环境给出的奖励采取下一轮的动作。在这样的循环过程中，动作、状态、奖励、环境和智能体依次更新，从而使智能体执行的动作可以获得越来越高的奖励。Ranzato等人(2016)首次将强化学习方法应用于离散序列生成任务，将METEOR和BLEU分别作为直接优化对象，增强了模型的鲁棒性。Bahdanau等人(2017)在强化学习方法中加入了动作批判机制，区分了不同行为对梯度方向的影响，减小了奖励值的方差，使模型训练过程更加稳定。Rennie等人(2017)则采用了自批判机制，将贪心采样方法取得的结果作为基础值，指导梯度方向的更新。以上方法通过直接对评价指标进行优化使得最终结果取得了高得分，但是在字幕生成过程中，忽略了字幕本身的语法规则，可能会造成生成的字幕缺乏连贯性和多样性。

1.2 基于对抗式训练的字幕生成技术

在基于生成式对抗网络的图像描述语句生成方法中，将生成式对抗网络中的生成器作为强化学习方法中的智能体，而动作即为解码器在每个时刻生成的单词。生成式对抗网络中的判别器扮演强化学习方法中的环境角色，对执行的动作给出奖励值并更新状态，生成器依据该奖励值以及每个时刻生成单词的概率值计算优化函数，并更新网络参数，从而使得网络可以让奖励值高的句子获得较高的出现概率，而奖励值低的句子获得较低的出现概率。Dai等人(2017)提出了特定条件下生成式对抗网络在图像字幕生成任务上的应用，该方法将Simonyan和Zisserman(2015)提出的网络作为提取图像特征的工具，然后将该特征与随机噪声进行融合作为解码器的输入，从而得到与图像内容相关的语句。Shetty等人(2017)使用对抗训练的方式生成图像字幕，并在字幕多样性上取得了一定的成功，使生成的字幕对相似内容的图像描述更加有区分度。

基于对抗式训练得到的结果在语句用词的丰富性上具有一定优势，可以使模型对相似的图片给出更有针对性的描述。但这些方法同样没有针对生成的句子中单词搭配的合理性进行优化。

针对目前使用生成式对抗网络进行图像字幕生成任务的问题，提出一种基于编码器—解码器框架和生成式对抗网络的融合训练新方法。同时使用两种方式训练网络，分别针对生成字幕的整体质量和局部质量进行优化，使生成的字幕在内容上与图像内容保持统一，同时让生成字幕更加符合自然语言的语法规则。

2 本文方法

本文提出的图像字幕生成网络结构如图 1所示。图 1中输入的图像首先经过CNN提取图像特征，然后将该特征复制相同的3份分别传递给约束器、生成器和判别器。约束器基于输入的图像特征和字幕计算局部损失，生成器和判别器组成的生成式对抗网络(GAN)计算全局损失，即生成字幕的分数。最后融合两种计算结果得到最终的损失。

图 1 图像字幕生成网络

Fig. 1 Image description network

2.1 图像特征提取方法

鉴于He等人(2016)提出的ResNet-152在图像分类任务上的优良表现，本文使用基于该网络结构的图像特征提取模块对原始图像进行特征提取。训练中，在原残差网络的基础上，将网络最后一层替换为一个全连接层，使网络的输出结果变为一个512维的向量，并在之后的模块中都将该向量作为图像特征。

2.2 基于监督学习的语法约束模块

该模块使用监督学习的思想对模型的训练进行约束。监督学习是指在模型训练过程中，使用真实样本作为参照，通过比较真实样本与生成样本之间的差异调整模型的优化方向，使模型输出的概率分布向着真实数据分布逼近。图 1中约束器由单层LSTM(long short-term memory)组成，在训练过程中，首先将真实字幕转换为单词向量并和图像特征一起共同作为模型每个时刻的输入，一个时刻输入一个单词向量，且当前时刻模型的状态会向后传递，作为下一时刻的初始状态。接着在字幕生成过程中将LSTM单元每个时刻输出的隐向量作为模型每个时刻的输出，各个时刻的隐向量按照时间顺序拼接在一起得到的向量即为模型对当前图像字幕的预测。最后依据真实字幕计算损失值，具体为

$ \begin{aligned} {loss}_{\mathrm{sup}}^{\mathrm{G}}=&-\mathrm{E}_{I-P_{I}(i)}(\boldsymbol{I} \times \log (G(\boldsymbol{X}))+\\ &\log (\boldsymbol{I}) \times G(\boldsymbol{X})) \end{aligned} $

(1)

式中，$\boldsymbol{X}$为图像特征，$P_{X}(x)$为图像的概率分布，$\boldsymbol{I}$为真实字幕，$P_{I}(i)$表示真实字幕的概率分布，$G$为生成器，$G(\boldsymbol{X})$为以$\boldsymbol{X}$为输入、由生成器预测的图像字幕。

将真实字幕作为模型每个时刻的输入可以降低误差累积的影响，通过这种训练方式和式(1)的约束可以让模型更加准确地学习到单词之间的语法规则，增加生成字幕的连贯性。

2.3 基于生成式对抗网络和强化学习的奖励优化模块

在基于监督学习的语法约束模块中，约束器重点关注生成单词之间的依赖关系，用特殊的训练方法和对应的优化函数指导模型学习句子中的语法规则。而基于生成式对抗网络和强化学习的奖励优化模块则重点改善了生成字幕整体的质量。该模块以生成式对抗网络为主体结合强化学习方法，通过生成器和判别器的对抗训练提高模型的性能。

如图 1所示，在该模块中存在字幕生成器和奖励判别器这两个相互独立的、均由单层LSTM组成的神经网络。在训练过程中，首先生成器接收经过CNN处理的图像特征作为网络的输入，接着将LSTM在某一时刻的隐状态作为该时刻输出并将其作为下一时刻的输入。在经过了多个时间间隔的计算后，得到一组时间连续的序列，即为生成器对图像字幕的预测。然后将该预测结果传递给判别器，判别器使用LSTM网络对输入字幕进行编码，在判别器的输出结果中只取最后时刻的输出作为编码结果。最后将编码结果和图像特征共同作为激活函数的输入计算得分。需要说明的是，判别器预测的得分针对的是整个字幕，而非某个特定的单词。

在模型优化过程中，生成器依据判别器预测的得分计算梯度值，由于预测的得分是连续数值，有效避免了梯度消失问题，得分的计算方法为

$ {Reward}(\boldsymbol{X}, \boldsymbol{S})=D(\boldsymbol{X}, \boldsymbol{S}) $

(2)

式中，$\boldsymbol{S}$为输入判别器的图像字幕，$\boldsymbol{X}$为图像特征，$D$为判别器，$Reward$即为判别器给出的对应不同输入的得分。本文直接将判别器的输出作为生成字幕的得分，判别器的优化函数为

$ \begin{aligned} {loss}_{\mathrm{Adv}}^{\mathrm{D}}=&-\underset{\boldsymbol{X}\sim P_{X(x)}}{\mathrm{E}_{\boldsymbol{I} \sim P_{I}(i)}}(\log ({Reward}(\boldsymbol{X}, \boldsymbol{I}))+\\ & \log (1-{Reward}(\boldsymbol{X}, G(\boldsymbol{X})))+\\ &\left.\log \left(1-{Reward}\left(\boldsymbol{X}, \boldsymbol{I}^{\prime}\right)\right)\right) \end{aligned} $

(3)

式中，$\boldsymbol{I}^{\prime}$为错误匹配字幕，$Reward(\boldsymbol{X}, \boldsymbol{I})$表示图像特征和对应的字幕经过判别器后获得的得分，$Reward(\boldsymbol{X}, \boldsymbol{I}^{\prime})$表示图像特征和错误的字幕经过判别器后获得的得分，其作用是给予判别器更多的正向反馈和负向反馈，以提高判别器区分真实字幕和错误字幕的能力。生成器的优化函数为

$ \begin{aligned} {loss}_{\text {Adv }}^{\mathrm{C}}=&-\underset{X \sim P_{X(x)}}{\mathrm{E}_{I-P_{I}(i)}}(\log (G(\boldsymbol{X})) \times\\ & {Reward}(\boldsymbol{X}, G(\boldsymbol{X})) \times \alpha+\\ &(\log (G(\boldsymbol{X})) \times \boldsymbol{I}+\log (\boldsymbol{I}) \times G(\boldsymbol{X})) \times \beta) \\ &\{\alpha \in \bf{R}, 0 \leqslant \alpha \leqslant 1\} \\ &\{\beta \in \bf{R}, 0 \leqslant \beta \leqslant 1\} \end{aligned} $

(4)

式中，$α, β$是权重系数，取值为[0, 1]的实数，表示网络的局部关注度和全局关注度。$\boldsymbol{X}$为图像特征，$G$为生成器。本文通过调整α, β的值控制网络优化过程中对生成字幕的局部和全局关注度，让网络可以更加有针对性地对生成字幕进行优化。

3 实验及结果分析

3.1 数据集

实验使用的数据集包括MSCOCO、Oxford-102和CUB-200- 2011，各数据集的数据分布如表 1所示。

表 1 不同数据集参数对比
Table 1 Comparison of parameters of different data sets

下载CSV

数据集	训练数据	验证数据	测试数据	描述	单词表大小	图片主题
MSCOCO	82 783	40 504	40 775	5	9 957	综合
Oxford	5 000	2 000	1 189	10	1 994	花
CUB-200	11 788	—	2 961	10	2 147	鸟
注：“描述”表示数据集中每幅图像对应的描述语句数量。

深度神经网络的训练需要大量数据才能将模型参数调整到较优状态，从而使模型的输出分布接近真实数据的分布。然而，获取大规模的图像文本描述数据集是比较困难的。MSCOCO数据集虽然数据量大，但是场景较为复杂，模型对数据量的要求较高。在Oxford-102和CUB-200-2011数据集中，虽然图像中的目标是花或鸟，场景较为简单，但总的数据量较少。

因此，为了提高模型的性能，进行数据增强是必不可少的步骤。本文使用了随机裁剪以及随机水平翻转进行数据增强。通过该步骤，对训练数据中相同的图像，模型会得到不同的输入，使得模型间接获得更多的训练数据，防止模型过拟合，提高模型的泛化能力。

测试阶段，基于Karpathy和Li(2015)和Vinyals等人(2015)提出的测试方法，本文使用整个验证集来衡量模型性能。对于MSCOCO和CUB-200-2011数据集，本文将在训练集中出现至少5次的单词纳入单词表，而在Oxford-102数据集上取4次作为单词出现次数的下界。

3.2 评价指标

为了更加准确地评价生成结果，本文使用BLEU-1、BLEU-2、BLEU-3、BLEU-4、METEOR、ROUGE-L和CIDEr等6种评价指标，同时通过对比不同的模型和不同的优化函数对模型性能的影响，更加直观地体现本文方法在生成结果上的提高。

3.3 训练细节

在约束器和判别器中，LSTM隐藏层的维度设置为512，每幅图像在经过数据增强操作后尺寸变为3×224×224，使用ResNet-152组成的CNN网络对图像进行处理，得到维度为512的图像特征，输入的字幕经过嵌入操作变为词向量，其维度同样转换为512，每个训练批次包括256组数据。

模型训练过程分为预训练和对抗训练两个阶段。在预训练阶段，分别训练约束器20轮和判别器5轮，初始学习率为0.001，使用Adam优化器对网络进行训练。在两个网络中，隐藏层的输出经过一个全连接层调整维度到单词表长度。在对抗训练阶段，调整学习率为0.000 1，生成器和判别器交替训练，式(4)中的$α, β$均取0.5。值得注意的是，为了输出完整的语句，需要对输出的结果进行采样。本文在训练阶段使用多项式采样，在测试阶段使用和Sutskever等人(2014)同样的方法选取预测结果。

3.4 实验结果分析

在CUB-200-2011和Oxford-102数据集上，本文方法与使用交叉熵作为模型损失计算方法(M方法)和在本文方法的基础上去除约束器的方法(R方法)进行对比，得分越高说明模型效果越好，效果如表 2和表 3所示。需要注意的是，M方法和R方法中使用的字幕生成网络与本文使用的网络在结构和超参数设置上均相同。

表 2 不同方法在CUB-200-2011数据集上的效果对比
Table 2 Comparison of the effects of different methods on CUB-200-2011 datasets

下载CSV

方法	CIDEr	BLEU-4	BLEU-3	BLEU-2	BLEU-1	ROUGE-L	METEOR
M	0.350	0.438	0.579	0.741	0.899	0.579	0.292
R	0.373	0.426	0.569	0.733	0.897	0.566	0.279
本文	0.371	0.434	0.581	0.749	0.906	0.584	0.289
注：加粗字体为每项指标的最优值。

表 3 不同方法在Oxford-102数据集上的效果对比
Table 3 Comparison of the effects of different methods on Oxford-102 datasets

下载CSV

方法	CIDEr	BLEU-4	BLEU-3	BLEU-2	BLEU-1	ROUGE-L	METEOR
M	0.627	0.724	0.777	0.840	0.905	0.792	0.430
R	0.625	0.716	0.766	0.830	0.892	0.793	0.427
本文	0.663	0.731	0.783	0.844	0.907	0.798	0.437
注：加粗字体为每项指标的最优值。

从表 2和表 3可以看出，与M方法相比，在CUB-200-2011数据集上，本文方法在CIDEr、BLEU-3、BLEU-2、BLEU-1和ROUGE-L等5个评价指标上的得分分别提升了2.1%、0.2%、0.8%、0.7%和0.5%。在Oxford-102数据集上，本文方法在CIDEr、BLEU-4、BLEU-3、BLEU-2、BLEU-1、ROUGE-L和METEOR等7个评价指标上的得分分别提升了3.9%、0.7%、0.6%、0.4%、0.2%、0.6%和0.7%。由此可以说明，在模型输出结果的概率分布上，本文方法更加接近于真实分布，生成的字幕和图像的语义相似性更大。与R方法相比，在CUB-200-2011数据集上，本文方法在BLEU-4、BLEU-3、BLEI-2、BLEU-1、ROUGE-L和METEOR等6个评价指标上的得分分别提升了0.8%、1.2%、1.6%、0.9%、1.8%和1.0%。在Oxford-102数据集上，本文方法在CIDEr、BLEU-4、BLEU-3、BLEU-2、BLEU-1、ROUGE-L和METEOR等7个评价指标上的得分分别提升了3.8%、1.5%、1.7%、1.4%、1.5%、0.5%和0.1%。由此可以说明，在模型的优化过程中，约束器在优化生成字幕的语法规则的同时对生成字幕在评价指标的得分有积极作用，可以使预测的概率分布更加接近真实分布。

不同方法在MSCOCO数据集上的效果对比如表 4所示。表 4中，DeepVS(deep visual-semantic)是Karpathy和Li(2015)提出的一种图像字幕生成方法，该方法结合了目标检测方法，通过对图像中各个物体的精准识别，获取到更好的图像特征，从而在预测过程中提高了生成字幕在评价指标上的得分。NIC(neural image caption)是由Vinyals等人(2015)提出的，该方法将编码器—解码器框架应用于图像字幕生成任务，使用CNN作为编码器，LSTM作为解码器，最终形成一个端到端的字幕生成网络。gLSTM(guiding long-short term memory)是由Jia等人(2015)提出的，该方法在使用LSTM生成字幕时额外加入了新的指导信息以提高生成字幕的质量。RLF(reinforcement learning with feedback)由Ling和Fidler(2017)提出，该方法在强化学习的基础上加入了一个反馈网络以达到指导网络生成字幕的目的。IRBO(image description generation by modeling the relationship between objects)是Bai等人(2018)提出的一种通过目标检测方法提取图像特征并在对象之间建立关系模型的方法，该方法根据对象及对象之间的关系生成图像字幕。FFGS(feature fusion with gating structure)是Yuan等人(2017)提出的另一种图像字幕生成方法，该方法提出一种基于门控结构的特征融合方法，将不同维度的图像特征融合并输入到LSTM网络进行字幕生成。

表 4 不同方法在MSCOCO数据集上的效果对比
Table 4 Comparison of the effects of different methods on MSCOCO dataset

下载CSV

方法	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR
DeepVS	—	—	0.321	0.23	0.195
NIC	—	—	0.329	0.277	0.237
gLSTM	—	—	0.358	0.264	0.227
RLF	0.662	0.45	0.301	0.203	—
IRBO	0.696	0.433	0.175	0.086	—
FFGS	0.701	0.503	0.358	0.255	0.241
本文	0.683	0.504	0.368	0.271	0.236
注：加粗字体表示效果最优，“—”表示结果未知。

如表 4所示，在MSCOCO数据上，本文方法在BLEU-2和BLEU-3两项评价指标上取得了最优值。本文方法与表 4中提出的对比方法都使用了基于编码器—解码器的框架，不同的是，本文方法同时关注生成字幕的全局和局部特征。BLEU-2和BLEU-3取得最优值正是约束器对生成字幕局部质量优化的结果。而在BLEU-1和BLEU-4指标上，本文模型未取得最优值是因为生成式对抗网络(GAN)的存在，训练中判别器参数不断变化，使得生成器的训练过程增加了一些噪声，最终使得生成的字幕与目标字幕有所不同。

综合表 2—表 4的结果可知，本文方法可以使得生成模型的概率分布更加接近真实分布，使得生成的字幕更加符合图片内容且在多个评价指标上均获得了一定的提升。而本文提出的约束器可以约束生成模型的预测过程，提高生成结果的准确性。

模型生成的字幕如表 5所示。从表 5可以看出，一方面，生成字幕在整体语义信息上与真实字幕高度一致。对于CUB-200-2011数据集，生成的字幕不仅可以指出图片中鸟的主要颜色，还可以准确描述诸如鸟喙、翅膀、尾巴和脖子等细节部位的特征。对于Oxford-102数据集，生成的字幕可以准确描述花瓣的颜色和形状以及花蕊的颜色。而生成字幕中部分单词存在拼写错误是由于数据集中用词不规范，导致在制作单词表的过程中将拼写错误的单词也收录其中。另一方面，生成字幕在主语、谓语及宾语等常用句子组成的使用上符合自然语言使用规则且无明显的语法错误。在保证准确表达原图片内容的基础上，生成的字幕与真实字幕存在一定的差异，具有多样性的特点。

表 5 模型生成字幕与真实字幕的对比
Table 5 Comparison of the ground truth and model generation captions

下载CSV

图像	真实字幕	生成字幕
	this bird has a black body whit stripe on his wings and a yellow head and chest.	a black bird with a black head and a yellow beak.
	this small bird has a white crest and a black belly, with grey wings.	the bird has a small bill that is black and white.
	a bird with a black head as well as brown and white on it’s belly.	a small black bird with a white belly and a black beak.
	a medium bird with a cream head and chest and a gray body and wings.	a bird with a long pointed bill, a white breast and belly, and brown wings.
	this bird has a large, curved, black bill, a white crown, and gray tarsuses and feet.	a white bird with grey wings and a yellow beak.
	the grey waterbird has prominent round cheeks and a round bill.	a medium sized bird with a white underbelly and a long beak.
	this is a white bird with a grey wing and a yellow pointy beak.	this is a white bird with grey wings and a black beak.
	a small bird with a gray back, white belly, and pink and orange throat and face.	this is a colorful bird with a white belly and a black head.
	a small bird with a small long pointed beek flutters above a feeder.	the bird has a small bill that is black as well as a small bill.
	this flower is blue in color with petals that have veins.	this flower is pink and white in color with petals that are multi colored.
	the flower has petals that are overlapping and yellow with a large brown center.	this flower is yellow and black in color with petals that are oval shaped.
	this flower is yellow and white in color with petals that are ruffled and wavy.	the petals of this flower are white with a short stigma.
	large yellow almost transparent petals folding in with a green pedicel.	this flower is yellow in color with petals that are closely wrapped around the center.
	this flower has large green sepals surrounding pointed purple petals which bend slightly backwards.	this flower has petals that are purple and very stringy.
	this flower has multiple black stamen and soft blue petals which are wide with rounded edges.	this flower has purple petals as well as a white stamen.
	a flower with flat pink petals and cluster of yellow stamen.	this flower has petals that are pink and has red stamen.

4 结论

本文提出了一种基于编码器—解码器框架和生成式对抗网络的融合训练新方法，用于图像字幕生成任务，通过引入约束器更加有针对性地对生成字幕局部单词搭配的合理性进行优化，解决了之前的方法忽略局部信息导致生成结果不理想的问题。在训练时使用两种不同的字幕作为输入，并用不同的损失函数计算损失值，最后融合两种不同的优化方法对整个网络进行优化。该方法中的两种损失函数分别针对生成字幕的整体语义信息和局部语法规则进行优化。在保证生成字幕内容和图像内容一致的前提下，约束字幕内局部单词的搭配关系，提高字幕整体在多个评价指标上的得分。在CUB-200-2011、Oxford-102和MSCOCO数据集上与部分已经提出的方法进行对比，实验结果表明，本文提出的约束器可以有效提升生成字幕的准确性，对优化生成字幕局部单词搭配的合理性有积极的作用。

本文方法的不足之处是使用了生成式对抗网络和强化学习方法，存在模型训练不稳定、生成字幕多样性不足的问题。在之后的研究工作中，将从生成字幕的句式结构和训练过程的稳定性入手，力图让生成的字幕更加自然、准确且多样。

参考文献

Bahdanau D, Brakel P, Xu K, Goyal A, Lowe R, Pineau J, Courville A and Bengio Y. 2017. An actor-critic algorithm for sequence prediction//Proceedings of 2017 International Conference on Learning Representations.[s.l.]: [s.n.]

Bai L, Yang L N, Huo L and Li T S. 2018. Image description generation by modeling the relationship between objects//Proceedings of 2018 International Conference on Wavelet Analysis and Pattern Recognition. Chengdu, China: IEEE: 215-222[DOI: 10.1109/ICWAPR.2018.8521291]

Dai B, Fidler S, Urtasun R and Lin D. 2017. Towards diverse and natural image descriptions via a conditional GAN//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2989-2998[DOI: 10.1109/ICCV.2017.323]

Denkowski M and Lavie A. 2014. Meteor universal: language specific translation evaluation for any target language//Proceedings of the 9th Workshop on Statistical Machine Translation. Baltimore, Maryland, USA: Association for Computational Linguistics: 376-380

Farhadi A, Hejrati M, Sadeghi M A, Young P, Rashtchian C, Hockenmaier J and Forsyth D. 2010. Every picture tells a story: generating sentences from images//Proceedings of the 11th European Conference on Computer Vision. Heraklion, Crete: Springer: 15-29

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 770-778[DOI: 10.1109/CVPR.2016.90]

Jia X, Gavves E, Fernando B and Tuytelaars T. 2015. Guiding the long-short term memory model for image caption generation//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 2407-2415[DOI: 10.1109/ICCV.2015.277]

Karpathy A and Li F F. 2015. Deep visual-semantic alignments for generating image descriptions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE: 3128-3137[DOI: 10.1109/CVPR.2015.7298932]

Lin C Y. 2004. Rouge: a package for automatic evaluation of summaries//Proceedings of Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL 2004. Barcelona, Spain: Association for Computational Linguistics: 74-81

Ling H and Fidler S. 2017. Teaching machines to describe images via natural language feedback//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, California, USA: Curran Associates Inc.: 5075-5085

Papineni K, Roukos S, Ward T and Zhu W J. 2002. BLEU: a method for automatic evaluation of machine translation//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia, Pennsylvania: Association for Computational Linguistics: 311-318[DOI: 10.3115/1073083.1073135]

Ranzato M A, Chopra S, Auli M and Zaremba W. 2016. Sequence level training with recurrent neural networks//Proceedings of the 4th International Conference on Learning Representations.[s.l.]: [s.n.]

Rennie S J, Marcheret E, Mroueh Y, Ross J and Goel V. 2017. Self-critical sequence training for image captioning//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE: 1179-1195[DOI: 10.1109/CVPR.2017.131]

Shetty R, Rohrbach M, Anne Hendricks L, Fritz M and Schiele B. 2017. Speaking the same language: matching machine to human captions by adversarial training//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 4155-4164[DOI: 10.1109/ICCV.2017.445]

Simonyan K and Zisserman A. 2015. Very deep convolutional networks for large-scale image recognition//Proceedings of 2015 International Conference on Learning Representations.[s.l.]: [s.n.]

Sutskever I, Vinyals O and Le Q V. 2014. Sequence to sequence learning with neural networks//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Quebec, Canada: Curran Associated Inc: 3104-3112

Vedantam R, Lawrence Zitnick C and Parikh D. 2015. Cider: consensus-based image description evaluation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE: 4566-4575[DOI: 10.1109/CVPR.2015.7299087]

Vinyals O, Toshev A, Bengio S and Erhan D. 2015. Show and tell: a neural image caption generator//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE: 3156-3164[DOI: 10.1109/CVPR.2015.7298935]

Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel R and Bengio Y. 2015. Show, attend and tell: neural image caption generation with visual attention//Proceedings of the 32nd International Conference on Machine Learning. Lille, France: PMLR: 2048-2057

Yuan A H, Li X L and Lu X Q. 2017. FFGS: feature fusion with gating structure for image caption generation//Proceedings of the 2nd CCF Chinese Conference on Computer Vision. Tianjin, China: Springer: 638-649[DOI: 10.1007/978-981-10-7299-4_53]