融合约束学习的图像字幕生成方法

杜海骏; 刘学亮

doi:10.11834/jig.190222

图像理解和计算机视觉 | 浏览量 : 0 下载量: 20 CSCD: 2

PDF
导出
分享
收藏
专辑

融合约束学习的图像字幕生成方法
Image description generation method based on inhibitor learning
2020年25卷第2期页码：333-342
收稿：2019-05-27，

修回：2019-7-24，

录用：2019-7-31，

纸质出版：2020-02-16
DOI： 10.11834/jig.190222
稿件说明：

移动端阅览

杜海骏, 刘学亮. 融合约束学习的图像字幕生成方法[J]. 中国图象图形学报, 2020,25(2):333-342. DOI： 10.11834/jig.190222.

Haijun Du, Xueliang Liu. Image description generation method based on inhibitor learning[J]. Journal of Image and Graphics, 2020, 25(2): 333-342. DOI： 10.11834/jig.190222.

摘要

目的

图像字幕生成是一个涉及计算机视觉和自然语言处理的热门研究领域，其目的是生成可以准确表达图片内容的句子。在已经提出的方法中，生成的句子存在描述不准确、缺乏连贯性的问题。为此，提出一种基于编码器-解码器框架和生成式对抗网络的融合训练新方法。通过对生成字幕整体和局部分别进行优化，提高生成句子的准确性和连贯性。

方法

使用卷积神经网络作为编码器提取图像特征，并将得到的特征和图像对应的真实描述共同作为解码器的输入。使用长短时记忆网络作为解码器进行图像字幕生成。在字幕生成的每个时刻，分别使用真实描述和前一时刻生成的字幕作为下一时刻的输入，同时生成两组字幕。计算使用真实描述生成的字幕和真实描述本身之间的相似性，以及使用前一时刻的输出生成的字幕通过判别器得到的分数。将二者组合成一个新的融合优化函数指导生成器的训练。

结果

在CUB-200数据集上，与未使用约束器的方法相比，本文方法在BLEU-4、BLEU-3、BLEI-2、BLEU-1、ROUGE-L和METEOR等6个评价指标上的得分分别提升了0.8%、1.2%、1.6%、0.9%、1.8%和1.0%。在Oxford-102数据集上，与未使用约束器的方法相比，本文方法在CIDEr、BLEU-4、BLEU-3、BLEU-2、BLEU-1、ROUGE-L和METEOR等7个评价指标上的得分分别提升了3.8%、1.5%、1.7%、1.4%、1.5%、0.5%和0.1%。在MSCOCO数据集上，本文方法在BLEU-2和BLEU-3两项评价指标上取得了最优值，分别为50.4%和36.8%。

结论

本文方法将图像字幕中单词前后的使用关系纳入考虑范围，并使用约束器对字幕局部信息进行优化，有效解决了之前方法生成的字幕准确度和连贯度不高的问题，可以很好地用于图像理解和图像字幕生成。

Abstract

Objective

Image description

a popular research field in computer vision and natural language processing

focuses on generating sentences that accurately describe the image content. Image description has a wide application in infant education

film production

and road navigation. In previous methods

generating image description based on deep learning achieved great success. On the basis of encoder-decoder framework

convolutional neural network is used as the feature extractor

and recurrent neural network is used as the caption generator. Cross entropy is applied to calculate the generation loss. However

the descriptions produced by these methods are often overly disorderly and inaccurate. Some researchers have exposed regularization method based on the encoder-decoder framework to strengthen the relationship between the image and generated description. However

incoherent problems remain with the generated descriptions caused by missing local information and high-level semantic concepts. Therefore

we propose a novel fusion training method based on encoder-decoder framework and generative adversarial networks

which enables global and local information to be calculated by a generator and inhibitor. This method encourages high linguistic coherence to human level while closing semantic concepts between image and description.

Method

The model is composed of an image feature extractor

inhibitor

generator

and discriminator. First

ResNet-152 is used as the image feature extractor. In ResNet-152

a key module named bottleneck is made up of a 1×1 convolution layer in 64 dimensions

a 3×3 convolution layer in 64 dimensions

a 1×1 convolution layer in 256 dimensions

and a shortcut connection. To suppress the time complexity per layer

an important principle is that the number of filters will be doubled if the dimension of convolutions is halved. The shortcut connection is introduced to address vanishing gradient. The last layer in ResNet-152 is replaced by a fully connected layer to align the dimension between the image feature and word after embedding. For the input of an image

the output of the extractor is an image feature vector with 512 dimensions. Second

the inhibitor is composed of a long short-term memory (LSTM). The input of the first moment is an image feature from the extractor. Every moment after that

the input is a word vector after embedding from ground truth. The results of inhibitor and "ground truth" are used to calculate the local score. The local score represents the coherence of the generated sentences. Third

the structure of the generator is the same as that of the inhibitor. Despite parameter sharing between the inhibitor and generator

the input of the generator is different from that of the inhibitor. In the generator

the output of the previous moment is used as the input of the current moment. The generator result is the image description and is used as part of the discriminator input. Fourth

the discriminator similarly consists of LSTM. Each word in the description generated by the generator corresponds to the input of the discriminator at each moment. The discriminator output at the last moment is combined with the image features obtained by the feature extractor to calculate the global score. The global score measures the semantic similarity between the generated description and image. Finally

the fusion loss consists of local and global scores. By controlling the weight of the local and global scores in fusion loss

coherence and accuracy are given different degrees of attention

and different descriptions may be generated for the same image. On the basis of the fusion loss

the model optimizes the parameter by backpropagation. In the experiment

the feature extractor is pretrained on the basis of the ImageNet dataset

and the parameters of the last layer are fine-tuned in formal training. As the training number increases

the generated sentences will perform increasingly well in terms of coherence and accuracy.

Result

Model performance is evaluated using three datasets

namely

MSCOCO-2014

Oxford-102

and CUB-200-2011. In the CUB-200-2011 dataset

our method shows improvement compared with that using the maximum likelihood estimate as the optimization function (CIDEr +1.6%

BLEU-3 +0.2%

BLEU-2 +0.8%

BLEU-1 +0.7%

and ROUGE-L +0.5%). The model performance declines when the inhibitor is removed from the model (BLEU-4 -0.8%

BLEU-3 -1.2%

BLEU-2 -1.6%

BLEU-1 -0.9%

ROUGE-L -1.8%

and METEOR -1.0%). In the Oxford-102 dataset

our method gains additional improvements compared with that using MLE as the optimization function (CIDEr +3.6%

BLEU-4 +0.7%

BLEU-3 +0.6%

BLEU-2 +0.4%

BLEU-1 +0.2%

ROUGE-L +0.6%

and METEOR +0.7%). The model performance declines substantially after removing the inhibitor (CIDEr -3.8%

BLEU-4 -1.5%

BLEU-3 -1.7%

BLEU-2 -1.4%

BLEU-1 -1.5%

ROUGE-L -0.5%

and METEOR -0.1%). In the MSCOCO-2014 dataset

our method achieves a leading position in several evaluation metrics compared with several proposed methods (BLEU-3 +0.4%

BLEU-2 +0.4%

BLEU-1 +0.4%

and ROUGE-L +0.3%).

Conclusion

The new optimization function and fusion training method are considered in the dependency relationships between words and the semantic relativity between the generated description and image. The method in this study obtains better scores in multiple evaluation metrics than that using MLE as the optimization function. Experiment results indicate that the inhibitor has a positive effect on the model performance in terms of evaluation metrics while optimizing the coherence of generated descriptions. The method increases the coherency of generated captions on the premise of the captions generated by the generator and image matching in the content.

关键词

Keywords

references

Bahdanau D, Brakel P, Xu K, Goyal A, Lowe R, Pineau J, Courville A and Bengio Y. 2017. An actor-critic algorithm for sequence prediction//Proceedings of 2017 International Conference on Learning Representations.[s.l.]: [s.n.]

Bai L, Yang L N, Huo L and Li T S. 2018. Image description generation by modeling the relationship between objects//Proceedings of 2018 International Conference on Wavelet Analysis and Pattern Recognition. Chengdu, China: IEEE: 215-222[ DOI: 10.1109/ICWAPR.2018.8521291 http://dx.doi.org/10.1109/ICWAPR.2018.8521291 ]

Dai B, Fidler S, Urtasun R and Lin D. 2017. Towards diverse and natural image descriptions via a conditional GAN//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2989-2998[ DOI: 10.1109/ICCV.2017.323 http://dx.doi.org/10.1109/ICCV.2017.323 ]

Denkowski M and Lavie A. 2014. Meteor universal: language specific translation evaluation for any target language//Proceedings of the 9th Workshop on Statistical Machine Translation. Baltimore, Maryland, USA: Association for Computational Linguistics: 376-380

Farhadi A, Hejrati M, Sadeghi M A, Young P, Rashtchian C, Hockenmaier J and Forsyth D. 2010. Every picture tells a story: generating sentences from images//Proceedings of the 11th European Conference on Computer Vision. Heraklion, Crete: Springer: 15-29

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 770-778[ DOI: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]

Jia X, Gavves E, Fernando B and Tuytelaars T. 2015. Guiding the long-short term memory model for image caption generation//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 2407-2415[ DOI: 10.1109/ICCV.2015.277 http://dx.doi.org/10.1109/ICCV.2015.277 ]

Karpathy A and Li F F. 2015. Deep visual-semantic alignments for generating image descriptions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE: 3128-3137[ DOI: 10.1109/CVPR.2015.7298932 http://dx.doi.org/10.1109/CVPR.2015.7298932 ]

Lin C Y. 2004. Rouge: a package for automatic evaluation of summaries//Proceedings of Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL 2004. Barcelona, Spain: Association for Computational Linguistics: 74-81

Ling H and Fidler S. 2017. Teaching machines to describe images via natural language feedback//Proceedings of the 31stInternational Conference on Neural Information Processing Systems. Long Beach, California, USA: Curran Associates Inc.: 5075-5085

Papineni K, Roukos S, Ward T and Zhu W J. 2002. BLEU: a method for automatic evaluation of machine translation//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia, Pennsylvania: Association for Computational Linguistics: 311-318[ DOI: 10.3115/1073083.1073135 http://dx.doi.org/10.3115/1073083.1073135 ]

Ranzato M A, Chopra S, Auli M and Zaremba W. 2016. Sequence level training with recurrent neural networks//Proceedings of the 4th International Conference on Learning Representations.[s.l.]: [s.n.]

Rennie S J, Marcheret E, Mroueh Y, Ross J and Goel V. 2017. Self-critical sequence training for image captioning//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE: 1179-1195[ DOI: 10.1109/CVPR.2017.131 http://dx.doi.org/10.1109/CVPR.2017.131 ]

Shetty R, Rohrbach M, Anne Hendricks L, Fritz M and Schiele B. 2017. Speaking the same language: matching machine to human captions by adversarial training//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 4155-4164[ DOI: 10.1109/ICCV.2017.445 http://dx.doi.org/10.1109/ICCV.2017.445 ]

Simonyan K and Zisserman A. 2015. Very deep convolutional networks for large-scale image recognition//Proceedings of 2015 International Conference on Learning Representations.[s.l.]: [s.n.]

Sutskever I, Vinyals O and Le Q V. 2014. Sequence to sequence learning with neural networks//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Quebec, Canada: Curran Associated Inc: 3104-3112

Vedantam R, Lawrence Zitnick C and Parikh D. 2015. Cider: consensus-based image description evaluation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE: 4566-4575[ DOI: 10.1109/CVPR.2015.7299087 http://dx.doi.org/10.1109/CVPR.2015.7299087 ]

Vinyals O, Toshev A, Bengio S and Erhan D. 2015. Show and tell: a neural image caption generator//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE: 3156-3164[ DOI: 10.1109/CVPR.2015.7298935 http://dx.doi.org/10.1109/CVPR.2015.7298935 ]

Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel R and Bengio Y. 2015. Show, attend and tell: neural image caption generation with visual attention//Proceedings of the 32nd International Conference on Machine Learning. Lille, France: PMLR: 2048-2057

Yuan A H, Li X L and Lu X Q. 2017. FFGS: feature fusion with gating structure for image caption generation//Proceedings of the 2nd CCF Chinese Conference on Computer Vision. Tianjin, China: Springer: 638-649[ DOI: 10.1007/978-981-10-7299-4_53 http://dx.doi.org/10.1007/978-981-10-7299-4_53 ]