引入语义匹配和语言评价的跨语言图像描述

张静; 郭丹; 宋培培; 李坤; 汪萌

doi:10.11834/jig.210588

图像理解和计算机视觉 | 浏览量 : 0 下载量: 0 CSCD: 0

PDF
导出
分享
收藏
专辑

引入语义匹配和语言评价的跨语言图像描述
Cross-lingual image captioning based on semantic matching and language evaluation
2022年27卷第11期页码：3343-3355
纸质出版日期： 2022-11-16 ，

录用日期： 2021-11-02
DOI： 10.11834/jig.210588
稿件说明：

移动端阅览

张静, 郭丹, 宋培培, 李坤, 汪萌. 引入语义匹配和语言评价的跨语言图像描述[J]. 中国图象图形学报, 2022,27(11):3343-3355.

Jing Zhang, Dan Guo, Peipei Song, Kun Li, Meng Wang. Cross-lingual image captioning based on semantic matching and language evaluation[J]. Journal of Image and Graphics, 2022,27(11):3343-3355.
张静, 郭丹, 宋培培, 李坤, 汪萌. 引入语义匹配和语言评价的跨语言图像描述[J]. 中国图象图形学报, 2022,27(11):3343-3355. DOI： 10.11834/jig.210588.

Jing Zhang, Dan Guo, Peipei Song, Kun Li, Meng Wang. Cross-lingual image captioning based on semantic matching and language evaluation[J]. Journal of Image and Graphics, 2022,27(11):3343-3355. DOI： 10.11834/jig.210588.

摘要

目的

由于缺乏图像与目标语言域的成对数据，现有的跨语言描述方法都是基于轴（源）语言转化为目标语言，由于转化过程中的语义噪音干扰，生成的句子存在不够流畅以及与图像视觉内容关联弱等问题，为此，本文提出了一种引入语义匹配和语言评价的跨语言图像描述模型。

方法

首先，选择基于编码器—解码器的图像描述基准网络框架。其次，为了兼顾图像及其轴语言所包含的语义知识，构建了一个源域语义匹配模块；为了学习目标语言域的语言习惯，还构建了一个目标语言域评价模块。基于上述两个模块，对图像描述模型进行语义匹配约束和语言指导：1）图像&轴语言域语义匹配模块通过将图像、轴语言描述以及目标语言描述映射到公共嵌入空间来衡量各自模态特征表示的语义一致性。2）目标语言域评价模块依据目标语言风格，对所生成的描述句子进行语言评分。

结果

针对跨语言的英文图像描述任务，本文在MS COCO（Microsoft common objects in context）数据集上进行了测试。与性能较好的方法相比，本文方法在BLEU（bilingual evaluation understudy）-2、BLEU-3、BLEU-4和METEOR（metric for evaluation of translation with explicit ordering）等4个评价指标上的得分分别提升了1.4%，1.0%，0.7%和1.3%。针对跨语言的中文图像描述任务，本文在AIC-ICC（image Chinese captioning from artificial intelligence challenge）数据集上进行了测试。与性能较好的方法相比，本文方法在BLEU-1、BLEU-2、BLEU-3、BLEU-4、METEOR和CIDEr（consensus-based image description evaluation）等6个评价指标上的评分分别提升了5.7%，2.0%，1.6%，1.3%，1.2%和3.4%。

结论

本文模型中图像&轴语言域语义匹配模块引导模型学习了更丰富的语义知识，目标语言域评价模块约束模型生成更加流畅的句子，本文模型适用于跨语言图像描述生成任务。

Abstract

Objective

With the development of deep learning

image captioning has achieved great success. Image captioning can not only be applied to infant education

web search

and human-computer interaction but also can aid visual disables to obtain invisible information better. Most image captioning works have been developed for captioning in English. However

the ideal of image captioning should be extended to non-native English speakers further. The main challenge of cross-lingual image captioning is lack of paired image-caption datasets in the context of target language. It is challenged to collect a large-scale image caption dataset for the target language of each. Thanks to existing large-scale English captioning datasets and translation models

using the pivot language (e.g.

English) to bridge the image and the target language (e.g.

Chinese) is currently the main backbone framework for cross-lingual image captioning. However

such a language-pivoted approach is restricted by dis-fluency and poor semantic relevance to images. We facilitate a cross-lingual image captioning model based on semantic matching and language evaluation.

Method

First

our model is constructed via a native encoder-decoder framework

which extracts convolulional neural network(CNN)-based image features and generates the description in terms of the recurrent neural network. The pivot language (source language) descriptions are transformed into the target language sentences via a translation API

which is regarded as pseudo captioning labels of the images. Our model is initialized with pseudo-labels. However

the captions generated by the initialized model are in combination with of high-frequency vocabulary

the language style of pseudo-labels

or poor-irrelevant image content. It is worth noting that the pivot language written by humans is a correct description for the image content and contains the consistent semantics of the image. Therefore

considering the semantic guidance of the image content and pivot language

a semantic matching module is proposed based on the source corpus. Moreover

the language style of the generated captions greatly differs from the human-written target languages. To learn the language style of the target languages

a language evaluation module under the guidance of target language is proposed. The above two modules perform the constraints of semantic matching and language style on the optimization of the proposed captioning model. The methodological contributions are listed as following: 1) The semantic matching module is an embedding network in terms of source-domain-related image and language labels. To coordinate the semantic matching between image

pivot language

and generated sentence

these multimodal data is mapped into the embedding space for semantic-relevant calculation. Our model can guarantee the sentence-generated semantic enhancement linked to the visual content in the image. 2) The semantic evaluation module based on corpus in the target domain encourages the style of generated sentences to resemble the target language style. Under the joint rewards of semantic matching and language evaluation

our model is optimized to generate image-related sentences better. The semantic matching reward and language evaluation reward are performed in a reinforcement learning mode.

Result

In order to verify the effectiveness of the proposed model

we carried out two sub-task experiments. 1) The cross-lingual English image captioning task is evaluated on the Microsoft common object in context(MS COCO) image-English dataset

which is trained under image-Chinese captioning from artificial intelligence challenge(AIC-ICC) dataset and MS COCO English corpus. Compared with the state-of-the-art method

our metric values of bilingual evaluation understudy(BLEU)-2

BLEU-3

BLEU-4

and metric for evaluation of translation with explicit ordering(METEOR) have increased by 1.4%

1.0%

0.7% and 1.3%

respectively. 2) The cross-lingual Chinese image captioning task is evaluated on the AIC-ICC image-Chinese dataset

which is trained under MS COCO image-English dataset and AIC-ICC Chinese corpus. Compared with the state-of-the-art method

the performances for BLEU-1

BLEU-2

BLEU-3

BLEU-4

METEOR

and consensus-based image description evoluation(CIDEr) have increased by 5.7%

2.0%

1.6%

1.3%

1.2%

and 3.4% respectively.

Conclusion

The semantic matching module yields the model to learn the relevant semantics in image and pivot language description. The language evaluation module learns the data distribution and language style of the target corpus. The semantic and language rewards have their potentials for cross-lingual image captioning

which not only optimize the semantic relevance of the sentence but also improve the fluency of the sentence further.

关键词

跨语言图像描述强化学习神经网络轴语言

Keywords

cross-lingualimage captioningreinforcement learningneural networkpivot language

references

Anderson P, He X D, Buehler C, Teney D, Johnson M, Gould S and Zhang L. 2018. Bottom-up and top-down attention for image captioning and visual question answering//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 6077-6086[DOI: 10.1109/CVPR.2018.00636http://dx.doi.org/10.1109/CVPR.2018.00636]

Ben H X, Pan Y W, Li Y H, Yao T, Hong R C, Wang M and Mei T. 2022. Unpaired image captioning with semantic-constrained self-learning. IEEE Transactions on Multimedia, 24: 904-916[DOI: 10.1109/TMM.2021.3060948]

Farhadi A, Hejrati M, Sadeghi M A, Young P, Rashtchian C, Hockenmaier J and Forsyth D. 2010. Every picture tells a story: generating sentences from images///Proceedings of the 11th European Conference on Computer Vision. Heraklion, Greece: Springer: 15-29[DOI: 10.1007/978-3-642-15561-1_2http://dx.doi.org/10.1007/978-3-642-15561-1_2]

Feng Y, Ma L, Liu W and Luo J B. 2019. Unsupervised image captioning//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 4120-4129[DOI: 10.1109/CVPR.2019.00425http://dx.doi.org/10.1109/CVPR.2019.00425]

Gu J X, Joty S, Cai J F and Wang G. 2018. Unpaired image captioning by language pivoting//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 519-535[DOI: 10.1007/978-3-030-01246-5_31http://dx.doi.org/10.1007/978-3-030-01246-5_31]

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778[DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]

Hou J Y, Wu X X, Zhang X X, Qi Y Y, Jia Y D and Luo J B. 2020. Joint commonsense and relation reasoning for image and video captioning. Proceedings of the AAAI Conference on Artificial Intelligence, 34(7): 10973-10980[DOI: 10.1609/aaai.v34i07.6731]

Ji J Z, Xu C, Zhang X D, Wang B Y and Song X H. 2020. Spatio-temporal memory attention for image captioning. IEEE Transactions on Image Processing, 29: 7615-7628[DOI: 10.1109/TIP.2020.3004729]

Lan W Y, Li X R and Dong J F. 2017. Fluency-guided cross-lingual image captioning//Proceedings of the 25th ACM International Conference on Multimedia. Mountain View, USA: Association for Computing Machinery: 1549-1557[DOI: 10.1145/3123266.3123366http://dx.doi.org/10.1145/3123266.3123366]

Li Z X, Wei H Y, Huang F C, Zhang C L, Ma H F and Shi Z Z. 2020. Combine visual features and scene semantics for image captioning. Chinese Journal of Computers, 43(9): 1624-1640

李志欣, 魏海洋, 黄飞成, 张灿龙, 马慧芳, 史忠植. 2020. 结合视觉特征和场景语义的图像描述生成. 计算机学报, 43(9): 1624-1640[DOI: 10.11897/SP.J.1016.2020.01624]

Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P and Zitnick C L. 2014. Microsoft COCO: common objects in context//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 740-755[DOI: 10.1007/978-3-319-10602-1_48http://dx.doi.org/10.1007/978-3-319-10602-1_48]

Liu X H, Li H S, Shao J, Chen D P and Wang X G. 2018. Show, tell and discriminate: image captioning by self-retrieval with partially labeled data//Proceedings of the 15thEuropean Conference on Computer Vision. Munich, Germany: Springer: 353-369[DOI: 10.1007/978-3-030-01267-0_21http://dx.doi.org/10.1007/978-3-030-01267-0_21]

Luo H L and Yue L L. 2020. Image caption based on causal convolutional decoding with cross-layer multi-model feature fusion. Journal of Image and Graphics, 25(8): 1604-1617

罗会兰, 岳亮亮. 2020. 跨层多模型特征融合与因果卷积解码的图像描述. 中国图象图形学报, 25(8): 1604-1617[DOI: 10.11834/jig.190543]

Ranzato M A, Chopra S, Auli M and Zaremba W. 2016. Sequence level training with recurrent neural networks[EB/OL]. [2021-06-11].https://arxiv.org/pdf/1511.06732.pdfhttps://arxiv.org/pdf/1511.06732.pdf

Ren S Q, He K M, Girshick R and Sun J. 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149[DOI: 10.1109/TPAMI.2016.2577031]

Rennie S J, Marcheret E, Mroueh Y, Ross J and Goel V. 2017. Self-critical sequence training for image captioning//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 1179-1195[DOI: 10.1109/CVPR.2017.131http://dx.doi.org/10.1109/CVPR.2017.131]

Song Y Q, Chen S Z, Zhao Y D and Jin Q. 2019. Unpaired cross-lingual image caption generation with self-supervised rewards//Proceedings of the 27th ACM International Conference on Multimedia. Nice, France: Association for Computing Machinery: 784-792[DOI: 10.1145/3343031.3350996http://dx.doi.org/10.1145/3343031.3350996]

Tang P J, Tan Y L and Li J Z. 2017. Image description based on the fusion of scene and object category prior knowledge. Journal of Image and Graphics, 22(9): 1251-1260

汤鹏杰, 谭云兰, 李金忠. 2017. 融合图像场景及物体先验知识的图像描述生成模型. 中国图象图形学报, 22(9): 1251-1260[DOI: 10.11834/jig.170052]

Vinyals O, Toshev A, Bengio S and Erhan D. 2015. Show and tell: a neural image caption generator//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3156-3164[DOI: 10.1109/CVPR.2015.7298935http://dx.doi.org/10.1109/CVPR.2015.7298935]

Wang W X, Chen Z H and Hu H F. 2019. Hierarchical attention network for image captioning. Proceedings of the AAAIConference on Artificial Intelligence, 33(1): 8957-8964[DOI: 10.1609/AAAI.V33I01.33018957]

Wu J H, Zheng H, Zhao B, Li Y X, Yan B M, Liang R, Wang W J, Zhou S P, Lin G S, Fu Y W, Wang Y Z and Wang Y G. 2017. AI challenger: a large-scale dataset for going deeper in image understanding[EB/OL]. [2021-05-18].https://arxiv.org/pdf/1711.06475.pdfhttps://arxiv.org/pdf/1711.06475.pdf

Xu K, Ba J L, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R S and Bengio Y. 2015. Show, attend and tell: neural image caption generation with visual attention//Proceedings of the 32nd International Conference on Machine Learning. Lille, France: JMLR. org: 2048-2057

Zhou L W, Palangi H, ZhangL, Hu H D, Corso J and Gao J F. 2020. Unified vision-language pre-training for image captioning and VQA. Proceedings of the AAAI Conference on Artificial Intelligence, 34(7): 13041-13049[DOI:10.1609/aaai.v34i07.7005]

文章被引用时，请邮件提醒。

提交