结合多层级解码器和动态融合机制的图像描述
The integrated mechanism of hierarchical decoders and dynamic fusion for image captioning
- 2022年27卷第9期 页码:2775-2787
收稿:2022-01-20,
修回:2022-5-5,
录用:2022-5-12,
纸质出版:2022-09-16
DOI: 10.11834/jig.211252
移动端阅览

浏览全部资源
扫码关注微信
收稿:2022-01-20,
修回:2022-5-5,
录用:2022-5-12,
纸质出版:2022-09-16
移动端阅览
目的
2
注意力机制是图像描述模型的常用方法,特点是自动关注图像的不同区域以动态生成描述图像的文本序列,但普遍存在不聚焦问题,即生成描述单词时,有时关注物体不重要区域,有时关注物体上下文,有时忽略图像重要目标,导致描述文本不够准确。针对上述问题,提出一种结合多层级解码器和动态融合机制的图像描述模型,以提高图像描述的准确性。
方法
2
对Transformer的结构进行扩展,整体模型由图像特征编码、多层级文本解码和自适应融合等3个模块构成。通过设计多层级文本解码结构,不断精化预测的文本信息,为注意力机制的聚焦提供可靠反馈,从而不断修正注意力机制以生成更加准确的图像描述。同时,设计文本融合模块,自适应地融合由粗到精的图像描述,使低层级解码器的输出直接参与文本预测,不仅可以缓解训练过程产生的梯度消失现象,同时保证输出的文本描述细节信息丰富且语法多样。
结果
2
在MS COCO(Microsoft common objects in context)和Flickr30K两个数据集上使用不同评估方法对模型进行验证,并与具有代表性的12种方法进行对比实验。结果表明,本文模型性能优于其他对比方法。其中,在MS COCO数据集中,相比于对比方法中性能最好的模型,BLEU-1(bilingual evaluation understudy)值提高了0.5,CIDEr(consensus-based image description evaluation)指标提高了1.0;在Flickr30K数据集中,相比于对比方法中性能最好的模型,BLEU-1值提高了0.1,CIDEr指标提高了0.6;同时,消融实验分别验证了级联结构和自适应模型的有效性。定性分析也表明本文方法能够生成更加准确的图像描述。
结论
2
本文方法在多种数据集的多项评价指标上取得最优性能,能够有效提高文本序列生成的准确性,最终形成对图像内容的准确描述。
Objective
2
Image captioning aims at automatically generating lingual descriptions of images. It has a wide variety of applications scenarios like image indexing
medical imaging reports generation and human-machine interaction. To generate fluent sentences of the gathered information all
an algorithm of image captioning is called to recognize the scenes
entities and their relationships of the image. A deep encoder-decoder framework has been developed to resolve the issue past decades. The convolutional neural networks based (CNNs-based) encoder extracts feature vectors of the image and the recurrent neural networks based (RNNs-based) decoder generates image descriptions. Recent image captioning is driven by the development of attention mechanism. It improves the performance of image captioners via attending to informative image regions. Most attention models are based on the previously generated words as inputs when the next attending phases are predicted. Due to the lack of relevant textual guidance
most existing works are challenged of "attention defocus"
i.e.
they fail to concentrate on correct image regions when generating the target words. As a result
contemporary models are prone to "hallucinating" objects
or missing informative visual clues
and make attention model be less interpretable. So
we facilitate an integrated hierarchical architecture and dynamic fusion strategy.
Method
2
The estimated word provides useful knowledge for predicting more grounded regions
although it is hard to localize the correct regions from the previously generated words at once. To refine the attention mechanism and improve the predicted words
we design a hierarchical architecture based on a series of captioning decoders. Our architecture is a hierarchical variant extended from the conventional encoder-decoder framework. Specifically
the first step is focused on the standard image captioning models
which generates a coarse description as a draft. To ground correct image regions with proper generated words
the latter one takes the outputs from the early decoder. Since the former decoder provides more predictable information to the target word
the attention accuracy is improved in latter decoders. To ground the final predicted words properly in this hierarchical architecture
attended regions from the early decoder can be well validated by the later decoders in a coarse-to-fine manner. Furthermore
we carry out a dynamic fusion strategy to aggregate the coarse-to-fine predictions from different decoders. Noteworthy
our manipulated gating mechanism is focused on the contributions from different decoders to the final word prediction. Differentiated from the previous gating mechanism managing the weight from each pathway
the contributions are jointed with a softmax schema from each decoder
which incorporates contextual information from all decoders to estimate the overall weight distribution. The dynamic fusion strategy provides rich fine-grained image descriptions and alleviates the problem of "vanishing gradients"
which makes the learning of the hierarchical architecture easier.
Result
2
Our method is evaluated on Microsoft common objects in context (MS COCO) and Flickr30K
which are the common benchmark for image captioning. The MS COCO dataset is composed of 120 k images
and the Flickr30K includes 31 k examples. Each image of both datasets is provided with five descriptions. The model is trained and tested using the Karpathy splits. The quantitative evaluation metrics are related to bilingual evaluation understudy (BLEU)
metric for evaluation of translation with explicit ordering (MEREOR)
and consensus-based image description evaluation (CIDEr). We compare the performance of our model with 12 current methods. On MS COCO
our analysis is optimized by 0.5 and 1.0 of each beyond BLEU-1 and CIDEr. Our result achieves a CIDEr of 69.94 on Flickr30K. Compared to the baseline method (Transformer)
our performance is optimized 4.6 of CIDEr on MS COCO and 3.8 on Flickr30K
which verifies that our method improves the accuracy of the predicted sentences effectively. In addition
our qualitative results demonstrate that the proposed method provides rich fine-grained image descriptions in comparison with other methods. Our method describes the number of appeared objects precisely when they belong to the same category. Our method could also describe small objects accurately. To further verify the effectiveness of the proposed hierarchical architecture
we visualize the attention mechanism and it shows that our method attends to discriminative parts of the target objects. In contrast
the baseline method may focus on irrelevant backgrounds
which leads to false predictions straightforward.
Conclusion
2
Our research is focused on a hierarchical architecture with dynamic fusion strategy for image captioning. The hierarchical architecture consists of a sequence of captioning decoders that refine the attention mechanism. To generate final sentence with rich fine-grained information
the dynamic fusion strategy aggregates different decoders. The ablation study demonstrates the effectiveness of each module in our proposed network. Our optimized results are demonstrated through the comparative experiments on MS COCO and Flickr30K datasets.
Anderson P, He X D, Buehler C, Teney D, Johnson M, Gould S and Zhang L. 2018. Bottom-up and top-down attention for image captioning and visual question answering//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 6077-6086 [ DOI: 10.1109/CVPR.2018.00636 http://dx.doi.org/10.1109/CVPR.2018.00636 ]
Banerjee S and Lavie A. 2005. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments//Proceedings of 2005 ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Ann Arbor, USA: Association for Computational Linguistics: 65-72
Cornia M, Stefanini M, Baraldi L and Cucchiara R. 2020. Meshed-memory transformer for image captioning//Proceedi ngs of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 10575-10584 [ DOI: 10.1109/CVPR42600.2020.01059 http://dx.doi.org/10.1109/CVPR42600.2020.01059 ]
Farhadi A, Hejrati M, Sadeghi M A, Young P, Rashtchian C, Hockenmaier J and Forsyth D. 2010. Every picture tells a story: generating sentences from images//Proceedings of the 11th European Conference on Computer Vision. Heraklion, Greece: Springer: 15-29 [ DOI: 10.1007/978-3-642-15561-1_2 http://dx.doi.org/10.1007/978-3-642-15561-1_2 ]
Gu J X, Cai J F, Wang G and Chen T. 2018. Stack-captioning: coarse-to-fine learning for image captioning//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. New Orleans, USA: AAAI: 6837-6844
Guo L T, Liu J, Lu S C and Lu H Q. 2020. Show, tell, and polish: ruminant decoding for image captioning. IEEE Transactions on Multimedia, 22(8): 2149-2162 [DOI: 10.1109/TMM.2019.2951226]
Hendricks L A, Burns K, Saenko K, Darrell T and Rohrbach A. 2018. Women also snowboard: overcoming bias in captioning models//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 793-811 [ DOI: 10.1007/978-3-030-01219-9_47 http://dx.doi.org/10.1007/978-3-030-01219-9_47 ]
Huang L, Wang W M, Chen J and Wei X Y. 2019a. Attention on attention for image captioning//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 4633-4642 [ DOI: 10.1109/ICCV.2019.00473 http://dx.doi.org/10.1109/ICCV.2019.00473 ]
Huang L, Wang W M, Xia Y X and Chen J. 2019b. Adaptively aligned image captioning via adaptive attention time//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates, Inc. : 8942-8951
Jiang H Z, Misra I, Rohrbach M, Learned-Miller E and Chen X L. 2020. In defense of grid features for visual question answering//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 10264-10273 [ DOI: 10.1109/CVPR42600.2020.01028 http://dx.doi.org/10.1109/CVPR42600.2020.01028 ]
Karpathy A and Li F F. 2015. Deep visual-semantic alignments for generating image descriptions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3128-3137 [ DOI: 10.1109/CVPR.2015.7298932 http://dx.doi.org/10.1109/CVPR.2015.7298932 ]
Kuznetsova P, Ordonez V, Berg T L and Choi Y. 2014. TreeTalk: composition and compression of trees for image descriptions. Transactions of the Association for Computational Linguistics, 2: 351-362 [DOI: 10.1162/tacl_a_00188]
Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P and Zitnick C L. 2014. Microsoft COCO: common objects in context//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 740-755 [ DOI: 10.1007/978-3-319-10602-1_48 http://dx.doi.org/10.1007/978-3-319-10602-1_48 ]
Liu C X, Mao J H, Sha F and Yuille A. 2017. Attention correctness in neural image captioning//Proceedings of the 31st AAAI Conference on Artificial Intelligence. San Francisco, USA: AAAI: 4176-4182
Lu J S, Xiong C M, Parikh D and Socher R. 2017. Knowing when to look: adaptive attention via a visual sentinel for image captioning//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 3242-3250 [ DOI: 10.1109/CVPR.2017.345 http://dx.doi.org/10.1109/CVPR.2017.345 ]
Luo H L and Yue L L. 2020. Image caption based on causal convolutional decoding with cross-layer multi-model feature fusion. Journal of Image and Graphics, 25(8): 1604-1617
罗会兰, 岳亮亮. 2020. 跨层多模型特征融合与因果卷积解码的图像描述. 中国图象图形学报, 25(8): 1604-1617 [DOI: 10.11834/jig.190543]
Ma C Y, Kalantidis Y, AlRegib G, Vajda P, Rohrbach M and Kira Z. 2020. Learning to generate grounded visual captions without localization supervision//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 353-370 [ DOI: 10.1007/978-3-030-58523-5_21 http://dx.doi.org/10.1007/978-3-030-58523-5_21 ]
Papineni K, Roukos S, Ward T and Zhu W J. 2002. BLEU: a method for automatic evaluation of machine translation//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia, USA: ACL: 311-318 [ DOI: 10.3115/1073083.1073135 http://dx.doi.org/10.3115/1073083.1073135 ]
Plummer B A, Wang L W, Cervantes C M, Caicedo J C, Hockenmaier J and Lazebnik S. 2015. Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 2641-2649 [ DOI: 10.1109/ICCV.2015.303 http://dx.doi.org/10.1109/ICCV.2015.303 ]
Rennie S J, Marcheret E, Mroueh Y, Ross J and Goel V. 2017. Self-critical sequence training for image capti oning//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 1179-1195 [ DOI: 10.1109/CVPR.2017.131 http://dx.doi.org/10.1109/CVPR.2017.131 ]
Rohrbach A, Hendricks L A, Burns K, Darrell T and Saenko K. 2018. Object hallucination in image captioning//Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: ACL: 4035-4045 [ DOI: 10.18653/v1/d18-1437 http://dx.doi.org/10.18653/v1/d18-1437 ]
Tan Y L, Tang P J, Zhang L and Luo Y P. 2021. From image to language: image captioning and description. Journal of Image and Graphics, 26(4): 727-750
谭云兰, 汤鹏杰, 张丽, 罗玉盘. 2021. 从图像到语言: 图像标题生成与描述. 中国图象图形学报, 26(4): 727-750 [DOI: 10.11834/jig.200177]
Tang P J, Tan Y L and Li J Z. 2017. Image description based on the fusion of scene and object category prior knowledge. Journal of Image and Graphics, 22(9): 1251-1260
汤鹏杰, 谭云兰, 李金忠. 2017. 融合图像场景及物体先验知识的图像描述生成模型. 中国图象图形学报, 22(9): 1251-1260 [DOI: 10.11834/jig.170052]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc. : 6000-6010
Vedantam R, Zitnick C L and Parikh D. 2015. CIDEr: consensus-based image description evaluation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 4566-4575 [ DOI: 10.1109/CVPR.2015.7299087 http://dx.doi.org/10.1109/CVPR.2015.7299087 ]
Vinyals O, Toshev A, Bengio S and Erhan D. 2015. Show and tell: a neural image caption generator//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3156-3164 [ DOI: 10.1109/CVPR.2015.7298935 http://dx.doi.org/10.1109/CVPR.2015.7298935 ]
Wan B Y, Jiang W H, Fang Y M, Zhu M W, Li Q and Liu Y. 2022. Revisiting image captioning via maximum discrepancy competition. Pattern Recognition, 122: #108358 [DOI: 10.1016/j.patcog.2021.108358]
Xu K, Ba J L, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel R S and Bengio Y. 2015. Show, attend and tell: neural image caption generation with visual attention//Proceedings of the 32nd International Conference on International Conference on Machine Learning. Lille, France: PMLR: 2048-2057
Zhang W Q, Shi H C, Tang S L, Xiao J, Yu Q and Zhuang Y T. 2021. Consensus graph representation learning for better grounded image captioning//Proceedings of the 35th AAAI Conference on Artificial Intelligence. Virtual: AAAI: 3394-3402
Zhou L W, Kalantidis Y, Chen X L, Corso J J and Rohrbach M. 2019. Grounded video description//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 6571-6580 [ DOI: 10.1109/CVPR.2019.00674 http://dx.doi.org/10.1109/CVPR.2019.00674 ]
Zhou Y E, Wang M, Liu D Q, Hu Z Z and Zhang H W. 2020. More grounded image captioning by distilling image-text matching model//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 4776-4785 [ DOI: 10.1109/CVPR42600.2020.00483 http://dx.doi.org/10.1109/CVPR42600.2020.00483 ]
相关作者
相关机构
京公网安备11010802024621