跨层多模型特征融合与因果卷积解码的图像描述
Image caption based on causal convolutional decoding with cross-layer multi-model feature fusion
- 2020年25卷第8期 页码:1604-1617
收稿:2019-10-24,
修回:2020-2-5,
录用:2020-2-12,
纸质出版:2020-08-16
DOI: 10.11834/jig.190543
移动端阅览

浏览全部资源
扫码关注微信
收稿:2019-10-24,
修回:2020-2-5,
录用:2020-2-12,
纸质出版:2020-08-16
移动端阅览
目的
2
图像描述结果的准确合理性体现在模型对信息处理的两个方面,即视觉模块对特征信息提取的丰富程度和语言模块对描述复杂场景句子的处理能力。然而现有图像描述模型仅使用一个编码器对图像进行特征提取,容易造成特征信息丢失,进而无法全面理解输入图像的语义。运用RNN(recurrent neural network)或LSTM(long short-term memory)在对句子建模时容易忽略句子的基本层次结构,且对长序列单词的学习效果不佳。针对上述问题,提出一种跨层多模型特征融合与因果卷积解码的图像描述模型。
方法
2
在视觉特征提取模块,对单个模型添加低层到高层的跨层特征融合结构,实现语义特征和细节特征之间的信息互补,训练出多个编码器对图像进行特征提取,在充分描述和表征图像语义方面起到补充作用。在语言模块中使用因果卷积对描述复杂场景的长序列单词进行建模处理,得到一组单词特征。使用attention机制将图像特征和单词特征进行连接匹配,用于学习文本信息与图像不同区域之间的相关性,最终通过预测模块结合Softmax函数得到单词的最终预测概率。
结果
2
在MS COCO(Microsoft common objects in context)和Flickr30k两个数据集上使用不同评估方法对模型进行验证,实验结果表明本文提出的模型性能较好。反映生成单词准确率的BLEU(bilingual evaluation understudy)-1指标值高达72.1%,且在其他多个评估指标上优于其他主流对比方法,如B-4指标超过性能优越的Hard-ATT(“Hard” attention)方法6.0%,B-1和CIDEr(consensus-based image description evaluation)指标分别超过emb-g(embedding guidance)LSTM方法5.1%和13.3%,与同样使用CNN(convolutional neural network)+CNN策略的ConvCap(convdntioral captioning)方法相比,在B-1指标上本文模型提升了0.3%。
结论
2
本文设计的模型能够有效提取和保存复杂背景图像中的语义信息,且具有处理长序列单词的能力,对图像内容的描述更准确、信息表达更丰富。
Objective
2
The results of image captioning can be influenced by the richness of image features
but existing methods only use one encoder for feature extraction and are thereby unable to learn the semantics of images
which may lead to inaccurate captions for images with complicated content. Meanwhile
to generate accurate and reasonable captions
the ability of language modules to process sentences of complex contexts plays an important role. However
the current mainstream methods that use RNN(recurrent neural network) or LSTM(long short-term memory) tend to ignore the basic hierarchical structure of sentences and therefore do not work well in expressing long sequences of words. To address these issues
an image captioning model based on cross-layer multi-model feature fusion and causal convolutional decoding(CMFF/CD) is proposed in this paper.
Method
2
In the visual feature extraction stage
given the feature information loss during the propagation of image features in the convolutional layer
a cross-layer feature fusion structure from the low to high levels is added to realize an information complementarity between the semantic and detail features. Afterward
multiple encoders are trained to conduct feature extraction on the input image. When the information contained in an image is highly complex
these encoders play a supplementary role in fully describing and representing image semantics. Each image in the training dataset corresponds to artificially labeled sentences that are used to train the language decoder. When the sentences are longer and more complex
the learning ability of the language model is reduced
thereby presenting a challenge in learning the relationship among objects. Causal convolution can model long sequences of words to express complex contexts and is therefore used in the proposed language module to obtain the word features. An attention mechanism is then proposed to match the image features with the word features. Each word feature corresponds to an object feature in the image. The model not only accurately describes the image content but also learns the correlation between the text information and different regions of an image. The prediction probability of words is determined by the prediction module by using the Softmax function.
Result
2
The model was validated on different Microsoft common objects in context (MS COCO) and Flickr30k datasets by using various evaluation methods. The experimental results demonstrate that the proposed model has a comparable and competing performance
especially in describing complex scene images. Compared with other mainstream methods
the proposed model not only specifies the scene information of an image (e.g.
restaurants) but also identifies specific objects in the scene and accurately describe their categories. Compared with attention fully convolutional network (ATT-FCN)
spatial and channel-wise attention(Sca)-convolutional neural network(CNN)
and part-of-speech(POS)
the proposed model generates richer image information in description sentences and has a better processing effect on long-sequence words. This model can describe "toothbrush
sink/tunnel/audience"
"bed trailer
bus/mother/cruise ship"
and other objects in an image
whereas other models based on the CNN + LSTM architecture are unable to do such. Although the ConvCap model
which also uses the CNN + CNN architecture
can describe multiple objects in an image and assign them some attribute descriptions
the CMFF/CD model provides more accurate and detailed descriptions
such as "bread
peppers/curtain
blow dryer". In addition
while these two models are able to describe "computer"
the "desktop computer" description of the proposed model is more accurate than the "black computer" description derived by the ConvCap model. Meanwhile
the sentence structure derived by the proposed model is very similar to human expression. Given the quality of sentences produced
the bilingual evaluation understudy(BLEU)-1 indicator
which reflects the accuracy of word generation
of the proposed model reaches 72.1%. This model also obtains a 6.0% higher B-4 compared with the Hard-ATT("Hard" attention) method
thereby highlighting its excellent ability in matching local image features with word vectors. The proposed model can also fully utilize local information to express the content in detail. This model also outperforms the emb-gLSTM method in terms of B-1 and CIDEr(consensus-based image description evaluation) by 5.1% and 13.3%
respectively
and the ConvCap method
which also uses the CNN + CNN strategy
in terms of B-1 by 0.3%.
Conclusion
2
The proposed captioning model can effectively extract and preserve the semantic information in complex background images and process long sequences of words. In expressing the hierarchical relationship among complex background information
the proposed model effectively uses convolutional neural networks (i.e.
causal convolution) to process text information. The experimental results show that the proposed model achieves highly accurate image content descriptions and highly abundant information expression.
Aneja J, Deshpande A and Schwing A G. 2018. Convolutional image captioning//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE: 5561-5570[ DOI: 10.1109/cvpr.2018.00583 http://dx.doi.org/10.1109/cvpr.2018.00583 ]
Chen L, Zhang H W, Xiao J, Nie L Q, Shao J, Liu W and Chua T S. 2017. SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE: 6298-6306[ DOI: 10.1109/CVPR.2017.667 http://dx.doi.org/10.1109/CVPR.2017.667 ]
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H and Bengio Y. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: Association for Computational Linguistics: 1724-1734[ DOI: 10.3115/v1/D14-1179 http://dx.doi.org/10.3115/v1/D14-1179 ]
Chung J, Gulcehre C, Cho K and Bengio Y. 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling[EB/OL].[2019-09-23] . https://arxiv.org/pdf/1412.3555v1.pdf https://arxiv.org/pdf/1412.3555v1.pdf
Dauphin Y N, Fan A, Auli M and Grangier D. 2017. Language modeling with gated convolutional networks//Proceedings of the 34th International Conference on Machine Learning. Sydney, NSW, Australia: PMLR: 933-941
Denkowski M and Lavie A. 2014. Meteor universal: language specific translation evaluation for any target language//Proceedings of the Workshop on Statistical Machine Translation. Baltimore, Maryland, USA: Association for Computational Linguistics: 376-380[ DOI: 10.3115/v1/W14-3348 http://dx.doi.org/10.3115/v1/W14-3348 ]
Deshpande A, Aneja J, Wang L W, Schwing A G and Forsyth D. 2019. Fast, diverse and accurate image captioning guided by part-of-speech//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, CA, USA: IEEE: 10687-10696[ DOI: 10.1109/CVPR.2019.01095 http://dx.doi.org/10.1109/CVPR.2019.01095 ]
Dognin P L, Melnyk I, Mroueh Y, Ross J and Sercu T. 2018. Improved image captioning with adversarial semantic alignment[EB/OL].[2019-09-14] . https://arxiv.org/pdf/1805.00063v2.pdf https://arxiv.org/pdf/1805.00063v2.pdf
Donahue J, Hendricks L A, Guadarrama S, Rohrbach M, Venugopalan S, Darrell T and Saenko K. 2015. Long-term recurrent convolutional networks for visual recognition and description//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE: 2625-2634[ DOI: 10.1109/CVPR.2015.7298878 http://dx.doi.org/10.1109/CVPR.2015.7298878 ]
Fang H, Gupta S, Iandola F, Srivastava R K, Deng L, Dollár P, Gao J F, He X D, Mitchell M, Platt J C, Zitnick C L and Zweig G. 2015. From captions to visual concepts and back//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE: 1473-1482[ DOI: 10.1109/CVPR.2015.7298754 http://dx.doi.org/10.1109/CVPR.2015.7298754 ]
Gan Z, Gan C, He X D, Pu Y C, Tran K, Gao J F, Carin L and Deng L. 2017. Semantic compositional networks for visual captioning//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE: 1141-1150[ DOI: 10.1109/CVPR.2017.127 http://dx.doi.org/10.1109/CVPR.2017.127 ]
Gehring J, Auli M, Grangier D, Yarats D and Dauphin Y N. 2017. Convolutional sequence to sequence learning[EB/OL].[2019-09-23] . https://arxiv.org/pdf/1705.03122.pdf https://arxiv.org/pdf/1705.03122.pdf
Gu J X, Wang G., Cai J F and Chen T. 2017. An empirical study of language CNN for image captioning//Proceedings of the International Conference on Computer Vision. Venice, Italy: IEEE: 1231-1240[ DOI: 10.1109/iccv.2017.138] http://dx.doi.org/10.1109/iccv.2017.138] .
Hochreiter S and Schmidhuber J. 1997. Long short-term memory. Neural Computation, 9(8):1735-1780[DOI:10.1162/neco.1997.9.8.1735]
Jia X, Gavves E, Fernando B and Tuytelaars T. 2015. Guiding the long-short term memory model for image caption generation//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 2407-2415[ DOI: 10.1109/ICCV.2015.277 http://dx.doi.org/10.1109/ICCV.2015.277 ]
Johnson J, Karpathy A and Li F F. 2016. DenseCap: fully convolutional localization networks for dense captioning//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 4565-4574[ DOI: 10.1109/CVPR.2016.494 http://dx.doi.org/10.1109/CVPR.2016.494 ]
Karpathy A and Li F F. 2015. Deep visual-semantic alignments for generating image descriptions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE: 3128-3137[ DOI: 10.1109/CVPR.2015.7298932 http://dx.doi.org/10.1109/CVPR.2015.7298932 ]
Krizhevsky A, Sutskever I and Hinton G E. 2012. ImageNet classification with deep convolutional neural networks//Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, Nevada: ACM: 1097-1105[ DOI: 10.1145/3065386 http://dx.doi.org/10.1145/3065386 ]
Kuznetsova P, Ordonez V, Berg A C, Berg T L and Choi Y. 2012. Collective generation of natural image descriptions//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Jeju, Republic of Korea: ACM: 359-368
Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P and Zitnick C L. 2014. Microsoft COCO: common objects in context//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 740-755[ DOI: 10.1007/978-3-319-10602-1_48 http://dx.doi.org/10.1007/978-3-319-10602-1_48 ]
Liu C X, Mao J H, Sha F and Yuille A. 2016. Attention correctness in neural image captioning[EB/OL].[2019-09-23] . https://arxiv.org/pdf/1605.09553.pdf https://arxiv.org/pdf/1605.09553.pdf
Mao J H, Xu W, Yang Y, Wang J, Huang Z H and Yuille A. 2015. Deep captioning with multimodal recurrent neural networks (m-RNN)[EB/OL].[2019-09-23] . https://arxiv.org/pdf/1412.6632v1.pdf https://arxiv.org/pdf/1412.6632v1.pdf
Mitchell M, Han X F, Dodge J, Mensch A, Goyal A, Berg A, Yamaguchi K, Berg T, Stratos K and Daumé H. 2012. Midge: generating image descriptions from computer vision detections//Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Avignon, France: ACM: 747-756
Papineni K, Roukos S, Ward T and Zhu W J. 2002. BLEU: a method for automatic evaluation of machine translation//Proceedings of the 40th Annual Meeting on Associationfor Computational Linguistics. Philadelphia: ACM: 311-318[ DOI: 10.3115/1073083.1073135 http://dx.doi.org/10.3115/1073083.1073135 ]
Plummer B A, Wang L W, Cervantes C M, Caicedo J C, Hockenmaier J and Lazebnik S. 2015. Flickr30k Entities: collecting region-to-phrase correspondences for richer image-to-sentence models//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 2641-2649[ DOI: 10.1109/ICCV.2015.303 http://dx.doi.org/10.1109/ICCV.2015.303 ]
Pu Y C, Gan Z, Henao R, Yuan X, Li C Y, Stevens A and Carin L. 2016. Variational autoencoder for deep learning of images, labels and captions//Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona: ACM: 2360-2368
Simonyan K and Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition[EB/OL].[2019-09-23] . https://arxiv.org/pdf/1409.1556.pdf https://arxiv.org/pdf/1409.1556.pdf
Tan Y H and Chan C. 2017. phi-LSTM: a phrase-based hierarchical LSTM model for image captioning//Proceedings of the 13th Asian Conference on Computer Vision. Taipei, China: Springer: 101-117[ DOI: 10.1007/978-3-319-54193-8_7 http://dx.doi.org/10.1007/978-3-319-54193-8_7 ]
Tang P J, Tan Y L and Li J Z. 2017. Image description based on the fusion of scene and object category prior knowledge. Journal of Image and Graphics, 22(9):1251-1260
汤鹏杰, 谭云兰, 李金忠. 2017.融合图像场景及物体先验知识的图像描述生成模型.中国图象图形学报, 22(9):1251-1260)[DOI:10.11834/jig.170052]
van den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A and Kavukcuoglu K. 2016a. WaveNet: a generative model for raw audio[EB/OL].[2019-09-23] . https://arxiv.org/pdf/1609.03499.pdf https://arxiv.org/pdf/1609.03499.pdf
van den Oord A, Kalchbrenner N, Vinyals O, Espeholt L, Graves A and Kavukcuoglu K. 2016b. Conditional image generation with PixelCNN decoders[EB/OL].[2019-09-23] . https://arxiv.org/pdf/1606.05328v2.pdf https://arxiv.org/pdf/1606.05328v2.pdf
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, KaiserŁand Polosukhin I. 2017. Attention is all you need[EB/OL].[2019-09-23] . https://arxiv.org/pdf/1706.03762.pdf https://arxiv.org/pdf/1706.03762.pdf
Vedantam R, Zitnick C L and Parikh D. 2015. CIDEr: consensus-based image description evaluation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE: 4566-4575[ DOI: 10.1109/CVPR.2015.7299087 http://dx.doi.org/10.1109/CVPR.2015.7299087 ]
Vinyals O, Toshev A, Bengio S and Erhan D. 2017. Show and tell:lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4):652-663[DOI:10.1109/TPAMI.2016.2587640]
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel R and Bengio Y. 2015. Show, attend and tell: neural image caption generation with visual attention[EB/OL].[2019-09-23] . https://arxiv.org/pdf/1502.03044v2.pdf https://arxiv.org/pdf/1502.03044v2.pdf
Yang Z L, Yuan Y, Wu Y X, Cohen W W and Salakhutdinov R. 2016. Review networks for caption generation//Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona: ACM: 2369-2377
Yao T, Pan Y W, Li Y H, Qiu Z F and Mei T. 2017. Boosting image captioning with attributes//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 4904-4912[ DOI: 10.1109/iccv.2017.524 http://dx.doi.org/10.1109/iccv.2017.524 ]
You Q Z, Jin H L, Wang Z W, Fang C and Luo J B. 2016. Image captioning with semantic attention//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 4651-4659[ DOI: 10.1109/CVPR.2016.503 http://dx.doi.org/10.1109/CVPR.2016.503 ]
Zhao W Q, Yan H and Shao X Q. 2018. Object detection based on improved non-maximum suppression algorithm. Journal of Image and Graphics, 23(11):1676-1685
赵文清, 严海, 邵绪强. 2018.改进的非极大值抑制算法的目标检测.中国图象图形学报, 23(11):1676-1685)[DOI:10.11834/jig.180275]
相关作者
相关机构
京公网安备11010802024621