结合自底向上注意力机制和记忆网络的视觉问答模型
Visual question answering model based on bottom-up attention and memory network
- 2020年25卷第5期 页码:993-1006
收稿:2019-07-22,
修回:2019-10-14,
录用:2019-10-21,
纸质出版:2020-05-16
DOI: 10.11834/jig.190366
移动端阅览

浏览全部资源
扫码关注微信
收稿:2019-07-22,
修回:2019-10-14,
录用:2019-10-21,
纸质出版:2020-05-16
移动端阅览
目的
2
现有大多数视觉问答模型均采用自上而下的视觉注意力机制,对图像内容无加权统一处理,无法更好地表征图像信息,且因为缺乏长期记忆模块,无法对信息进行长时间记忆存储,在推理答案过程中会造成有效信息丢失,从而预测出错误答案。为此,提出一种结合自底向上注意力机制和记忆网络的视觉问答模型,通过增强对图像内容的表示和记忆,提高视觉问答的准确率。
方法
2
预训练一个目标检测模型提取图像中的目标和显著性区域作为图像特征,联合问题表示输入到记忆网络,记忆网络根据问题检索输入图像特征中的有用信息,并结合输入图像信息和问题表示进行多次迭代、更新,以生成最终的信息表示,最后融合记忆网络记忆的最终信息和问题表示,推测出正确答案。
结果
2
在公开的大规模数据集VQA(visual question answering)v2.0上与现有主流算法进行比较实验和消融实验,结果表明,提出的模型在视觉问答任务中的准确率有显著提升,总体准确率为64.0%。与MCB(multimodal compact bilinear)算法相比,总体准确率提升了1.7%;与性能较好的VQA machine算法相比,总体准确率提升了1%,其中回答是/否、计数和其他类型问题的准确率分别提升了1.1%、3.4%和0.6%。整体性能优于其他对比算法,验证了提出算法的有效性。
结论
2
本文提出的结合自底向上注意力机制和记忆网络的视觉问答模型,更符合人类的视觉注意力机制,并且在推理答案的过程中减少了信息丢失,有效提升了视觉问答的准确率。
Objective
2
Visual question-answering (VQA) belongs to the intersection of computer vision and natural language processing and is one of the key research directions in the field of artificial intelligence. VQA is an important task for conducting research in the field of artificial intelligence because of its multimodal nature
clear evaluation protocol
and potential real-world applications. The prevailing approach to VQA is based on three components. First
question-answering is posed as a classification over a set of candidate answers. Questions in the current VQA datasets are mostly visual in nature
where the correct answers are composed of a small set of key words or phrases. Second
most VQA models are based on a deep neural network that implements a joint embedding of image and question features. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are used to map the two inputs into a vector representation of fixed size. Third
due to the success of deep learning in solving supervised learning problems
the whole neural network is trained end-to-end from questions
images
and their ground truth answers. However
using global image features as visual input will introduce noise at the stage of predicting the answer. Inspired by human visual attention and the development of related research
visual attention mechanism has been widely used in VQA to alleviate the problem. Most conventional visual attention mechanisms used in VQA models are of the top-down variety. These attention mechanisms are typically trained to attend selectively to the output of one or more layers of a CNN and predict the weight of each region in the image. This method ignores image content. Thus
image information cannot be further represented. In humans
attention will be focused more on the objects or other salient image regions. To generate more human-like question answers
objects and other salient image regions are a much more natural basis for attention. In addition
owing to the lack of long-term memory modules
information is lost during the reasoning of answers. Therefore
wrong answers can be inferred
which will affect the VQA effect. To solve the above problems
we propose a VQA model based on bottom-up attention and memory network
which improves the accuracy of VQA by enhancing the representation and memory of image content.
Method
2
We use image features from bottom-up attention to provide region-specific features rather than the traditional CNN grid-like feature maps. We implement bottom-up attention using faster R-CNN (region-based CNN) in conjunction with the ResNet-101 CNN
which represents a natural expression of a bottom-up attention mechanism. To pre-train the bottom-up attention model
we first initialize faster R-CNN with ResNet-101 pre-trained for classification on ImageNet. Then
it is trained on visual genome data. To aid the learning of good feature representations
we introduce an additional training output to predict attribute classes in addition to object classes. To improve computational efficiency
questions are trimmed to a maximum of 14 words. The extra words are then simply discarded
and the questions shorter than 14 words are end-padded with vectors of zeros. We use bi-directional g
ated recurrent unit (GRU) to extract the question features and its final state as our question embedding. The purpose of the memory network is to retrieve the information needed to answer the question from the input image facts and memorize them for a long time. To improve understanding of the question and image
especially when the question requires transmission reasoning
the memory network may need to transfer the input several times and update the memory after each transmission. Memory network is composed of two parts: attention mechanism module and episode memory update module. Each iteration will calculate the weight of the input vector through the attention mechanism to generate a new memory and then update the memory through the memory update module. The image features and question representations will be input into the memory network to obtain the final episodic memory. Finally
the representations of the question and the final episodic memory are passed through nonlinear full connection layers and combined with a simple concatenation
and then the result is fed into the output classifier to deduce the correct answer. Unlike the softmax classifier
which is commonly used in most VQA models
we treat VQA as a multi-label classification task and use sigmoid activation function to predict a score for each of the
$$N$$
candidates answers. The sigmoid can optimize multiple correct answers for each question and normalize the final scores to (0
1). The final stage can be regarded as a logistic regression to predict the correctness of each candidate answer.
Result
2
The main performance metric is the standard VQA accuracy
which is the average ground truth score for predicting the answers to all questions
taking into account the occasional divergence of ground truth answers among annotators. The proposed VQA model is tested on the VQA v2.0 dataset. Our VQA test server submissions are trained on the training and validation sets. Results show that the overall accuracy is 64.0%
where the accuracy of answering yes/no questions is 80.9%
the accuracy of answering counting questions is 44.3%
and the accuracy of answering other types of questions is 54.0%. Several existing mainstream algorithms are selected to compare with our model on the VQA v2.0 dataset. Results show that our VQA model has higher accuracy in overall and different types of questions. Ablation experiments are carried out in the VQA v2.0 dataset as well. Results of the ablation experiments show that combining bottom-up attention mechanism with memory network is better than using the modules alone
proving the effectiveness of the proposed algorithm.
Conclusion
2
In this study
we propose a VQA model through combining bottom-up attention mechanism and memory network
thus providing a new means of thinking for VQA. Our model synthesizes the advantages of bottom-up attention and memory network and is more in line with the visual attention mechanism of human beings. Moreover
it can remember effective information for a long time and reduce the loss of information in the process of reasoning answers
thus effectively improving the accuracy of VQA.
Anderson P, He X D, Buehler C, Teney D, Johnson M, Gould S and Zhang L. 2018. Bottom-up and top-down attention for image captioning and visual question answering//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE: 6077-6086[ DOI:10.1109/CVPR.2018.00636 http://dx.doi.org/10.1109/CVPR.2018.00636 ]
Andreas J, Rohrbach M, Darrel T and Klein D. 2016. Neural module networks//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 39-48[ DOI:10.1109/CVPR.2016.12 http://dx.doi.org/10.1109/CVPR.2016.12 ]
Antol S, Agrawal A, Lu J S, Mitchell M, Batra D, Zitnick C L and Parikh D. 2015. VQA: visual question answering//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 2425-2433[ DOI:10.1109/ICCV.2015.279 http://dx.doi.org/10.1109/ICCV.2015.279 ]
Chandar S, Ahn S, Larochelle H, Vincent P, Tesauro G and Bengio Y. 2016. Hierarchical memory networks[EB/OL ] .[2019-07-2 1 ] . https://arxiv.org/pdf/1605.07427.pdf https://arxiv.org/pdf/1605.07427.pdf
Cho K, van Merriёnboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H and Bengio Y. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: Association for Computational Linguistics: 1724-1734[ DOI:10.3115/v1/D14-1179 http://dx.doi.org/10.3115/v1/D14-1179 ]
Fukui A, Park D H, Yang D, Rohrbach A, Darrell T and Rohrbach M. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding//Proceedings of 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas: Association for Computational Linguistics: 457-468[ DOI:10.18653/v1/D16-1044] http://dx.doi.org/10.18653/v1/D16-1044] .
Goyal Y, Khot T, Summers-Stay D, Batra D and Parikh D. 2017. Making the V in VQA matter: elevating the role of image understanding in visual question answering//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE: 6325-6334[ DOI:10.1109/CVPR.2017.670 http://dx.doi.org/10.1109/CVPR.2017.670 ]
He K M, Zhang X Y , Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 770-778[ DOI:10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]
Kim J H, Lee S W, Kwak D H, Heo M O, Kim J, Ha J W and Zhang B T. 2016. Multimodal residual learning for visual QA//Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain: Curran Associates Inc.: 361-369
Kingma D P and Ba J. 2014. Adam: a method for stochastic optimization[EB/OL ] .[2019-07-21 ] . https://arxiv.org/pdf/1412.6980.pdf https://arxiv.org/pdf/1412.6980.pdf
Krishna R, Zhu Y K, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L J, Shamma D A, Bernstein M S and Li F F. 2017. Visual genome:connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32-73[DOI:10.1007/s11263-016-0981-7].
Kumar A, Irsoy O, Ondruska P, Iyyer M, Bradbury J, Gulrajani I, Zhong V, Paulus R and Socher R. 2016. Ask me anything: dynamic memory networks for natural language processing//Proceedings of the 33rd International Conference on Machine Learning. New York, NY, USA: JMLR.org: 1378-1387
Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P and Zitnick C L. 2014. Microsoft COCO: common objects in context//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 740-755[ DOI:10.1007/978-3-319-10602-1_48] http://dx.doi.org/10.1007/978-3-319-10602-1_48] .
Lin Y T, Pang Z Y, Wang D H and Zhuang Y T. 2017. Task-driven visual saliency and attention-based visual question answering[EB/OL ] .[2019-07-21 ] . https://arxiv.org/pdf/1702.06700.pdf https://arxiv.org/pdf/1702.06700.pdf
Lu J S, Lin X, Batra D and Parikh D. 2015. Deeper LSTM and normalized CNN visual question answering model[EB/OL ] .[2019-07-21 ] . https://github.com/VT-vision-lab/VQA_LSTM_CNN https://github.com/VT-vision-lab/VQA_LSTM_CNN
Lu J S, Yang J W, Batra D and Parikh D. 2016. Hierarchical question-image co-attention for visual question answering//Proceedings of the 30th Conference on Neural Information Processing Systems. Barcelona, Spain: NIPS: 289-297
Ma C, Shen C H, Dick A, Wu Q, Wang P, van den Hengel A and Reid I. 2018. Visual question answering with memory-augmented networks//Pro ceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE: 6975-6984[ DOI:10.1109/CVPR.2018.00729 http://dx.doi.org/10.1109/CVPR.2018.00729 ]
Pennington J, Socher R and Manning C. 2014. Glove: global vectors for word representation//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: Association for Computational Linguistics: 1532-1543[ DOI:10.3115/v1/D14-1162] http://dx.doi.org/10.3115/v1/D14-1162] .
Ren S Q, He K M, Girshick R and Sun J. 2017. Faster R-CNN:towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137-1149[DOI:10.1109/TPAMI.2016.2577031]
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S A, Huang Z H, Karpathy A, Khosla A, Bernstein M, Berg A C and Li F F. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211-252[DOI:10.1007/s11263-015-0816-y]
Shih K J, Singh S and Hoiem D. 2016. Where to look: focus regions for visual question answering//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 4613-4621[ DOI:10.1109/CVPR.2016.499 http://dx.doi.org/10.1109/CVPR.2016.499 ]
Simonyan K and Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition[EB/OL ] .[2019-07-21 ] . https://arxiv.org/pdf/1409.1556.pdf https://arxiv.org/pdf/1409.1556.pdf
Sukhbaatar S, Szlam A, Weston J and Fergus R. 2015. Weakly supervised memory networks[EB/OL ] .[2019-07-21 ] . https://arxiv.org/pdf/1503.08895.pdf https://arxiv.org/pdf/1503.08895.pdf
Wang P, Wu Q, Shen C H and van den Hengel A. 2017. The VQA-machine: learning how to use existing vision algorithms to answer new questions//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE: 3909-3918[ DOI:10.1109/CVPR.2017.416 http://dx.doi.org/10.1109/CVPR.2017.416 ]
Weston J, Chopra S and Bordes A. 2015. Memory networks.[EB/OL ] .[2019-07-21 ] . https://arxiv.org/pdf/1410.3916.pdf https://arxiv.org/pdf/1410.3916.pdf
Wu Q, Shen C H, Wang P, Dick A and van den Hengel A. 2018. Image captioning and visual question answering based on attributes and external knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6):1367-1381[DOI:10.1109/TPAMI.2017.2708709]
Wu Q, Wang P, Shen C H, Dick A and van den Hengel A. 2016. Ask me anything: free-form visual question answering based on knowledge from external sources//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 4622-4630[ DOI:10.1109/CVPR.2016.500 http://dx.doi.org/10.1109/CVPR.2016.500 ]
Xiong C M, Merity S and Socher R. 2016. Dynamic memory networks for visual and textual question answering//Proceedings of the 33rd International Conference on Machine Learning. New York, USA: JMLR.org: 2397-2406
Xu H J and Saenko K. 2016. Ask, attend and answer: exploring question-guided spatial attention for visual question answering//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer: 451-466[ DOI:10.1007/978-3-319-46478-7] http://dx.doi.org/10.1007/978-3-319-46478-7] .
Yang Z C, He X D, Gao J F, Deng L and Smola A. 2016. Stacked attention networks for image question answering//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 21-29[ DOI:10.1109/CVPR.2016.10 http://dx.doi.org/10.1109/CVPR.2016.10 ]
Yu D F. 2019. Attention Mechanism and High-level Semantics for Visual Question Answering. Hefei: University of Science and Technology of China http://kns.cnki.net/KCMS/detail/detail.aspx?dbcode=CDFD&filename=1019057817.nh .
于东飞. 2019.基于注意力机制与高层语义的视觉问答研究.合肥: 中国科学技术大学
Yu D F, Fu J L, Mei T and Rui Y. 2017. Multi-level attention networks for visual question answering//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE: 4709-4717[ DOI:10.1109/CVPR.2017.446 http://dx.doi.org/10.1109/CVPR.2017.446 ]
Zhang W M. 2018. Research and Implement for Question Answering Based on Deep Learning and Knowledge Graph Embedding. Beijing: Beijing University of Posts and Telecommunications http://cdmd.cnki.com.cn/Article/CDMD-10013-1018162133.htm .
张为明. 2018.基于深度学习和知识表示的问答系统的研究与实现.北京: 北京邮电大学
Zhou B L, Tian Y D, Sukhbaatar S, Szlam A and Fergus R. 2015. Simple baseline for visual question answering[EB/OL ] .[2019-07-21 ] . https://arxiv.org/pdf/1512.02167.pdf https://arxiv.org/pdf/1512.02167.pdf
Zhou Y X and Yu J. 2018. Design of image question and answer system based on deep learning. Computer Applications and Software, 35(12):199-208
周远侠, 于津. 2018.基于深度学习的图片问答系统设计研究.计算机应用与软件, 35(12):199-208)[DOI:10.3969/j.issn.1000-386x.2018.12.038]
相关作者
相关机构
京公网安备11010802024621