问题引导的空间关系图推理视觉问答模型

兰红; 张蒲芬

doi:10.11834/jig.200611

图像理解和计算机视觉 | 浏览量 : 0 下载量: 0 CSCD: 1

PDF
导出
分享
收藏
专辑

问题引导的空间关系图推理视觉问答模型
Question-guided spatial relation graph reasoning model for visual question answering
2022年27卷第7期页码：2274-2286
纸质出版日期： 2022-07-16 ，

录用日期： 2021-04-27
DOI： 10.11834/jig.200611
稿件说明：

移动端阅览

兰红, 张蒲芬. 问题引导的空间关系图推理视觉问答模型[J]. 中国图象图形学报, 2022,27(7):2274-2286.

Hong Lan, Pufen Zhang. Question-guided spatial relation graph reasoning model for visual question answering[J]. Journal of Image and Graphics, 2022,27(7):2274-2286.
兰红, 张蒲芬. 问题引导的空间关系图推理视觉问答模型[J]. 中国图象图形学报, 2022,27(7):2274-2286. DOI： 10.11834/jig.200611.

Hong Lan, Pufen Zhang. Question-guided spatial relation graph reasoning model for visual question answering[J]. Journal of Image and Graphics, 2022,27(7):2274-2286. DOI： 10.11834/jig.200611.

摘要

目的

现有视觉问答模型的研究主要从注意力机制和多模态融合角度出发，未能对图像场景中对象之间的语义联系显式建模，且较少突出对象的空间位置关系，导致空间关系推理能力欠佳。对此，本文针对需要空间关系推理的视觉问答问题，提出利用视觉对象之间空间关系属性结构化建模图像，构建问题引导的空间关系图推理视觉问答模型。

方法

利用显著性注意力，用Faster R-CNN（region-based convolutional neural network）提取图像中显著的视觉对象和视觉特征；对图像中的视觉对象及其空间关系结构化建模为空间关系图；利用问题引导的聚焦式注意力进行基于问题的空间关系推理。聚焦式注意力分为节点注意力和边注意力，分别用于发现与问题相关的视觉对象和空间关系；利用节点注意力和边注意力权重构造门控图推理网络，通过门控图推理网络的信息传递机制和控制特征信息的聚合，获得节点的深度交互信息，学习得到具有空间感知的视觉特征表示，达到基于问题的空间关系推理；将具有空间关系感知的图像特征和问题特征进行多模态融合，预测出正确答案。

结果

模型在VQA（visual question answering）v2数据集上进行训练、验证和测试。实验结果表明，本文模型相比于Prior、Language only、MCB（multimodal compact bilinear）、ReasonNet和Bottom-Up等模型，在各项准确率方面有明显提升。相比于ReasonNet模型，本文模型总体的回答准确率提升2.73%，是否问题准确率提升4.41%，计数问题准确率提升5.37%，其他问题准确率提升0.65%。本文还进行了消融实验，验证了方法的有效性。

结论

提出的问题引导的空间关系图推理视觉问答模型能够较好地将问题文本信息和图像目标区域及对象关系进行匹配，特别是对于需要空间关系推理的问题，模型展现出较强的推理能力。

Abstract

Objective

Current visual question answering (VQA) methods are mostly based on attention mechanism and multimodal fusion. Deep learning have intensively promoted computer vision and natural language processing (NLP) both. Interdisciplinary area between language and vision like VQA has been focused on. VQA is composed of an AI-completed task and it yields a proxy to evaluate our progress towards artificial intelligence (AI)-based quick response reasoning. A VQA based model needs to fully understand the visual scene of the image

especially the interaction between multiple objects. This task inherently requires visual reasoning beyond the relationships between the image objects.

Method

Our question-guided spatial relationship graph reasoning (QG-SRGR) model is demonstrated in order to solve the issue of spatial relationship reasoning in VQA

which uses the inherent spatial relationship properties between image objects. First

saliency-based attention mechanism is used in our model

the salient visual objects and visual features are extracted by using faster region-based convolutional neural network (Faster R-CNN). Next

the visual objects and their spatial relationships are structured as a spatial relation graph. The visual objects in the image are defined as vertices of spatial relation graph

and the edges of the graph are dynamically constructed by the inherently spatial relation between the visual objects. Then

question-guided focused attention is used to conduct question-based spatial relation reasoning. Focused attention is divided into node attention and edge attention. Node attention is used to find the most relevant visual objects to the question

and edge attention is used to discover the spatial relation that most relevant to the question. Furthermore

the gated graph reasoning network (GGRN) is constructed based on the node attention weights and the edge attention weights

and the features of the neighbor nodes are aggregated by GGRN. Therefore

the deep interaction information between nodes can be obtained

the visual feature representation with spatial perception can be learned

and the question-based spatial relationship reasoning can also be achieved. Finally

the image features with spatial relation-aware and question features are fused to predict the right answer.

Result

Our QG-SRGR model is trained

validated and tested on the VQA v2.0 dataset. The results illustrate that the overall accuracy is 66.43% on the Test-dev set

where the accuracy of answering "Yes" or "No" questions is 83.58%

the accuracy of answering counting questions is 45.61%

and the accuracy of answering other questions types is 56.62%. The Test-std set based accuracies calculated are 66.65%

83.86%

45.36% and 56.93%

respectively. QG-SRGR model improves the average accuracy achieved by the ReasonNet model by 2.73%

4.41%

5.37% and 0.65% respectively on the overall

Yes/No

counting and other questions beyond the Test-std set. In addition

the ablation experiments are carried out on validation set. The results of ablation experiments verify the effectiveness of our method.

Conclusion

Our proposed QG-SRGR model can better match the text information of the question with the image target regions and the spatial relationships of objects

especially for the spatial relationship reasoning oriented questions. Our illustrated QG-SRGR model demonstrates its priority on reasoning ability.

关键词

视觉问答(VQA)图卷积神经网络(GCN)注意力机制空间关系推理多模态学习

Keywords

visual question answering (VQA)graph convolution neural network (GCN)attention mechanismspatial relation reasoningmultimodal learning

references

Anderson P, He X D, Buehler C, Teney D, Johnson M, Gould S and Zhang L. 2018. Bottom-up and top-down attention for image captioning and visual question answering//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 6077-6086[DOI: 10.1109/CVPR.2018.00636http://dx.doi.org/10.1109/CVPR.2018.00636]

Antol S, Agrawal A, Lu J S, Mitchell M, Batra D, Zitnick C L and Parikh D. 2015. VQA: visual question answering//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 2425-2433[DOI: 10.1109/ICCV.2015.279http://dx.doi.org/10.1109/ICCV.2015.279]

Ben-Younes H, Cadene R, Cord M and Thome N. 2017. MUTAN: multimodal tucker fusion for visual question answering//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 2631-2639[DOI: 10.1109/ICCV.2017.285http://dx.doi.org/10.1109/ICCV.2017.285]

Chen X L, Li L J, Li F F and Gupta A. 2018. Iterative visual reasoning beyond convolutions//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7239-7248[DOI: 10.1109/CVPR.2018.00756http://dx.doi.org/10.1109/CVPR.2018.00756]

Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H and Bengio Y. 2014. Learning phraserepresentations using RNN encoder-decoder for statistical machine translation//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: Association for Computational Linguistics: 1724-1734[DOI: 10.3115/v1/D14-1179http://dx.doi.org/10.3115/v1/D14-1179]

Fukui A, Park D H, Yang D, Rohrbach A, Darrell T and Rohrbach M. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding//Proceedings of 2016 Conference on Empirical Methods in Natural Language Processing. Austin, USA: Association for Computational Linguistics: 457-468[DOI: 10.18653/v1/D16-1044http://dx.doi.org/10.18653/v1/D16-1044]

Gao P, Jiang Z K, You H X, Lu P, Hoi S C H, Wang X G and Li H S. 2019. Dynamic fusion with intra-and inter-modality attention flow for visual question answering//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 6632-6641[DOI: 10.1109/CVPR.2019.00680http://dx.doi.org/10.1109/CVPR.2019.00680]

Goyal Y, Khot T, Summers-Stay D, Batra D and Parikh D. 2017. Making the V in VQA matter: elevating the role of image understanding in visual question answering//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 6325-6334[DOI: 10.1109/CVPR.2017.670http://dx.doi.org/10.1109/CVPR.2017.670]

Hamilton W L, Ying R and Leskovec J. 2017. Inductive representation learning on large graphs//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc.: 1025-1035

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 770-778[DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]

Ilievski I and Feng J S. 2017. Multimodal learning and reasoning for visual question answering//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc.: 551-562

Kim J H, Jun J and Zhang B T. 2018. Bilinear attention networks//Proceedings of the 32nd Conference on Neural Information Processing Systems. Montréal, Canada: Curran Associates Inc.: 1564-1574

Kingma D P and Ba J L. 2017. Adam: a method for stochastic optimization[EB/OL]. [2020-11-03].https://arxiv.org/pdf/1412.6980.pdfhttps://arxiv.org/pdf/1412.6980.pdf

Kipf T N and Welling M. 2017. Semi-supervised classification with graph convolutional networks[EB/OL]. [2020-11-03].https://arxiv.org/pdf/1609.02907.pdfhttps://arxiv.org/pdf/1609.02907.pdf

Krishna R, Zhu Y K, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L J, Shamma D A, Bernstein M S and Li F F. 2017. Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1): 32-73[DOI: 10.1007/s11263-016-0981-7]

Lu J S, Yang J W, Batra D and Parikh D. 2016. Hierarchical question-image co-attention for visual question answering//Proceedings of the 30th Conference on Neural Information Processing Systems. Barcelona, Spain: NIPS: 289-297

Norcliffe-Brown W, Vafeais E and Parisot S. 2018. Learning conditioned graph structures for interpretable visual question answering//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal, Canada: Curran Associates Inc.: 8334-8343

Pennington J, Socher R and Manning C. 2014. GloVe: global vectors for word representation//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: Association for Computational Linguistics: 1532-1543[DOI: 10.3115/v1/D14-1162http://dx.doi.org/10.3115/v1/D14-1162]

Ren S Q, He K M, Girshick R and Sun J. 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149[DOI: 10.1109/TPAMI.2016.2577031]

Sun Q and Fu Y W. 2019. Stacked self-attention networks for visual question answering//Proceedings of 2019 on International Conference on Multimedia Retrieval. Ottawa, Canada: ACM: 207-211[DOI: 10.1145/3323873.3325044http://dx.doi.org/10.1145/3323873.3325044]

Wu Q, Teney D, Wang P, Shen CH, Dick A and Van Den Hengel A. 2017. Visual question answering: a survey of methods and datasets. Computer Vision and Image Understanding, 163: 21-40[DOI: 10.1016/j.cviu.2017.05.001]

Yan R Y and Liu X L. 2020. Visual question answering model based on bottom-up attention and memory network. Journal of Image and Graphics, 25(5): 993-1006

闫茹玉, 刘学亮. 2020. 结合自底向上注意力机制和记忆网络的视觉问答模型. 中国图象图形学报, 25(5): 993-1006)[DOI: 10.11834/jig.190366]

Yang Z C, He X D, Gao J F, Deng L and Smola A. 2016. Stacked attention networks for image question answering//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 21-29[DOI: 10.1109/CVPR.2016.10http://dx.doi.org/10.1109/CVPR.2016.10]

Yu D F, Fu J L, Mei T and Rui Y. 2017. Multi-level attention networks for visual question answering//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 4187-4195[DOI: 10.1109/CVPR.2017.446http://dx.doi.org/10.1109/CVPR.2017.446]

Zhou B L, Tian Y D, Sukhbaatar S, Szlam A and Fergus R. 2015. Simple baseline for visual question answering[EB/OL]. [2020-11-03].https://arxiv.org/pdf/1512.02167.pdfhttps://arxiv.org/pdf/1512.02167.pdf

文章被引用时，请邮件提醒。

提交

红外与可见光图像特征动态选择的目标检测网络

注意力引导局部特征联合学习的人脸表情识别

结合注意力机制和编码器—解码器架构的化学结构识别方法

基于多视图自适应3D骨架网络的工业装箱动作识别

阿尔茨海默症诊断与病理区域检测的反事实推理模型