Leading weight-driven re-position relation network for figure question answering

Ying Li; Qingfeng Wu; Jiatong Liu; Jialong Zou

doi:10.11834/jig.211026

Image Understanding and Computer Vision | Views : 0 下载量: 0 CSCD: 1

PDF
Export
Share
Collection
Album

Leading weight-driven re-position relation network for figure question answering
Vol. 28, Issue 2, Pages: 510-521(2023)
Published： 16 February 2023 ，

Accepted： 24 December 2021
DOI： 10.11834/jig.211026
稿件说明：

移动端阅览

Ying Li, Qingfeng Wu, Jiatong Liu, Jialong Zou. Leading weight-driven re-position relation network for figure question answering. [J]. Journal of Image and Graphics 28(2):510-521(2023)
DOI：

Ying Li, Qingfeng Wu, Jiatong Liu, Jialong Zou. Leading weight-driven re-position relation network for figure question answering. [J]. Journal of Image and Graphics 28(2):510-521(2023) DOI： 10.11834/jig.211026.

摘要

目的

图表问答是计算机视觉多模态学习的一项重要研究任务，传统关系网络（relation network，RN）模型简单的两两配对方法可以包含所有像素之间的关系，因此取得了不错的结果，但此方法不仅包含冗余信息，而且平方式增长的关系配对的特征数量会给后续的推理网络在计算量和参数量上带来很大的负担。针对这个问题，提出了一种基于融合语义特征提取的引导性权重驱动的重定位关系网络模型来改善不足。

方法

首先通过融合场景任务的低级和高级图像特征来提取更丰富的统计图语义信息，同时提出了一种基于注意力机制的文本编码器，实现融合语义的特征提取，然后对引导性权重进行排序进一步重构图像的位置，从而构建了重定位的关系网络模型。

结果

在2个数据集上进行实验比较，在FigureQA（an annotated figure dataset for visual reasoning）数据集中，相较于IMG+QUES（image+questions）、RN和ARN（appearance and relation networks），本文方法的整体准确率分别提升了26.4%，8.1%，0.46%，在单一验证集上，相较于LEAF-Net（locate，encode and attend for figure network）和FigureNet，本文方法的准确率提升了2.3%，2.0%；在DVQA（understanding data visualization via question answering）数据集上，对于不使用OCR（optical character recognition）方法，相较于SANDY（san with dynamic encoding model）、ARN和RN，整体准确率分别提升了8.6%，0.12%，2.13%；对于有Oracle版本，相较于SANDY、LEAF-Net和RN，整体准确率分别提升了23.3%，7.09%，4.8%。

结论

本文算法围绕图表问答任务，在DVQA和FigureQA两个开源数据集上分别提升了准确率。

Abstract

Objective

Figure-based question and answer (Q&A) is focused on learning the basic information representation of data mining in real scenes and provide the basis of judgment for reasoning in terms of the text information of the joint questions. It is widely used for multi-modal learning tasks. Existing methods can be segmented into two categories in common: 1) model tasks are based on neural network framework algorithms directly. The statistical graph is processed by the convolutional neural network to obtain the feature map of the image information

the question text is encoded by the recurrent neural network to obtain the sentence-level embedding representation vector. The output answer is obtained by the fusion inference model. To capture the overall representation of the fusion of multi-modal feature information

the popular attention mechanism is concerned about the obtained image feature matrix as the input of the text encoder in recent years. However

the interaction between the relationship features in the multi-modal scene has a huge negative impact on the extraction of effective semantic features. 2) A multi-module framework algorithm is used to decompose the task into multiple steps. Different modules are used to obtain the feature information at first

the obtained information is then used as the input of the subsequent modules

and the final output results are obtained through the subsequent algorithm modules. However

this type of method needs to rely on additional annotation information to train individual modules

and the complexity is quite higher. So

we develop a weight-driven re-located relational network model based on fusion semantic feature extraction.

Method

We clarify the whole framework for weight-driven re-located relation network

which consists of three modules in the context of image feature extraction

the attention-based long short-term memory(LSTM) and joint weight-driven re-located relation network. 1) For the image feature extraction module

image feature extraction is implemented via fusing the convolutional layer and the up-sampling layer. To make the extracted image feature information more suitable for the scene task

we design a fusion of convolutional neural network and U-Net network architecture to construct a network model that can extract the semantic meaning of low-level and high-level image features. 2) For the attention-based LSTM module

we joint the problem-based reasoning feature representation in terms of attention mechanism. LSTM can just retain the influence of existing words on unrecognized words. To obtain a better vector representation of the sentence

we can capture different contextual information based on attention mechanism. 3) For the joint leading weight-driven re-located relation network module

we propose a paired matching mechanism

which guides the matching process of relationship features in the relationship network. That is to calculate the inner product of the feature vector of each pixel with the feature vectors of all the pixels

the similarity can be obtained between it and all the points and the pixel can be obtained by averaging in the entire group at the end. However

to resolve the high complexity problem

it ignore the overall relationship balance that can be obtained by the original pairwise pairing method although the relationship features matching pair sequence obtained by the above method. Therefore

our re-located operation is carried out to achieve a balanced effect for overall relationship. 1) Remove the relationship feature of the pixel paired with itself from the obtained relationship feature pair set; 2) swap locations in the relationship feature list of each pixel according to a constant one exchange and this iterative rule; and 3) add the location information of the pixels and the sentence-level embedding. Especially

the generation of relational features is composed of three parts: a) the feature vector of two pixels

b) the coordinated value of the two pixels

and c) the embedding representation of the question text.

Result

The experiment is compared to the 2 datasets with the latest 6 methods. 1) For the FigureQA(an annotated figure dataset for visual reasoning) dataset

compared to IMG+QUES(image+questions)

relation networks(RN) and ARN(appearance and relation network)

the overall accuracy rate is increased by 26.4%

8.1%

and 0.46%

respectively. 2) For a single verification set

compared to LEAF-Net(locate

encode and attend for figure network) and FigureNet

the accuracy is increased by 2.3% and 2.0% of each. 3) For the understanding data visualization via question answering(DVQA) dataset

the overall accuracy of the DVQA dataset is increased by 8.6%

0.12%

and 2.13% compared to SANDY(san with dynamic encoding model)

ARN

and RN

and 4) For the Oracle version

compared to SANDY

LEAF-Net and RN

the overall accuracy rate has increased by 23.3%

7.09%

4.8%

respectively.

Conclusion

Our model has good results on the two large open source datasets in the statistical graph Q&A beyond baseline model.

关键词

计算机视觉图表问答(FQA)多模态融合注意力机制关系网络(RN)深度学习

Keywords

computer visionfigure question answeringmultimodal fusionattention mechanismrelation network(RN)deep learning

references

Andreas J, Rohrbach M, Darrell T and Klein D. 2016. Learning to compose neural networks for question answering//Proceedings of 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego, USA: Association for Computational Linguistics: 1545-1554 [DOI: 10.18653/v1/N16-1181http://dx.doi.org/10.18653/v1/N16-1181]

Antol S, Agrawal A, Lu J S, Mitchell M, Batra D, Zitnick C L and Parikh D. 2015. VQA: visual question answering//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 2425-2433 [DOI: 10.1109/ICCV.2015.279http://dx.doi.org/10.1109/ICCV.2015.279]

Brill E, Dumais S and Banko M. 2002. An analysis of the AskMSR question-answering system//Proceedings of 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002). Philadelphia, USA: Association for Computational Linguistics: 257-264 [DOI: 10.3115/1118693.1118726http://dx.doi.org/10.3115/1118693.1118726]

Chaudhry R, Shekhar S, Gupta U, Maneriker P, Bansal P and Joshi A. 2020. LEAF-QA: locate, encode and attend for figure question answering//Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). Snowmass, USA: IEEE: 3501-3510 [DOI: 10.1109/WACV45572.2020.9093269http://dx.doi.org/10.1109/WACV45572.2020.9093269]

Echihabi A and Marcu D. 2003. A noisy-channel approach to question answering//Proceedings of the 41st Annual Meeting on Association for Computational Linguistics. Sapporo, Japan: Association for Computational Linguistics: 16-23 [DOI: 10.3115/1075096.1075099http://dx.doi.org/10.3115/1075096.1075099]

Fukui H, Hirakawa T, Yamashita T and Fujiyoshi H. 2019. Attention branch network: learning of attention mechanism for visual explanation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 10697-10706 [DOI: 10.1109/CVPR.2019.01096http://dx.doi.org/10.1109/CVPR.2019.01096]

Gers F A, Schmidhuber J and Cummins F. 2000. Learning to forget: continual prediction with LSTM. Neural Computation, 12(10): 2451-2471 [DOI: 10.1162/089976600300015015]

Gibson R F. 2010. A review of recent research on mechanics of multifunctional composite materials and structures. Composite Structures, 92(12): 2793-2810 [DOI: 10.1016/j.compstruct.2010.05.003]

Goel D, Jain S, Vishwakarma D K and Bansal A. 2021. Automatic image colorization using U-Net//Proceedings of the 12th International Conference on Computing Communication and Networking Technologies (ICCCNT). Kharagpur, India: IEEE: 1-7 [DOI: 10.1109/ICCCNT51525.2021.9580001http://dx.doi.org/10.1109/ICCCNT51525.2021.9580001]

He W J, Liu Y Y, Feng J F, Zhang W W, Gu G H and Chen Q. 2020. Low-light image enhancement combined with attention map and U-Net network//Proceedings of the 3rd IEEE International Conference on Information Systems and Computer Aided Education (ICISCAE). Dalian, China: IEEE: 397-401 [DOI: 10.1109/ICISCAE51034.2020.9236828http://dx.doi.org/10.1109/ICISCAE51034.2020.9236828]

Johnson J, Hariharan B, Van Der Maaten L, Li F F, Zitnick C L and Girshick R. 2017. CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 1988-1997 [DOI: 10.1109/CVPR.2017.215http://dx.doi.org/10.1109/CVPR.2017.215]

Kafle K, Price B, Cohen S and Kanan C. 2018. DVQA: understanding data visualizations via question answering//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 5648-5656 [DOI: 10.1109/CVPR.2018.00592http://dx.doi.org/10.1109/CVPR.2018.00592]

Kafle K, Shrestha R, Price B, Cohen S and Kanan C. 2020. Answering questions about data visualizations using efficient bimodal fusion//Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). Snowmass, USA: IEEE: 1487-1496 [DOI: 10.1109/WACV45572.2020.9093494http://dx.doi.org/10.1109/WACV45572.2020.9093494]

Kahou S E, Michalski V, Atkinson A, Kádár Á, Trischler A and Bengio Y. 2017. FigureQA: an annotated figure dataset for visual reasoning//Proceedings of the 6th International Conference on Learning Representations. Vancouver, Canada: OpenReview. net

Luong T, Pham H and Manning C D. 2015. Effective approaches to attention-based neural machine translation//Proceedings of 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal: Association for Computational Linguistics: 1412-1421 [DOI: 10.18653/v1/D15-1166http://dx.doi.org/10.18653/v1/D15-1166]

Mertens K C, Verbeke L P C, Westra T and De Wulf R R. 2004. Sub-pixel mapping and sub-pixel sharpening using neural network predicted wavelet coefficients. Remote Sensing of Environment, 91(2): 225-236 [DOI: 10.1016/j.rse.2004.03.003]

Miller J, Krauth K, Recht B and Schmidt L. 2020. The effect of natural distribution shift on question answering models//Proceedings of the 37th International Conference on Machine Learning[EB/OL]. [2021-10-22].https://arxiv.org/pdf/2004.14444v1.pdfhttps://arxiv.org/pdf/2004.14444v1.pdf

Pal A, Chang S and Konstan J A. 2012. Evolution of experts in question answering communities//Proceedings of the 6th International AAAI Conference on Weblogs and Social Media. Dublin, Ireland: AAAI: 274-281

Ranjan P, Patil S and Ansari R A. 2020. U-Net based MRA framework for segmentation of remotely sensed images//Proceedings of 2020 International Conference on Artificial Intelligence and Signal Processing (AISP). Amaravati, India: IEEE: 1-4 [DOI: 10.1109/AISP48273.2020.9073131http://dx.doi.org/10.1109/AISP48273.2020.9073131]

Reddy R, Ramesh R, Deshpande A and Khapra M M. 2019. FigureNet: a deep learning model for question-answering on scientific plots//Proceedings of 2019 International Joint Conference on Neural Networks (IJCNN). Budapest, Hungary: IEEE: 1-8 [DOI: 10.1109/IJCNN.2019.8851830http://dx.doi.org/10.1109/IJCNN.2019.8851830]

Ronneberger O, Fischer P and Brox T. 2015. U-Net: convolutional networks for biomedical image segmentation//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich, Germany: Springer: 234-241 [DOI: 10.1007/978-3-319-24574-4_28http://dx.doi.org/10.1007/978-3-319-24574-4_28]

Santoro A, Raposo D, Barrett D G T, Malinowski M, Pascanu R, Battaglia P and Lillicrap T. 2017. A simple neural network module for relational reasoning//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc. : 4937-4983

Shopovska I, Jovanov L and Philips W. 2018. RGB-NIR demosaicing using deep residual U-Net//The 26th Telecommunications Forum (TELFOR). Belgrade, Serbia: IEEE: 1-4 [DOI: 10.1109/TELFOR.2018.8611819http://dx.doi.org/10.1109/TELFOR.2018.8611819]

Teney D, Anderson P, He X D and Van Den Hengel A. 2018. Tips and tricks for visual question answering: learnings from the 2017 challenge//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4223-4232 [DOI: 10.1109/CVPR.2018.00444http://dx.doi.org/10.1109/CVPR.2018.00444]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc. : 6000-6010

Wang Y Q, Huang M L, Zhu X Y and Zhao L. 2016. Attention-based LSTM for aspect-level sentiment classification//Proceedings of 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas, USA: Association for Computational Linguistics: 606-615[DOI: 10.18653/v1/D16-1058http://dx.doi.org/10.18653/v1/D16-1058]

Wu Y X and Nakayama H. 2020. Graph-based heuristic search for module selection procedure in neural module network//Proceedings of the 15th Asian Conference on Computer Vision on Computer Vision. Kyoto, Japan: Springer: 560-575 [DOI: 10.1007/978-3-030-69535-4_34http://dx.doi.org/10.1007/978-3-030-69535-4_34]

Yan R Y and Liu X L. 2020. Visual question answering model based on bottom-up attention and memory network. Journal of Image and Graphics, 25(5): 993-1006

闫茹玉, 刘学亮. 2020. 结合自底向上注意力机制和记忆网络的视觉问答模型. 中国图象图形学报, 25(5): 993-1006 [DOI: 10.11834/jig.190366]

Yang J L, Guo X J and Chen Z H. 2021. Road extraction method from remote sensing images based on improved U-Net network. Journal of Image and Graphics, 26(12): 3005-3014

杨佳林, 郭学俊, 陈泽华. 2021. 改进U-Net型网络的遥感图像道路提取. 中国图象图形学报, 26(12): 3005-3014 [DOI: 10.11834/jig.200579]

Zhang J B, Zhu X D, Chen Q, Dai L R, Wei S and Jiang H. 2017. Exploring question understanding and adaptation in neural-network-based question answering[EB/OL]. [2021-10-22].https://arxiv.org/pdf/1703.04617.pdfhttps://arxiv.org/pdf/1703.04617.pdf

Zhou B L, Tian Y D, Sukhbaatar S, Szlam A and Fergus R. 2015. Simple baseline for visual question answering[EB/OL]. [2021-10-22].https://arxiv.org/pdf/1512.02167.pdfhttps://arxiv.org/pdf/1512.02167.pdf

Zou J L, Wu G L, Xue T F and Wu Q F. 2020. An affinity-driven relation network for figure question answering//Proceedings of 2020 IEEE International Conference on Multimedia and Expo (ICME). London, UK: IEEE: 1-6 [DOI: 10.1109/ICME46284.2020.9102911http://dx.doi.org/10.1109/ICME46284.2020.9102911]

Alert me when the article has been cited

提交

Comprehensive survey on 3D visual-language understanding techniques

Deep learning-based real-time semantic segmentation： a survey

Saliency guided object complementary hiding for weakly supervised semantic segmentation

Blueprint separable convolution Transformer network for lightweight image super-resolution

Optic disc and cup segmentation with combined residual context encoding and path augmentation