融合知识表征的多模态Transformer场景文本视觉问答

余宙; 俞俊; 朱俊杰; 匡振中

doi:10.11834/jig.211213

多媒体分析与理解 | 浏览量 : 0 下载量: 242 CSCD: 0

PDF
导出
分享
收藏
专辑

融合知识表征的多模态Transformer场景文本视觉问答
Knowledge-representation-enhanced multimodal Transformer for scene text visual question answering
2022年27卷第9期页码：2761-2774
收稿：2022-01-05，

修回：2022-6-1，

录用：2022-6-8，

纸质出版：2022-09-16
DOI： 10.11834/jig.211213
稿件说明：

移动端阅览

余宙, 俞俊, 朱俊杰, 匡振中. 融合知识表征的多模态Transformer场景文本视觉问答[J]. 中国图象图形学报, 2022,27(9):2761-2774. DOI： 10.11834/jig.211213.

Zhou Yu, Jun Yu, Junjie Zhu, Zhenzhong Kuang. Knowledge-representation-enhanced multimodal Transformer for scene text visual question answering[J]. Journal of Image and Graphics, 2022, 27(9): 2761-2774. DOI： 10.11834/jig.211213.

摘要

目的

现有视觉问答方法通常只关注图像中的视觉物体，忽略了对图像中关键文本内容的理解，从而限制了图像内容理解的深度和精度。鉴于图像中隐含的文本信息对理解图像的重要性，学者提出了针对图像中场景文本理解的“场景文本视觉问答”任务以量化模型对场景文字的理解能力，并构建相应的基准评测数据集TextVQA(text visual question answering)和ST-VQA(scene text visual question answering)。本文聚焦场景文本视觉问答任务，针对现有基于自注意力模型的方法存在过拟合风险导致的性能瓶颈问题，提出一种融合知识表征的多模态Transformer的场景文本视觉问答方法，有效提升了模型的稳健性和准确性。

方法

对现有基线模型M4C(multimodal multi-copy mesh)进行改进，针对视觉对象间的“空间关联”和文本单词间的“语义关联”这两种互补的先验知识进行建模，并在此基础上设计了一种通用的知识表征增强注意力模块以实现对两种关系的统一编码表达，得到知识表征增强的KR-M4C(knowledge-representation-enhanced M4C)方法。

结果

在TextVQA和ST-VQA两个场景文本视觉问答基准评测集上，将本文KR-M4C方法与最新方法进行比较。本文方法在TextVQA数据集中，相比于对比方法中最好的结果，在不增加额外训练数据的情况下，测试集准确率提升2.4%，在增加ST-VQA数据集作为训练数据的情况下，测试集准确率提升1.1%；在ST-VQA数据集中，相比于对比方法中最好的结果，测试集的平均归一化Levenshtein相似度提升5%。同时，在TextVQA数据集中进行对比实验以验证两种先验知识的有效性，结果表明提出的KR-M4C模型提高了预测答案的准确率。

结论

本文提出的KR-M4C方法的性能在TextVQA和ST-VQA两个场景文本视觉问答基准评测集上均有显著提升，获得了在该任务上的最好结果。

Abstract

Objective

Deep neural networks technology promotes the research and development of computer vision and natural language processing intensively. Multiple applications like human face recognition

optical character recognition (OCR)

and machine translation have been widely used. Recent development enable the machine learning to deal with more complex multimodal learning tasks that involve vision and language modalities

e.g.

visual captioning

image-text retrieval

referring expression comprehension

and visual question answering (VQA). Given an arbitrary image and a natural language question

the VQA task is focused on an image-content-oriented and question-guided understanding of the fine-grained semantics and the following complex reasoning answer. The VQA task tends to be as the generalization of the rest of multimodal learning tasks. Thus

an effective VQA algorithms is a key step toward artificial general intelligence (AGI). Recent VQA have realized human-level performance on the commonly used benchmarks of VQA. However

most existing VQA methods are focused on visual objects of images only

while neglecting the recognition of textual content in the image. In many real-world scenarios

the image text can transmit essential information for scene understanding and reasoning

such as the number of a traffic sign or the brand awareness of a product. The ignorance of textual information is constraint of the applicability of the VQA methods in practice

especially for visually-impaired users. Due to the importance of the developing textual information for image interpretation

most researches intend to incorporate textual content into VQA for a scene text VQA task organizing. Specifically

the questions involve the textual contents in the scene text VQA task. The learned VQA model is required to establish unified associations among the question

visual object and the scene text. The reasoning is followed to generate a correct answer. To address the scene text VQA task

a model of multimodal multi-copy mesh(M4C) is faciliated based on the transformer architecture. Multimodal heterogeneous features are as input

a multimodal transformer is used to capture the interactions between input features

and the answers are predicted in an iterative manner. Despite of the strengths of M4C

it still has the two weaknesses as following: 1) the relative spatial relationship cannot be illustrated well between paired objects although each visual object and OCR object encode its absolute spatial location. It is challenged to achieve accurate spatial reasoning for M4C model; 2) the predicted words of answering are selected from either a dynamic OCR vocabulary or a fixed answer vocabulary. The semantic relationships are not explicitly considered in M4C between the multi-sources words. At the iterative answer prediction stage

it is challenged to understand the potential semantic associations between multiple sources derived words.

Method

To resolve the weaknesses of M4C mentioned above

we improve the reference of M4C model by introducing two added knowledge like the spatial relationship and semantic relationship

and a knowledge-representation-enhanced M4C (KR-M4C) approach is demonstrated to integrate the two types of knowledge representations simultaneously. Additionally

the spatial relationship knowledge encodes the relative spatial positions between each paired object (including the visual objects and OCR objects) in terms of their bounding box coordinates. The semantic relationship knowledge encodes the semantic similarity between the text words and the predicted answer words in accordance with the similarity calculated from their GloVe word embeddings. The two types of knowledge representation are encoded as unified knowledge representations. To match the knowledge representations adequately

the multi-head attention (MHA) module of M4C is modified to be a KRMHA module. By stacking the KRMHA modules in depth

the KR-M4C model performs spatial and semantic reasoning to improve the model performance over the reference M4C model.

Result

The KR-M4C approach is verified that our extended experiments are conducted on two benchmark datasets of text VQA(TextVQA) and scene text VQA(ST-VQA) based on same experimental settings. The demonstrated results are shown as below: 1) excluded of extra training data

KR-M4C obtains an accuracy improvement of 2.4% over existing optimization on the test set of TextVQA; 2) KR-M4C achieves an average normalized levenshtein similarity (ANLS) score of 0.555 on the test set of ST-VQA

which is 5% higher than theresult of SA-M4C. To verify the synergistic effect of two types of introduced knowledge further

comprehensive ablation studies are carried out on TextVQA

and the demonstrated results can support our hypothesis of those two types of knowledge are proactively and mutually to model performance. Finally

some visualized cases are provided to verify the effects of the two introduced knowledge representations. The spatial relationship knowledge improve the ability to localize key objects in the image

whilst the improved semantic relationship knowledge is perceived of the contextual words via the iterative answer decoding.

Conclusion

A novel KR-M4C method is introduced for the scene text VQA task. KR-M4C has its priority for the knowledge enhancement beyond the TextVQA and ST-VQA datasets.

关键词

Keywords

references

Almazán J, Gordo A, Fornés A and Valveny E. 2014. Word spotting and recognition with embedded attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(12): 2552-2566 [DOI: 10.1109/TPAMI.2014.2339814]

Anderson P, Fernando B, Johnson M and Gould S. 2016. SPICE: semantic propositional image caption evaluation//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 382-398 [ DOI:10.1007/978-3-319-46454-1_24 http://dx.doi.org/10.1007/978-3-319-46454-1_24 ]

Antol S, Agrawal A, Lu J S, Mitchell M, Batra D, Zitnick C L and Parikh D. 2015. VQA: visual question answering//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 2425-2433 [ DOI:10.1109/ICCV.2015.279 http://dx.doi.org/10.1109/ICCV.2015.279 ]

Ben-Younes H, Cadene R, Cord M and Thome N. 2017. MUTAN: multimodal tucker fusion for visual question answering//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2612-2620 [ DOI:10.1109/ICCV.2017.285 http://dx.doi.org/10.1109/ICCV.2017.285 ]

Biten A F, Tito R, Mafla A, Gomez L, Rusiñol M, Jawahar C V, Valveny E and Karatzas D. 2019. Scene text visual question answering//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Koera (South): IEEE: 4291-4301 [ DOI:10.1109/ICCV.2019.00439 http://dx.doi.org/10.1109/ICCV.2019.00439 ]

Bojanowski P, Grave E, Joulin A and Mikolov T. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5: 135-146 [DOI: 10.1162/tacl_a_00051]

Borisyuk F, Gordo A and Sivakumar V. 2018. Rosetta: large scale system for text detection and recognition in images//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. London, UK: ACM: 71-79 [ DOI:10.1145/3219819.3219861 http://dx.doi.org/10.1145/3219819.3219861 ]

Chen Y C, Li L J, Yu L C, El Kholy A, Ahmed F, Gan Z, Cheng Y and Liu J J. 2020. UNITER: universal image-text representation learning//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 104-120 [ DOI:10.1007/978-3-030-58577-8_7 http://dx.doi.org/10.1007/978-3-030-58577-8_7 ]

Cui Y H, Yu Z, WangC Q, Zhao Z Z, Zhang J, Wang M and Yu J. 2021. ROSITA: enhancing vision-and-language semantic alignments via cross-and intra-modal knowledge integration//Proceedings of the 29th ACM International Conference on Multimedia. Chengdu, China: ACM: 797-806 [ DOI:10.1145/3474085.3475251 http://dx.doi.org/10.1145/3474085.3475251 ]

Deng J, Dong W, Socher R, Li L J, Li K and Li F F. 2009. ImageNet: a large-scale hierarchical image database//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, USA: IEEE: 248-255.

Devlin J, Chang M W, Lee K and Toutanova K. 2019. BERT: pre-training of deep bidirectional transformers for language understanding//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, USA: Association for Computational Linguistics: 4171-4186 [ DOI:10.18653/v1/N19-1423 http://dx.doi.org/10.18653/v1/N19-1423 ]

Fukui A, Park D H, Y ang D, Rohrbach A, Darrell T and Rohrbach M. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding//Proceedings of 2016 Conference on Empirical Methods in Natural Language Processing. Austin, USA: Association for Computational Linguistics: 457-468 [ DOI:10.18653/v1/D16-1044 http://dx.doi.org/10.18653/v1/D16-1044 ]

Gao C Y, Zhu Q, Wang P, Li H, Liu Y L, Van Den Hengel A and Wu Q. 2021. Structured multimodal attentions for textvqa [EB/OL ] . [2021-11-26 ] . https://arxiv.org/pdf/2006.00753.pdf https://arxiv.org/pdf/2006.00753.pdf

Gao D F, Li K, Wang R P, Shan S G and Chen X L. 2020. Multi-modal graph neural network for joint reasoning on vision and scene text//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 12746-12756 [ DOI:10.1109/CVPR42600.2020.01276 http://dx.doi.org/10.1109/CVPR42600.2020.01276 ]

Gurari D, Li Q, Stangl A J, Guo A H, Lin C, Grauman K, Luo J B and Bigham J P. 2018. VizWiz grand challenge: answering visual questions from blind people//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 3608-3617 [ DOI:10.1109/CVPR.2018.00380 http://dx.doi.org/10.1109/CVPR.2018.00380 ]

Han W, Huang H T and Han T. 2020. Finding the evidence: localization-aware answer prediction for text visual question answering//Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain: International Committee on Computational Linguistics: 3118-3131 [ DOI:10.18653/v1/2020.coling-main.278 http://dx.doi.org/10.18653/v1/2020.coling-main.278 ]

Hu H, Gu J Y, Zhang Z, Dai J F and Wei Y C. 2018. Relation networks for object detection//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 3588-3597 [ DOI:10.1109/CVPR.2018.00378 http://dx.doi.org/10.1109/CVPR.2018.00378 ]

Hu R H, Singh A, Darrell T and Rohrbach M. 2020. Iterative answer prediction with pointer-augmented multimodal transformers for textVQA//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 9989-9999 [ DOI:10.1109/CVPR42600.2020.01001 http://dx.doi.org/10.1109/CVPR42600.2020.01001 ]

Jiang Y, Natarajan V, Chen X L, Rohrbach M, Batra D and Parikh D. 2018. Pythia v0.1: the winning entry to the VQA challenge 2018 [EB/OL ] . [2021-11-28 ] . https://arxiv.org/pdf/1807.09956.pdf https://arxiv.org/pdf/1807.09956.pdf

Kant Y, Batra D, Anderson P, Schwing A, Parikh D, Lu J S and Agrawal H. 2020. Spatially aware multimodal transformers for textVQA//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 715-732 [ DOI:10.1007/978-3-030-58545-7_41 http://dx.doi.org/10.1007/978-3-030-58545-7_41 ]

Karatzas D, Gomez-Bigorda L, Nicolaou A, Ghosh S, Bagdanov A, Iwamura M, Matas J, Neumann L, Chandrasekhar V R, Lu S J, Shafait F, Uchida S and Valveny E. 2015. ICDAR 2015 competition on robust reading//Proceedings of the 13th International Conference on Document Analysis and Recognition. Tunis, Tunisia: IEEE: 1156-1160 [ DOI:10.1109/ICDAR.2015.7333942 http://dx.doi.org/10.1109/ICDAR.2015.7333942 ]

Karpathy A and Li F F. 2015. Deep visual-semantic alignments for generating image descriptions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3128-3137 [ DOI:10.1109/CVPR.2015.7298932 http://dx.doi.org/10.1109/CVPR.2015.7298932 ]

Kim J H, Jun J and Zhang B T. 2018. Bilinear attention networks [EB/OL ] . [2021-05-21 ] . https://arxiv.org/pdf/1805.07932.pdf https://arxiv.org/pdf/1805.07932.pdf

Kim J H, On K W, Lim W, Kim J, Ha J W and Zhang B T. 2017. Hadamard product for low-rank bilinear pooling//Proceedings of the 5th International Conference on Learning Representations. Toulon, France: OpenReview

Krishna R, Zhu Y K, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L J, Shamma D A, Bernstein M S and Li F F. 2017. Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1): 32-73 [DOI: 10.1007/s11263-016-0981-7]

Kuznetsova A, Rom H, Alldrin N, Uijlings J, Krasin I, Pont-Tuset J, Kamali S, Popov S, Malloci M, Kolesnikov A, Duerig T andFerrari V. 2020. The open images dataset v4. International Journal of Computer Vision, 128(7): 1956-1981 [DOI: 10.1007/s11263-020-01316-z]

Lee K H, Chen X, Hua G, Hu H D and He X D. 2018. Stacked cross attention for image-text matching//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 212-228 [ DOI:10.1007/978-3-030-01225-0_13 http://dx.doi.org/10.1007/978-3-030-01225-0_13 ]

Liu F, Xu G H, Wu Q, Du Q, Jia W and Tan M K. 2020. Cascade reasoning network for text-based visual question answering//Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM: 4060-4069 [ DOI:10.1145/3394171.3413924 http://dx.doi.org/10.1145/3394171.3413924 ]

Lu J S, Batra D, Parikh D and Lee S. 2019. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks [EB/OL ] . [2021-08-06 ] . https://arxiv.org/pdf/1908.02265.pdf https://arxiv.org/pdf/1908.02265.pdf

Mishra A, Alahari K and Jawahar C V. 2013. Image retrieval using textual cues//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE: 3040-3047 [ DOI:10.1109/ICCV.2013.378 http://dx.doi.org/10.1109/ICCV.2013.378 ]

Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z M, Gimelshein N, Antiga L, Desmaison A, Köpf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J J and Chintala S. 2019. PyTorch: an imperative style, high-performance deep learninglibrary//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver, Canada: NIPS: 8026-8037

Pennington J, Socher R and Manning C D. 2014. GloVe: global vectors for word representation//Proceedings of 2014 conference on Empirical Methods in Natural Language Processing. Doha, Qatar: ACL: 1532-1543 [ DOI:10.3115/v1/D14-1162 http://dx.doi.org/10.3115/v1/D14-1162 ]

Ren S Q, He K M, Girshick R and Sun J. 2015. Faster R-CNN: towards real-time object detection with region proposal networks//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Canada: NIPS: 91-99

Singh A, Natarajan V, Shah M, Jiang Y, Chen X L, Batra D, Parikh D and Rohrbach M. 2019. Towards VQA models that can read//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 8317-8326 [ DOI:10.1109/CVPR.2019.00851 http://dx.doi.org/10.1109/CVPR.2019.00851 ]

Tan H and Bansal M. 2019. LXMERT: learning cross-modality encoder representations from transformers//Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong, China: Association for Computational Linguistics: 5100-5111 [ DOI:10.18653/v1/D19-1514 http://dx.doi.org/10.18653/v1/D19-1514 ]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: NIPS: 6000-6010

Veit A, Matera T, Neumann L, Matas J and Belongie S. 2016. COCO-Text: dataset and benchmark for text detection and recognition in natural images [EB/OL ] . [2021-01-26 ] . https://arxiv.org/pdf/1601.07140.pdf https://arxiv.org/pdf/1601.07140.pdf

Xie S N, Girshick R, Dollár P, Tu Z W and He K M. 2017. Aggregated residual transformations for deep neural networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 5987-5995 [ DOI:10.1109/CVPR.2017.634 http://dx.doi.org/10.1109/CVPR.2017.634 ]

Yao T, Pan Y W, Li Y H and Mei T. 2018. Exploring visual relationship for image captioning//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 711-727 [ DOI:10.1007/978-3-030-01264-9_42 http://dx.doi.org/10.1007/978-3-030-01264-9_42 ]

Yu L C, Tan H, Bansal M and Berg T L. 2017a. A joint speaker-listener-reinforcer model for referring expressions//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 3521-3529 [ DOI:10.1109/CVPR.2017.375 http://dx.doi.org/10.1109/CVPR.2017.375 ]

Yu Z, Yu J, Cui Y H, Tao D C and T ian Q. 2019. Deep modular co-attention networks for visual question answering//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 6274-6283 [ DOI:10.1109/CVPR.2019.00644 http://dx.doi.org/10.1109/CVPR.2019.00644 ]

Yu Z, Yu J, Fan J P and Tao D C. 2017b. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 1839-1848 [ DOI:10.1109/ICCV.2017.202 http://dx.doi.org/10.1109/ICCV.2017.202 ]

Zhou B L, Tian Y D, Sukhbaatar S, Szlam A and Fergus R. 2015. Simple baseline for visual question answering [EB/OL ] . [2021-12-07 ] . https://arxiv.org/pdf/1512.02167.pdf https://arxiv.org/pdf/1512.02167.pdf