Knowledge-representation-enhanced multimodal Transformer for scene text visual question answering
- Vol. 27, Issue 9, Pages: 2761-2774(2022)
Published: 16 September 2022 ,
Accepted: 08 June 2022
DOI: 10.11834/jig.211213
移动端阅览
浏览全部资源
扫码关注微信
Published: 16 September 2022 ,
Accepted: 08 June 2022
移动端阅览
Zhou Yu, Jun Yu, Junjie Zhu, Zhenzhong Kuang. Knowledge-representation-enhanced multimodal Transformer for scene text visual question answering. [J]. Journal of Image and Graphics 27(9):2761-2774(2022)
目的
2
现有视觉问答方法通常只关注图像中的视觉物体,忽略了对图像中关键文本内容的理解,从而限制了图像内容理解的深度和精度。鉴于图像中隐含的文本信息对理解图像的重要性,学者提出了针对图像中场景文本理解的“场景文本视觉问答”任务以量化模型对场景文字的理解能力,并构建相应的基准评测数据集TextVQA(text visual question answering)和ST-VQA(scene text visual question answering)。本文聚焦场景文本视觉问答任务,针对现有基于自注意力模型的方法存在过拟合风险导致的性能瓶颈问题,提出一种融合知识表征的多模态Transformer的场景文本视觉问答方法,有效提升了模型的稳健性和准确性。
方法
2
对现有基线模型M4C(multimodal multi-copy mesh)进行改进,针对视觉对象间的“空间关联”和文本单词间的“语义关联”这两种互补的先验知识进行建模,并在此基础上设计了一种通用的知识表征增强注意力模块以实现对两种关系的统一编码表达,得到知识表征增强的KR-M4C(knowledge-representation-enhanced M4C)方法。
结果
2
在TextVQA和ST-VQA两个场景文本视觉问答基准评测集上,将本文KR-M4C方法与最新方法进行比较。本文方法在TextVQA数据集中,相比于对比方法中最好的结果,在不增加额外训练数据的情况下,测试集准确率提升2.4%,在增加ST-VQA数据集作为训练数据的情况下,测试集准确率提升1.1%;在ST-VQA数据集中,相比于对比方法中最好的结果,测试集的平均归一化Levenshtein相似度提升5%。同时,在TextVQA数据集中进行对比实验以验证两种先验知识的有效性,结果表明提出的KR-M4C模型提高了预测答案的准确率。
结论
2
本文提出的KR-M4C方法的性能在TextVQA和ST-VQA两个场景文本视觉问答基准评测集上均有显著提升,获得了在该任务上的最好结果。
Objective
2
Deep neural networks technology promotes the research and development of computer vision and natural language processing intensively. Multiple applications like human face recognition
optical character recognition (OCR)
and machine translation have been widely used. Recent development enable the machine learning to deal with more complex multimodal learning tasks that involve vision and language modalities
e.g.
visual captioning
image-text retrieval
referring expression comprehension
and visual question answering (VQA). Given an arbitrary image and a natural language question
the VQA task is focused on an image-content-oriented and question-guided understanding of the fine-grained semantics and the following complex reasoning answer. The VQA task tends to be as the generalization of the rest of multimodal learning tasks. Thus
an effective VQA algorithms is a key step toward artificial general intelligence (AGI). Recent VQA have realized human-level performance on the commonly used benchmarks of VQA. However
most existing VQA methods are focused on visual objects of images only
while neglecting the recognition of textual content in the image. In many real-world scenarios
the image text can transmit essential information for scene understanding and reasoning
such as the number of a traffic sign or the brand awareness of a product. The ignorance of textual information is constraint of the applicability of the VQA methods in practice
especially for visually-impaired users. Due to the importance of the developing textual information for image interpretation
most researches intend to incorporate textual content into VQA for a scene text VQA task organizing. Specifically
the questions involve the textual contents in the scene text VQA task. The learned VQA model is required to establish unified associations among the question
visual object and the scene text. The reasoning is followed to generate a correct answer. To address the scene text VQA task
a model of multimodal multi-copy mesh(M4C) is faciliated based on the transformer architecture. Multimodal heterogeneous features are as input
a multimodal transformer is used to capture the interactions between input features
and the answers are predicted in an iterative manner. Despite of the strengths of M4C
it still has the two weaknesses as following: 1) the relative spatial relationship cannot be illustrated well between paired objects although each visual object and OCR object encode its absolute spatial location. It is challenged to achieve accurate spatial reasoning for M4C model; 2) the predicted words of answering are selected from either a dynamic OCR vocabulary or a fixed answer vocabulary. The semantic relationships are not explicitly considered in M4C between the multi-sources words. At the iterative answer prediction stage
it is challenged to understand the potential semantic associations between multiple sources derived words.
Method
2
To resolve the weaknesses of M4C mentioned above
we improve the reference of M4C model by introducing two added knowledge like the spatial relationship and semantic relationship
and a knowledge-representation-enhanced M4C (KR-M4C) approach is demonstrated to integrate the two types of knowledge representations simultaneously. Additionally
the spatial relationship knowledge encodes the relative spatial positions between each paired object (including the visual objects and OCR objects) in terms of their bounding box coordinates. The semantic relationship knowledge encodes the semantic similarity between the text words and the predicted answer words in accordance with the similarity calculated from their GloVe word embeddings. The two types of knowledge representation are encoded as unified knowledge representations. To match the knowledge representations adequately
the multi-head attention (MHA) module of M4C is modified to be a KRMHA module. By stacking the KRMHA modules in depth
the KR-M4C model performs spatial and semantic reasoning to improve the model performance over the reference M4C model.
Result
2
The KR-M4C approach is verified that our extended experiments are conducted on two benchmark datasets of text VQA(TextVQA) and scene text VQA(ST-VQA) based on same experimental settings. The demonstrated results are shown as below: 1) excluded of extra training data
KR-M4C obtains an accuracy improvement of 2.4% over existing optimization on the test set of TextVQA; 2) KR-M4C achieves an average normalized levenshtein similarity (ANLS) score of 0.555 on the test set of ST-VQA
which is 5% higher than theresult of SA-M4C. To verify the synergistic effect of two types of introduced knowledge further
comprehensive ablation studies are carried out on TextVQA
and the demonstrated results can support our hypothesis of those two types of knowledge are proactively and mutually to model performance. Finally
some visualized cases are provided to verify the effects of the two introduced knowledge representations. The spatial relationship knowledge improve the ability to localize key objects in the image
whilst the improved semantic relationship knowledge is perceived of the contextual words via the iterative answer decoding.
Conclusion
2
A novel KR-M4C method is introduced for the scene text VQA task. KR-M4C has its priority for the knowledge enhancement beyond the TextVQA and ST-VQA datasets.
场景文本视觉问答知识表征注意力机制Transformer多模态融合
scene text visual question answeringknowledge representationattention mechanismTransformermultimodal fusion
Almazán J, Gordo A, Fornés A and Valveny E. 2014. Word spotting and recognition with embedded attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(12): 2552-2566 [DOI: 10.1109/TPAMI.2014.2339814]
Anderson P, Fernando B, Johnson M and Gould S. 2016. SPICE: semantic propositional image caption evaluation//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 382-398 [DOI:10.1007/978-3-319-46454-1_24http://dx.doi.org/10.1007/978-3-319-46454-1_24]
Antol S, Agrawal A, Lu J S, Mitchell M, Batra D, Zitnick C L and Parikh D. 2015. VQA: visual question answering//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 2425-2433 [DOI:10.1109/ICCV.2015.279http://dx.doi.org/10.1109/ICCV.2015.279]
Ben-Younes H, Cadene R, Cord M and Thome N. 2017. MUTAN: multimodal tucker fusion for visual question answering//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2612-2620 [DOI:10.1109/ICCV.2017.285http://dx.doi.org/10.1109/ICCV.2017.285]
Biten A F, Tito R, Mafla A, Gomez L, Rusiñol M, Jawahar C V, Valveny E and Karatzas D. 2019. Scene text visual question answering//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Koera (South): IEEE: 4291-4301 [DOI:10.1109/ICCV.2019.00439http://dx.doi.org/10.1109/ICCV.2019.00439]
Bojanowski P, Grave E, Joulin A and Mikolov T. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5: 135-146 [DOI: 10.1162/tacl_a_00051]
Borisyuk F, Gordo A and Sivakumar V. 2018. Rosetta: large scale system for text detection and recognition in images//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. London, UK: ACM: 71-79 [DOI:10.1145/3219819.3219861http://dx.doi.org/10.1145/3219819.3219861]
Chen Y C, Li L J, Yu L C, El Kholy A, Ahmed F, Gan Z, Cheng Y and Liu J J. 2020. UNITER: universal image-text representation learning//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 104-120 [DOI:10.1007/978-3-030-58577-8_7http://dx.doi.org/10.1007/978-3-030-58577-8_7]
Cui Y H, Yu Z, WangC Q, Zhao Z Z, Zhang J, Wang M and Yu J. 2021. ROSITA: enhancing vision-and-language semantic alignments via cross-and intra-modal knowledge integration//Proceedings of the 29th ACM International Conference on Multimedia. Chengdu, China: ACM: 797-806 [DOI:10.1145/3474085.3475251http://dx.doi.org/10.1145/3474085.3475251]
Deng J, Dong W, Socher R, Li L J, Li K and Li F F. 2009. ImageNet: a large-scale hierarchical image database//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, USA: IEEE: 248-255.
Devlin J, Chang M W, Lee K and Toutanova K. 2019. BERT: pre-training of deep bidirectional transformers for language understanding//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, USA: Association for Computational Linguistics: 4171-4186 [DOI:10.18653/v1/N19-1423http://dx.doi.org/10.18653/v1/N19-1423]
Fukui A, Park D H, Yang D, Rohrbach A, Darrell T and Rohrbach M. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding//Proceedings of 2016 Conference on Empirical Methods in Natural Language Processing. Austin, USA: Association for Computational Linguistics: 457-468 [DOI:10.18653/v1/D16-1044http://dx.doi.org/10.18653/v1/D16-1044]
Gao C Y, Zhu Q, Wang P, Li H, Liu Y L, Van Den Hengel A and Wu Q. 2021. Structured multimodal attentions for textvqa [EB/OL]. [2021-11-26].https://arxiv.org/pdf/2006.00753.pdfhttps://arxiv.org/pdf/2006.00753.pdf
Gao D F, Li K, Wang R P, Shan S G and Chen X L. 2020. Multi-modal graph neural network for joint reasoning on vision and scene text//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 12746-12756 [DOI:10.1109/CVPR42600.2020.01276http://dx.doi.org/10.1109/CVPR42600.2020.01276]
Gurari D, Li Q, Stangl A J, Guo A H, Lin C, Grauman K, Luo J B and Bigham J P. 2018. VizWiz grand challenge: answering visual questions from blind people//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 3608-3617 [DOI:10.1109/CVPR.2018.00380http://dx.doi.org/10.1109/CVPR.2018.00380]
Han W, Huang H T and Han T. 2020. Finding the evidence: localization-aware answer prediction for text visual question answering//Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain: International Committee on Computational Linguistics: 3118-3131 [DOI:10.18653/v1/2020.coling-main.278http://dx.doi.org/10.18653/v1/2020.coling-main.278]
Hu H, Gu J Y, Zhang Z, Dai J F and Wei Y C. 2018. Relation networks for object detection//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 3588-3597 [DOI:10.1109/CVPR.2018.00378http://dx.doi.org/10.1109/CVPR.2018.00378]
Hu R H, Singh A, Darrell T and Rohrbach M. 2020. Iterative answer prediction with pointer-augmented multimodal transformers for textVQA//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 9989-9999 [DOI:10.1109/CVPR42600.2020.01001http://dx.doi.org/10.1109/CVPR42600.2020.01001]
Jiang Y, Natarajan V, Chen X L, Rohrbach M, Batra D and Parikh D. 2018. Pythia v0.1: the winning entry to the VQA challenge 2018 [EB/OL]. [2021-11-28].https://arxiv.org/pdf/1807.09956.pdfhttps://arxiv.org/pdf/1807.09956.pdf
Kant Y, Batra D, Anderson P, Schwing A, Parikh D, Lu J S and Agrawal H. 2020. Spatially aware multimodal transformers for textVQA//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 715-732 [DOI:10.1007/978-3-030-58545-7_41http://dx.doi.org/10.1007/978-3-030-58545-7_41]
Karatzas D, Gomez-Bigorda L, Nicolaou A, Ghosh S, Bagdanov A, Iwamura M, Matas J, Neumann L, Chandrasekhar V R, Lu S J, Shafait F, Uchida S and Valveny E. 2015. ICDAR 2015 competition on robust reading//Proceedings of the 13th International Conference on Document Analysis and Recognition. Tunis, Tunisia: IEEE: 1156-1160 [DOI:10.1109/ICDAR.2015.7333942http://dx.doi.org/10.1109/ICDAR.2015.7333942]
Karpathy A and Li F F. 2015. Deep visual-semantic alignments for generating image descriptions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3128-3137 [DOI:10.1109/CVPR.2015.7298932http://dx.doi.org/10.1109/CVPR.2015.7298932]
Kim J H, Jun J and Zhang B T. 2018. Bilinear attention networks [EB/OL]. [2021-05-21].https://arxiv.org/pdf/1805.07932.pdfhttps://arxiv.org/pdf/1805.07932.pdf
Kim J H, On K W, Lim W, Kim J, Ha J W and Zhang B T. 2017. Hadamard product for low-rank bilinear pooling//Proceedings of the 5th International Conference on Learning Representations. Toulon, France: OpenReview
Krishna R, Zhu Y K, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L J, Shamma D A, Bernstein M S and Li F F. 2017. Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1): 32-73 [DOI: 10.1007/s11263-016-0981-7]
Kuznetsova A, Rom H, Alldrin N, Uijlings J, Krasin I, Pont-Tuset J, Kamali S, Popov S, Malloci M, Kolesnikov A, Duerig T andFerrari V. 2020. The open images dataset v4. International Journal of Computer Vision, 128(7): 1956-1981 [DOI: 10.1007/s11263-020-01316-z]
Lee K H, Chen X, Hua G, Hu H D and He X D. 2018. Stacked cross attention for image-text matching//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 212-228 [DOI:10.1007/978-3-030-01225-0_13http://dx.doi.org/10.1007/978-3-030-01225-0_13]
Liu F, Xu G H, Wu Q, Du Q, Jia W and Tan M K. 2020. Cascade reasoning network for text-based visual question answering//Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM: 4060-4069 [DOI:10.1145/3394171.3413924http://dx.doi.org/10.1145/3394171.3413924]
Lu J S, Batra D, Parikh D and Lee S. 2019. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks [EB/OL]. [2021-08-06].https://arxiv.org/pdf/1908.02265.pdfhttps://arxiv.org/pdf/1908.02265.pdf
Mishra A, Alahari K and Jawahar C V. 2013. Image retrieval using textual cues//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE: 3040-3047 [DOI:10.1109/ICCV.2013.378http://dx.doi.org/10.1109/ICCV.2013.378]
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z M, Gimelshein N, Antiga L, Desmaison A, Köpf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J J and Chintala S. 2019. PyTorch: an imperative style, high-performance deep learninglibrary//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver, Canada: NIPS: 8026-8037
Pennington J, Socher R and Manning C D. 2014. GloVe: global vectors for word representation//Proceedings of 2014 conference on Empirical Methods in Natural Language Processing. Doha, Qatar: ACL: 1532-1543 [DOI:10.3115/v1/D14-1162http://dx.doi.org/10.3115/v1/D14-1162]
Ren S Q, He K M, Girshick R and Sun J. 2015. Faster R-CNN: towards real-time object detection with region proposal networks//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Canada: NIPS: 91-99
Singh A, Natarajan V, Shah M, Jiang Y, Chen X L, Batra D, Parikh D and Rohrbach M. 2019. Towards VQA models that can read//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 8317-8326 [DOI:10.1109/CVPR.2019.00851http://dx.doi.org/10.1109/CVPR.2019.00851]
Tan H and Bansal M. 2019. LXMERT: learning cross-modality encoder representations from transformers//Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong, China: Association for Computational Linguistics: 5100-5111 [DOI:10.18653/v1/D19-1514http://dx.doi.org/10.18653/v1/D19-1514]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: NIPS: 6000-6010
Veit A, Matera T, Neumann L, Matas J and Belongie S. 2016. COCO-Text: dataset and benchmark for text detection and recognition in natural images [EB/OL]. [2021-01-26].https://arxiv.org/pdf/1601.07140.pdfhttps://arxiv.org/pdf/1601.07140.pdf
Xie S N, Girshick R, Dollár P, Tu Z W and He K M. 2017. Aggregated residual transformations for deep neural networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 5987-5995 [DOI:10.1109/CVPR.2017.634http://dx.doi.org/10.1109/CVPR.2017.634]
Yao T, Pan Y W, Li Y H and Mei T. 2018. Exploring visual relationship for image captioning//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 711-727 [DOI:10.1007/978-3-030-01264-9_42http://dx.doi.org/10.1007/978-3-030-01264-9_42]
Yu L C, Tan H, Bansal M and Berg T L. 2017a. A joint speaker-listener-reinforcer model for referring expressions//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 3521-3529 [DOI:10.1109/CVPR.2017.375http://dx.doi.org/10.1109/CVPR.2017.375]
Yu Z, Yu J, Cui Y H, Tao D C and Tian Q. 2019. Deep modular co-attention networks for visual question answering//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 6274-6283 [DOI:10.1109/CVPR.2019.00644http://dx.doi.org/10.1109/CVPR.2019.00644]
Yu Z, Yu J, Fan J P and Tao D C. 2017b. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 1839-1848 [DOI:10.1109/ICCV.2017.202http://dx.doi.org/10.1109/ICCV.2017.202]
Zhou B L, Tian Y D, Sukhbaatar S, Szlam A and Fergus R. 2015. Simple baseline for visual question answering [EB/OL]. [2021-12-07].https://arxiv.org/pdf/1512.02167.pdfhttps://arxiv.org/pdf/1512.02167.pdf
相关文章
相关作者
相关机构