Current Issue Cover
融合知识表征的多模态Transformer场景文本视觉问答

余宙, 俞俊, 朱俊杰, 匡振中(杭州电子科技大学计算机学院复杂系统建模与仿真教育部重点实验室, 杭州 310018)

摘 要
目的 现有视觉问答方法通常只关注图像中的视觉物体,忽略了对图像中关键文本内容的理解,从而限制了图像内容理解的深度和精度。鉴于图像中隐含的文本信息对理解图像的重要性,学者提出了针对图像中场景文本理解的“场景文本视觉问答”任务以量化模型对场景文字的理解能力,并构建相应的基准评测数据集TextVQA(text visual question answering)和ST-VQA(scene text visual question answering)。本文聚焦场景文本视觉问答任务,针对现有基于自注意力模型的方法存在过拟合风险导致的性能瓶颈问题,提出一种融合知识表征的多模态Transformer的场景文本视觉问答方法,有效提升了模型的稳健性和准确性。方法 对现有基线模型M4C(multimodal multi-copy mesh)进行改进,针对视觉对象间的“空间关联”和文本单词间的“语义关联”这两种互补的先验知识进行建模,并在此基础上设计了一种通用的知识表征增强注意力模块以实现对两种关系的统一编码表达,得到知识表征增强的KR-M4C(knowledge-representation-enhanced M4C)方法。结果 在TextVQA和ST-VQA两个场景文本视觉问答基准评测集上,将本文KR-M4C方法与最新方法进行比较。本文方法在TextVQA数据集中,相比于对比方法中最好的结果,在不增加额外训练数据的情况下,测试集准确率提升2.4%,在增加ST-VQA数据集作为训练数据的情况下,测试集准确率提升1.1%;在ST-VQA数据集中,相比于对比方法中最好的结果,测试集的平均归一化Levenshtein相似度提升5%。同时,在TextVQA数据集中进行对比实验以验证两种先验知识的有效性,结果表明提出的KR-M4C模型提高了预测答案的准确率。结论 本文提出的KR-M4C方法的性能在TextVQA和ST-VQA两个场景文本视觉问答基准评测集上均有显著提升,获得了在该任务上的最好结果。
关键词
Knowledge-representation-enhanced multimodal Transformer for scene text visual question answering

Yu Zhou, Yu Jun, Zhu Junjie, Kuang Zhenzhong(Key Laboratory of Complex Systems Modeling and Simulation, School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China)

Abstract
Objective Deep neural networks technology promotes the research and development of computer vision and natural language processing intensively.Multiple applications like human face recognition,optical character recognition (OCR),and machine translation have been widely used.Recent development enable the machine learning to deal with more complex multimodal learning tasks that involve vision and language modalities,e.g.,visual captioning,image-text retrieval,referring expression comprehension,and visual question answering (VQA).Given an arbitrary image and a natural language question,the VQA task is focused on an image-content-oriented and question-guided understanding of the fine-grained semantics and the following complex reasoning answer.The VQA task tends to be as the generalization of the rest of multimodal learning tasks.Thus,an effective VQA algorithms is a key step toward artificial general intelligence (AGI).Recent VQA have realized human-level performance on the commonly used benchmarks of VQA.However,most existing VQA methods are focused on visual objects of images only,while neglecting the recognition of textual content in the image.In many real-world scenarios,the image text can transmit essential information for scene understanding and reasoning,such as the number of a traffic sign or the brand awareness of a product.The ignorance of textual information is constraint of the applicability of the VQA methods in practice,especially for visually-impaired users.Due to the importance of the developing textual information for image interpretation,most researches intend to incorporate textual content into VQA for a scene text VQA task organizing.Specifically,the questions involve the textual contents in the scene text VQA task.The learned VQA model is required to establish unified associations among the question,visual object and the scene text.The reasoning is followed to generate a correct answer.To address the scene text VQA task,a model of multimodal multi-copy mesh (M4C) is faciliated based on the transformer architecture.Multimodal heterogeneous features are as input,a multimodal transformer is used to capture the interactions between input features,and the answers are predicted in an iterative manner.Despite of the strengths of M4C,it still has the two weaknesses as following:1) the relative spatial relationship cannot be illustrated well between paired objects although each visual object and OCR object encode its absolute spatial location.It is challenged to achieve accurate spatial reasoning for M4C model;2) the predicted words of answering are selected from either a dynamic OCR vocabulary or a fixed answer vocabulary.The semantic relationships are not explicitly considered in M4C between the multi-sources words.At the iterative answer prediction stage,it is challenged to understand the potential semantic associations between multiple sources derived words.Method To resolve the weaknesses of M4C mentioned above,we improve the reference of M4C model by introducing two added knowledge like the spatial relationship and semantic relationship,and a knowledge-representation-enhanced M4C (KR-M4C) approach is demonstrated to integrate the two types of knowledge representations simultaneously.Additionally,the spatial relationship knowledge encodes the relative spatial positions between each paired object (including the visual objects and OCR objects) in terms of their bounding box coordinates.The semantic relationship knowledge encodes the semantic similarity between the text words and the predicted answer words in accordance with the similarity calculated from their GloVe word embeddings.The two types of knowledge representation are encoded as unified knowledge representations.To match the knowledge representations adequately,the multi-head attention (MHA) module of M4C is modified to be a KRMHA module.By stacking the KRMHA modules in depth,the KR-M4C model performs spatial and semantic reasoning to improve the model performance over the reference M4C model.Result The KR-M4C approach is verified that our extended experiments are conducted on two benchmark datasets of text VQA (TextVQA) and scene text VQA (ST-VQA) based on same experimental settings.The demonstrated results are shown as below:1) excluded of extra training data,KR-M4C obtains an accuracy improvement of 2.4% over existing optimization on the test set of TextVQA;2) KR-M4C achieves an average normalized levenshtein similarity (ANLS) score of 0.555 on the test set of ST-VQA,which is 5% higher than theresult of SA-M4C.To verify the synergistic effect of two types of introduced knowledge further,comprehensive ablation studies are carried out on TextVQA,and the demonstrated results can support our hypothesis of those two types of knowledge are proactively and mutually to model performance.Finally,some visualized cases are provided to verify the effects of the two introduced knowledge representations.The spatial relationship knowledge improve the ability to localize key objects in the image,whilst the improved semantic relationship knowledge is perceived of the contextual words via the iterative answer decoding.Conclusion A novel KR-M4C method is introduced for the scene text VQA task.KR-M4C has its priority for the knowledge enhancement beyond the TextVQA and ST-VQA datasets.
Keywords

订阅号|日报