结合自底向上注意力机制和记忆网络的视觉问答模型

闫茹玉; 刘学亮

发布时间： 2020-05-13
摘要点击次数： 2352
全文下载次数： 764
DOI: 10.11834/jig.190366
2020 | Volume 25 | Number 5

结合自底向上注意力机制和记忆网络的视觉问答模型

闫茹玉, 刘学亮(合肥工业大学计算机与信息学院, 合肥 230601)

摘要

目的现有大多数视觉问答模型均采用自上而下的视觉注意力机制，对图像内容无加权统一处理，无法更好地表征图像信息，且因为缺乏长期记忆模块，无法对信息进行长时间记忆存储，在推理答案过程中会造成有效信息丢失，从而预测出错误答案。为此，提出一种结合自底向上注意力机制和记忆网络的视觉问答模型，通过增强对图像内容的表示和记忆，提高视觉问答的准确率。方法预训练一个目标检测模型提取图像中的目标和显著性区域作为图像特征，联合问题表示输入到记忆网络，记忆网络根据问题检索输入图像特征中的有用信息，并结合输入图像信息和问题表示进行多次迭代、更新，以生成最终的信息表示，最后融合记忆网络记忆的最终信息和问题表示，推测出正确答案。结果在公开的大规模数据集VQA （visual question answering）v2.0上与现有主流算法进行比较实验和消融实验，结果表明，提出的模型在视觉问答任务中的准确率有显著提升，总体准确率为64.0%。与MCB（multimodal compact bilinear）算法相比，总体准确率提升了1.7%；与性能较好的VQA machine算法相比，总体准确率提升了1%，其中回答是/否、计数和其他类型问题的准确率分别提升了1.1%、3.4%和0.6%。整体性能优于其他对比算法，验证了提出算法的有效性。结论本文提出的结合自底向上注意力机制和记忆网络的视觉问答模型，更符合人类的视觉注意力机制，并且在推理答案的过程中减少了信息丢失，有效提升了视觉问答的准确率。

关键词

视觉问答自底向上注意力机制记忆网络多模态融合多分类

Visual question answering model based on bottom-up attention and memory network

Yan Ruyu, Liu Xueliang(School of Computer and Information, Hefei University of Technology, Hefei 230601, China)

Abstract

Objective Visual question-answering (VQA) belongs to the intersection of computer vision and natural language processing and is one of the key research directions in the field of artificial intelligence. VQA is an important task for conducting research in the field of artificial intelligence because of its multimodal nature, clear evaluation protocol, and potential real-world applications. The prevailing approach to VQA is based on three components. First, question-answering is posed as a classification over a set of candidate answers. Questions in the current VQA datasets are mostly visual in nature, where the correct answers are composed of a small set of key words or phrases. Second, most VQA models are based on a deep neural network that implements a joint embedding of image and question features. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are used to map the two inputs into a vector representation of fixed size. Third, due to the success of deep learning in solving supervised learning problems, the whole neural network is trained end-to-end from questions, images, and their ground truth answers. However, using global image features as visual input will introduce noise at the stage of predicting the answer. Inspired by human visual attention and the development of related research, visual attention mechanism has been widely used in VQA to alleviate the problem. Most conventional visual attention mechanisms used in VQA models are of the top-down variety. These attention mechanisms are typically trained to attend selectively to the output of one or more layers of a CNN and predict the weight of each region in the image. This method ignores image content. Thus, image information cannot be further represented. In humans, attention will be focused more on the objects or other salient image regions. To generate more human-like question answers, objects and other salient image regions are a much more natural basis for attention. In addition, owing to the lack of long-term memory modules, information is lost during the reasoning of answers. Therefore, wrong answers can be inferred, which will affect the VQA effect. To solve the above problems, we propose a VQA model based on bottom-up attention and memory network, which improves the accuracy of VQA by enhancing the representation and memory of image content. Method We use image features from bottom-up attention to provide region-specific features rather than the traditional CNN grid-like feature maps. We implement bottom-up attention using faster R-CNN (region-based CNN) in conjunction with the ResNet-101 CNN, which represents a natural expression of a bottom-up attention mechanism. To pre-train the bottom-up attention model, we first initialize faster R-CNN with ResNet-101 pre-trained for classification on ImageNet. Then, it is trained on visual genome data. To aid the learning of good feature representations, we introduce an additional training output to predict attribute classes in addition to object classes. To improve computational efficiency, questions are trimmed to a maximum of 14 words. The extra words are then simply discarded, and the questions shorter than 14 words are end-padded with vectors of zeros. We use bi-directional gated recurrent unit (GRU) to extract the question features and its final state as our question embedding. The purpose of the memory network is to retrieve the information needed to answer the question from the input image facts and memorize them for a long time. To improve understanding of the question and image, especially when the question requires transmission reasoning, the memory network may need to transfer the input several times and update the memory after each transmission. Memory network is composed of two parts: attention mechanism module and episode memory update module. Each iteration will calculate the weight of the input vector through the attention mechanism to generate a new memory and then update the memory through the memory update module. The image features and question representations will be input into the memory network to obtain the final episodic memory. Finally, the representations of the question and the final episodic memory are passed through nonlinear full connection layers and combined with a simple concatenation, and then the result is fed into the output classifier to deduce the correct answer. Unlike the softmax classifier, which is commonly used in most VQA models, we treat VQA as a multi-label classification task and use sigmoid activation function to predict a score for each of the N candidates answers. The sigmoid can optimize multiple correct answers for each question and normalize the final scores to (0, 1). The final stage can be regarded as a logistic regression to predict the correctness of each candidate answer. Result The main performance metric is the standard VQA accuracy, which is the average ground truth score for predicting the answers to all questions, taking into account the occasional divergence of ground truth answers among annotators. The proposed VQA model is tested on the VQA v2.0 dataset. Our VQA test server submissions are trained on the training and validation sets. Results show that the overall accuracy is 64.0%, where the accuracy of answering yes/no questions is 80.9%, the accuracy of answering counting questions is 44.3%, and the accuracy of answering other types of questions is 54.0%. Several existing mainstream algorithms are selected to compare with our model on the VQA v2.0 dataset. Results show that our VQA model has higher accuracy in overall and different types of questions. Ablation experiments are carried out in the VQA v2.0 dataset as well. Results of the ablation experiments show that combining bottom-up attention mechanism with memory network is better than using the modules alone, proving the effectiveness of the proposed algorithm. Conclusion In this study, we propose a VQA model through combining bottom-up attention mechanism and memory network, thus providing a new means of thinking for VQA. Our model synthesizes the advantages of bottom-up attention and memory network and is more in line with the visual attention mechanism of human beings. Moreover, it can remember effective information for a long time and reduce the loss of information in the process of reasoning answers, thus effectively improving the accuracy of VQA.

Keywords

visual question answering (VQA) bottom-up attention mechanism memory network multimodal fusion multi-classification

在线采编平台

在线出版

年度会议

下载中心

年度信息