目的 现有的视觉问答模型由于受到语言先验的影响，预测准确率不高。虽然模型能够根据数据集中问题和答案的统计规律学习到它们之间简单的对应关系，但无法学习到问题和答案类型之间深层次的对应关系，容易出现答非所问的现象。为此，提出了一种使用答案掩码对预测结果中的无关答案进行遮盖的方法，迫使模型关注问题和答案类型之间的对应关系，提高模型的预测准确率。方法 首先对数据集中的答案进行聚类并为每一类答案生成不同的答案掩码，然后使用预训练的答案类型识别模型预测问题对应的答案类型，并根据该模型的预测结果选择相应的答案掩码对基线模型的预测结果进行遮盖，最终得到正确答案。结果 提出的方法使用UpDn（bottom-upand top-down ）、RUBi （reducing unimodal biases ）、LMH（learned-mixin+h ）和CSS（counterfactual samples synthesizing ）4种模型作为基线模型，在3个大型公开数据集上进行实验。在VQA（visual question answer）-CP v2.0数据集上的实验结果表明，本文方法使UpDn模型的准确率提高了2.15%，LMH模型的准确率提高了2.29%，融合本方法的CSS模型的准确率达到了60.14%，较原模型提升了2.02%，达到了目前较高的水平。在VQA v2.0和VQA-CP v1.0数据集上的结果也显示本文方法提高了大多数模型的准确率，具有良好的泛化性。此外，在VQA-CP v2.0上的消融实验证明了本文方法的有效性。结论 提出的方法通过答案掩码对视觉问答模型的预测结果进行遮盖，减少无关答案对最终结果的影响，使模型学习到问题和答案类型之间的对应关系，有效改善了视觉问答模型答非所问的现象，提高了模型的预测准确率。
Answer mask-fused visual question answering model
Objective Visual question answering（VQA）is essential for artificial intelligence（AI）in recent years. Current VQA is concerned of the linkage of natural language processing and computer vision more. Therefore，VQA-related model is required for text and image information processing simultaneously，and the information of these two modes can be fused to infer the answer. Such popular VQA models like VQA v2. 0 dataset have been developing in terms of a deep neural network and trained samples. However，these prior language models-based tasks can be simplified to learn the surface relationship for answer questions between questions and answers. The weakness of uneven distribution of answers is still to be challenged for its weak generalization and poor performance in the VQA-CP v2. 0 dataset. Specifically，language problemsprior has threatened for its prediction errors of the model and the predicted answer and question are in irrelevance. To optimize this non-linkage and generalization of the model，we develop an answer mask-related method to cover the irrelevant answers for predictable results，which can forge the model to learn the deeper relationship between question and answer. The prediction accuracy of the model can be improved as well. Method The prediction results of the baseline model is masked via the answer mask. It is necessary to cluster all candidate answers and fewer answers-involved for each type of answer can be used to preserve accurate classification through more answers-irrelevant coverage of mask-of the prediction results. The answers consist of non-contextual words and phrases. Conventional Word2Vec and Glove is still challenged for its effectiveness of these encoded answers. Such clip is illustrated as the encoder to extract the answer features. And，the kmeans algorithm is used to cluster answer-extracted feature vectors. After clustering，original dataset can be modified and the corresponded type is changed to the clustering-after type answer of the dataset，and different answer mask vectors are generated for each answer type. The answer mask vector is structured of 0 and 1. The elements of the vector can be assigned to 1 when the corresponding positions are contained for each answer type，and the others are configured to 0；the impact of irrelevant answers of prediction can be eliminated for final results of the baseline model. We design an answer type recognition model，which uses the questions and answers types for pre-training. Input question-based model can be used to predict the answer type corresponding to the question. The model’s accuracy can reflect the quality of clustering work，and its prediction results are the basis for the optioned answers mask types- task. The baseline model is focused on encoding the image and text and depth neural network is linked to fuse the image and text features. The preliminary prediction results can be obtained through the classifier as well. First，corresponding answer mask vector is leaked out in terms of answer type identification model-based prediction results. Then，the multiplied prediction results are generated via the baseline model and the distribution of irrelevant answers are covered in the prediction results of the baseline model. At the end，final results are predicted. The model is trained to learn the corresponding relationship between the types of questions and answers. Result We selected out UpDn，RUBi，LMH and CSS as baseline models and experiments are carried out on three large public datasets mentioned below. VQA-CP v2. 0 dataset-related experiments can show its potentials for model’s accuracy. Three sort of accuracy of the UpDn，LMH and CSS model are improved by 2. 15%，2. 29% and 2. 02% each. Among them，the higher accuracy of the CSS model is reached to 60. 14%. Additionally，our model's accuracy is preserved when VQA v2. 0-related accuracy is reduced. The VQA v2. 0-based experimental results show that the accuracy of most baseline models are improved further. Among them，the accuracy of the CSS model is optimized by 3. 18%. To demonstrate better generalization of our model，comparative experiments are carried out on VQA-CP v1. 0 dataset further. The experimental results show that our method is mutual-benefited for most of baseline models，which is sufficient to reflect its potential ability of generalization. Furthermore，ablation experiment on VQA-CP v2. 0 shows that the accuracy can be optimized further in terms of the answer mask. Conclusion We develop an answer mask-related method to cover irrelevant answers in the model prediction results and the final influence of irrelevant answers can be alleviated. The model is yielded to learn the corresponding relationship between the question and the answer type，and its challenge can be resolved for the question-irrelevant model's prediction answer to a certain extent ，and the model’s generalization and accuracy can be optimized as well.