Current Issue Cover

黎颖, 吴清锋, 刘佳桐, 邹嘉龙(厦门大学信息学院, 厦门 361005)

摘 要
目的 图表问答是计算机视觉多模态学习的一项重要研究任务,传统关系网络(relation network,RN)模型简单的两两配对方法可以包含所有像素之间的关系,因此取得了不错的结果,但此方法不仅包含冗余信息,而且平方式增长的关系配对的特征数量会给后续的推理网络在计算量和参数量上带来很大的负担。针对这个问题,提出了一种基于融合语义特征提取的引导性权重驱动的重定位关系网络模型来改善不足。方法 首先通过融合场景任务的低级和高级图像特征来提取更丰富的统计图语义信息,同时提出了一种基于注意力机制的文本编码器,实现融合语义的特征提取,然后对引导性权重进行排序进一步重构图像的位置,从而构建了重定位的关系网络模型。结果 在2个数据集上进行实验比较,在FigureQA (an annotated figure dataset for visual reasoning)数据集中,相较于IMG+QUES (image+questions)、RN和ARN (appearance and relation networks),本文方法的整体准确率分别提升了26.4%,8.1%,0.46%,在单一验证集上,相较于LEAF-Net (locate,encode and attend for figure network)和FigureNet,本文方法的准确率提升了2.3%,2.0%;在DVQA (understanding data visualization via question answering)数据集上,对于不使用OCR (optical character recognition)方法,相较于SANDY (san with dynamic encoding model)、ARN和RN,整体准确率分别提升了8.6%,0.12%,2.13%;对于有Oracle版本,相较于SANDY、LEAF-Net和RN,整体准确率分别提升了23.3%,7.09%,4.8%。结论 本文算法围绕图表问答任务,在DVQA和FigureQA两个开源数据集上分别提升了准确率。
Leading weight-driven re-position relation network for figure question answering

Li Ying, Wu Qingfeng, Liu Jiatong, Zou Jialong(School of Informatics, Xiamen University, Xiamen 361005, China)

Objective Figure-based question and answer (Q&A) is focused on learning the basic information representation of data mining in real scenes and provide the basis of judgment for reasoning in terms of the text information of the joint questions. It is widely used for multi-modal learning tasks. Existing methods can be segmented into two categories in common:1) model tasks are based on neural network framework algorithms directly. The statistical graph is processed by the convolutional neural network to obtain the feature map of the image information, the question text is encoded by the recurrent neural network to obtain the sentence-level embedding representation vector. The output answer is obtained by the fusion inference model. To capture the overall representation of the fusion of multi-modal feature information, the popular attention mechanism is concerned about the obtained image feature matrix as the input of the text encoder in recent years. However, the interaction between the relationship features in the multi-modal scene has a huge negative impact on the extraction of effective semantic features. 2) A multi-module framework algorithm is used to decompose the task into multiple steps. Different modules are used to obtain the feature information at first, the obtained information is then used as the input of the subsequent modules, and the final output results are obtained through the subsequent algorithm modules. However, this type of method needs to rely on additional annotation information to train individual modules, and the complexity is quite higher. So, we develop a weight-driven re-located relational network model based on fusion semantic feature extraction. Method We clarify the whole framework for weight-driven re-located relation network, which consists of three modules in the context of image feature extraction, the attention-based long short-term memory(LSTM) and joint weight-driven re-located relation network. 1) For the image feature extraction module, image feature extraction is implemented via fusing the convolutional layer and the up-sampling layer. To make the extracted image feature information more suitable for the scene task, we design a fusion of convolutional neural network and U-Net network architecture to construct a network model that can extract the semantic meaning of low-level and high-level image features. 2) For the attention-based LSTM module, we joint the problem-based reasoning feature representation in terms of attention mechanism. LSTM can just retain the influence of existing words on unrecognized words. To obtain a better vector representation of the sentence, we can capture different contextual information based on attention mechanism. 3) For the joint leading weight-driven re-located relation network module, we propose a paired matching mechanism, which guides the matching process of relationship features in the relationship network. That is to calculate the inner product of the feature vector of each pixel with the feature vectors of all the pixels, the similarity can be obtained between it and all the points and the pixel can be obtained by averaging in the entire group at the end. However, to resolve the high complexity problem, it ignore the overall relationship balance that can be obtained by the original pairwise pairing method although the relationship features matching pair sequence obtained by the above method. Therefore, our re-located operation is carried out to achieve a balanced effect for overall relationship. 1) Remove the relationship feature of the pixel paired with itself from the obtained relationship feature pair set; 2) swap locations in the relationship feature list of each pixel according to a constant one exchange and this iterative rule; and 3) add the location information of the pixels and the sentence-level embedding. Especially, the generation of relational features is composed of three parts:a) the feature vector of two pixels, b) the coordinated value of the two pixels, and c) the embedding representation of the question text. Result The experiment is compared to the 2 datasets with the latest 6 methods. 1) For the FigureQA(an annotated figure dataset for visual reasoning) dataset, compared to IMG+QUES(image+questions), relation networks(RN) and ARN(appearance and relation network), the overall accuracy rate is increased by 26.4%, 8.1%, and 0.46%, respectively. 2) For a single verification set, compared to LEAF-Net(locate, encode and attend for figure network) and FigureNet, the accuracy is increased by 2.3% and 2.0% of each. 3) For the understanding data visualization via question answering(DVQA) dataset, the overall accuracy of the DVQA dataset is increased by 8.6%, 0.12%, and 2.13% compared to SANDY(san with dynamic encoding model), ARN, and RN, and 4) For the Oracle version, compared to SANDY, LEAF-Net and RN, the overall accuracy rate has increased by 23.3%, 7.09%, 4.8%, respectively. Conclusion Our model has good results on the two large open source datasets in the statistical graph Q&A beyond baseline model.