Current Issue Cover
融合多重注意力机制的人眼注视点预测

孔力1, 胡学敏1, 汪顶1, 刘艳芳1, 张龑1, 陈龙2(1.湖北大学计算机与信息工程学院,武汉 430062;2.中山大学数据科学与计算机学院,广州 510006)

摘 要
目的 经典的人眼注视点预测模型通常采用跳跃连接的方式融合高、低层次特征,容易导致不同层级之间特征的重要性难以权衡,且没有考虑人眼在观察图像时偏向中心区域的问题。对此,本文提出一种融合注意力机制的图像特征提取方法,并利用高斯学习模块对提取的特征进行优化,提高了人眼注视点预测的精度。方法 提出一种新的基于多重注意力机制(multiple attention mechanism,MAM)的人眼注视点预测模型,综合利用3种不同的注意力机制,对添加空洞卷积的ResNet-50模型提取的特征信息分别在空间、通道和层级上进行加权。该网络主要由特征提取模块、多重注意力模块和高斯学习优化模块组成。其中,空洞卷积能够有效获取不同大小的感受野信息,保证特征图分辨率大小的不变性;多重注意力模块旨在自动优化获得的低层丰富的细节信息和高层的全局语义信息,并充分提取特征图通道和空间信息,防止过度依赖模型中的高层特征;高斯学习模块用来自动选择合适的高斯模糊核来模糊显著性图像,解决人眼观察图像时的中心偏置问题。结果 在公开数据集SALICON(saliency in context)上的实验表明,提出的方法相较于同结构的SAM-Res(saliency attention modal)模型以及DINet(dilated inception network)模型在相对熵(Kullback-Leibler divergence,KLD)、sAUC(shuffled area under ROC curve)和信息增益(information gain,IG)评价标准上分别提高了33%、0.3%和6%;53%、0.5%和192%。结论 实验结果表明,提出的人眼注视点预测模型能通过加权的方式分别提取空间、通道、层之间的特征,在多数人眼注视点预测指标上超过了主流模型。
关键词
Eye fixation prediction combining with multiple attention mechanism

Kong Li1, Hu Xuemin1, Wang Ding1, Liu Yanfang1, Zhang Yan1, Chen Long2(1.School of Computer Science and Information Engineering, Hubei University, Wuhan 430062, China;2.School of Data and Computer Science, Sun Yat-sen University, Guangzhou 510006, China)

Abstract
Objective Human eye fixation recognition has been developing in images-related computer vision in recent years. The distinctive salient regions of an image are selected for capturing visual structure better. Recent saliency models are developed through salient object detection, object segmentation and image cropping. Traditional applications are focused on hand-crafted features based on low-level cues (e.g., contrast, texture, color) for saliency prediction. However, these features are easily failed to simulate the complex activation of the human visual system, especially in complex scenarios. Existing eye fixation prediction models often use jump connections to fuse high-level and low-level features, which easily leads to the difficulty of weighing the importance of features between different levels, and the gazing problem are biased toward the center. Commonly, humans are inclined to look at the center of the image when there are no obvious salient regions. We develop layer attention mechanism that different weights are assigned to different layer features for selective layer features extraction, and the channel attention mechanism and spatial attention mechanism are integrated to selectively extract different channel and spatial features in convolutional features. In addition, we facilitate a method of Gaussian learning to solve the problem of the center priors and improve the prediction accuracy. Method Our eye fixation prediction model is based on multiple attention mechanism network (MAM-Net), which uses three different attention mechanisms to weight the feature information of different layers, different channels, and different image pixels extracted by the ResNet-50 model with dilated convolution. Our network is mainly composed of the feature extraction module, the novel multiple attention mechanism (MAM) module, and the Gaussian learning optimization module. 1) A dilated convolution network is used to capture long-range information via extracting local and global feature maps, which can contain a lot of different receptive fields. 2) A MAM attention module is incorporated features from different contexts of layer, channels, and image pixels of feature maps and output an intermediate saliency map. 3) A Gaussian learning layer is used to select best kernel automatically to blur the intermediate saliency map and generate the final saliency map. Our MAM module aims to optimize the obtained low-level features automatically in the context of rich details and high-level global semantic information features, fully extract channel and spatial information, and prevent over-reliance on high-level features. The Gaussian learning module is used for the final optimization processing since human eyes tend to focus to the image center, which is inconsistent with the prediction results of common methods. The deficiency of setting Gaussian fuzzy parameters is avoided by human prior in our method. Result Experiments on the public dataset saliency in context(SALICON) show that our results has improved Kullback-Leibler divergence (KLD), shuffled area under region of interest(ROC) curve (sAUC), and information gain (IG) evaluation criteria by 33%, 0.3%, and 6%; 53%, 0.6%, and 192%, respectively. Conclusion We propose a novel attentive model for predicting human eye fixations on natural images. Our MAM-Net can be used to predict saliency map of an image, which extract high-level and low-level features. The channel and spatial attention mechanism can optimize the feature maps of different layers, and the layer attention mechanism can predict the saliency map of the image composed of high-level and low-level features as well. We illustrate a Gaussian learning blur layer in terms of the integrated saliency maps optimization with different kernel.
Keywords

订阅号|日报