Current Issue Cover
结合空间注意力的多层特征融合显著性检测方法

陈凯,王永雄(上海理工大学)

摘 要
目的:多层特征对于显著性检测起着重要作用,多层特征的提取和融合是显著性检测研究的重要方向之一。针对现有的多层特征提取中忽略了特征融合与传递、对背景干扰信息敏感等问题,本文提出一种新的基于特征金字塔网络和空间注意力的显著性检测模型,该模型用简单的网络结构较好地实现了多层特征的融合与传递。方法:首先,为了提高特征融合质量,设计了多层次的特征融合模块,通过不同尺度的池化、卷积优化高层特征和低层特征的融合与传递过程。其次,为了减少低层特征中的背景等噪声干扰,设计了空间注意力模块,利用不同尺度的池化、卷积从高层特征获得空间注意力图,通过注意力图为低层特征补充全局语义信息,突出低层特征的前景并抑制背景干扰。结果:本文在DUTS-test,DUT-OMRON,HKU-IS和ECSSD四个公开数据集上对比了9种相关的主流显著性检测方法,在DUTS-test数据集中相对于性能第2的模型,MaxF提高了1.04%,MAE下降了4.35%,PR曲线、S-measure等评价指标也均优于对比方法,得到的显著图更接近真值图,同时模型也有着不错的速度表现。结论:本文用简单的网络结构较好地实现了多层次特征的融合,特征融合模块提高了特征融合与传递质量,空间注意力模块实现了有效的特征选择,突出了显著区域减少了背景噪声的干扰,大量的实验证明了模型的综合性能以及各个模块的有效性。
关键词
Saliency detection based on multiple features and spatial attention

chenkai,wangyongxiong(University of Shanghai for Science and Technology)

Abstract
Objective:Unlike semantic segmentation and edge detection, saliency detection more focuses on finding the most attractive target in an image. Saliency maps can be widely used in various computer vision tasks as a preprocessing step such as image retrieval, image segmentation, object recognition, object detection and visual tracking. In computer graphics, maps can also be used in non-photorealistic rendering, automatic image cropping, video summarization and image retargeting. Early saliency detection methods mainly measure the salient score by basic characteristics such as color, texture and contrast. Although great progress has been made, hand-craft features usually lack global information and tend to highlight the edges of salient targets rather than the overall area to describe complex scenes and structures. Because of the development of deep learning, the introduction of convolutional neural network makes the saliency detection free from the shackle of traditional hand-craft features and achieves the best results at present. Fully Convolutional Network stacks convolution layer and pooling layer to get global semantic information. The spatial structure information may be lost and the edge information of the saliency targets may be destroyed when we increase receptive field to obtain global semantic features, so the Fully Convolutional Network cannot satisfy the requirement of complex saliency detection task. In order to obtain more accurate saliency maps, some works attempt to introduce hand-craft features to retain the edge of saliency target and obtain the final saliency maps by combine the extracted edge hand-craft features with the higher-level features of the Fully Convolutional Network. But the extraction of hand-craft features takes too much time. In the process of features from low-level to high-level, details may be gradually lost. Some works achieve good results that they combine high-level features with low-level features and use low-level features to enrich the details of high-level features. In recent years, many models based on multi-level feature fusion have been proposed including multi-flow structure, side fusion structure, bottom-up and top-down structure. These models focus on the network structure and ignore the importance of the transmission and the difference between high-level and low-level features. Thus, it may cause the loss of global semantic information of high level features and increase the interference of low level features. Multi-level features play an important role in saliency detection. The method of Multi-level feature extraction and fusion is one of the important research directions of saliency detection. In order to solve the problems of feature fusion and sensitivity to background interference, this paper proposed a new saliency detection method based on the feature pyramid networks and spatial attention which achieves the fusion and transmission of multi-level features with simple network architecture. Method:In order to integrate different level of features, we propose a multi-level feature fusion network architecture based on feature pyramid network and spatial attention. The proposed architecture takes the feature pyramid network which is the classic bottom-up and top-down structure as the backbone network and focuses on the optimization of multi-level feature fusion and transmission process. The network proposed in this paper mainly consists of two parts: the first part is the bottom-up convolution part which is used to extract features. The second part is the top-down upsampling part. Each upsampling of the high-level features will be fused with the low-level features of the corresponding scale and transmit forward. In order to reduce the computation, the feature pyramid network takes out the high resolution feature before the first pooling. Multi-level features are extracted by VGG-16 which is one of the most excellent feature extraction networks. In order to improve the quality of feature fusion, we design a multi-level feature fusion module which optimizes the fusion and transmission process of high-level features and various low-level features through pooling and convolution of different scales. In order to reduce the background interference of low-level features, a spatial attention module which supplies global semantic information for low-level features through attention maps obtained from the high-level features by pooling and convolution of different scales is designed. The attention maps can assist low-level features to highlight foreground and suppress background. Result:Experimental results show that saliency maps obtained by the proposed method are very similar to the ground-truth maps in four standard datasets including DUTS-test, DUT-OMRON, HKU-IS, ECSSD. The MaxF increased by 1.04% and MAE decreased by 4.35% compared with the second in DUTS-test dataset. The method that we proposed in this paper is the best in the simple scenes or complex scenes. The network has good feature fusion ability and edge learning ability, which can effectively suppress the background of salient areas and fuse the details of low-level features. The saliency maps from our method have more complete salient areas and clearer edges. And the results in term of four common evaluation indexes are better than those obtained by nine state-of-the-art methods. Conclusion:In this paper, the fusion of multi-level features is well realized by using simple network structure. The multi-level feature fusion module can retain the location information of saliency targets and improve the quality of feature fusion and transmission. Spatial attention module makes the background details less and the saliency areas more complete. The spatial attention module realizes the feature selection and avoids the interference of background noise. And many experiments have proved the performance of the model and the effectiveness of each module proposed in this paper.
Keywords
QQ在线


订阅号|日报