Saliency detection based on multiple features and spatial attention
chenkai,wangyongxiong(University of Shanghai for Science and Technology)
Objective：Unlike semantic segmentation and edge detection, saliency detection more focuses on finding the most attractive target in an image. Saliency maps can be widely used in various computer vision tasks as a preprocessing step such as image retrieval, image segmentation, object recognition, object detection and visual tracking. In computer graphics, maps can also be used in non-photorealistic rendering, automatic image cropping, video summarization and image retargeting. Early saliency detection methods mainly measure the salient score by basic characteristics such as color, texture and contrast. Although great progress has been made, hand-craft features usually lack global information and tend to highlight the edges of salient targets rather than the overall area to describe complex scenes and structures. Because of the development of deep learning, the introduction of convolutional neural network makes the saliency detection free from the shackle of traditional hand-craft features and achieves the best results at present. Fully Convolutional Network stacks convolution layer and pooling layer to get global semantic information. The spatial structure information may be lost and the edge information of the saliency targets may be destroyed when we increase receptive field to obtain global semantic features, so the Fully Convolutional Network cannot satisfy the requirement of complex saliency detection task. In order to obtain more accurate saliency maps, some works attempt to introduce hand-craft features to retain the edge of saliency target and obtain the final saliency maps by combine the extracted edge hand-craft features with the higher-level features of the Fully Convolutional Network. But the extraction of hand-craft features takes too much time. In the process of features from low-level to high-level, details may be gradually lost. Some works achieve good results that they combine high-level features with low-level features and use low-level features to enrich the details of high-level features. In recent years, many models based on multi-level feature fusion have been proposed including multi-flow structure, side fusion structure, bottom-up and top-down structure. These models focus on the network structure and ignore the importance of the transmission and the difference between high-level and low-level features. Thus, it may cause the loss of global semantic information of high level features and increase the interference of low level features. Multi-level features play an important role in saliency detection. The method of Multi-level feature extraction and fusion is one of the important research directions of saliency detection. In order to solve the problems of feature fusion and sensitivity to background interference, this paper proposed a new saliency detection method based on the feature pyramid networks and spatial attention which achieves the fusion and transmission of multi-level features with simple network architecture. Method：In order to integrate different level of features, we propose a multi-level feature fusion network architecture based on feature pyramid network and spatial attention. The proposed architecture takes the feature pyramid network which is the classic bottom-up and top-down structure as the backbone network and focuses on the optimization of multi-level feature fusion and transmission process. The network proposed in this paper mainly consists of two parts: the first part is the bottom-up convolution part which is used to extract features. The second part is the top-down upsampling part. Each upsampling of the high-level features will be fused with the low-level features of the corresponding scale and transmit forward. In order to reduce the computation, the feature pyramid network takes out the high resolution feature before the first pooling. Multi-level features are extracted by VGG-16 which is one of the most excellent feature extraction networks. In order to improve the quality of feature fusion, we design a multi-level feature fusion module which optimizes the fusion and transmission process of high-level features and various low-level features through pooling and convolution of different scales. In order to reduce the background interference of low-level features, a spatial attention module which supplies global semantic information for low-level features through attention maps obtained from the high-level features by pooling and convolution of different scales is designed. The attention maps can assist low-level features to highlight foreground and suppress background. Result：Experimental results show that saliency maps obtained by the proposed method are very similar to the ground-truth maps in four standard datasets including DUTS-test, DUT-OMRON, HKU-IS, ECSSD. The MaxF increased by 1.04% and MAE decreased by 4.35% compared with the second in DUTS-test dataset. The method that we proposed in this paper is the best in the simple scenes or complex scenes. The network has good feature fusion ability and edge learning ability, which can effectively suppress the background of salient areas and fuse the details of low-level features. The saliency maps from our method have more complete salient areas and clearer edges. And the results in term of four common evaluation indexes are better than those obtained by nine state-of-the-art methods. Conclusion：In this paper, the fusion of multi-level features is well realized by using simple network structure. The multi-level feature fusion module can retain the location information of saliency targets and improve the quality of feature fusion and transmission. Spatial attention module makes the background details less and the saliency areas more complete. The spatial attention module realizes the feature selection and avoids the interference of background noise. And many experiments have proved the performance of the model and the effectiveness of each module proposed in this paper.