结合空间注意力多层特征融合显著性检测
Saliency detection based on multi-level features and spatial attention
- 2020年25卷第6期 页码:1130-1141
收稿:2019-09-05,
修回:2019-11-21,
录用:2019-11-28,
纸质出版:2020-06-16
DOI: 10.11834/jig.190436
移动端阅览

浏览全部资源
扫码关注微信
收稿:2019-09-05,
修回:2019-11-21,
录用:2019-11-28,
纸质出版:2020-06-16
移动端阅览
目的
2
多层特征对于显著性检测具有重要作用,多层特征的提取和融合是显著性检测研究的重要方向之一。针对现有的多层特征提取中忽略了特征融合与传递、对背景干扰信息敏感等问题,本文基于特征金字塔网络和注意力机制提出一种结合空间注意力的多层特征融合显著性检测模型,该模型用简单的网络结构较好地实现了多层特征的融合与传递。
方法
2
为了提高特征融合质量,设计了多层次的特征融合模块,通过不同尺度的池化和卷积优化高层特征和低层特征的融合与传递过程。为了减少低层特征中的背景等噪声干扰,设计了空间注意力模块,利用不同尺度的池化和卷积从高层特征获得空间注意力图,通过注意力图为低层特征补充全局语义信息,突出低层特征的前景并抑制背景干扰。
结果
2
本文在DUTS,DUT-OMRON(Dalian University of Technology and OMRON Corporation),HKU-IS和ECSSD(extended complex scene saliency dataset)4个公开数据集上对比了9种相关的主流显著性检测方法,在DUTS-test数据集中相对于性能第2的模型,本文方法的最大F值(MaxF)提高了1.04%,平均绝对误差(mean absolute error,MAE)下降了4.35%,准确率—召回率(precision-recall,PR)曲线、结构性度量(S-measure)等评价指标也均优于对比方法,得到的显著图更接近真值图,同时模型也有着不错的速度表现。
结论
2
本文用简单的网络结构较好地实现了多层次特征的融合,特征融合模块提高了特征融合与传递质量,空间注意力模块实现了有效的特征选择,突出了显著区域、减少了背景噪声的干扰。大量的实验表明了模型的综合性能以及各个模块的有效性。
Objective
2
In contrast with semantic segmentation and edge detection
saliency detection focuses on finding the most attractive target in an image. Saliency maps can be widely used as a preprocessing step in various computer vision tasks
such as image retrieval
image segmentation
object recognition
object detection
and visual tracking. In computer graphics
a map scan is used in non-photorealistic rendering
automatic image cropping
video summarization
and image retargeting. Early saliency detection methods mostly measure the salient score through basic characteristics
such as color
texture
and contrast. Although considerable progress has been achieved
handcrafted features typically lack global information and tend to highlight the edges of salient targets rather than the overall area to describe complex scenes and structures. Given the development of deep learning
the introduction of convolutional neural networks frees saliency detection from the restraint of traditional handcrafted features and achieves the best results at present. Fully convolutional networks (FCNs) stack convolution and pooling layers to obtain global semantic information. Spatial structure information may be lost and the edge information of saliency targets may be destroyed when we increase the receptive field to obtain global semantic features. Thus
the FCN cannot satisfy the requirement of a complex saliency detection task. To obtain accurate saliency maps
some studies have attempted to introduce handcrafted features to retain the edge of a saliency target and obtain the final saliency maps by combining the extracted edge's handcrafted features with the higher-level features of the FCN. However
the extraction of handcrafted features takes considerable time. Details may be gradually lost in the process of transforming features from low level to high level. Some studies have achieved good results; they combine high- and low-level features and use low-level features to enrich the details of high-level features. Many models based on multilevel feature fusion have been proposed in recent years
including multi flow
side fusion
bottom-up
and top-down structures. These models focus on network structures and disregard the importance of transmission and the difference between high- and low-level features. This condition may cause the loss of the global semantic information of high-level features and increase the interference of low-level features. Multilevel features play an important role in saliency detection. The method of multilevel feature extraction and fusion is one of the important research directions in saliency detection. To solve the problems of feature fusion and sensitivity to background interference
this study proposes a new saliency detection method based on feature pyramid networks and spatial attention. This method achieves the fusion and transmission of multilevel features with simple network architecture.
Method
2
We propose a multilevel feature fusion network architecture based on a feature pyramid network and spatial attention to integrate different levels of features. The proposed architecture adopts the feature pyramid network
which is the classic bottom-up and top-down structure
as the backbone network and focuses on the optimization of multilevel feature fusion and the transmission process. The network proposed in this work consists of two parts. The first part is the bottom-up convolution part
which is used to extract features. The second part is the top-down upsampling part. Each upsampling of high-level features will be fused with the low-level features of the corresponding scale and transmitted forward. The feature pyramid network removes the high-resolution feature before the first pooling to reduce computation. Multilevel features are extracted using visual geometry group (VGG)-16
which is one of the most excellent feature extraction networks. To improve the quality of feature fusion
a multilevel feature fusion module that optimizes the fusion and transmission processes of high-level features and various low-level features through the pooling and convolution of different scales is designed. To reduce the background interference of low-level features
a spatial attention module that supplies global semantic information for low-level features through attention maps obtained from high-level features via the pooling and convolution of different scales is designed. These attention maps can assist low-level features to highlight the foreground and suppress the background.
Result
2
The experimental results show that the saliency maps obtained using the proposed method are highly similar to the ground truth maps in four standard datasets
namely
DUTS
DUT-OMRON(Dalian University of Technology and OMRON Corporation)
HKU-IS
and extended complex scene saliency dataset(ECSSD)
the max F-measure MaxF increased by 1.04%
and mean absolute error (MAE) decreased by 4.35% compared with the second in the DUTS-test dataset. The method proposed in this study performs the best in simple or complex scenes. The network exhibits good feature fusion and edge learning abilities
which can effectively suppress the background of salient areas and fuse the details of low-level features. The saliency maps from our method have more complete salient areas and clearer edges. The results in terms of four common evaluation indexes are better than those obtained by nine state-of-the-art methods.
Conclusion
2
In this study
the fusion of multilevel features is realized well by using a simple network structure. The multilevel feature fusion module can retain the location information of saliency targets and improve the quality of feature fusion and transmission. The spatial attention module reduces the background details and makes the saliency areas more complete. This module realizes feature selection and avoids the interference of background noise. Many experiments have proven the performance of the model and the effectiveness of each module proposed in this work.
Achanta R, Hemami S, Estrada F and Susstrunk S. 2009. Frequency-tuned salient region detection//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL: IEEE: 1597-1604[ DOI: 10.1109/CVPR.2009.5206596 http://dx.doi.org/10.1109/CVPR.2009.5206596 ]
Borji A, Frintrop S, Sihite D N and Itti L. 2012. Adaptive object tracking by learning background context//Proceedings of 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. Providence, RI, USA: IEEE: 23-30[ DOI: 10.1109/CVPRW.2012.6239191 http://dx.doi.org/10.1109/CVPRW.2012.6239191 ]
Chen S H, Tan X L, Wang B and Hu X L. 2018. Reverse attention for salient object detection//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 236-252[ DOI: 10.1007/978-3-030-01240-3_15 http://dx.doi.org/10.1007/978-3-030-01240-3_15 ]
Donoser M, Urschler M, Hirzer M and Bischof H. 2009. Saliency driven total variation segmentation//Proceedings of the 12th IEEE International Conference on Computer Vision. Kyoto: IEEE: 817-824[ DOI: 10.1109/ICCV.2009.5459296 http://dx.doi.org/10.1109/ICCV.2009.5459296 ]
Fan D P, Cheng M M, Liu Y, Li T and Borji A. 2017. Structure-measure: a new way to evaluate foreground maps//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice: IEEE: 4558-4567[ DOI: 10.1109/ICCV.2017.487 http://dx.doi.org/10.1109/ICCV.2017.487 ]
Feng M Y, Lu H C and Ding E R. 2019. Attentive feedback network for boundary-aware salient object detection//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, CA, USA: IEEE: 1623-1632[ DOI: 10.1109/CVPR.2019.00172 http://dx.doi.org/10.1109/CVPR.2019.00172 ]
Fu J L, Zheng H L and Mei T. 2017. Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI: IEEE: 4476-4484[ DOI: 10.1109/CVPR.2017.476 http://dx.doi.org/10.1109/CVPR.2017.476 ]
Gao Y, Wang M, Tao D C, Ji R R and Dai Q H. 2012. 3-D object retrieval and recognition with hypergraph analysis. IEEE Transactions on Image Processing, 21(9):4290-4303[DOI:10.1109/TIP.2012.2199502]
Han J G, Pauwels E J and De Zeeuw P. 2013. Fast saliency-aware multi-modality image fusion. Neurocomputing, 111:70-80[DOI:10.1016/j.neucom.2012.12.015]
Hou Q B, Cheng M M, Hu X W, Borji A, Tu Z W and Torr P. 2017. Deeply supervised salient object detection with short connections//Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE: 5300-5309[ DOI: 10.1109/CVPR.2017.563 http://dx.doi.org/10.1109/CVPR.2017.563 ]
Itti L, Koch C and Niebur E. 1998. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1254-1259[DOI:10.1109/34.730558]
Jaderberg M, Simonyan K and Zisserman A. 2015. Spatial transformer networks//Proceedings of Advances in Neural Information Processing Systems. Montreal, Canada: [s.n.]: 2017-2025
Judd T, Ehinger K, Durand F and Torralba A. 2009. Learning to predict where humans look//Proceedings of the 12th IEEE International Conference on Computer Vision. Kyoto: IEEE: 2106-2113[ DOI: 10.1109/ICCV.2009.5459462 http://dx.doi.org/10.1109/ICCV.2009.5459462 ]
Lee G, Tai Y W and Kim J. 2016. Deep saliency with encoded low level distance map and high level features//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV: IEEE: 660-668[ DOI: 10.1109/CVPR.2016.78 http://dx.doi.org/10.1109/CVPR.2016.78 ]
Li G B and Yu Y Z. 2015. Visual saliency based on multiscale deep features//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA: IEEE: 5455-5463[ DOI: 10.1109/CVPR.2015.7299184 http://dx.doi.org/10.1109/CVPR.2015.7299184 ]
Li X, Yang F, Cheng H, Liu W and Shen D G. 2018. Contour knowledge transfer for salient object detection//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 370-385[ DOI: 10.1007/978-3-030-01267-0_22 http://dx.doi.org/10.1007/978-3-030-01267-0_22 ]
Lin T Y, Dollár P, Girshick R, He K M, Hariharan B and Belongie S. 2017. Feature pyramid networks for object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI: IEEE: 936-944[ DOI: 10.1109/CVPR.2017.106 http://dx.doi.org/10.1109/CVPR.2017.106 ]
Liu J N, Peng J Y, Li D X and Wang P. 2011. Detecting salient objects based on spectral residual and multi-resolution. Journal of Image and Graphics, 16(2):244-249
刘娟妮, 彭进业, 李大湘, 王平. 2011.基于谱残差和多分辨率分析的显著目标检测.中国图象图形学报, 16(2):244-249[DOI:10.11834/jig.20110212]
Liu N, Han J W and Yang M H. 2018. PiCANet: learning pixel-wise contextual attention for saliency detection//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE: 3089-3098[ DOI: 10.1109/CVPR.2018.00326 http://dx.doi.org/10.1109/CVPR.2018.00326 ]
Liu T, Sun J, Zheng N N, Tang X O and Shum H Y. 2007. Learning to detect a salient object//Proceedings of 2007 IEEE Conference on Computer Vision and Pattern Recognition. Minneapolis, MN, USA: IEEE: 1-8[ DOI: 10.1109/cvpr.2007.383047 http://dx.doi.org/10.1109/cvpr.2007.383047 ]
Luo Z M, Mishra A, Achkar A, Eichel J, Li S Z and Jodoin P M. 2017. Non-local deep features for salient object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI: IEEE: 6593-6601[ DOI: 10.1109/CVPR.2017.698 http://dx.doi.org/10.1109/CVPR.2017.698 ]
Ma Y F, Lu L, Zhang H J and Li M J. 2002. A user attention model for video summarization//Proceedings of the 10th ACM International conference on Multimedia. Juan-les-Pins, France: ACM: 533-542[ DOI: 10.1145/641007.641116 http://dx.doi.org/10.1145/641007.641116 ]
Mnih V, Heess N, Graves A and Kavukcuoglu K. 2014. Recurrent models of visual attention//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press: 2204-2212
Simonyan K and Zisserman A. 2015. Very deep convolutional networks for large-scale image recognition//Proceedings of 2015 International Conference on Learning Representations. San Diego, CA: [s.n.]: 1-14
Sun J and Ling H B. 2011. Scale and object aware image retargeting for thumbnail browsing//Proceedings of 2011 International Conference on Computer Vision. Barcelona: IEEE: 1511-1518[ DOI: 10.1109/ICCV.2011.6126409 http://dx.doi.org/10.1109/ICCV.2011.6126409 ]
Tang Y B and Wu X Q. 2016. Saliency detection via combining region-level and pixel-level predictions with CNNs//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer: 809-825[ DOI: 10.1007/978-3-319-46484-8_49 http://dx.doi.org/10.1007/978-3-319-46484-8_49 ]
Wang L J, Lu H C, Wang Y F, Feng M Y, Wang D, Yin B C and Ruan X. 2017a. Learning to detect salient objects with image-level supervision//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI: IEEE: 3796-3805[ DOI: 10.1109/CVPR.2017.404 http://dx.doi.org/10.1109/CVPR.2017.404 ]
Wang L Z, Wang L J, Lu H C, Zhang P P and Ruan X. 2019a. Salient object detection with recurrent fully convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(7):1734-1746[DOI:10.1109/TPAMI.2018.2846598]
Wang T T, Borji A, Zhang L H, Zhang P P and Lu H C. 2017b. A stagewise refinement model for detecting salient objects in images//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 4039-4048[ DOI: 10.1109/ICCV.2017.433 http://dx.doi.org/10.1109/ICCV.2017.433 ]
Wang W G, Lai Q X, Fu H Z, Shen J B and Ling H B. 2019b. Salient object detection in the deep learning era: an in-depth survey[EB/OL]. 2019-04-19[2019-08-20] . https://arxiv.org/pdf/1904.09146.pdf https://arxiv.org/pdf/1904.09146.pdf
Wang Z, Bovik A C, Sheikh H R and Simoncelli E P. 2004. Image quality assessment:from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600-612[DOI:10.1109/TIP.2003.819861]
Xi X Y, Luo Y K, Wang P and Qiao H. 2019. Salient object detection based on an efficient end-to-end saliency regression network. Neurocomputing, 323:265-276[DOI:10.1016/j.neucom.2018.10.002]
Xiang S K, Cao T Y, Fang Z and Hong S Z. 2020. Dense weak attention model for salient object detection. Journal of Image and Graphics, 25(1):136-147
项圣凯, 曹铁勇, 方正, 洪施展. 2020.使用密集弱注意力机制的图像显著性检测.中国图象图形学报, 25(1):136-147[DOI:10.11834/jig.190187]
Yan Q, Xu L, Shi J P and Jia J Y. 2013. Hierarchical saliency detection//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE: 1155-1162[ DOI: 10.1109/CVPR.2013.153 http://dx.doi.org/10.1109/CVPR.2013.153 ]
Yang C, Zhang L H, Lu H C, Ruan X and Yang M H. 2013. Saliency detection via graph-based manifold ranking//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR: IEEE: 3166-3173[ DOI: 10.1109/CVPR.2013.407 http://dx.doi.org/10.1109/CVPR.2013.407 ]
Zhang L, Dai J, Lu H C, He Y and Wang G. 2018a. A Bi-directional message passing model for salient object detection//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE: 1741-1750[ DOI: 10.1109/CVPR.2018.00187 http://dx.doi.org/10.1109/CVPR.2018.00187 ]
Zhang P P, Wang D, Lu H C, Wang H Y and Ruan X. 2017. Amulet: aggregating multi-level convolutional features for salient object detection//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 202-211[ DOI: 10.1109/ICCV.2017.31 http://dx.doi.org/10.1109/ICCV.2017.31 ]
Zhang X N, Wang T T, Qi J Q, Lu H C and Wang G. 2018b. Progressive attention guided recurrent network for salient object detection//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT: IEEE: 714-722[ DOI: 10.1109/CVPR.2018.00081 http://dx.doi.org/10.1109/CVPR.2018.00081 ]
Zhao D P, Xiao T J, Shi J and Jiang Z G. 2014. An airport and oil depot recognition method based on salient semantics model. Journal of Computer-Aided Design and Computer Graphics, 26(1):47-55
赵丹培, 肖腾蛟, 史骏, 姜志国. 2014.基于显著语义模型的机场与油库目标的识别方法.计算机辅助设计与图形学学报, 26(1):47-55
Zheng H L, Fu J L, Zha Z J and Luo J B. 2019. Looking for the devil in the details: learning trilinear attention sampling network for fine-grained image recognition//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, CA, USA, USA: IEEE: 5007-5016[ DOI: 10.1109/CVPR.2019.00515 http://dx.doi.org/10.1109/CVPR.2019.00515 ]
相关作者
相关机构
京公网安备11010802024621