使用密集弱注意力机制的图像显著性检测
Dense weak attention model for salient object detection
- 2020年25卷第1期 页码:136-147
收稿:2019-05-15,
修回:2019-7-1,
纸质出版:2020-01-16
DOI: 10.11834/jig.190187
移动端阅览

浏览全部资源
扫码关注微信
收稿:2019-05-15,
修回:2019-7-1,
纸质出版:2020-01-16
移动端阅览
目的
2
基于全卷积网络(FCN)模型的显著性检测(SOD)的研究认为,更大的解码网络能实现比小网络更好的检测效果,导致解码阶段参数量庞大。视觉注意力机制一定程度上缓解了模型过大的问题。本文将注意力机制分为强、弱注意力两种:强注意力能为解码提供更强的先验,但风险很大;相反,弱注意力机制风险更小,但提供的先验较弱;基于此提出并验证了采用弱注意力的小型网络架构也能达到大网络的检测精度这一观点。
方法
2
本文设计了全局显著性预测和基于弱注意力机制的边缘优化两个阶段,其核心是提出的密集弱注意力模块。它弥补了弱注意力的缺点,仅需少量额外参数,就能提供不弱于强注意力的先验信息。
结果
2
相同的实验环境下,提出的模型在5个数据集上取得了总体上更好的检测效果。同时,提出的方法将参数量控制在69.5 MB,检测速度达到了实时32帧/s。实验结果表明,与使用强注意力的检测方法相比,提出的密集弱注意力模块使得检测模型的泛化能力更好。
结论
2
本文目标是使用弱注意力机制来提高检测效能,为此设计了兼顾效率和风险的弱注意力模块。弱注意力机制可以提高解码特征的效率,从而压缩模型大小和加快检测速度,并在现有测试集上体现出更好的泛化能力。
Objective
2
Salient object detection
also called saliency detection
aims to localize and segment the most conspicuous and eye-attracting objects or regions in an image. Several applications have benefited from saliency detection
such as image and video compression
context-aware image retargeting
scene parsing
image resizing
object detection
and segmentation. The detection process includes feature extraction and mapping to the saliency value. Most of the state-of-art salient object detection models use extracted features from pre-trained classification convolution network. Related works have shown that models based on fully convolutional networks (FCNs) can encode semantic-rich features
thereby improving the robustness and accuracy of saliency detection. An intuitive opinion states that a large complex network performs better than a small and simple one. Many of the current methods lack efficiency and require numerous storage resources. In the past few years
attention mechanism has been employed to boost and aid many visual tasks in reducing the decoding difficulty and producing lightweight networks. To be more specific
attention mechanism utilizes pre-estimated attention mask and provides useful prior knowledge to the decoding progress. This mechanism eases the mapping from features to the saliency value to eliminate the need to design a large and complex decoding network. However
the wildly used strong attention applies a multiplicative operation between attention mask and features. When the attention mask is normalized
scilicet values range from 0 to 1
where a value of 0 irreversibly wipes out the distribution of certain features. Thus
using strong attention may cause overfitting risks. On the contrary
weak attention applies an additive operation and is less risky and less efficient. Weak attention shifts the features in the feature space and does not destroy the distribution. However
the previously added information can be smoothed by the convolutional operations. The longer the sequence of convolutional layers are
the less effect the attention mask will exert on the decoding features. This work contributes in three aspects:1) We infer about the visual attention mechanism by dividing it into strong and weak attentions before qualitatively explaining how the attention mechanism improves the decoding efficiency. 2) We discuss the principles of the two types of attention mechanism. Finally
3) we propose a dense weak attention module that can improve the efficiency of utilizing the features compared with the existing methods.
Method
2
Instead of applying the weak attention only at the beginning of the first convolutional layer
we performed the application tautologically and consequently (i.e.
applying weak attention before all decoding convolutional layers). The proposed method is called dense weak attention module (DWAM)
which introduces an ideal end-to-end detection model called dense weak attention network. The proposed method inherits an FCN-like architecture
which consists of a sequence of convolutional
pooling
and different activation layers. To fine-tune the VGG-16 network
we divide the decoding network into two parts:global saliency detection and edge optimization using DWAM. A rough saliency map is predicted in the deepest branch of the network. Then
the saliency map is treated as an attention mask and concatenated to shallow features to predict a saliency map with increased resolution. To output side saliency maps
we add cross entropy layers after each side output
a process known as deep supervision
to optimize the network. We discover that weak attention plays an important role in the optimization of the detection result by providing effective prior information. With few additional parameters
we have achieved an improved detection result and detection speed. To achieve a more robust prediction than before
the atrous spatial pyramid pooling is used to enhance the ability of detecting multiscale targets.
Result
2
We compared the proposed method with seven FCN-based state-of-the-art techniques on five widely accepted benchmarks
and set three indicators as evaluation criteria:mean absolute error (MAE)
F measure
and precision-recall curve. Under the same condition
the proposed model demonstrated more competitive results compared with the other state-of-art methods. The MAE of the proposed method is generally better than that of other methods
which means that DWAM produces more pixel-level accuracy results than the other techniques. DWAM's F measure is higher by approximately 2%6% than most of the state-of-art methods. In addition
the precision-recall curve shows that DWAM has a slight advantage and better balance between precision and recall metrics than the other techniques. Meanwhile
the model size of the proposed method is only 69.5 MB and the real-time detection speed reaches 32 frame per second.
Conclusion
2
In this study
we proposed an efficient and fully convolutional salient object detection model to improve the efficiency of feature decoding and enhance the generalization ability through weak attention mechanism and deep supervision training than other state-of-the-art methods. Compared with the existing methods
the results of the proposed method is more competitive and the detection speed is faster even if the model remained small.
Borji A, Frintrop S and Sihite D N. 2012. Adaptive object tracking by learning background context//Proceedings of 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. Providence, RI, USA: IEEE: 23-30[ DOI:10.1109/CVPRW.2012.6239191 http://dx.doi.org/10.1109/CVPRW.2012.6239191 ]
Chen L C, Papandreou G, Kokkinos I, Murphy K and Yuille A L. 2016a. DeepLab:semantic image segmentation with deep convolutional nets, Atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834-848[DOI:10.1109/TPAMI.2017.2699184]
Chen L C, Yang Y, Wang J, Wei X and Alan L. 2016b. Attention to scale: scale-aware semantic image segmentation//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 3640-3649[ DOI:10.1109/CVPR.2016.396 http://dx.doi.org/10.1109/CVPR.2016.396 ]
Chen S H, Tan X L and Wang B. 2018. Reverse attention for salient object detection//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 236-252[ DOI:10.1007/978-3-030-01240-3_15 http://dx.doi.org/10.1007/978-3-030-01240-3_15 ]
Cheng M M, Mitra N J, Huang X L, Torr P and Hu S M. 2015. Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3):569-582[DOI:10.1109/TPAMI.2014.2345401]
Cheng M M, Mitra N J, Huang X L and Hu S M. 2014. SalientShape:group saliency in image collections. The Visual Computer, 30(4):443-453[DOI:10.1007/s00371-013-0867-4]
Cheng M M, Hou Q B, Zhang S H and Rosin P L. 2017. Intelligent visual media processing:when graphics meets vision. Journal of Computer Science and Technology, 32(1):110-121[DOI:10.1007/s11390-017-1681-7]
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 770-778[ DOI:10.1109/cvpr.2016.90 http://dx.doi.org/10.1109/cvpr.2016.90 ]
He X T, Peng Y X and Zhao J J. 2019. Which and how many regions to gaze:focus discriminative regions for fine-grained visual categorization. International Journal of Computer Vision, 127(9):1235-1255[DOI:10.1007/s11263-019-01176-2]
Hou Q B, Cheng M M, Hu X W, Borji A, Tu Z and Torr P. 2019. Deeply supervised salient object detection with short connections. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(4):815-828[DOI:10.1109/TPAMI. 2018.2815688]
Hu X W, Zhu L, Qin J, Fu C W and Heng P A. 2018. Recurrently aggregating deep features for salient object detection//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Louisiana: 6943-6950
Krizhevsky A, Sutskever I and Hinton G E. 2017. ImageNet classification with deep convolutional neural networks. Communications of the ACM Advances in Neural Information Processing Systems, 60(6):84-90[DOI:10.1145/3065386]
Itti L. 2004. Automatic foveation for video compression using a neurobiological model of visual attention. IEEE Transactions on Image Processing, 13(10):1304-1318[DOI:10.1109/tip.2004. 834657]
Li G B and Yu Y Z. 2016a. Deep contrast learning for salient object detection//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Conference on. Las Vegas, NV, USA. IEEE: 478-487[ DOI:10.1109/cvpr.2016.58 http://dx.doi.org/10.1109/cvpr.2016.58 ]
Li G B and Yu Y Z. 2016b. Visual saliency detection based on multiscale deep CNN features. IEEE Transactions on Image Processing, 25(11):5012-5024[DOI:10.1109/tip.2016.2602079]
Li J, Lyu S H, Chen F, Yang G G and Dou Y. 2017. Image retrieval by combining recurrent neural network and visual attention mechanism. Journal of Image and Graphics, 22(2):241-248
李军, 吕绍和, 陈飞, 阳国贵, 窦勇. 2017.结合视觉注意机制与递归神经网络的图像检索.中国图象图形学报, 22(2):241-248[DOI:10.11834/jig.20170212]
Liu N and Han J W. 2016. DHSNet: deep hierarchical saliency network for salient object detection//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 678-686[ DOI:10.1109/cvpr.2016.80 http://dx.doi.org/10.1109/cvpr.2016.80 ]
Liu T, Yuan Z, Sun J, Wang J, Zheng N N, Tang X and Shum H Y. 2007. Learning to detect a salient object//Proceedings of 2007 IEEE Conference on Computer Vision and Pattern Recognition. Minneapolis, MN, USA: IEEE: 353-367[ DOI:10.1109/cvpr.2007.383047 http://dx.doi.org/10.1109/cvpr.2007.383047 ]
Shelhamer E, Long J, Shelhamer E and Darrell T. 2017. Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4):640-651[DOI:10.1109/TPAMI.2016.2572683]
Luo Z M, Mishra A, Achkar A, Eichel J, Li S and Josoin P M. 2017. Non-local deep features for salient object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA. IEEE: 6609-6617[ DOI:10.1109/cvpr.2017.698 http://dx.doi.org/10.1109/cvpr.2017.698 ]
Peng Y X, Zhao Y Z and Zhang J C. 2019. Two-stream collaborative learning with spatial-temporal attention for video classification. IEEE Transactions on Circuits and Systems for Video Technology, 29(3):773-786[DOI:10.1109/tcsvt.2018.2808685]
Ronneberger O, Fischer P and Brox T. 2015. U-Net: convolutional networks for biomedical image segmentation//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich, Germany: Springer: 234-241[ DOI:10.1007/978-3-319-24574-4_28 http://dx.doi.org/10.1007/978-3-319-24574-4_28 ]
Shi J P, Yan Q, Xu L and Jia J. 2016. Hierarchical image saliency detection on extended CSSD. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(4):717-729[DOI:10.1109/tpami.2015.2465960]
Simonyan K and Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition[EB/OL].[2015-04-10] https://arxiv.org/pdf/1409.1556.pdf https://arxiv.org/pdf/1409.1556.pdf
Sun J and Ling H B. 2011. Scale and object aware image retargeting for thumbnail browsing//Proceedings of 2011 International Conference on Computer Vision. Barcelona, Spain: IEEE: 1511-1518[ DOI:10.1109/iccv.2011.6126409 http://dx.doi.org/10.1109/iccv.2011.6126409 ]
Vinyals O, Toshev A, Bengio S and Erhan D. 2015. Show and tell: a neural image caption generator//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE: 3156-3164[ DOI:10.1109/cvpr.2015.7298935 http://dx.doi.org/10.1109/cvpr.2015.7298935 ]
Yang C, Zhang L H, Lu H C, Ruan X and Yang M H. 2013. Saliency detection via graph-based manifold ranking//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE: 3166-3173[ DOI:10.1109/cvpr.2013.407 http://dx.doi.org/10.1109/cvpr.2013.407 ]
Yang Y, Yan J H and Jing Q F. 2018. Deformation object tracking based on the fusion of invariant scalable key point matching and image saliency. Journal of Image and Graphics, 23(3):384-398
杨勇, 闫钧华, 井庆丰. 2018.融合图像显著性与特征点匹配的形变目标跟踪.中国图象图形学报, 23(3):384-398[DOI:10.11834/jig.170339]
Gao Y, Wang M, Tao D C, Ji R R and Dai Q H. 2012. 3-D object retrieval and recognition with hypergraph analysis. IEEE Transactions on Image Processing, 21(9):4290-4303[DOI:10.1109/tip.2012. 2199502]
Zhang P P, Wang D, Lu H C, Wang H and Ruan X. 2017a. Amulet: aggregating multi-level convolutional features for salient object detection//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 202-211[ DOI:10.1109/iccv.2017.31 http://dx.doi.org/10.1109/iccv.2017.31 ]
Zhang P P, Wang D, Lu H C, Wang H and Yin B. 2017b. Learning uncertain convolutional features for accurate saliency detection//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 212-221[ DOI:10.1109/iccv.2017.32 http://dx.doi.org/10.1109/iccv.2017.32 ]
Zhang P P, Wang L Y, Wang D, Lu H and Shen C. 2018. Agile amulet: real-time salient object detection with contextual attention[EB/OL].[2019-05-01] . https://arxiv.org/pdf/1802.06960.pdf https://arxiv.org/pdf/1802.06960.pdf
相关作者
相关机构
京公网安备11010802024621