结合细粒度特征与深度卷积网络的手绘图检索
Sketch-based image retrieval based on fine-grained feature and deep convolutional neural network
- 2019年24卷第6期 页码:946-955
收稿:2018-09-07,
修回:2018-11-14,
纸质出版:2019-06-16
DOI: 10.11834/jig.180525
移动端阅览

浏览全部资源
扫码关注微信
收稿:2018-09-07,
修回:2018-11-14,
纸质出版:2019-06-16
移动端阅览
目的
2
传统的手绘图像检索方法主要集中在检索相同类别的图像,忽略了手绘图像的细粒度特征。对此,提出了一种新的结合细粒度特征与深度卷积网络的手绘图像检索方法,既注重通过深度跨域实现整体匹配,也实现细粒度细节匹配。
方法
2
首先构建多通道混合卷积神经网络,对手绘图像和自然图像分别进行不同的处理;其次通过在网络中加入注意力模型来获取细粒度特征;最后将粗细特征融合,进行相似性度量,得到检索结果。
结果
2
在不同的数据库上进行实验,与传统的尺度不变特征(SIFT)、方向梯度直方图(HOG)和深度手绘模型Deep SaN(sketch-a-net)、Deep 3DS(sketch)、Deep TSN(triplet sketch net)等5种基准方法进行比较,选取了Top-1和Top-10,在鞋子数据集上,本文方法Top-1正确率提升了12%,在椅子数据集上,本文方法Top-1正确率提升了11%,Top-10提升了3%,与传统的手绘检索方法相比,本文方法得到了更高的准确率。在实验中,本文方法通过手绘图像能在第1幅检索出绝大多数的目标图像,达到了实例级别手绘检索的目的。
结论
2
提出了一种新的手绘图像检索方法,为手绘图像和自然图像的跨域检索提供了一种新思路,进行实例级别的手绘检索,与原有的方法相比,检索精度得到明显提升,证明了本文方法的可行性。
Objective
2
Content-based image retrieval or text-based retrieval has played a major role in practical computer vision applications. In several scenarios
however
retrieval becomes a problem when sample queries are unavailable or describing them with a keyword is difficult. However
compared with text
sketches can intrinsically capture object appearance and structure. Sketches are incredibly intuitive to humans and descriptive in nature. They provide a convenient and intuitive means to specify object appearance and structure. As a query modality
they offer a degree of precision and flexibility that is missing in traditional text-based image retrieval. Closely correlated with the proliferation of touch-screen devices
sketch-based image retrieval has become an increasingly prominent research topic in recent years. Conventional sketch-based image retrieval (SBIR) principally focuses on retrieving images of the same category and disregards the fine-grained feature of sketches. However
SBIR is challenging because humans draw free-hand sketches without any reference but only focus on the salient object structures. Hence
the shapes and scales in sketches are usually distorted compared with those in natural images. To deal with this problem
studies have developed methods to bridge the domain gap between sketches and natural images for SBIR. These approaches can be roughly divided into hand-crafted and cross-domain deep learning-based methods. SBIR generates approximate sketches by extracting edge or contour maps from natural images. Afterward
hand-crafted features are extracted for sketches and edge maps of natural images
which are then fed into "bag-of-words" methods to generate representations for SBIR. The major limitation of hand-crafted methods is that the domain gap between sketches and natural images cannot be well remedied because matching edge maps to non-aligned sketches with large variations and ambiguity is difficult. For this problem
we propose a novel sketch-based image retrieval method based on fine-grained feature and deep convolutional neural network. This fine-grained SBIR (FG-SBIR) approach focuses not only on coarse holistic matching via a deep cross-domain but also on explicit accounting for fine-grained detail matching. The proposed deep convolutional neural network is designed for sketch-based image retrieval.
Method
2
Most existing SBIR studies have focused on category-level sketch-to-photo retrieval. A bag-of-words representation combined with a form of edge detection from photo images is often employed to bridge the domain gap. Previous work that attempted to address the fine-grained SBIR problem is based on a deformable part-based model and graph matching. However
the definition of fine-grained in previous work is different from ours-a sketch is considered to be a match to a photo if the objects depicted look similar. In addition
these hand-crafted feature-based approaches are inadequate in capturing the subtle intra-category and inter-instance differences
as demonstrated in our experiments. Our methods are demonstrated as follows:First
we construct a multiple branch of confusing deep convolutional neural network to perform a different deal with sketch and natural image; Three different branches are used:one sketch branch and two nature image branches. The sketch branch has four convolutional and two pooling layers
whereas the natural image branch has five and two
respectively. By adding a convolutional layer to obtain abstractive natural image features
the problem of abstraction level inconsistency is solved. Different branch designs can reduce domain differences. Second
we extract detail information by adding the attention model in the neural network. Most attention models learn an attention mask
which assigns different weights to different regions of an image. Soft attention is the most commonly used model because it is differentiable and can thus be learned end-to-end with the rest of the network. Our attention model is specifically designed for FG-SBIR in that it is robust against spatial misalignment through the shortcut connection architecture. Third
we combine coarse and fine semantic information to achieve retrieval. By combining the information
we obtain robust features. Finally
we use deep triplet loss to obtain good results. The loss is defined using the max-margin framework.
Result
2
The experiment on different benchmark datasets comprises shoe and chair datasets. We use two traditional hand-crafted feature-based models
namely
scale-invariant feature transform (SIFT) and histogram of oriented gradient (HOG)
apart from three other baseline models
namely
deep SaN
deep 3D
and deep TSN
which use the deep features designed for the sketch. We utilize the ratio of correctly predicting the true match at Top1 and Top10 as the evaluation metrics. We compare the performance of our full model and the five baselines. Results prove that the proposed method obtains higher retrieval precision than the traditional methods. Our model performs the best overall in each metric and in both datasets. The improvement is particularly clear at Top1
with an approximately 12% increase. In the chair dataset
we obtain an approximately 11% increase. Moreover
we obtain an approximately 3% increase at Top10. In other words
we can acquire the right result on the first image. In the proposed method
we wish to achieve instance-level retrieval. Thus
the proposed model obtains good results in the FG-SBIR task.
Conclusion
2
The proposed sketch-based image retrieval provides a new means of thinking for the cross-domain retrieval of sketch and natural images. This sketch convolutional neural network obtains good results in sketch-based image retrieval. This task is more challenging than the well-studied category-level SBIR task
but it is also more useful for commercial SBIR adoption. Achieving fine-grained retrieval across the sketch/image gap requires a deep network learned with triplet annotation requirements. We demonstrate how to sidestep these requirements in order to achieve good performance in this new and challenging task. By introducing attention modeling and the sketch convolutional neural network
the model can concentrate on the subtle differences between local regions of a sketch and photo images and compute deep features containing fine-grained and high-level semantics. The proposed sketch neural network is suitable for FG-SIBR.
Canny J. A computational approach to edge detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1986, PAMI-8(6):679-698.[DOI:10.1109/TPAMI.1986.4767851]
Martin D R, Fowlkes C C, Malik J. Learning to detect natural image boundaries using local brightness, color, and texture cues[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004, 26(5):530-549.[DOI:10.1109/TPAMI.2004.1273918]
Zagoruyko S, Komodakis N. Learning to compare image patches via convolutional neural networks[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2015: 4353-4361.[ DOI: 10.1109/CVPR.2015.7299064 http://dx.doi.org/10.1109/CVPR.2015.7299064 ]
Eitz M, Hildebrand K, Boubekeur T, et al. A descriptor for large scale image retrieval based on sketched feature lines[C]//The 6th Eurographics Symposium on Sketch-Based Interfaces and Modeling. New Orleans, Louisiana: ACM, 2009: 9-36.[ DOI: 10.1145/1572741.1572747 http://dx.doi.org/10.1145/1572741.1572747 ]
Hu R, Barnard M, Collomosse J. Gradient field descriptor for sketch based retrieval and localization[C]//Proceedings of 2010 IEEE International Conference on Image Processing. Hong Kong, China: IEEE, 2010: 1025-1028.[ DOI: 10.1109/ICIP.2010.5649331 http://dx.doi.org/10.1109/ICIP.2010.5649331 ]
Hu R, Collomosse J. A performance evaluation of gradient field HOG descriptor for sketch based image retrieval[J]. Computer Vision and Image Understanding, 2013, 117(7):790-806.[DOI:10.1016/j.cviu.2013.02.005]
Cao Y, Wang C H, Zhang L Q,et al. Edgel index for large-scale sketch-based image search[C]//CVPR 2011. Colorado Springs, CO: IEEE, 2011: 761-768.[ DOI: 10.1109/CVPR.2011.5995460] http://dx.doi.org/10.1109/CVPR.2011.5995460] .
Eitz M, Hildebrand K, Boubekeur T, et al. Sketch-based image retrieval:benchmark and bag-of-features descriptors[J]. IEEE Transactions on Visualization and Computer Graphics, 2011, 17(11):1624-1636.[DOI:10.1109/TVCG.2010.266]
Sun X H, Wang C H, Xu C, et al. Indexing billions of images for sketch-based retrieval[C]//Proceedings of the 21st ACM International Conference on Multimedia. Barcelona, Spain: ACM, 2013: 233-242.[ DOI: 10.1145/2502081.2502281 http://dx.doi.org/10.1145/2502081.2502281 ]
Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks[C]//Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, Nevada: ACM, 2012: 1097-1105.[ DOI: 10.1145/3065386 http://dx.doi.org/10.1145/3065386 ]
Szegedy C, Liu W, Jia Y Q, et al. Going deeper with convolutions[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE Computer Society Press, 2015: 1-9.[ DOI: 10.1109/CVPR.2015.7298594 http://dx.doi.org/10.1109/CVPR.2015.7298594 ]
Yu Q, Yang Y X, Song Y Z, et al. Sketch-a-net that beats humans[C]//Proceedings of the British Machine Vision Conference, 2015: 7.1-7.12.[ DOI: 10.5244/c.29.7 http://dx.doi.org/10.5244/c.29.7 ]
Seddati O, Dupont S, Mahmoudi S. Quadruplet networks for sketch-based image retrieval[C]//Proceedings of 2017 ACM on International Conference on Multimedia Retrieval. Bucharest, Romania: ACM Press, 2017: 184-191.[ DOI: 10.1145/3078971.3078985 http://dx.doi.org/10.1145/3078971.3078985 ]
Bui T, Ribeiro L, Ponti M, et al. Generalisation and sharing in triplet convnets for sketch based visual search[EB/OL].[2018-08-20] . https://arxiv.org/pdf/1611.05301.pdf https://arxiv.org/pdf/1611.05301.pdf .
Liu L, Shen F M, Shen Y M, et al. Deep sketch hashing: fast free-hand sketch-based image retrieval[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 2298-2307.[ DOI: 10.1109/CVPR.2017.247 http://dx.doi.org/10.1109/CVPR.2017.247 ]
Shen Y M, Liu L, Shen F M, et al. Zero-shot sketch-image hashing[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 3598-3607.[ DOI: 10.1109/CVPR.2018.00379 http://dx.doi.org/10.1109/CVPR.2018.00379 ]
Li Y, Hospedales T M, Song Y Z, et al. Fine-grained sketch-based image retrieval by matching deformable part models[C]//Proceedings of British Machine Vision Conference, 2014: 1-12. http://www.eecs.qmul.ac.uk/~tmh/papers/li2014sbirDpm.pdf .
Sangkloy P, Burnell N, Ham C, et al. The sketchy database:learning to retrieve badly drawn bunnies[J]. ACM Transactions on Graphics, 2016, 35(4):#119.[DOI:10.1145/2897824.2925954]
Yu Q, Liu F, Song Y Z, et al. Sketch me that shoe[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV: IEEE, 2016: 799-807.[ DOI: 10.1109/CVPR.2016.93 http://dx.doi.org/10.1109/CVPR.2016.93 ]
Wang J, Song Y, Leung T, et al. Learning fine-grained image similarity with deep ranking[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE, 2014: 1386-1393.[ DOI: 10.1109/CVPR.2014.180 http://dx.doi.org/10.1109/CVPR.2014.180 ]
Li K, Pang K Y, Song Y Z, et al. Fine-grained sketch-based image retrieval: the role of part-aware attributes[C]//Proceedings of 2016 IEEE Winter Conference on Applications of Computer Vision. Lake Placid, NY, USA: IEEE, 2016: 1-9.[ DOI: 10.1109/WACV.2016.7477615 http://dx.doi.org/10.1109/WACV.2016.7477615 ]
Song J F, Yu Q, Song Y Z, et al. Deep spatial-semantic attention for fine-grained sketch-based image retrieval[C]//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 5552-5561.[ DOI: 10.1109/ICCV.2017.592 http://dx.doi.org/10.1109/ICCV.2017.592 ]
Mnih V, Heess N, Graves A, et al. Recurrent models of visual attention[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: ACM, 2014. https://arxiv.org/abs/1406.6247 .
Sermanet P, Frome A, Real E. Attention for fine-grained categorization[EB/OL].[2018-08-20] . https://arxiv.org/pdf/1412.7054.pdf https://arxiv.org/pdf/1412.7054.pdf .
Lu J S, Xiong C M, Parikh D, et al. Knowing when to look: adaptive attention via a visual sentinel for image captioning[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 3242-3250.[ DOI: 10.1109/CVPR.2017.345 http://dx.doi.org/10.1109/CVPR.2017.345 ]
Fukui A, Park D H, Yang D, et al. Multimodal compact bilinear pooling for visual question answering and visual grounding[EB/OL].[2018-08-20] . https://arxiv.org/pdf/1606.01847.pdf https://arxiv.org/pdf/1606.01847.pdf .
Sharma S, Kiros R, Salakhutdinov R. Action recognition using visual attention[EB/OL].[2018-08-20] . https://arxiv.org/pdf/1511.04119.pdf https://arxiv.org/pdf/1511.04119.pdf .
Xu K, Ba J, Kiros R, et al. Show, attend and tell: neural image caption generation with visual attention[C]//Proceedings of the 32nd International Conference on Machine Learning, 2015: 2048-2057. https://arxiv.org/abs/1502.03044 .
Yang Z C, He X D, Gao J F, et al. Stacked attention networks for image question answering[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 21-29.[ DOI: 10.1109/CVPR.2016.10 http://dx.doi.org/10.1109/CVPR.2016.10 ]
Sundermeyer M, Schlüter R, Ney H. LSTM neural networks for language modeling[C]//Proceedings of the 13th Annual Conference of the International Speech Communication Association. Portland, OR, USA: ISCA, 2012: 601-608. https://blog.csdn.net/qq_32113189/article/details/79475049 .
Laskar Z, Kannala J. Context aware query image representation for particular object retrieval[C]//Proceedings of the 20th Sca ndinavian Conference. Tromsø, Norway: Springer, 2017: 88-99.[ DOI: 10.1007/978-3-319-59129-2_8 http://dx.doi.org/10.1007/978-3-319-59129-2_8 ]
相关作者
相关机构
京公网安备11010802024621