部件检测和语义网络的细粒度鞋类图像检索

陈前; 刘骊; 付晓东; 刘利军; 黄青松

doi:10.11834/jig.190467

图像分析和识别 | 浏览量 : 0 下载量: 21 CSCD: 3

PDF
导出
分享
收藏
专辑

部件检测和语义网络的细粒度鞋类图像检索
Fine-grained shoe image retrieval by part detection and semantic network
2020年25卷第8期页码：1578-1590
收稿日期：2019-09-09，

修回日期：2020-01-07，

录用日期：2020-1-14，

纸质出版日期：2020-08-16
DOI： 10.11834/jig.190467
稿件说明：

移动端阅览

陈前, 刘骊, 付晓东, 刘利军, 黄青松. 部件检测和语义网络的细粒度鞋类图像检索[J]. 中国图象图形学报, 2020,25(8):1578-1590. DOI： 10.11834/jig.190467.

Qian Chen, Li Liu, Xiaodong Fu, Lijun Liu, Qingsong Huang. Fine-grained shoe image retrieval by part detection and semantic network[J]. Journal of image and graphics, 2020, 25(8): 1578-1590. DOI： 10.11834/jig.190467.

摘要

目的

细粒度图像检索是当前细粒度图像分析和视觉领域的热点问题。以鞋类图像为例，传统方法仅提取其粗粒度特征且缺少关键的语义属性，难以区分部件间的细微差异，不能有效用于细粒度检索。针对鞋类图像检索大多基于简单款式导致检索效率不高的问题，提出一种结合部件检测和语义网络的细粒度鞋类图像检索方法。

方法

结合标注后的鞋类图像训练集对输入的待检鞋类图像进行部件检测；基于部件检测后的鞋类图像和定义的语义属性训练语义网络，以提取待检图像和训练图像的特征向量，并采用主成分分析进行降维；通过对鞋类图像训练集中每个候选图像与待检图像间的特征向量进行度量学习，按其匹配度高低顺序输出检索结果。

结果

实验在UT-Zap50K数据集上与目前检索效果较好的4种方法进行比较，检索精度提高近6%。同时，与同任务的SHOE-CNN（semantic hierarchy of attribute convolutional neural network）检索方法比较，本文具有更高的检索准确率。

结论

针对传统图像特征缺少细微的视觉描述导致鞋类图像检索准确率低的问题，提出一种细粒度鞋类图像检索方法，既提高了鞋类图像检索的精度和准确率，又能较好地满足实际应用需求。

Abstract

Objective

Fine-grained image retrieval is a major issue in current fine-grained image analysis and computer vision. Traditional methods typically retrieve similar replicated images

which are primarily based on large-scale coarse-grained retrieval but with low precision. Fine-grained image retrieval belongs to fine-grained image identification and retrieval subclasses. The traditional image retrieval task extracts only the coarse-grained features of images and cannot be effectively used for fine-grained retrieval. It also lacks key semantic attributes

and this deficiency brings difficulty in distinguishing the nuances among parts. The difficulty in fine-grained image retrieval is that the traditional coarse-grained feature extraction cannot represent images effectively. Fine-grained images of the same subclasses also cause a significant difference due to such factors as shape

posture

and color; consequently

search results cannot be effectively applied to actual needs. Compared with conventional image analysis problems

fine-grained image retrieval is more challenging due to the inter-level subcategories of its smaller class differences and the class differences within the larger ones. A fine-grained image retrieval method by part detection and semantic network for various shoe images is therefore proposed to solve the above-mentioned problems.

Method

First

part-based detection is conducted to detect undetected shoe images through an annotated training dataset of shoe images. Second

the semantic network is trained based on the semantic attributes of the detected shoe and training images

and feature vectors are extracted. Third

principal component analysis is used for dimensionality reduction. Finally

the results are implemented and output by metric learning to calculate the similarity among images

and fine-grained image retrieval is implemented. On the UT-Zap50K dataset

fine-grained shoe attributes are defined for the shoe images in combination with the component area of shoes. The toe area defines a shape attribute that contains five attribute values. Two attributes of shape and height are defined for the heel area

which contains 13 attribute values. A height attribute is defined for the upper area

which contains four attribute values. A closed-mode attribute is defined for the upper area

which contains nine attribute values. Footwear global properties are defined to include colors and styles

which contain 20 attribute values.

Result

The experiment is compared with four methods with good retrieval performance on the UT-Zap50K dataset. The retrieval accuracy is improved by nearly 6%. Compared with the semantic hierarchy of attribute convolutional neural network(SHOE-CNN) retrieval method of the same task

the proposed method has higher retrieval accuracy. The proposed semantic network is compared with traditional GIST(generalized search trees) features

the linear support vector machine(LSVM) method

and the deep learning method to illustrate the effectiveness of the proposed retrieval method. The performance is evaluated in terms of the accuracy of top-K retrieval. Results show that the method based on deep learning is much better than the traditional GIST features and LSVM method. The retrieval accuracy of this method is better than that of the metric network and SHOE-CNN by combining a metric learning algorithm.

Conclusion

A fine-grained shoe image retrieval method is proposed to address the low accuracy of shoe image retrieval caused by the lack of fine visual description of traditional image features. The method can accurately detect different parts of a shoe image and define the detailed semantic attributes of the shoe image. The visual attribute features of the shoe image are obtained by training the semantic network. The problem of unsatisfied accuracy of shoe image retrieval caused by using only coarse-grained features to represent images is solved. The experimental results show that the proposed method can retrieve the same image as the image to be detected on the UT-Zap50K dataset. The accuracy can reach 80% and 86% while ensuring the running efficiency. However

this method exhibits shortcomings. On the one hand

the accuracy of partial image detection is low because of the many styles and complexity of shoes. On the other hand

the prediction of some semantic attributes is inaccurate

and the fine-grained semantic attributes of shoe images are imperfect. The follow-up work will focus on these issues to improve the search accuracy

and the application issues will be extended to different scenarios.

关键词

Keywords

references

Bourdev L, Maji S and Malik J. 2011. Describing people: a poselet-based approach to attribute classification//Proceedings of 2011 International Conference on Computer Vision. Barcelona, Spain, IEEE: 1543-1550[ DOI: 10.1109/ICCV.2011.6126413 http://dx.doi.org/10.1109/ICCV.2011.6126413 ]

Cao C Q, Wang B, Zhang W R, Zeng X D, Yan X, Feng Z J, Liu Y T and Wu Z Y. 2019. An improved faster R-CNN for small object detection. IEEE Access, 7:106838-106846[DOI:10.1109/ACCESS.2019.2932731]

Chen B H and Deng W H. 2019. Hybrid-attention based decoupled metric learning for zero-shot image retrieval//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 2745-2754[ DOI: 10.1109/CVPR.2019.00286 http://dx.doi.org/10.1109/CVPR.2019.00286 ]

Chen H Z, Gallagher A and Girod B. 2012. Describing clothing by semantic attributes//Proceedings of the 12th European Conference on Computer Vision. Florence, Italy: Springer: 609-623[ DOI: 10.1007/978-3-642-33712-3_44 http://dx.doi.org/10.1007/978-3-642-33712-3_44 ]

Felzenszwalb P, McAllester D and Ramanan D. 2008. A discriminatively trained, multiscale, deformable part model//Proceedings of 2008 IEEE Conference on Computer Vision and Pattern Recognition. Anchorage, USA: IEEE: 1-8[ DOI: 10.1109/CVPR.2008.4587597 http://dx.doi.org/10.1109/CVPR.2008.4587597 ]

Guillaumin M, Mensink T, Verbeek J and Schmid C. 2009. Tagprop: discriminative metric learning in nearest neighbor models for image auto-annotation//Proceedings of the 12th IEEE International Conference on Computer Vision. Kyoto, Japan: IEEE: 309-316[ DOI: 10.1109/ICCV.2009.5459266 http://dx.doi.org/10.1109/ICCV.2009.5459266 ]

He K M, Zhang X Y, Ren S Q and Sun J. 2014. Spatial pyramid pooling in deep convolutional networks for visual recognition//Proceedings of European Conference on Computer Vision, 346-361[ DOI: 10.1109/TPAMI.210.1109/TPAMI.2015.2389824 http://dx.doi.org/10.1109/TPAMI.210.1109/TPAMI.2015.2389824 ]

Hosang J, Benenson R and Schiele B. 2017. Learning non-maximum suppression//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 6469-6477[ DOI: 10.1109/CVPR.2017.685 http://dx.doi.org/10.1109/CVPR.2017.685 ]

Huang J S, Liu S, Xing J L, Mei T and Yan S C. 2014. Circle and search:attribute-aware shoe retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications, 11(1):1-21[DOI:10.1145/2632165]

Kiapour M H, Han X F, Lazebnik S, Berg A C and Berg T L. 2015. Where to buy it: matching street clothing photos in online shops//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago: IEEE: 3343-3351[ DOI: 10.1109/ICCV.2015.382 http://dx.doi.org/10.1109/ICCV.2015.382 ]

Kovashka A, Parikh D and Grauman K. 2012. WhittleSearch: image search with relative attribute feedback//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 2973-2980[ DOI: 10.1109/CVPR.2012.6248026 http://dx.doi.org/10.1109/CVPR.2012.6248026 ]

Krizhevsky A, Sutskever I and Hinton G E. 2012. Imagenet classification with deep convolutional neural networks//Proceedings of the 25th International Conference on Neural Information Processing Systems Advances in neural information processing systems. NV: NIPS: 1097-1105[ DOI: 10.1145/3065386 http://dx.doi.org/10.1145/3065386 ]

Li H J, Wang X H, Tang J H and Zhao C X. 2013. Combining global and local matching of multiple features for precise item image retrieval. Multimedia Systems, 19(1):37-49[DOI:10.1007/s00530-012-0265-1]

Li Z D, Zhong Y and Cao D P. 2018. Deep convolution feature vector for fast face image retrieval. Journal of Computer-Aided Design and Computer Graphics, 30(12):2311-2317

李振东, 钟勇, 曹冬平. 2018.深度卷积特征向量用于快速人脸图像检索.计算机辅助设计与图形学学报, 30(12):2311-2317)[DOI:10.3724/SP.J.1089.2018.17119]

Ozeki M and Okatani T. 2014. Understanding convolutional neural networks in terms of category-level attributes//Proceedings of the 12th Asian Conference on Computer Vision. Singapore: Springer: 362-375[ DOI: 10.1007/978-3-319-16808-1_25 http://dx.doi.org/10.1007/978-3-319-16808-1_25 ]

Redmon J, Divvala S, Girshick R and Farhadi A. 2016. You only look once: unified, real-time object detection//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 779-788[ DOI: 10.1109/CVPR.2016.91 http://dx.doi.org/10.1109/CVPR.2016.91 ]

Wei XS, Luo J H, Wu J X and Zhou Z H. 2017. Selective convolutional descriptor aggregation for fine-grained image retrieval. IEEE Transactions on Image Processing, 26(6):2868-2881[DOI:10.1109/TIP.2017.2688133]

Wu S M, Liu L, Fu X D, Liu L J and Huang Q S. 2019. Human detection and multi-task learning for minority clothing recognition. Journal of Image and Graphics, 24(4):562-572

吴圣美, 刘骊, 付晓东, 刘利军, 黄青松. 2019.结合人体检测和多任务学习的少数民族服装识别.中国图象图形学报, 24(4):562-572)[DOI:10.11834/jig.180500]

Xie L X, Wang J D, Zhang B and Tian Q. 2015. Fine-grained image search. IEEE Transactions on Multimedia, 17(5):636-647[DOI:10.1109/TMM.2015.2408566]

Yu A and Grauman K. 2014. Fine-grained visual comparisons with local learning//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, Ohio: IEEE: 192-199[ DOI: 10.1109/CVPR.2014.32 http://dx.doi.org/10.1109/CVPR.2014.32 ]

Yu Y L, Ji Z, Fu Y W, Guo J, Pang Y and Zhang Z. 2018. Stacked semantics-guided attention model for fine-grained zero-shot learning//Proceedings of the 32nd Conference on Neural Information Processing Systems. Canada: MIT Press: 5995-6004

Zhan H J, Shi B X and Kot A C. 2017. Cross-domain shoe retrieval with a semantic hierarchy of attribute classification network. IEEE Transactions on Image Processing, 26(12):5867-5881[DOI:10.1109/TIP.2017.2736346]

Zhang N, Paluri M, Ranzato M A, Darrell T and Bourdev L. 2014. Panda: pose aligned networks for deep attribute modeling//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE: 1637-1644[ DOI: 10.1109/CVPR.2014.212 http://dx.doi.org/10.1109/CVPR.2014.212 ]

Zhang S Y, Song Z J, Cao X C, Zhang H and Zhou J. 2019. Task-aware attention model for clothing attribute prediction. IEEE Transactions on Circuits and Systems for Video Technology: 1051-1064[ DOI: 10.1109/TCSVT.2019.2902268 http://dx.doi.org/10.1109/TCSVT.2019.2902268 ]

Zhou W G, Lu Y J, Li H Q, Song Y B and Tain Q. 2010. Spatial coding for large scale partial-duplicate web image search//Proceedings of the 18th International Conference on Multimedea, Firenze, Italy, ACM: 511-520[ DOI: 10.1145/1873951.1874019 http://dx.doi.org/10.1145/1873951.1874019 ]

Zhu P K, Wang H X and Saligrama V. 2019. Zero shot detection. IEEE Transactions on Circuits and Systems for Video Technology, 30(14):998-0101[DOI:10.1109/TCSVT.2019.2899569]

文章被引用时，请邮件提醒。

提交