Fashion style retrieval based on deep multimodal fusion

Zhuo Su; Sibo Ke; Ruomei Wang; Fan Zhou

doi:10.11834/jig.200193

Image Analysis and Recognition | Views : 0 下载量: 1 CSCD: 1

PDF
Export
Share
Collection
Album

Fashion style retrieval based on deep multimodal fusion
Vol. 26, Issue 4, Pages: 857-871(2021)
Published： 16 April 2021 ，

Accepted： 26 October 2020
DOI： 10.11834/jig.200193
稿件说明：

移动端阅览

Zhuo Su, Sibo Ke, Ruomei Wang, Fan Zhou. Fashion style retrieval based on deep multimodal fusion. [J]. Journal of Image and Graphics 26(4):857-871(2021)
DOI：

Zhuo Su, Sibo Ke, Ruomei Wang, Fan Zhou. Fashion style retrieval based on deep multimodal fusion. [J]. Journal of Image and Graphics 26(4):857-871(2021) DOI： 10.11834/jig.200193.

摘要

目的

服装检索方法是计算机视觉与自然语言处理领域的研究热点，其包含基于内容与基于文本的两种查询模态。然而传统检索方法通常存在检索效率低的问题，且很少研究关注服装在风格上的相似性。为解决这些问题，本文提出深度多模态融合的服装风格检索方法。

方法

提出分层深度哈希检索模型，基于预训练的残差网络ResNet(residual network)进行迁移学习，并把分类层改造成哈希编码层，利用哈希特征进行粗检索，再用图像深层特征进行细检索。设计文本分类语义检索模型，基于LSTM(long short-term memory)设计文本分类网络以提前分类缩小检索范围，再以基于doc2vec提取的文本嵌入语义特征进行检索。同时提出相似风格上下文检索模型，其参考单词相似性来衡量服装风格相似性。最后采用概率驱动的方法量化风格相似性，并以最大化该相似性的结果融合方法作为本文检索方法的最终反馈。

结果

在Polyvore数据集上，与原始ResNet模型相比，分层深度哈希检索模型的top5平均检索精度提高11.6%，检索速度提高2.57 s/次。与传统文本分类嵌入模型相比，本文分类语义检索模型的top5查准率提高29.96%，检索速度提高16.53 s/次。

结论

提出的深度多模态融合的服装风格检索方法获得检索精度与检索速度的提升，同时进行了相似风格服装的检索使结果更具有多样性。

Abstract

Objective

Fashion retrieval method is a research hotspot in the field of computer vision and natural language processing. It aims to help users easily and quickly retrieve clothes that meet the query conditions from a large number of clothing. To make the retrieval method more diverse and convenient

the retrieval method researched in recent years usually includes the image query mode for intuitive retrieval and the text query mode for supplementary retrieval

that is

content-based image retrieval and text-based image retrieval. However

most of them pay attention to the precise matching in vision

and few pay attention to the similarity in style of clothing. In addition

the extracted feature dimensions are usually high

which leads to low retrieval efficiency. To solve these problems

we propose a fashion style retrieval method based on deep multimodal fusion.

Method

To solve the problem of low efficiency of image query mode

a hierarchical deep hash retrieval model is first proposed in this study. Its image deep feature extraction network is based on the pre-trained residual network ResNet for migration learning

which can learn the image deep features at a lower cost. The network classification layer is transformed into a hash coding layer

which can generate simple hash features. In this study

hash features are used for coarse retrieval

while in the fine retrieval stage

the preliminary results are rearranged based on the deep features of the image. To solve the problem of low efficiency of text query mode and to improve the scalability of the search engine

a text classification semantic retrieval model is proposed in this study

which designs a text classification network based on long short-term memory(LSTM) to classify query text in advance. Then

we construct a text embedding feature extraction model based on doc2vec

which can retrieve the text embedding feature in the pre-classified categories. At the same time

to capture the similarity of clothing style

a similar style context retrieval model is proposed

which measures the similarity of clothing style by referring to the similarity of part of speech and collocation level of words

references the training form of word2vec model in text words

and trains clothing as words and outfit as sentences. Finally

we use the probability driven method to quantify fashion style similarity without manual style annotation; compare different multimodal hybrid methods to maximize the similarity as the final return of search engine

that is

based on the text retrieval modal results to retrieve style context similar clothing; and rearrange all modal results and style context results based on image features.

Result

Choosing Polyvore as the dataset

we use the test set data as the query and retrieve the returned training set data as the result

so as to evaluate the results for different indicators. For the image retrieval mode

compared with the original ResNet model

the average retrieval accuracy of top 5 of the hierarchical deep hash retrieval framework is improved by 11.6%

and the retrieval speed is increased by 2.57 s/query. The average retrieval accuracy of the two feature retrieval strategies from coarse to fine is comparable to that of the direct image deep feature retrieval. For the text retrieval mode

compared with the traditional text embedding model

the top 5 precision of the text classification semantic retrieval framework is increased by 29.96%

and the retrieval speed is increased by 16.53 s/query. Finally

for the multimodal fusion results

we retrieve the context style similar clothing based on the text modal results and rearrange the final results in the image feature space. The average style similarity of the final results is 24%.

Conclusion

We propose a fashion style retrieval method based on deep multimodal fusion

whose hierarchical deep hash retrieval model is used as the image retrieval mode. Compared with most other modes and retrieval methods

the method of fine-tuning based on pre-training network with the goal of generating hash code and retrieval strategy from coarse to fine can improve the retrieval accuracy and speed. As the text retrieval mode

the text classification semantic retrieval model uses the text classification network to narrow the scope of retrieval and then uses the text features extracted from the text feature extraction model combined with the output of different models for retrieval. Compared with other text semantic retrieval methods

this mode can also improve the retrieval speed and accuracy. At the same time

in order to capture the similarity of fashion style

a similar style context retrieval model is proposed to find the results similar to the query clothing style and make the results more diverse.

关键词

多模态服装检索哈希特征文本嵌入风格相似性深度哈希

Keywords

multimodal fashion searchhash featuretext embeddingstyle similaritydeep hashing

references

Ak K E, Lim J H, Tham J Y and Kassim A. 2018. Efficient multi-attribute similarity learning towards attribute-based fashion search//Proceedings of 2018 IEEE Winter Conference on Applications of Computer Vision. Lake Tahoe, USA: IEEE: 1671-1679[DOI: 10.1109/WACV.2018.00186http://dx.doi.org/10.1109/WACV.2018.00186]

Chen S, He L L and Zheng J H. 2019. Clothing image retrieval method based on deep learning. Computer Systems and Applications, 28(3): 229-234

陈双, 何利力, 郑军红. 2019. 基于深度学习的服装图像检索方法. 计算机系统应用, 28(3): 229-234[DOI:10.15888/j.cnki.csa.006826]

Furnas G W, Deerwester S, Durnais S T, Landauer T K, Harshman R A, Streeter L A and Lochbaum K E. 2017. Information retrieval using a singular value decomposition model of latent semantic structure. ACM SIGIR Forum, 51(2): 90-105[DOI:10.1145/3130348.3130358]

Han X T, Wu Z X, Jiang Y G and Davis L S. 2017. Learning fashion compatibility with bidirectional LSTMs//Proceedings of the 25th ACM International Conference on Multimedia. Mountain View: ACM: 1078-1086[DOI: 10.1145/3123266.3123394http://dx.doi.org/10.1145/3123266.3123394]

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778[DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]

Kim D, Seo D, Cho S and Kang P. 2019. Multi-co-training for document classification using various document representations: TF-IDF, LDA, and Doc2Vec. Information Sciences, 477: 15-29[DOI:10.1016/j.ins.2018.10.006]

Le Q and Mikolov T. 2014. Distributed representations of sentences and documents//Proceedings of the 31st International Conference on Machine Learning. Beijing, China: PMLR: 1188-1196

Li J, Lyu S H, Chen F, Yang G G and Dou Y. 2017. Image retrieval by combining recurrent neural network and visual attention mechanism. Journal of Image and Graphics, 22(2): 241-248

李军, 吕绍和, 陈飞, 阳国贵, 窦勇. 2017. 结合视觉注意机制与递归神经网络的图像检索. 中国图象图形学报, 22(2): 241-248[DOI:10.11834/jig.20170212]

Lin K, Yang H F, Hsiao J H and Chen C S. 2015. Deep learning of binary Hash codes for fast image retrieval//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Boston, USA: IEEE: 27-35[DOI: 10.1109/CVPRW.2015.7301269http://dx.doi.org/10.1109/CVPRW.2015.7301269]

Liu H M, Wang R P, Shan S G and Chen X L. 2016a. Deep supervised hashing for fast image retrieval//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 2064-2072[DOI: 10.1109/CVPR.2016.227http://dx.doi.org/10.1109/CVPR.2016.227]

Liu Z W, Luo P, Qiu S, Wang X G and Tang X O. 2016b. DeepFashion: powering robust clothes recognition and retrieval with rich annotations//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 1096-1104[DOI: 10.1109/CVPR.2016.124http://dx.doi.org/10.1109/CVPR.2016.124]

Lowe D G. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2): 91-110[DOI:10.1023/B:VISI.0000029664.99615.94]

Luan T T, Zhu J H, Xu S Y, Wang J X, Shi X and Li Y C. 2019. Hashing method for image retrieval based on product quantization with Huffman coding. Journal of Image and Graphics, 24(3): 389-399

栾婷婷, 祝继华, 徐思雨, 王佳星, 时璇, 李垚辰. 2019. 哈夫曼编码乘积量化的图像哈希检索方法. 中国图象图形学报, 24(3): 389-399[DOI:10.11834/jig.180264]

Lun Z L, Kalogerakis E and Sheffer A. 2015. Elements of style: learning perceptual shape style similarity. ACM Transactions on Graphics, 34(4): #84[DOI:10.1145/2766929]

Mikolov T, Chen K, Corrado G and Dean J. 2013. Efficient estimation of word representations in vector space//Proceedings of the 1st International Conference on Learning Representations. Scottsdale, USA: ICLR: 1-12

Peng Y F, Song X N, Wu H and Zi L L. 2019. Remote sensing image retrieval combined with deep learning and relevance feedback. Journal of Image and Graphics, 24(3): 420-434

彭晏飞, 宋晓男, 武宏, 訾玲玲. 2019. 结合深度学习与相关反馈的遥感图像检索. 中国图象图形学报, 24(3): 420-434[DOI:10.11834/jig.180384]

Pennington J, Socher R and Manning C. 2014. Glove: global vectors for word representation//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: Association for Computational Linguistics: 1532-1543[DOI: 10.3115/v1/D14-1162http://dx.doi.org/10.3115/v1/D14-1162]

Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S A, Huang Z H, Karpathy A, Khosla A, Bernstein M, Berg A C and Li F F. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3): 211-252[DOI:10.1007/s11263-015-0816-y]

Salton G, Wong A and Yang C S. 1975. A vector space model for automatic indexing. Communications of the ACM, 18(11): 613-620[DOI:10.1145/361219.361220]

Tautkute I, Trzciński T, Skorupa A P, Brocki Ł and Marasek K. 2019. Deepstyle: multimodal search engine for fashion and interior design. IEEE Access, 7: 84613-84628[DOI:10.1109/ACCESS.2019.2923552]

van Kaick O, Xu K, Zhang H, Wang Y Z, Sun S Y, Shamir A and Cohen-Or D. 2013. Co-hierarchical analysis of shape structures. ACM Transactions on Graphics, 32(4): #69[DOI:10.1145/2461912.2461924]

Wang Y X, Yang H, Qian X M, Ma L, Lu J, Li B and Fan X. 2019. Position focused attention network for image-text matching//Proceedings of the 28th International Joint Conference on Artificial Intelligence. [s. l.]: IJCAI: 3792-3798[DOI: 10.24963/ijcai.2019/526http://dx.doi.org/10.24963/ijcai.2019/526]

Wang Y X, Zhu L, Qian X M and Han J W. 2018. Joint hypergraph learning for tag-based image retrieval. IEEE Transactions on Image Processing, 27(9): 4437-4451[DOI:10.1109/TIP.2018.2837219]

Yuan W F, Guo J M, Su Z, Luo X N and Zhou F. 2019. Clothing retrieval by deep multi-label parsing and Hashing. Journal of Image and Graphics, 24(2): 159-169

原尉峰, 郭佳明, 苏卓, 罗笑南, 周凡. 2019. 结合深度多标签解析的哈希服装检索. 中国图象图形学报, 24(2): 159-169[DOI:10.11834/jig.180361]

Yumer M E and Kara L B. 2014. Co-constrained handles for deformation in shape collections. ACM Transactions on Graphics, 33(6): #187[DOI:10.1145/2661229.2661234]

Zhang Y H, Pan P, Zheng Y, Zhao K, Zhang Y Y, Ren X F and Jin R. 2018. Visual search at alibaba//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. London, UK: ACM: 993-1001[DOI: 10.1145/3219819.3219820http://dx.doi.org/10.1145/3219819.3219820]

Alert me when the article has been cited

提交

Deep Hashing image retrieval methods