Fashion style retrieval based on deep multimodal fusion
- Vol. 26, Issue 4, Pages: 857-871(2021)
Published: 16 April 2021 ,
Accepted: 26 October 2020
DOI: 10.11834/jig.200193
移动端阅览
浏览全部资源
扫码关注微信
Published: 16 April 2021 ,
Accepted: 26 October 2020
移动端阅览
Zhuo Su, Sibo Ke, Ruomei Wang, Fan Zhou. Fashion style retrieval based on deep multimodal fusion. [J]. Journal of Image and Graphics 26(4):857-871(2021)
目的
2
服装检索方法是计算机视觉与自然语言处理领域的研究热点,其包含基于内容与基于文本的两种查询模态。然而传统检索方法通常存在检索效率低的问题,且很少研究关注服装在风格上的相似性。为解决这些问题,本文提出深度多模态融合的服装风格检索方法。
方法
2
提出分层深度哈希检索模型,基于预训练的残差网络ResNet(residual network)进行迁移学习,并把分类层改造成哈希编码层,利用哈希特征进行粗检索,再用图像深层特征进行细检索。设计文本分类语义检索模型,基于LSTM(long short-term memory)设计文本分类网络以提前分类缩小检索范围,再以基于doc2vec提取的文本嵌入语义特征进行检索。同时提出相似风格上下文检索模型,其参考单词相似性来衡量服装风格相似性。最后采用概率驱动的方法量化风格相似性,并以最大化该相似性的结果融合方法作为本文检索方法的最终反馈。
结果
2
在Polyvore数据集上,与原始ResNet模型相比,分层深度哈希检索模型的top5平均检索精度提高11.6%,检索速度提高2.57 s/次。与传统文本分类嵌入模型相比,本文分类语义检索模型的top5查准率提高29.96%,检索速度提高16.53 s/次。
结论
2
提出的深度多模态融合的服装风格检索方法获得检索精度与检索速度的提升,同时进行了相似风格服装的检索使结果更具有多样性。
Objective
2
Fashion retrieval method is a research hotspot in the field of computer vision and natural language processing. It aims to help users easily and quickly retrieve clothes that meet the query conditions from a large number of clothing. To make the retrieval method more diverse and convenient
the retrieval method researched in recent years usually includes the image query mode for intuitive retrieval and the text query mode for supplementary retrieval
that is
content-based image retrieval and text-based image retrieval. However
most of them pay attention to the precise matching in vision
and few pay attention to the similarity in style of clothing. In addition
the extracted feature dimensions are usually high
which leads to low retrieval efficiency. To solve these problems
we propose a fashion style retrieval method based on deep multimodal fusion.
Method
2
To solve the problem of low efficiency of image query mode
a hierarchical deep hash retrieval model is first proposed in this study. Its image deep feature extraction network is based on the pre-trained residual network ResNet for migration learning
which can learn the image deep features at a lower cost. The network classification layer is transformed into a hash coding layer
which can generate simple hash features. In this study
hash features are used for coarse retrieval
while in the fine retrieval stage
the preliminary results are rearranged based on the deep features of the image. To solve the problem of low efficiency of text query mode and to improve the scalability of the search engine
a text classification semantic retrieval model is proposed in this study
which designs a text classification network based on long short-term memory(LSTM) to classify query text in advance. Then
we construct a text embedding feature extraction model based on doc2vec
which can retrieve the text embedding feature in the pre-classified categories. At the same time
to capture the similarity of clothing style
a similar style context retrieval model is proposed
which measures the similarity of clothing style by referring to the similarity of part of speech and collocation level of words
references the training form of word2vec model in text words
and trains clothing as words and outfit as sentences. Finally
we use the probability driven method to quantify fashion style similarity without manual style annotation; compare different multimodal hybrid methods to maximize the similarity as the final return of search engine
that is
based on the text retrieval modal results to retrieve style context similar clothing; and rearrange all modal results and style context results based on image features.
Result
2
Choosing Polyvore as the dataset
we use the test set data as the query and retrieve the returned training set data as the result
so as to evaluate the results for different indicators. For the image retrieval mode
compared with the original ResNet model
the average retrieval accuracy of top 5 of the hierarchical deep hash retrieval framework is improved by 11.6%
and the retrieval speed is increased by 2.57 s/query. The average retrieval accuracy of the two feature retrieval strategies from coarse to fine is comparable to that of the direct image deep feature retrieval. For the text retrieval mode
compared with the traditional text embedding model
the top 5 precision of the text classification semantic retrieval framework is increased by 29.96%
and the retrieval speed is increased by 16.53 s/query. Finally
for the multimodal fusion results
we retrieve the context style similar clothing based on the text modal results and rearrange the final results in the image feature space. The average style similarity of the final results is 24%.
Conclusion
2
We propose a fashion style retrieval method based on deep multimodal fusion
whose hierarchical deep hash retrieval model is used as the image retrieval mode. Compared with most other modes and retrieval methods
the method of fine-tuning based on pre-training network with the goal of generating hash code and retrieval strategy from coarse to fine can improve the retrieval accuracy and speed. As the text retrieval mode
the text classification semantic retrieval model uses the text classification network to narrow the scope of retrieval and then uses the text features extracted from the text feature extraction model combined with the output of different models for retrieval. Compared with other text semantic retrieval methods
this mode can also improve the retrieval speed and accuracy. At the same time
in order to capture the similarity of fashion style
a similar style context retrieval model is proposed to find the results similar to the query clothing style and make the results more diverse.
多模态服装检索哈希特征文本嵌入风格相似性深度哈希
multimodal fashion searchhash featuretext embeddingstyle similaritydeep hashing
Ak K E, Lim J H, Tham J Y and Kassim A. 2018. Efficient multi-attribute similarity learning towards attribute-based fashion search//Proceedings of 2018 IEEE Winter Conference on Applications of Computer Vision. Lake Tahoe, USA: IEEE: 1671-1679[DOI: 10.1109/WACV.2018.00186http://dx.doi.org/10.1109/WACV.2018.00186]
Chen S, He L L and Zheng J H. 2019. Clothing image retrieval method based on deep learning. Computer Systems and Applications, 28(3): 229-234
陈双, 何利力, 郑军红. 2019. 基于深度学习的服装图像检索方法. 计算机系统应用, 28(3): 229-234[DOI:10.15888/j.cnki.csa.006826]
Furnas G W, Deerwester S, Durnais S T, Landauer T K, Harshman R A, Streeter L A and Lochbaum K E. 2017. Information retrieval using a singular value decomposition model of latent semantic structure. ACM SIGIR Forum, 51(2): 90-105[DOI:10.1145/3130348.3130358]
Han X T, Wu Z X, Jiang Y G and Davis L S. 2017. Learning fashion compatibility with bidirectional LSTMs//Proceedings of the 25th ACM International Conference on Multimedia. Mountain View: ACM: 1078-1086[DOI: 10.1145/3123266.3123394http://dx.doi.org/10.1145/3123266.3123394]
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778[DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]
Kim D, Seo D, Cho S and Kang P. 2019. Multi-co-training for document classification using various document representations: TF-IDF, LDA, and Doc2Vec. Information Sciences, 477: 15-29[DOI:10.1016/j.ins.2018.10.006]
Le Q and Mikolov T. 2014. Distributed representations of sentences and documents//Proceedings of the 31st International Conference on Machine Learning. Beijing, China: PMLR: 1188-1196
Li J, Lyu S H, Chen F, Yang G G and Dou Y. 2017. Image retrieval by combining recurrent neural network and visual attention mechanism. Journal of Image and Graphics, 22(2): 241-248
李军, 吕绍和, 陈飞, 阳国贵, 窦勇. 2017. 结合视觉注意机制与递归神经网络的图像检索. 中国图象图形学报, 22(2): 241-248[DOI:10.11834/jig.20170212]
Lin K, Yang H F, Hsiao J H and Chen C S. 2015. Deep learning of binary Hash codes for fast image retrieval//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Boston, USA: IEEE: 27-35[DOI: 10.1109/CVPRW.2015.7301269http://dx.doi.org/10.1109/CVPRW.2015.7301269]
Liu H M, Wang R P, Shan S G and Chen X L. 2016a. Deep supervised hashing for fast image retrieval//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 2064-2072[DOI: 10.1109/CVPR.2016.227http://dx.doi.org/10.1109/CVPR.2016.227]
Liu Z W, Luo P, Qiu S, Wang X G and Tang X O. 2016b. DeepFashion: powering robust clothes recognition and retrieval with rich annotations//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 1096-1104[DOI: 10.1109/CVPR.2016.124http://dx.doi.org/10.1109/CVPR.2016.124]
Lowe D G. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2): 91-110[DOI:10.1023/B:VISI.0000029664.99615.94]
Luan T T, Zhu J H, Xu S Y, Wang J X, Shi X and Li Y C. 2019. Hashing method for image retrieval based on product quantization with Huffman coding. Journal of Image and Graphics, 24(3): 389-399
栾婷婷, 祝继华, 徐思雨, 王佳星, 时璇, 李垚辰. 2019. 哈夫曼编码乘积量化的图像哈希检索方法. 中国图象图形学报, 24(3): 389-399[DOI:10.11834/jig.180264]
Lun Z L, Kalogerakis E and Sheffer A. 2015. Elements of style: learning perceptual shape style similarity. ACM Transactions on Graphics, 34(4): #84[DOI:10.1145/2766929]
Mikolov T, Chen K, Corrado G and Dean J. 2013. Efficient estimation of word representations in vector space//Proceedings of the 1st International Conference on Learning Representations. Scottsdale, USA: ICLR: 1-12
Peng Y F, Song X N, Wu H and Zi L L. 2019. Remote sensing image retrieval combined with deep learning and relevance feedback. Journal of Image and Graphics, 24(3): 420-434
彭晏飞, 宋晓男, 武宏, 訾玲玲. 2019. 结合深度学习与相关反馈的遥感图像检索. 中国图象图形学报, 24(3): 420-434[DOI:10.11834/jig.180384]
Pennington J, Socher R and Manning C. 2014. Glove: global vectors for word representation//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: Association for Computational Linguistics: 1532-1543[DOI: 10.3115/v1/D14-1162http://dx.doi.org/10.3115/v1/D14-1162]
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S A, Huang Z H, Karpathy A, Khosla A, Bernstein M, Berg A C and Li F F. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3): 211-252[DOI:10.1007/s11263-015-0816-y]
Salton G, Wong A and Yang C S. 1975. A vector space model for automatic indexing. Communications of the ACM, 18(11): 613-620[DOI:10.1145/361219.361220]
Tautkute I, Trzciński T, Skorupa A P, Brocki Ł and Marasek K. 2019. Deepstyle: multimodal search engine for fashion and interior design. IEEE Access, 7: 84613-84628[DOI:10.1109/ACCESS.2019.2923552]
van Kaick O, Xu K, Zhang H, Wang Y Z, Sun S Y, Shamir A and Cohen-Or D. 2013. Co-hierarchical analysis of shape structures. ACM Transactions on Graphics, 32(4): #69[DOI:10.1145/2461912.2461924]
Wang Y X, Yang H, Qian X M, Ma L, Lu J, Li B and Fan X. 2019. Position focused attention network for image-text matching//Proceedings of the 28th International Joint Conference on Artificial Intelligence. [s. l.]: IJCAI: 3792-3798[DOI: 10.24963/ijcai.2019/526http://dx.doi.org/10.24963/ijcai.2019/526]
Wang Y X, Zhu L, Qian X M and Han J W. 2018. Joint hypergraph learning for tag-based image retrieval. IEEE Transactions on Image Processing, 27(9): 4437-4451[DOI:10.1109/TIP.2018.2837219]
Yuan W F, Guo J M, Su Z, Luo X N and Zhou F. 2019. Clothing retrieval by deep multi-label parsing and Hashing. Journal of Image and Graphics, 24(2): 159-169
原尉峰, 郭佳明, 苏卓, 罗笑南, 周凡. 2019. 结合深度多标签解析的哈希服装检索. 中国图象图形学报, 24(2): 159-169[DOI:10.11834/jig.180361]
Yumer M E and Kara L B. 2014. Co-constrained handles for deformation in shape collections. ACM Transactions on Graphics, 33(6): #187[DOI:10.1145/2661229.2661234]
Zhang Y H, Pan P, Zheng Y, Zhao K, Zhang Y Y, Ren X F and Jin R. 2018. Visual search at alibaba//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. London, UK: ACM: 993-1001[DOI: 10.1145/3219819.3219820http://dx.doi.org/10.1145/3219819.3219820]
相关作者
相关机构