视觉—语义双重解纠缠的广义零样本学习
Visual-semantic dual-disentangling for generalized zero-shot learning
- 2023年28卷第9期 页码:2913-2926
纸质出版日期: 2023-09-16
DOI: 10.11834/jig.220486
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2023-09-16 ,
移动端阅览
韩阿友, 杨关, 刘小明, 刘阳. 2023. 视觉—语义双重解纠缠的广义零样本学习. 中国图象图形学报, 28(09):2913-2926
Han Ayou, Yang Guan, Liu Xiaoming, Liu Yang. 2023. Visual-semantic dual-disentangling for generalized zero-shot learning. Journal of Image and Graphics, 28(09):2913-2926
目的
2
传统的零样本学习(zero-shot learning,ZSL)旨在依据可见类别的数据和相关辅助信息对未见类别的数据进行预测分类,而广义零样本学习(generalized zero-shot learning,GZSL)中分类的类别既可能属于可见类也可能属于不可见类,这更符合现实的应用场景。基于生成模型的广义零样本学习的原始特征和生成特征不一定编码共享属性所指的语义相关信息,这样会导致模型倾向于可见类,并且分类时忽略了语义信息中与特征相关的有用信息。为了分解出相关的视觉特征和语义信息,提出了视觉—语义双重解纠缠框架。
方法
2
首先,使用条件变分自编码器为不可见类生成视觉特征,再通过一个特征解纠缠模块将其分解为语义一致性和语义无关特征。然后,设计了一个语义解纠缠模块将语义信息分解为特征相关和特征无关的语义。其中,利用总相关惩罚来保证分解出来的两个分量之间的独立性,特征解纠缠模块通过关系网络来衡量分解的语义一致性,语义解纠缠模块通过跨模态交叉重构来保证分解的特征相关性。最后,使用两个解纠缠模块分离出来的语义一致性特征和特征相关语义信息联合学习一个广义零样本学习分类器。
结果
2
实验在4个广义零样本学习公开数据集AWA2(animals with attributes2)、CUB(caltech-ucsd birds-200-2011)、SUN(SUN attribute)和FLO(Oxford flowers)上取得了比Baseline更好的结果,调和平均值在AwA2、CUB、SUN和FLO上分别提升了1.6%、3.2%、6.2%和1.5%。
结论
2
在广义零样本学习分类中,本文提出的视觉—语义双重解纠缠方法经实验证明比基准方法取得了更好的性能,并且优于大多现有的相关方法。
Objective
2
Traditional deep learning models, widely adopted in many application scenarios, perform effectively. However, they rely on a large number of training samples. Thus, a large number of training samples are difficult to collect in practical applications. Moreover, the limitation of identifying only the classes already present in the training phase (seen classes) is bypassed, and processing the classes never seen in the training phase (unseen classes) is a challenge. Zero-shot learning(ZSL) provides a good solution to this challenge. Zero-shot learning aims to classify unseen classes for which no training samples are available during the training phase. However, another problem exists with the complexity of the real world. In practice, seen and unseen classes can be found in real life. Therefore, generalized zero-shot learning(GZSL) is proposed. This new method has realistic and universal characteristics. As a generalized method, generalized zero-shot learning can sample test sets from seen and unseen classes. The existing generalized zero-shot learning methods can be subdivided into two categories, namely, embedding-based and generation-based methods. The former learns a projection or embedding function that associates the visual features of the seen class with the corresponding semantics. In comparison, the latter learns a generative model to generate visual features for the unseen class. In previous studies, the visual features extracted using the pretrained deep models (e.g., ResNet101) are not specifically extracted for the generalized zero-shot learning task. The extracted visual features are not all semantically related to the predefined attributes in dimension. This scenario leads the model to incline to the seen classes. Most methods ignore useful information related to the features present in the semantics during classification, thereby remarkably impacting the final classification. In this paper, we propose a new generalized zero-shot learning method, called visual-semantic dual-disentangling framework for generalized zero-shot learning (VSD-GZSL), to disentangle relevant visual features and semantic information.
Method
2
The conditional variational auto-encoders(VAEs) are combined with a disentanglement network and trained in an end-to-end manner. The proposed disentanglement network is an encoder-decoder structure. The visual features and semantics of the seen classes are first used to train the conditional variational auto-encoders and the disentanglement network. Once the network has converged, the trained generative network generates visual features for the unseen classes. The real features of the seen classes and the generative features of the unseen classes are fed into a visual feature disentanglement network to disentangle the semantic-consistent and semantic-irrelevant features. Then, they are fed into a semantic disentanglement network to disentangle the semantic into feature-relevant and feature-irrelevant semantic information. The components disentangled by the two disentanglement networks are fed into the decoder to be reconstructed back to the corresponding space by using the reconstruction loss to prevent information loss during the disentanglement stage. A total correlation penalty module is designed to measure the independence between potential variables disentangled by the disentanglement network. A relational network is designed to maximize the compatibility score between the components disentangled by the visual disentanglement network and the corresponding semantics and learn the semantic consistency of the visual features. The semantic information related to the visual features disentangled by the semantic disentanglement network is fed into the visual disentanglement decoder for cross-modal reconstruction to measure the feature relevance of the semantics. Finally, the semantic consistency features and feature-related semantics disentangled by the two disentanglement networks are jointly learned into a generalized zero-shot classifier for classification.
Result
2
The proposed method was validated in several experiments on four generalized zero-shot learning open datasets (AwA2, CUB, SUN, and FLO). The proposed method achieved better results than the baseline, with a 3.8% improvement in the unseen class accuracy, a 0.2% improvement in the seen class accuracy, and a 1.6% improvement in the harmonic mean on the AwA2 dataset. The unseen class accuracy improved by 3.8%, the seen class accuracy improved by 2.4%, and the harmonic mean improved by 3.2% on the CUB dataset. The unseen class accuracy improved by 10.1%, the seen class accuracy improved by 4.1%, and the harmonic mean improved by 6.2% on the SUN dataset. Moreover, the seen class accuracy improved by 9.1%, and the harmonic mean improved by 1.5% on the FLO dataset. The proposed method was also compared with 10 recently proposed generalized zero-shot learning methods . Compared with f-CLSWGAN, VSD-GZSL exhibited improved harmonic means by 10%, 8.4%, 8.1%, and 5.7% on the four datasets. Compared with cycle-consistent adversarial networks for zero-shot learning(CANZSL), VSD-GZSL exhibited improved harmonic means by 12.2%, 5.6%, 7.5%, and 4.8% on the four datasets. Compared with leveraging invariant side GAN(LisGAN), VSD-GZSL exhibited improved harmonic means by 8.1%, 6.5%, 7.3%, and 3% on the four data sets. Compared with cross- and distribution-aligned VAE(CADA-VAE), VSD-GZSL exhibited improved harmonic means by 6.5%, 5.7%, 6.9%, and 10% on the four datasets. Compared with f-VAEGAN-D2, VSD-GZSL exhibited improved harmonic means by 6.9%, 4.5%, 6.2%, and 6.7% on the four datasets. Compared with Cycle-CLSWGAN, VSD-GZSL exhibited improved harmonic means by 5.1%, 8.1%, and 6.2% on the CUB, SUN, and FLO datasets, respectively. Compared with feature refinement(FREE), VSD-GZSL exhibited improved harmonic means by 3.3%, 0.4%, and 5.8% on the AwA2, CUB, and SUN datasets, respectively. The experimental results showed that the proposed method achieves excellent results. Thus, the effectiveness of the proposed method can be demonstrated.
Conclusion
2
The proposed VSD-GZSL method demonstrates its superiority to the traditional models. Our proposed method can disentangle the semantically consistent features in the visual features and the semantic information associated with the features in the semantics. Then a final classifier is learned from these two decomposed mutually consistent features. Compared with several related methods, VSD-GZSL achieves a remarkable performance improvement on multiple datasets.
零样本学习(ZSL)广义零样本学习(GZSL)解纠缠表示变分自编码器(VAE)跨模态重构总相关性(TC)
zero-shot learning(ZSL)generalized zero-shot learning(GZSL)disentanglement representationvariational auto-encoders(VAE)cross-modal reconstructiontotal correlation(TC)
Arjovsky M, Chintala S and Bottou L. 2017. Wasserstein generative adversarial networks//Proceedings of the 34th International Conference on Machine Learning. Sydney, Australia: JMLR.org: 214-223
Chao W L, Changpinyo S, Gong B Q and Sha F. 2016. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 52-68 [DOI: 10.1007/978-3-319-46475-6_4http://dx.doi.org/10.1007/978-3-319-46475-6_4]
Chen S M, Wang W J, Xia B H, Peng Q M, You X E, Zheng F and Shao L. 2021a. FREE: feature refinement for generalized zero-shot learning//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 122-131 [DOI: 10.1109/ICCV48922.2021.00019http://dx.doi.org/10.1109/ICCV48922.2021.00019]
Chen X, Duan Y, Houthooft R, Schulman J, Sutskever I and Abbeel P. 2016. InfoGAN: interpretable representation learning by information maximizing generative adversarial nets//Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain: Curran Associates Inc.: 2180-2188
Chen Z, Li J J, Luo Y D, Huang Z and Yangyang Y Y. 2020. CANZSL: cycle-consistent adversarial networks for zero-shot learning from natural language//Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision. Snowmass, USA: IEEE: 863-872 [DOI: 10.1109/WACV45572.2020.9093610http://dx.doi.org/10.1109/WACV45572.2020.9093610]
Chen Z, Luo Y D, Qiu R H, Wang S, Huang Z, Li J J and Zhang Z. 2021b. Semantics disentangling for generalized zero-shot learning//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 8692-8700 [DOI: 10.1109/ICCV48922.2021.00859http://dx.doi.org/10.1109/ICCV48922.2021.00859]
Deng J, Dong W, Socher R, Li L J, Li K and Li F F. 2009. ImageNet: a large-scale hierarchical image database//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, USA: IEEE: 248-255 [DOI: 10.1109/CVPR.2009.5206848http://dx.doi.org/10.1109/CVPR.2009.5206848]
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X H, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J and Houlsby N. 2021. An image is worth 16 × 16 words: Transformers for image recognition at scale [EB/OL]. [2022-05-09]. https://arxiv.org/pdf/2010.11929.pdfhttps://arxiv.org/pdf/2010.11929.pdf
Felix R, Kumar B G V, Reid I and Carneiro G. 2018. Multi-modal cycle-consistent generalized zero-shot learning//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 21-37 [DOI: 10.1007/978-3-030-01231-1_2http://dx.doi.org/10.1007/978-3-030-01231-1_2]
Feng Y G, Huang X W, Yang P B, Yu J and Sang J T. 2022. Non-generative generalized zero-shot learning via Task-correlated disentanglement and controllable samples synthesis//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 9346-9355 [DOI: 10.1109/CVPR52688.2022.00913http://dx.doi.org/10.1109/CVPR52688.2022.00913]
Frome A, Corrado G S, Shlens J, Bengio S, Dean J and Ranzato M and Mikolov T. 2013. DeViSE: a deep visual-semantic embedding model//Proceedings of the 26th International Conference on Neural Information Processing Systems. Lake Tahoe Nevada, USA: Curran Associates Inc: 2121-2129
Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozairy S, Courville A and Bengio Y. 2014. Generative adversarial nets [EB/OL]. [2022-05-09]. https://arxiv.org/pdf/1406.2661.pdfhttps://arxiv.org/pdf/1406.2661.pdf
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778 [DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]
Higgins I, Matthey L, Pal A, Burgess C P, Glorot X, Botvinick M M, Mohamed S and Lerchner A. 2017. beta-VAE: learning basic visual concepts with a constrained variational framework//Proceedings of the 5th International Conference on Learning Representations. Toulon, France: OpenReview.net
Ji Z, Wang H R, Yu Y L and Pang Y W. 2019. A decadal survey of zero-shot image classification. SCIENTIA SINICA Informationis, 49(10): 1299-1320
冀中, 汪浩然, 于云龙, 庞彦伟. 2019. 零样本图像分类综述: 十年进展. 中国科学: 信息科学, 49(10): 1299-1320 [DOI: 10.1360/N112018-00312http://dx.doi.org/10.1360/N112018-00312]
Jiang H J, Wang R P, Shan S G and Chen X L. 2019. Transferable contrastive network for generalized zero-shot learning//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 9764-9773 [DOI: 10.1109/ICCV.2019.00986http://dx.doi.org/10.1109/ICCV.2019.00986]
Keshari R, Singh R and Vatsa M. 2020. Generalized zero-shot learning via over-complete distribution//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 13297-13305 [DOI: 10.1109/CVPR42600.2020.01331http://dx.doi.org/10.1109/CVPR42600.2020.01331]
Kim H and Mnih A. 2018. Disentangling by factorising//Proceedings of the 35th International Conference on Machine Learning. Stockholmsmässan, Stockholm, Sweden: PMLR: 2649-2658
Kingma D P and Ba J. 2015. Adam: a method for stochastic optimization//Proceedings of the 3rd International Conference on Learning Representations. San Diego, USA: [s.n.]
Kingma D P and Welling M. 2014. Auto-encoding variational Bayes//Proceedings of the 2nd International Conference on Learning Representations. Banff, Canada: [s.n.]
Lampert C H, Nickisch H and Harmeling S. 2014. Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3): 453-465 [DOI: 10.1109/TPAMI.2013.140http://dx.doi.org/10.1109/TPAMI.2013.140]
Larochelle H, Erhan D and Bengio Y. 2008. Zero-data learning of new tasks//Proceedings of the 23rd National Conference on Artificial Intelligence. Chicago, USA: AAAI Press: 646-651
Li J J, Jing M M, Lu K, Ding Z M, Zhu L and Huang Z. 2019. Leveraging the invariant side of generative zero-shot learning//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 7394-7403 [DOI: 10.1109/CVPR.2019.00758http://dx.doi.org/10.1109/CVPR.2019.00758]
Li X Y, Xu Z, Wei K and Deng C. 2021. Generalized zero-shot learning via disentangled representation//Proceedings of the 35th AAAI Conference on Artificial Intelligence. New York, USA: AAAI: 1966-1974 [DOI: 10.1609/aaai.v35i3.16292http://dx.doi.org/10.1609/aaai.v35i3.16292]
Liu Y, Guo J S, Cai D and He X F. 2019. Attribute attention for semantic disambiguation in zero-shot learning//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 6697-6706 [DOI: 10.1109/ICCV.2019.00680http://dx.doi.org/10.1109/ICCV.2019.00680]
Lyu L L, Huang Y, Gao J Y, Yang X S and Xu C S. 2021. Multimodal-based zero-shot human action recognition. Journal of Image and Graphics, 26(7): 1658-1667
吕露露, 黄毅, 高君宇, 杨小汕, 徐常胜. 2021. 多模态零样本人体动作识别. 中国图象图形学报, 26(7): 1658-1667 [DOI: 10.11834/jig.200503http://dx.doi.org/10.11834/jig.200503]
Ma Y B, Xu X, Shen F M and Shen H T. 2020. Similarity preserving feature generating networks for zero-shot learning. Neurocomputing, 406: 333-342 [DOI: 10.1016/j.neucom.2019.08.111http://dx.doi.org/10.1016/j.neucom.2019.08.111]
Mishra A, Krishna Reddy S, Mittal A and Murthy H A. 2018. A generative model for zero shot learning using conditional variational autoencoders//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Salt Lake City, USA: IEEE: 2269-22698 [DOI: 10.1109/CVPRW.2018.00294http://dx.doi.org/10.1109/CVPRW.2018.00294]
Narayan S, Gupta A, Khan F S, Snoek C G M and Shao L. 2020. Latent embedding feedback and discriminative features for zero-shot classification//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 479-495 [DOI: 10.1007/978-3-030-58542-6_29http://dx.doi.org/10.1007/978-3-030-58542-6_29]
Nilsback M E and Zisserman A. 2008. Automated flower classification over a large number of classes//Proceedings of the 6th Indian Conference on Computer Vision, Graphics and Image Processing. Bhubaneswar, India: IEEE: 722-729 [DOI: 10.1109/ICVGIP.2008.47http://dx.doi.org/10.1109/ICVGIP.2008.47]
Patterson G and Hays J. 2012. SUN attribute database: discovering, annotating, and recognizing scene attributes//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 2751-2758 [DOI: 10.1109/CVPR.2012.6247998http://dx.doi.org/10.1109/CVPR.2012.6247998]
Schönfeld E, Ebrahimi S, Sinha S, Darrell T and Akata Z. 2019. Generalized zero- and few-shot learning via aligned variational autoencoders//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 8239-8247 [DOI: 10.1109/CVPR.2019.00844http://dx.doi.org/10.1109/CVPR.2019.00844]
Simonyan K and Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition [EB/OL]. [2022-05-09]. https://arxiv.org/pdf/1409.1556.pdfhttps://arxiv.org/pdf/1409.1556.pdf
Sohn, K, Yan X C and Lee H. 2015. Learning structured output representation using deep conditional generative models//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press: 3483-3491
Sung F, Yang Y X, Zhang L, Xiang T, Torr P H S and Hospedales T M. 2018. Learning to compare: relation network for few-shot learning//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 1199-1208 [DOI: 10.1109/CVPR.2018.00131http://dx.doi.org/10.1109/CVPR.2018.00131]
Szegedy C, Liu W, Jia Y Q, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V and Rabinovich A. 2015. Going deeper with convolutions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 1-9 [DOI: 10.1109/CVPR.2015.7298594http://dx.doi.org/10.1109/CVPR.2015.7298594]
Tong B, Wang C, Klinkigt M, Kobayashi Y and Nonaka Y. 2019. Hierarchical disentanglement of discriminative latent features for zero-shot learning//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 11459-11468 [DOI: 10.1109/CVPR.2019.01173http://dx.doi.org/10.1109/CVPR.2019.01173]
Verma V K, Arora G, Mishra A and Rai P. 2018. Generalized zero-shot learning via synthesized examples//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4281-4289 [DOI: 10.1109/CVPR.2018.00450http://dx.doi.org/10.1109/CVPR.2018.00450]
Wah C, Branson S, Welinder P, Perona P and Belongie S. 2011. The Caltech-UCSD Birds-200-2011 Dataset. CNS-TR-2011-001. California Institute of Technology
Wang W L, Xu H T, Wang G Y, Wang W Q and Carin L. 2021a. Zero-shot recognition via optimal transport//Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision. Waikoloa, USA: IEEE: 3471-3480 [DOI: 10.1109/WACV48630.2021.00351http://dx.doi.org/10.1109/WACV48630.2021.00351]
Wang Y Q, Yao Q M, Kwok J T and Ni L M. 2021b. Generalizing from a few examples: a survey on few-shot learning. ACM Computing Surveys, 53(3): #63 [DOI: 10.1145/3386252http://dx.doi.org/10.1145/3386252]
Xian Y Q, Akata Z, Sharma G, Nguyen Q, Hein M and Schiele B. 2016. Latent embeddings for zero-shot classification//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 69-77 [DOI: 10.1109/CVPR.2016.15http://dx.doi.org/10.1109/CVPR.2016.15]
Xian Y Q, Lorenz T, Schiele B and Akata Z. 2018. Feature generating networks for zero-shot learning//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 5542-5551 [DOI: 10.1109/CVPR.2018.00581http://dx.doi.org/10.1109/CVPR.2018.00581]
Xian Y Q, Sharma S, Schiele B and Akata Z. 2019. F-VAEGAN-D2: a feature generating framework for any-shot learning//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 10267-10276 [DOI: 10.1109/CVPR.2019.01052http://dx.doi.org/10.1109/CVPR.2019.01052]
Ye Z H, Lyu F, Li L Y, Fu Q M, Ren J C and Hu F Y. 2019. SR-GAN: semantic rectifying generative adversarial network for zero-shot learning//Proceedings of 2019 IEEE International Conference on Multimedia and Expo. Shanghai, China: IEEE: 85-90 [DOI: 10.1109/ICME.2019.00023http://dx.doi.org/10.1109/ICME.2019.00023]
Zhao P, Wang C Y, Zhang S Y and Liu Z Y. 2021. A zero-shot image classification method based on subspace learning with the fusion of reconstruction. Chinese Journal of Computers, 44(2): 409-421
赵鹏, 汪纯燕, 张思颖, 刘政怡. 2021. 一种基于融合重构的子空间学习的零样本图像分类方法. 计算机学报, 44(2): 409-421 [DOI: 10.11897/SP.J.1016.2021.00409http://dx.doi.org/10.11897/SP.J.1016.2021.00409]
Zhu Y Z, Elhoseiny M, Liu B C, Peng X and Elgammal A. 2018. A generative adversarial approach for zero-shot learning from noisy texts//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 1004-1013 [DOI: 10.1109/CVPR.2018.00111http://dx.doi.org/10.1109/CVPR.2018.00111]
相关文章
相关作者
相关机构