Current Issue Cover

许媛媛, 阚美娜, 山世光, 陈熙霖(中国科学院计算技术研究所)

摘 要
目的 域适应技术旨在利用有标签的源域信息提升无标签目标域上的任务性能。近期,对比语言-图像预训练模型CLIP(contrastive language-image pre-training)展现出了强大的泛化能力,部分方法将其引入到域适应中,以提升模型在目标域上的泛化能力。然而,目前基于CLIP的域适应方法通常只调整文本模态的特征,保持视觉模态的特征不变,这导致目标域的性能提升受限。为此,本文提出了双模态域无关提示引导的图像分类域适应方法DDAP(dual-modality domain-agnostic prompts)。方法 DDAP引入了双模态提示学习,即通过文本和视觉提示学习微调文本特征和图像特征,协同处理域差异的问题。一方面,DDAP致力于学习更具判别性的文本和图像特征,使模型在当前下游分类任务上的性能更好;另一方面,DDAP通过消除源域和目标域之间的域差异,学习域不变的文本和图像特征,以提升模型在目标域上的性能。以上两个目标可通过添加域无关文本提示模块和域无关视觉提示模块,使用分类损失和对齐损失微调CLIP来实现。对于分类损失,DDAP利用源域的标签和目标域的伪标签对样本进行分类;而对于对齐损失,DDAP则通过最大均值差异损失MMD(maximum mean discrepancy)来对齐源域和目标域的图像特征分布,从而消除图像特征的域差异。结果 本方法既适用于单源域适应,也适用于多源域适应。对于单源域适应,本方法在Office-Home、VisDa-2017及Office-31这3个数据集上进行了实验,分别取得了87.1%、89.6%和91.6%的平均分类准确率,达到了当前最好的性能。对于多源域适应,本方法在Office-Home上进行了实验,取得了88.6%的平均分类准确率。同时,本方法在Office-Home上进行了消融实验,验证了域无关文本提示模块和域无关视觉提示模块的有效性。结论 本文提出了双模态域无关提示引导的图像分类域适应方法DDAP,通过域无关的文本和视觉提示模块微调CLIP预训练模型,使模型学习源域与目标域之间域不变且判别性的特征,有效提升了模型在目标域上的性能表现。
Dual-modality domain-agnostic prompts guided cross-domain image classification

Xu Yuanyuan, Kan Meina, Shan Shiguang, Chen Xilin(Institute of Computing Technology, Chinese Academy of Sciences)

Objective Domain adaptation aims to utilize information from a labeled source domain to assist the task in the unlabeled target domain. Recently, CLIP (contrastive language-image pre-training) has demonstrated impressive generalization capabilities in classification downstream tasks. Some methods have incorporated CLIP into domain adaptation, enhancing the model"s generalization ability in the target domain. However, current domain adaptation methods based on CLIP typically only adjust the features of the textual modality, leaving the visual modality features un-changed. These existing methods overlook the importance of enhancing the discriminative capability of image features during classification and neglect the synergistic role of the visual modality in eliminating domain discrepancy. To address this, this paper proposes a domain adaptation method for image classification task guided by Dual-modality Domain AgnosticPrompts (DDAP). Method DDAP introduces dual-modality prompt learning, simultaneously fine-tuning textual and visual features, and collaboratively addressing domain discrepancy. The key modules of DDAP are the domain-agnostic textual prompt module and the domain-agnostic visual prompt module. The former employs textual prompt learning techniques to fine-tune the text encoder, fostering domain-agnostic and discriminative text features across domains. DDAP adopts task-level text prompt learning, sharing the textual prompt module across various domains and categories. Similarly, the domain-agnostic visual prompt module utilizes visual prompt learning techniques to enhance the image encoder, cultivating domain-agnostic and discriminative image features. Task-level visual prompt learning is employed, ensuring the visual prompt module is shared across diverse domains and samples. To learn the added dual-modality domain-agnostic prompts, the classification loss and alignment loss are used to finetune the model. On one hand, since the original pre-training task for CLIP is matching paired images and text, it needs to learn more discriminative text and image features specific to the current downstream classification task. Therefore, DDAP uses classification loss to train the added dual-modality domain-invariant prompt modules, enhancing the discriminative power of the features. For the source domain, the classification loss can directly use the existing labels, while for the target domain, the classification loss can use the collected pseudo-labels. On the other hand, due to significant visual differences between the images of the two domains, the extracted image representations contain domain-specific features. To enable the target domain to fully utilize the beneficial information from the source domain, DDAP employs MMD (Maximum Mean Discrepancy) loss to align the im-age feature distributions of the source and target domains, learning domain-invariant and image features. When aligning image feature distributions, to enhance the discriminative capability of the aligned features and reduce incorrect category matching between the source and target domains, this work aligns the fusion results of image features and classification probabilities. Result The experiments encompass three datasets: Office-Home, VisDa-2017, and Office-31. During training, all weights of the CLIP pre-trained model remain fixed, with only the weights of the newly added domain-invariant textual and visual prompt learning modules being updated. To assess DDAP against existing methods, experiments on single-source domain adaptation are conducted across these three datasets, yielding average classification accuracies of 87.1%, 89.6%, and 91.6%, respectively, marking the current state-of-the-art performances. Additionally, DDAP"s versatility extends to multi-source domain adaptation, where it achieves an average classification accuracy of 88.6% on the Office-Home dataset. Ablation studies on the Office-Home dataset further confirm the significance of the domain-agnostic text prompt module and the domain-agnostic visual prompt module. Notably, the comprehensive version of DDAP excels, surpassing the performance of individually added single-modality prompt modules and showcasing a 5% improvement over the CLIP pre-trained model. This underscores the effectiveness of employing dual-modal domain-agnostic prompts to collectively mitigate domain discrepancy. Moreover, experiments explore the sensitivity of hyperparameters. In the proposed DDAP method, the primary hyperparameters include the weight of the alignment loss and the lengths of the prompt vectors. Findings reveal that when the weight of the alignment loss approaches its optimal value, the target domain"s performance remains stable. Similarly, variations in the lengths of prompt vectors, do not significantly affect DDAP"s performance. For a more intuitive grasp of DDAP, this study also employs t-SNE (t-distributed stochastic neighbor embedding) to visualize the image features of different models, and the visualization demonstrates the superiority of DDAP in addressing domain adaptation problems. Conclusion This paper introduces a domain adaptation method DDAP for image classification tasks guided by dual-modality domain agnostic prompts. DDAP utilizes domain-invariant textual and visual prompts to collaboratively eliminate domain discrepancy between the source and target domains, learning do-main-invariant and discriminative image and text features to enhance model performance in the target domain. DDAP can be applied to both single-source domain adaptation and multi-source domain adaptation. The proposed DDAP has been experimentally validated across multiple datasets, achieving state-of-the-art results, and demonstrating the significance of collaborative handling of domain discrepancy from a dual-modality perspective.