姜定, 叶茫(武汉大学计算机学院, 武汉 430072)
目的 文本到图像的行人重识别是一个图像文本跨模态检索的子任务,现有方法大都采用在全局特征匹配的基础上加入多个局部特征进行跨模态匹配。这些局部特征匹配的方法都过分复杂且在检索时会大幅减慢速度,因此需要一种更简洁有效的方法提升文本到图像的行人重识别模型的跨模态对齐能力。对此,本文基于通用图像文本对大规模数据集预训练模型,对比语言-图像预训练(contrastive language-image pretraining,CLIP),提出了一种温度投影匹配结合CLIP的文本到图像行人重识别方法。方法 借助CLIP预训练模型的跨模态图像文本对齐的能力,本文模型仅使用全局特征进行细粒度的图像文本语义特征对齐。此外,本文提出了温度缩放跨模态投影匹配(temperature-scaled cross modal projection matching,TCMPM)损失函数来进行图像文本跨模态特征匹配。结果 在本领域的两个数据集上与最新的文本到图像行人重识别方法进行实验对比,在CUHK-PEDES (CUHK person description)和ICFG-PEDES (identity-centric and fine-grained person description)数据集中,相比于现有性能较好的局部匹配模型,本文方法 Rank-1值分别提高了5.92%和1.21%。结论 本文提出的基于双流Transformer的文本到图像行人重识别方法可以直接迁移CLIP的跨模态匹配知识,无须冻结模型参数训练或接入其他小模型辅助训练。结合提出的TCMPM损失函数,本文方法仅使用全局特征匹配就在检索性能上大幅超过了现有局部特征方法。
Transformer network for cross-modal text-to-image person re-identification
Jiang Ding, Ye Mang(School of Computer Science, Wuhan University, Wuhan 430072, China)
Objective Text-to-image person re-identification is a sub-task of image-text retrieval，which aims to retrieve the target person images corresponding to the given text description. The main challenge of the text-to-image person re-identification task is the significant feature gap between vision and language. The fine-grained matching between the semantic information of the two modalities is restricted by modal gap as well. The mixture of multiple local features and global feature are often adopted for cross-modal matching recently. These local-level matching methods are complicated and suppress the retrieval speed. Insufficient training data is still challenged for text-to-image person re-identification tasks as well. To alleviate this insufficiency，conventional methods are typically initialized their backbone models with weights pre-trained on single-modal large-scale datasets. However，this initialization method cannot be used to learn the information of fine-grained image-text cross-modal matching and its semantic alignment. Therefore，an easy-to-use method is required to optimize the cross-modal alignment for the text-to-image person re-identification model. Method We develop a transformer network with a temperature-scaled projection matching method and contrastive language-image pre-training （CLIP）for text-to-image person re-identification. The CLIP is a general multimodal foundation model pre-trained on largescale image-text datasets. The vision transformer is used as the visual backbone network to preserve fine-grained information，which can resolve the convolutional neural network（CNN）-based constraint of long-range relationships and downsampling. To optimize the cross-modal image-text alignment capability of the pre-trained CLIP model，our model is focused on fine-grained image-text semantic feature alignment using global features only. In addition，a temperature-scaled crossmodal projection matching（TCMPM）loss function is developed for image-text cross-modal feature matching as well. The TCMPM loss can be used to minimize the Kullback-Leibler （KL）divergence between temperature-scaled projection distributions and normalized true matching distributions in a mini-batch. Result Extensive experiments are carried out on two datasets in comparison with the latest text-to-image person re-identification methods. We adopt the two popular public datasets， CUHK person discription （CUHK-PEDES）and identity-centric and fine-grained person discription （ICFG-PEDES），to validate the effectiveness of the proposed method. Rank-K（K = 1，5，10）are adopted as the retrieval evaluation metrics. On the CUHK-PEDES dataset，the Rank-1 value is improved by 5. 92% compared to the best performing existing local-level matching method，and it is improved by 7. 09% for existing global-level matching method. On the ICFG-PEDES dataset， the Rank-1 value is improved by 1. 21% for local-level matching model. The ablation studies are also carried out on the CUHK-PEDES and ICFG-PEDES dataset. Compared to original CMPM loss，the Rank-1 value of the TCMPM loss is improved by 9. 54% on the CUHK-PEDES dataset，and the Rank-1 value is improved by 4. 67% on the ICFG-PEDES dataset. Compared to the InfoNCE loss，a commonly-used loss in cross-modal comparative learning，the Rank-1 value can be improved by 3. 38% on the CUHK-PEDES dataset in terms of the TCMPM loss，and the Rank-1 value is improved by 0. 42% on the ICFG-PEDES dataset. Conclusion An end-to-end dual Transformer network is developed to learn representations of person images and descriptive texts in the text-to-image person re-identification. We demonstrate that the globallevel matching method has its potential to outperform current state-of-the-art local-level matching methods. The transformer network can resolve the problem that CNN cannot model the long-range relationship and detailed information-loss for down-sampling. In addition，our proposed method can benefit from the powerful cross-modal alignment capability of CLIP，and together with our further designed TCMPM loss，our model can thus learn more discriminative image-text features.