面向跨模态文本到图像行人重识别的Transformer网络

姜定; 叶茫

doi:10.11834/jig.220620

行人重识别 | 浏览量 : 0 下载量: 0 CSCD: 0

PDF
导出
分享
收藏
专辑

面向跨模态文本到图像行人重识别的Transformer网络
Transformer network for cross-modal text-to-image person re-identification
2023年28卷第5期页码：1384-1395
纸质出版日期： 2023-05-16 ，
DOI： 10.11834/jig.220620
稿件说明：

移动端阅览

姜定，叶茫. 2023. 面向跨模态文本到图像行人重识别的Transformer网络. 中国图象图形学报， 28(05):1384-1395

Jiang Ding， Ye Mang. 2023. Transformer network for cross-modal text-to-image person re-identification. Journal of Image and Graphics， 28(05):1384-1395
姜定，叶茫. 2023. 面向跨模态文本到图像行人重识别的Transformer网络. 中国图象图形学报， 28(05):1384-1395 DOI： 10.11834/jig.220620.

Jiang Ding， Ye Mang. 2023. Transformer network for cross-modal text-to-image person re-identification. Journal of Image and Graphics， 28(05):1384-1395 DOI： 10.11834/jig.220620.

摘要

目的

文本到图像的行人重识别是一个图像文本跨模态检索的子任务，现有方法大都采用在全局特征匹配的基础上加入多个局部特征进行跨模态匹配。这些局部特征匹配的方法都过分复杂且在检索时会大幅减慢速度，因此需要一种更简洁有效的方法提升文本到图像的行人重识别模型的跨模态对齐能力。对此，本文基于通用图像文本对大规模数据集预训练模型，对比语言—图像预训练（contrastive language-image pretraining，CLIP），提出了一种温度投影匹配结合CLIP的文本到图像行人重识别方法。

方法

借助CLIP预训练模型的跨模态图像文本对齐的能力，本文模型仅使用全局特征进行细粒度的图像文本语义特征对齐。此外，本文提出了温度缩放跨模态投影匹配（temperature-scaled cross modal projection matching，TCMPM）损失函数来进行图像文本跨模态特征匹配。

结果

在本领域的两个数据集上与最新的文本到图像行人重识别方法进行实验对比，在CUHK-PEDES（CUHK person description）和ICFG-PEDES（identity-centric and fine-grained person description）数据集中，相比于现有性能较好的局部匹配模型，本文方法Rank-1值分别提高了5.92%和1.21%。

结论

本文提出的基于双流Transformer的文本到图像行人重识别方法可以直接迁移CLIP的跨模态匹配知识，无须冻结模型参数训练或接入其他小模型辅助训练。结合提出的TCMPM损失函数，本文方法仅使用全局特征匹配就在检索性能上大幅超过了现有局部特征方法。

Abstract

Objective

Text-to-image person re-identification is a sub-task of image-text retrieval， which aims to retrieve the target person images corresponding to the given text description. The main challenge of the text-to-image person re-identification task is the significant feature gap between vision and language. The fine-grained matching between the semantic information of the two modalities is restricted by modal gap as well. The mixture of multiple local features and global feature are often adopted for cross-modal matching recently. These local-level matching methods are complicated and suppress the retrieval speed. Insufficient training data is still challenged for text-to-image person re-identification tasks as well. To alleviate this insufficiency， conventional methods are typically initialized their backbone models with weights pre-trained on single-modal large-scale datasets. However， this initialization method cannot be used to learn the information of fine-grained image-text cross-modal matching and its semantic alignment. Therefore， an easy-to-use method is required to optimize the cross-modal alignment for the text-to-image person re-identification model.

Method

We develop a transformer network with a temperature-scaled projection matching method and contrastive language-image pre-training （CLIP） for text-to-image person re-identification. The CLIP is a general multimodal foundation model pre-trained on large-scale image-text datasets. The vision transformer is used as the visual backbone network to preserve fine-grained information， which can resolve the convolutional neural network（CNN）-based constraint of long-range relationships and down-sampling. To optimize the cross-modal image-text alignment capability of the pre-trained CLIP model， our model is focused on fine-grained image-text semantic feature alignment using global features only. In addition， a temperature-scaled cross-modal projection matching （TCMPM） loss function is developed for image-text cross-modal feature matching as well. The TCMPM loss can be used to minimize the Kullback-Leibler（KL） divergence between temperature-scaled projection distributions and normalized true matching distributions in a mini-batch.

Result

Extensive experiments are carried out on two datasets in comparison with the latest text-to-image person re-identification methods. We adopt the two popular public datasets， CUHK person discription（CUHK-PEDES） and identity-centric and fine-grained person discription（ICFG-PEDES）， to validate the effectiveness of the proposed method. Rank-

（

= 1， 5， 10） are adopted as the retrieval evaluation metrics. On the CUHK-PEDES dataset， the Rank-1 value is improved by 5.92% compared to the best performing existing local-level matching method， and it is improved by 7.09% for existing global-level matching method. On the ICFG-PEDES dataset， the Rank-1 value is improved by 1.21% for local-level matching model. The ablation studies are also carried out on the CUHK-PEDES and ICFG-PEDES dataset. Compared to original CMPM loss， the Rank-1 value of the TCMPM loss is improved by 9.54% on the CUHK-PEDES dataset， and the Rank-1 value is improved by 4.67% on the ICFG-PEDES dataset. Compared to the InfoNCE loss， a commonly-used loss in cross-modal comparative learning， the Rank-1 value can be improved by 3.38% on the CUHK-PEDES dataset in terms of the TCMPM loss， and the Rank-1 value is improved by 0.42% on the ICFG-PEDES dataset.

Conclusion

An end-to-end dual Transformer network is developed to learn representations of person images and descriptive texts in the text-to-image person re-identification. We demonstrate that the global-level matching method has its potential to outperform current state-of-the-art local-level matching methods. The transformer network can resolve the problem that CNN cannot model the long-range relationship and detailed information-loss for down-sampling. In addition， our proposed method can benefit from the powerful cross-modal alignment capability of CLIP， and together with our further designed TCMPM loss， our model can thus learn more discriminative image-text features.

关键词

跨模态检索行人重识别行人搜索Transformer图文检索

Keywords

cross modal retrievalperson re-identificationperson searchTransformertext-image retrieval

references

Aggarwal S， Radhakrishnan V B and Chakraborty A. 2020. Text-based person search via attribute-aided matching//Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision. Snowmass， USA： IEEE： 2606-2614 ［DOI： 10.1109/wacv45572.2020.9093640http://dx.doi.org/10.1109/wacv45572.2020.9093640］

Antol S， Agrawal A， Lu J S， Mitchell M， Batra D， Zitnick C L and Parikh D. 2015. VQA： visual question answering//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago， Chile： IEEE： 2425-2433 ［DOI： 10.1109/iccv.2015.279http://dx.doi.org/10.1109/iccv.2015.279］

Chen Y C， Huang R， Chang H， Tan C Q， Xue T and Ma B P. 2021. Cross-modal knowledge adaptation for language-based person search. IEEE Transactions on Image Processing， 30： 4057-4069 ［DOI： 10.1109/tip.2021.3068825http://dx.doi.org/10.1109/tip.2021.3068825］

Chen Y C， Li L J， Yu L C， El Kholy A， Ahmed F， Gan Z， Cheng Y and Liu J J. 2020. UNITER： universal image-text representation learning//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 104-120 ［DOI： 10.1007/978-3-030-58577-8_7http://dx.doi.org/10.1007/978-3-030-58577-8_7］

Chen Y H， Zhang G Q， Lu Y J， Wang Z X and Zheng Y H. 2022. TIPCB： a simple but effective part-based convolutional baseline for text-based person search. Neurocomputing， 494： 171-181 ［DOI： 10.1016/j.neucom.2022.04.081http://dx.doi.org/10.1016/j.neucom.2022.04.081］

Deng J， Dong W， Socher R， Li L J， Li K and Fei-Fei L. 2009. ImageNet： a large-scale hierarchical image database//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami， USA： IEEE： 248-255 ［DOI： 10.1109/cvpr.2009.5206848http://dx.doi.org/10.1109/cvpr.2009.5206848］

Devlin J， Chang M W， Lee K and Toutanova K. 2019. BERT： pre-training of deep bidirectional transformers for language understanding ［EB/OL］. ［2022-05-21］. https：//arxiv.org/pdf/1810.04805.pdfhttps://arxiv.org/pdf/1810.04805.pdf

Ding Z F， Ding C X， Shao Z Y and Tao D C. 2021. Semantically self-aligned network for text-to-image part-aware person re-identification ［EB/OL］. ［2022-05-21］. https：//arxiv.org/pdf/2107.12666.pdfhttps://arxiv.org/pdf/2107.12666.pdf

Dosovitskiy A， Beyer L， Kolesnikov A， Weissenborn D， Zhai X H， Unterthiner T， Dehghani M， Minderer M， Heigold G， Gelly S， Uszkoreit J and Houlsby N. 2021. An image is worth 16 × 16 words： transformers for image recognition at scale ［EB/OL］. ［2022-05-21］. https：//arxiv.org/pdf/2010.11929.pdfhttps://arxiv.org/pdf/2010.11929.pdf

Gao C Y， Cai G Y， Jiang X Y， Zheng F， Zhang J， Gong Y F， Peng P， Guo X W and Sun X. 2021. Contextual non-local alignment over full-scale representation for text-based person search ［EB/OL］. ［2022-05-21］. https：//arxiv.org/pdf/2101.03036.pdfhttps://arxiv.org/pdf/2101.03036.pdf

Han X， He S， Zhang L and Xiang T. 2021. Text-based person search with limited data ［EB/OL］. ［2022-05-21］. https：//arxvi.org/pdf/2110.10807.pdfhttps://arxvi.org/pdf/2110.10807.pdf

He K M， Zhang X Y， Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 770-778 ［DOI： 10.1109/cvpr.2016.90http://dx.doi.org/10.1109/cvpr.2016.90］

He S T， Luo H， Wang P C， Wang F， Li H and Jiang W. 2021. TransReID： transformer-based object re-identification//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 14993-15002 ［DOI： 10.1109/iccv48922.2021.01474http://dx.doi.org/10.1109/iccv48922.2021.01474］

Huo C W. 2020. Research on Person Re-Identification Algorithm Based on Natural Language Description. Tianjin： Hebei University of Technology

霍昶伟. 2020. 基于自然语言描述的行人再识别算法研究. 天津：河北工业大学

Jing Y， Si C Y， Wang J B， Wang W， Wang L and Tan T N. 2020. Pose-guided multi-granularity attention network for text-based person search. Proceedings of the AAAI Conference on Artificial Intelligence， 34（7）： 11189-11196 ［DOI： 10.1609/aaai.v34i07.6777http://dx.doi.org/10.1609/aaai.v34i07.6777］

Kiros R， Salakhutdinov R and Zemel R S. 2014. Unifying visual-semantic embeddings with multimodal neural language models ［EB/OL］. ［2022-05-21］. https：//arxiv.org/pdf/1411.2539.pdfhttps://arxiv.org/pdf/1411.2539.pdf

Li C J. 2019. Deep Attention Based Cross-Modal Person Search Via Natural Language Descriptions. Tianjin： Tianjin University

李晟嘉. 2019. 基于深度注意力的自然语言描述跨模态行人检索. 天津：天津大学

Li S， Xiao T， Li H S， Yang W and Wang X G. 2017a. Identity-aware textual-visual matching with latent co-attention//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice， Italy： IEEE： 1908-1917 ［DOI： 10.1109/iccv.2017.209http://dx.doi.org/10.1109/iccv.2017.209］

Li S， Xiao T， Li H S， Zhou B L， Yue D Y and Wang X G. 2017b. Person search with natural language description//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 5187-5196 ［DOI： 10.1109/cvpr.2017.551http://dx.doi.org/10.1109/cvpr.2017.551］

Li S P， Cao M and Zhang M. 2022. Learning semantic-aligned feature representation for text-based person search//Proceedings of ICASSP 2022-2022 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP）. Singapore，Singapore： IEEE： 2724-2728 ［DOI： 10.1109/icassp43922.2022.9746846http://dx.doi.org/10.1109/icassp43922.2022.9746846］

Li X J， Yin X， Li C Y， Zhang P C， Hu X W， Zhang L， Wang L J， Hu H D， Dong L， Wei F R， Choi Y and Gao J F. 2020. Oscar： object-semantics aligned pre-training for vision-language tasks//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 121-137 ［DOI： 10.1007/978-3-030-58577-8_8http://dx.doi.org/10.1007/978-3-030-58577-8_8］

Niu K， Huang Y， Ouyang W L and Wang L. 2020. Improving description-based person re-identification by multi-granularity image-text alignments. IEEE Transactions on Image Processing， 29： 5542-5556 ［DOI： 10.1109/tip.2020.2984883http://dx.doi.org/10.1109/tip.2020.2984883］

Radford A， Wu J， Child R， Luan D， Amodei D and Sutskever I. 2019. Language models are unsupervised multitask learners. OpenAI blog， 1（8）： #9

Radford A， Kim J W， Hallacy C， Ramesh A， Goh G， Agarwal S， Sastry G， Askell A， Mishkin P， Clark J， Krueger G and Sutskever I. 2021. Learning transferable visual models from natural language supervision ［EB/OL］. ［2022-05-21］. https：//arxiv.org/pdf/2103.00020.pdfhttps://arxiv.org/pdf/2103.00020.pdf

Sarafianos N， Xu X and Kakadiaris I. 2019. Adversarial representation learning for text-to-image matching//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul， Korea （South）： IEEE： 5813-5823 ［DOI： 10.1109/iccv.2019.00591http://dx.doi.org/10.1109/iccv.2019.00591］

Sennrich R， Haddow B and Birch A. 2016. Neural machine translation of rare words with subword units ［EB/OL］. ［2022-05-21］. https：//arxiv.org/pdf/1508.07909.pdfhttps://arxiv.org/pdf/1508.07909.pdf

Shao Z Y， Zhang X Y， Fang M， Lin Z F， Wang J and Ding C X. 2022. Learning granularity-unified representations for text-to-image person re-identification ［EB/OL］. ［2022-05-21］. https：//arxiv.org/pdf/2207.07802.pdfhttps://arxiv.org/pdf/2207.07802.pdf

Sun Y F， Zheng L， Yang Y， Tian Q and Wang S J. 2018. Beyond part models： person retrieval with refined part pooling （and a strong convolutional baseline）//Proceedings of the 15th European Conference on Computer Vision. Munich， Germany： Springer： 501-518 ［DOI： 10.1007/978-3-030-01225-0_30http://dx.doi.org/10.1007/978-3-030-01225-0_30］

van den Oord A， Li Y Z and Vinyals O. 2019. Representation learning with contrastive predictive coding ［EB/OL］. ［2022-05-21］. https：//arxiv.org/pdf/1807.03748.pdfhttps://arxiv.org/pdf/1807.03748.pdf

Vaswani A， Shazeer N， Parmar N， Uszkoreit J， Jones L， Gomez A N， Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 6000-6010

Wang Y Y. 2020. Language-based Person Re-Identification. Dalian： Dalian University of Technology

王玉煜. 2020. 基于语言信息的行人重识别算法研究. 大连：大连理工大学

Wang Z， Fang Z Y， Wang J and Yang Y Z. 2020. VITAA： visual-textual attributes alignment in person search by natural language//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 402-420 ［DOI： 10.1007/978-3-030-58610-2_24http://dx.doi.org/10.1007/978-3-030-58610-2_24］

Wei L H， Zhang S L， Gao W and Tian Q. 2018. Person transfer GAN to bridge domain gap for person re-identification//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 79-88 ［DOI： 10.1109/cvpr.2018.00016http://dx.doi.org/10.1109/cvpr.2018.00016］

Wu Y S， Yan Z Z， Han X G， Li G B， Zou C Q and Cui S G. 2021. LapsCore： language-guided person search via color reasoning//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 1604-1613 ［DOI： 10.1109/iccv48922.2021.00165http://dx.doi.org/10.1109/iccv48922.2021.00165］

Yin X M. 2021. Research on Cross-Modal Pedestrian Re-Recognition Technology Based on Natural Language. Chengdu： University of Electronic Science and Technology of China （殷雪朦. 2021. 基于自然语言的跨模态行人重识别技术研究. 成都：电子科技大学）

Zhang P. 2021. Research on Pedestrian Image Search Based on Natural Language Description. Chengdu： University of Electronic Science and Technology of China （张鹏. 2021. 基于自然语言描述的行人检索研究. 成都：电子科技大学）

Zhang Y and Lu H C. 2018. Deep cross-modal projection learning for image-text matching//Proceedings of the 15th European Conference on Computer Vision. Munich， Germany： Springer： 707-723 ［DOI： 10.1007/978-3-030-01246-5_42http://dx.doi.org/10.1007/978-3-030-01246-5_42］

Zheng X， Lin L， Ye M， Wang L and He C L. 2020. Improving person re-identification by attention and multi-attributes. Journal of Image and Graphics， 25（5）： 936-945

郑鑫，林兰，叶茂，王丽，贺春林. 2020. 结合注意力机制和多属性分类的行人再识别. 中国图象图形学报， 25（5）： 936-945 ［DOI： 10.11834/jig.190185http://dx.doi.org/10.11834/jig.190185］

Zheng Z D， Zheng L， Garrett M， Yang Y， Xu M L and Shen Y D. 2020. Dual-path convolutional image-text embeddings with instance loss. ACM Transactions on Multimedia Computing， Communications， and Applications， 16（2）： #51 ［DOI： 10.1145/3383184http://dx.doi.org/10.1145/3383184］

文章被引用时，请邮件提醒。

提交

混合监督学习的乳腺癌全切片病理图像分类

面向弱纹理目标立体匹配的Transformer网络

面向高光谱场景分类的空—谱模型蒸馏网络

轻量级图像超分辨率的蓝图可分离卷积Transformer网络

图像去模糊研究综述