Dual grained feature fusion network-relevant cross-modality pedestrian re-identification

Ma Xiaofeng; Cheng Wengang

doi:10.11834/jig.220387

Person Re-identification | Views : 0 下载量: 1 CSCD: 0

PDF
Export
Share
Collection
Album

Dual grained feature fusion network-relevant cross-modality pedestrian re-identification
Vol. 28, Issue 5, Pages: 1422-1433(2023)
Published： 16 May 2023 ，
DOI： 10.11834/jig.220387
稿件说明：

移动端阅览

马潇峰，程文刚. 2023. 双粒度特征融合网络的跨模态行人再识别. 中国图象图形学报， 28(05):1422-1433

Ma Xiaofeng， Cheng Wengang. 2023. Dual grained feature fusion network-relevant cross-modality pedestrian re-identification. Journal of Image and Graphics， 28(05):1422-1433
马潇峰，程文刚. 2023. 双粒度特征融合网络的跨模态行人再识别. 中国图象图形学报， 28(05):1422-1433 DOI： 10.11834/jig.220387.

Ma Xiaofeng， Cheng Wengang. 2023. Dual grained feature fusion network-relevant cross-modality pedestrian re-identification. Journal of Image and Graphics， 28(05):1422-1433 DOI： 10.11834/jig.220387.

摘要

目的

可见光—红外跨模态行人再识别旨在匹配具有相同行人身份的可见光图像和红外图像。现有方法主要采用模态共享特征学习或模态转换来缩小模态间的差异，前者通常只关注全局或局部特征表示，后者则存在生成模态不可靠的问题。事实上，轮廓具有一定的跨模态不变性，同时也是一种相对可靠的行人识别线索。为了有效利用轮廓信息减少模态间差异，本文将轮廓作为辅助模态，提出了一种轮廓引导的双粒度特征融合网络，用于跨模态行人再识别。

方法

在全局粒度上，通过行人图像到轮廓图像的融合，用于增强轮廓的全局特征表达，得到轮廓增广特征。在局部粒度上，通过轮廓增广特征和基于部件的局部特征的融合，用于联合全局特征和局部特征，得到融合后的图像表达。

结果

在可见光—红外跨模态行人再识别的两个公开数据集对模型进行评估，结果优于一些代表性方法。在SYSU-MM01（Sun Yat-sen University multiple modality 01）数据集上，本文方法rank-1准确率和平均精度均值（mean average precision，mAP）分别为62.42%和58.14%。在RegDB（Dongguk body-based person recognition database）数据集上，本文方法rank-1和mAP分别为84.42%和77.82%。

结论

本文将轮廓信息引入跨模态行人再识别，提出一种轮廓引导的双粒度特征融合网络，在全局粒度和局部粒度上进行特征融合，从而学习具有判别性的特征，性能超过了近年来一些具有代表性的方法，验证了轮廓线索及其使用方法的有效性。

Abstract

Objective

Visible-infrared cross-modality pedestrianre-identification （VI-ReID） is focused on same identity-related images between the visible and infrared modality. As a popular technique of intelligent surveillance， it is still challenged for cross-modality. To optimize intra-class variations in RGB image-based pedestrian re-identification （RGB-ReID） task， one crucial challenge in VI-ReID is to bridge the modality gap between the RGB and infrared（IR） images of the same identity because current methods are mainly derived from modality-incorporated feature learning or modality transformation approaches. Specifically， modality-incorporated feature learning methods are used to map the inputs of RGB and IR images into a common embedding space for cross-modality feature alignment. A two-stream convolutional neural network （two-stream CNN） architecture has been recognized and some discriminative constraints are developed as well. However， each filter is constrained for a small region， convolutions is challenged to interlinked to spatial-ranged concepts. Quantitative tests illustrates that the CNNs are strongly biased toward textures rather than shapes. Moreover， existing methods in VI-ReID focus on the global or local feature representation only. Another path of modality transformation approaches can be used to generate cross-modality images-relevant pedestrain images or transform them into an intermediate modality. Generative adversarial network （GAN） and encoder-decoder structures are commonly used for these methods. However， due to the distorted IR-to-RGB translation and additional noises， image generation is incredible and GAN models are difficult to be converged. The RGB images consist of three-color channels， while IR images only contain a single channel reflecting the thermal radiation emitted from the human body and its contexts. Compared to the missing colors and textures in IR images， we review VI-ReID problem and the contour is realized in terms of a relative effective feature. Furthermore， contour is a modality-shared path as it keeps consistent for both of IR and RGB images， which is more accurate and reliable than generated intermediate modality. We implement VI-ReID-integrated contour strategy. To develop its optimization， we take the contour as an auxiliary modality to narrow the modality gap. Meanwhile， we tend to introduce local features into our model to collaborate with global ones further beyond part-based features.

Method

A contour-guided dual-grained feature fusion network （CGDGFN） is developed for VI-ReID. It consists of two types of fusion. The first type is concerned of the image to contour fusion， which is called global-grained fusion （G-Fusion） at image-level. G-Fusion can output the augmented contour features. The other type can realize fusion between augmented contour features and local features， which is oriented at image and part mixed level. As the local feature is involved in， it called as local-grained fusion （L-Fusion） for simplicity. The proposed CGDGFN consists of four branches， which are 1） RGB images， 2） IR images， 3） RGB contour and 4） IR contour. First， the input of the network is a pair of RGB and IR images， while they are fed into RGB branch and IR branch， and a contour detector generates their contour images. Then， RGB and IR-relevant contour images of two modalities are fed into RGB-contour branch and IR-contour branch. The ResNet50 is used as the backbone architecture for each branch. The first convolutional layer in each branch has independent parameters to capture modality-specific information， while the remaining blocks are used to learn weight-shared modality-invariant features. In addition， RGB branch and IR branch average pooling layer structure is optimzed for part-based features extraction. G-Fusion is used to fuse an image to its corresponding contour image. After G-Fusion， the augmented contour features will be produced by the global average pooling layer of RGB-contour branch and IR-contour branch. Meanwhile， RGB branch and IR branch can output corresponding local features. RGB local feature as well as IR local feature is an array of feature vectors and its length is clarified by the partition setting. Two of local feature extraction methods are involved in： 1） uniform partition and 2） soft partition. L-Fusion is responsible for fusing the augmented contour features and corresponding local ones. The implementation of our method is based on the Pytorch framework. We adopt the ResNet50 pre-trained on ImageNet as the backbone network， and the stride of the last convolutional layer is set to 1 to obtain feature maps with higher spatial size. The batch size is set to 64. For each batch， we select 4 identities in random and each identity includes 8 visible images and 8 infrared images. The input images are resized to 288×144 pixels， random cropping and random horizontal flip are used for data augmentation. The stochastic gradient descent （SGD） optimizer is used for optimization and the momentum is set to 0.9. We train the model for 60 epochs at first. The initial learning rate is set as 0.01 and warmup strategy is applied to enhance performance. To realize its soft partition， we fine-tune our model for additional 20 epochs as well.

Result

The proposed CGDGFN method is compared with state-of-the-art （SOTA） VI-ReID approaches on two databases Sun Yat-sen University multiple modality 01（SYSU-MM01） and Dongguk body-based person recognition database（RegDB）， which is composed of the methods of global-feature， local-feature， and image generation. The standard cumulated matching characteristics （CMC） and mean average precision （mAP） are employed to evaluate the performance. Our proposed method has obtained 62.42% rank-1 identification rate and 58.14% mAP score on SYSU-MM01， and the values of rank-1 and mAP on RegDB are reached to 84.42% and 77.82% each in comparison with popular SOTA approaches on both datasets.

Conclusion

We introduce the contour clue to VI-ReID. To leverage contour information， we took the contour as an auxiliary modality， and a contour-guided dual-grained feature fusion network （CGDGFN） is developed for VI-ReID. Global-grained fusion （G-Fusion） can enhance the original contour representation and produce augmented contour features. Local-grained fusion （L-Fusion） can fuse the part-based local features and the augmented contour features to output its powerful image representation further.

关键词

行人再识别跨模态特征融合轮廓特征学习

Keywords

pedestrian re-identificationcross-modalityfeature fusioncontourfeature learning

references

Chen J X， Yang Q Z， Meng J K， Zheng W S and Lai J H. 2019. Contour-guided person re-identification//Proceedings of the 2nd Chinese Conference on Pattern Recognition and Computer Vision. Xi’an， China： Springer： 296-307 ［DOI： 10.1007/978-3-030-31726-3_25http://dx.doi.org/10.1007/978-3-030-31726-3_25］

Chen Y H S， Wan L， Li Z H， Jing Q Y and Sun Z Y. 2021. Neural feature search for RGB-infrared person re-identification//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 587-597 ［DOI： 10.1109/CVPR46437.2021.00065http://dx.doi.org/10.1109/CVPR46437.2021.00065］

Choi S， Lee S， Kim Y， Kim T and Kim C. 2020. Hi-CMD： hierarchical cross-modality disentanglement for visible-infrared person re-identification//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 10254-10263 ［DOI： 10.1109/CVPR42600.2020.01027http://dx.doi.org/10.1109/CVPR42600.2020.01027］

Dai P Y， Ji R R， Wang H B， Wu Q and Huang Y Y. 2018. Cross-modality person re-identification with generative adversarial training//Proceedings of the 27th International Joint Conference on Artificial Intelligence. Stockholm， Sweden： AAAI Press： 677-683 ［DOI： 10.24963/ijcai.2018/94http://dx.doi.org/10.24963/ijcai.2018/94］

Gao Y J， Liang T F， Jin Y， Gu X Y， Liu W， Li Y D and Lang C Y. 2021. MSO： multi-feature space joint optimization network for RGB-infrared person re-identification//Proceedings of the 29th ACM International Conference on Multimedia. Virtual， China： Association for Computing Machinery： 5257-5265 ［DOI： 10.1145/3474085.3475643http://dx.doi.org/10.1145/3474085.3475643］

Geirhos R， Rubisch P， Michaelis C， Bethge M， Wichmann F A and Brendel W. 2022. ImageNet-trained CNNs are biased towards texture； increasing shape bias improves accuracy and robustness ［EB/OL］. ［2022-04-02］. https：//arxiv.org/pdf/1811.12231.pdfhttps://arxiv.org/pdf/1811.12231.pdf

Han K， Wang Y H， Chen H T， Chen X H， Guo J Y， Liu Z H， Tang Y H， Xiao A， Xu C J， Xu Y X， Yang Z H， Zhang Y M and Tao D C. 2022. A survey on vision transformer ［EB/OL］. ［2022-04-02］. https：//arxiv.org/pdf/2012.12556.pdfhttps://arxiv.org/pdf/2012.12556.pdf

Hao Y， Wang N N， Li J and Gao X B. 2019a. HSME： hypersphere manifold embedding for visible thermal person re-identification. Proceedings of the AAAI Conference on Artificial Intelligence， 33（1）： 8385-8392 ［DOI： 10.1609/aaai.v33i01.33018385http://dx.doi.org/10.1609/aaai.v33i01.33018385］

Hao Y， Wang N N， Gao X B， Li J and Wang X Y. 2019b. Dual-alignment feature embedding for cross-modality person re-identification//Proceedings of the 27th ACM International Conference on Multimedia. Nice， France： Association for Computing Machinery： 57-65 ［DOI： 10.1145/3343031.3351006http://dx.doi.org/10.1145/3343031.3351006］

Li D G， Wei X， Hong X P and Gong Y H. 2020. Infrared-visible cross-modal person re-identification with an X modality. Proceedings of the AAAI Conference on Artificial Intelligence， 34（4）： 4610-4617 ［DOI： 10.1609/aaai.v34i04.5891http://dx.doi.org/10.1609/aaai.v34i04.5891］

Liu H J， Tan X H and Zhou X C. 2021. Parameter sharing exploration and hetero-center triplet loss for visible-thermal person re-identification. IEEE Transactions on Multimedia， 23： 4414-4425 ［DOI： 10.1109/TMM.2020.3042080http://dx.doi.org/10.1109/TMM.2020.3042080］

Liu Y， Cheng M M， Hu X W， Wang K and Bai X. 2017. Richer convolutional features for edge detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 5872-5881 ［DOI： 10.1109/CVPR.2017.622http://dx.doi.org/10.1109/CVPR.2017.622］

Nguyen D T， Hong H G， Kim K W and Park K R. 2017. Person recognition system based on a combination of body images from visible light and thermal cameras. Sensors， 17（3）： #605 ［DOI： 10.3390/s17030605http://dx.doi.org/10.3390/s17030605］

Park H， Lee S， Lee J and Ham B. 2021. Learning by aligning： visible-infrared person re-identification using cross-modal correspondences//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 12026-12035 ［DOI： 10.1109/ICCV48922.2021.01183http://dx.doi.org/10.1109/ICCV48922.2021.01183］

Shi W D， Zhang Y Z， Liu S W， Zhu S D and Bao J N. 2020. Person re-identification based on deformation and occlusion mechanisms. Journal of Image and Graphics， 25（12）： 2530-2540

史维东，张云洲，刘双伟，朱尚栋，暴吉宁. 2020. 针对形变与遮挡问题的行人再识别. 中国图象图形学报， 25（12）： 2530-2540 ［DOI： 10.11834 /jig.200016http://dx.doi.org/10.11834/jig.200016］

Sun Y F， Zheng L， Yang Y， Tian Q and Wang S J. 2018. Beyond part models： person retrieval with refined part pooling （and a strong convolutional baseline）//Proceedings of the 15th European Conference on Computer Vision. Munich， Germany： Springer： 501-518 ［DOI： 10.1007/978-3-030-01225-0_30http://dx.doi.org/10.1007/978-3-030-01225-0_30］

Wang G A， Zhang T Z， Cheng J， Liu S， Yang Y and Hou Z G. 2019a. RGB-infrared cross-modality person re-identification via joint pixel and feature alignment//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul， Korea （South）： IEEE： 3622-3631 ［DOI： 10.1109/ICCV.2019.00372http://dx.doi.org/10.1109/ICCV.2019.00372］

Wang G A， Zhang T Z， Yang Y， Cheng J， Chang J L， Liang X and Hou Z G. 2020. Cross-modality paired-images generation for RGB-infrared person re-identification. Proceedings of the AAAI Conference on Artificial Intelligence， 34（7）： 12144-12151 ［DOI： 10.1609/aaai.v34i07.6894http://dx.doi.org/10.1609/aaai.v34i07.6894］

Wang X L， Girshick R， Gupta A and He K M. 2018. Non-local neural networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 7794-7803 ［DOI： 10.1109/CVPR.2018.00813http://dx.doi.org/10.1109/CVPR.2018.00813］

Wang Z X， Wang Z， Zheng Y Q， Chuang Y Y and Satoh S. 2019b. Learning to reduce dual-level discrepancy for infrared-visible person re-identification//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 618-626 ［DOI： 10.1109/CVPR.2019.00071http://dx.doi.org/10.1109/CVPR.2019.00071］

Wei Z Y， Yang X， Wang N N and Gao X B. 2021. Syncretic modality collaborative learning for visible infrared person re-identification//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 225-234 ［DOI： 10.1109/ICCV48922.2021.00029http://dx.doi.org/10.1109/ICCV48922.2021.00029］

Wu A C， Zheng W S， Yu H X， Gong S G and Lai J H. 2017. RGB-infrared cross-modality person re-identification//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice， Italy： IEEE： 5390-5399 ［DOI： 10.1109/ICCV.2017.575http://dx.doi.org/10.1109/ICCV.2017.575］

Wu B C， Xu C F， Dai X L， Wan A， Zhang P Z， Yan Z C， Tomizuka M， Gonzalez J， Keutzer K and Vajda P. 2021. Visual transformers： where do transformers really belong in vision models？//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 579-589 ［DOI： 10.1109/ICCV48922.2021.00064http://dx.doi.org/10.1109/ICCV48922.2021.00064］

Ye M， Lan X Y， Leng Q M and Shen J B. 2020. Cross-modality person re-identification via modality-aware collaborative ensemble learning. IEEE Transactions on Image Processing， 29： 9387-9399 ［DOI： 10.1109/TIP.2020.2998275http://dx.doi.org/10.1109/TIP.2020.2998275］

Ye M， Lan X Y， Li J W and Yuen P C. 2018a. Hierarchical discriminative learning for visible thermal person re-identification. Proceedings of the AAAI Conference on Artificial Intelligence， 32（1）： 7501-7508 ［DOI： 10.1609/aaai.v32i1.12293http://dx.doi.org/10.1609/aaai.v32i1.12293］

Ye M， Shen J B， Lin G J， Xiang T， Shao L and Hoi S C H. 2022. Deep learning for person re-identification： a survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence， 44（6）： 2872-2893 ［DOI： 10.1109/TPAMI.2021.3054775http://dx.doi.org/10.1109/TPAMI.2021.3054775］

Ye M， Wang Z， Lan X Y and Yuen P C. 2018b. Visible thermal person re-identification via dual-constrained top-ranking//Proceedings of the 27th International Joint Conference on Artificial Intelligence. Stockholm， Sweden： AAAI Press： 1092-1099 ［DOI： 10.24963/ijcai.2018/152http://dx.doi.org/10.24963/ijcai.2018/152］

Zhang Y K， Yan Y， Lu Y and Wang H Z. 2021. Towards a unified middle modality learning for visible-infrared person re-identification//Proceedings of the 29th ACM International Conference on Multimedia. Virtual， China： Association for Computing Machinery： 788-796 ［DOI： 10.1145/3474085.3475250http://dx.doi.org/10.1145/3474085.3475250］

Zhao C R， Qi D， Dou S G， Tu Y P， Sun T L， Bai S， Jiang X Y， Bai X and Miao D Q. 2021. Key technology for intelligent video surveillance： a review of person re-identification. Scientia Sinica Informationis， 51（12）： 1979-2015

赵才荣，齐鼎，窦曙光，涂远鹏，孙添力，柏松，蒋忻洋，白翔，苗夺谦. 2021. 智能视频监控关键技术：行人再识别研究综述. 中国科学：信息科学， 51（12）： 1979-2015 ［DOI： 10.1360/SSI-2021-0211http://dx.doi.org/10.1360/SSI-2021-0211］

Zhu Y X， Yang Z， Wang L， Zhao S， Hu X and Tao D P. 2020. Hetero-center loss for cross-modality person re-identification. Neurocomputing， 386： 97-109 ［DOI： 10.1016/j.neucom.2019.12.100http://dx.doi.org/10.1016/j.neucom.2019.12.100］

Alert me when the article has been cited

提交

An iris feature-encoding method by fusion of graph neural networks and convolutional neural networks

Whole slide pathological image classification of breast cancer based on mixed supervision learning

Infrared-visible image object detection algorithm using feature dynamic selection

Pose guidance and multi-scale feature fusion for occluded person re-identification

Bilateral cross enhancement with self-attention compensation for semantic segmentation of point clouds