非局部注意力双分支网络的跨模态赤足足迹检索

鲍文霞; 茅丽丽; 王年; 唐俊; 杨先军; 张艳

发布时间： 2022-07-13
摘要点击次数： 1045
全文下载次数： 598
DOI: 10.11834/jig.200806
2022 | Volume 27 | Number 7

非局部注意力双分支网络的跨模态赤足足迹检索

鲍文霞¹, 茅丽丽¹, 王年¹, 唐俊¹, 杨先军², 张艳¹(1.安徽大学电子信息工程学院, 合肥 230601;2.中国科学院合肥物质科学研究院, 合肥 230031)

摘要

目的针对目前足迹检索中存在的采集设备种类多样化、有效的足迹特征难以提取等问题，本文以赤足足迹图像为研究对象，提出一种基于非局部（non-local）注意力双分支网络的跨模态赤足足迹检索算法。方法该网络由特征提取、特征嵌入以及双约束损失模块构成，其中特征提取模块采用双分支结构，各分支均以ResNet50作为基础网络分别提取光学和压力赤足图像的有效特征；同时在特征嵌入模块中通过参数共享学习一个多模态的共享空间，并引入非局部注意力机制快速捕获长范围依赖，获得更大感受野，专注足迹图像整体压力分布，在增强每个模态有用特征的同时突出了跨模态之间的共性特征；为了增大赤足足迹图像类间特征差异和减小类内特征差异，利用交叉熵损失L_CE（cross-entropy loss）和三元组损失L_TRI（triplet loss）对整个网络进行约束，以更好地学习跨模态共享特征，减小模态间的差异。结果本文将采集的138人的光学赤足图像和压力赤足图像作为实验数据集，并将本文算法与细粒度跨模态检索方法FGC （fine-grained cross-model）和跨模态行人重识别方法HC （hetero-center）进行了对比实验，本文算法在光学到压力检索模式下的mAP （mean average precision）值和rank1值分别为83.63%和98.29%，在压力到光学检索模式下的mAP值和rank1值分别为84.27%和94.71%，两种检索模式下的mAP均值和rank1均值分别为83.95%和96.5%，相较于FGC分别提高了40.01%和36.50%，相较于HC分别提高了26.07%和19.32%。同时本文算法在non-local注意力机制、损失函数、特征嵌入模块后采用的池化方式等方面进行了对比分析，其结果证实了本文算法的有效性。结论本文提出的跨模态赤足足迹检索算法取得了较高的精度，为现场足迹比对、鉴定等应用提供了研究基础。

关键词

图像检索跨模态足迹检索非局部注意力机制双分支网络赤足足迹图像

Non-local attention dual-branch network based cross-modal barefoot footprint retrieval

Bao Wenxia¹, Mao Lili¹, Wang Nian¹, Tang Jun¹, Yang Xianjun², Zhang Yan¹(1.College of Electronic Information Engineering, Anhui University, Hefei 230601, China;2.Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei 230031, China)

Abstract

Objective Footprints are the highest rate of material evidence left and extracted from crime scene in general. Footprint retrieval and comparison plays an important role in criminal investigation. Footprint features are identified via the foot shape and bone structure of the person involved and have its features of specificity and stability. Meanwhile, footprints can reveal their essential behavior in the context of the physiological and behavioral characteristics. It is related to the biological features like height, body shape, gender, age and walking habits. Medical research results illustrates that footprint pressure information of each person is unique. It is challenged to improve the rate of discovery, extraction and utilization of footprints in criminal investigation. The retrieval of footprint image is of great significance, which will provide theoretical basis and technical support for footprint comparison and identification. Footprint images have different modes due to the diverse scenarios and tools of extraction. The global information of cross-modal barefoot images is unique, which can realize retrieval-oriented. The retrieval orientation retrieves the corresponding image of cross-modes. The traditional cross-modal retrieval methods are mainly in the context of subspace method and objective model method. These retrieval methods are difficult to obtain distinguishable features. The deep learning based retrieval methods construct multi-modal public space via convolutional neural network (CNN). The high-level semantic features of image can be captured in terms of iterative optimization of network parameters, to lower the multi-modal heterogeneity. Method A cross-modal barefoot footprint retrieval algorithm based on non-local attention two-branch network is demonstrated to resolve the issue of intra-class wide distance and inter-class narrow distance in fine-grained images. The collected barefoot footprint images involve optical mode and pressure mode. The median filter is applied to remove noises for all images, and the data augmentation method is used to expand the footprint images of each mode. In the feature extraction module, the pre-trained ResNet50 is used as basic network to extract the inherent features of each mode. In the feature embedding module, parameter sharing is realized by splicing feature vectors, and a multi-modal sharing space is constructed. All the residual blocks in the Layer2 and Layer3 of the ResNet50 use a non-local attention mechanism to capture long-range dependence, obtain a large receptive field, and highlight common features quickly. Simultaneously, cross-entropy loss and triplet loss are used to better learn multi-modal sharing space in order to reduce intra-class differences and increase inter-class differences of features. Our research tool is equipped with two NVIDIA 2070TI graphics CARDS, and the network is built in PyTorch. The size of the barefoot footprint images is 224×224 pixels. The stochastic gradient descent (SGD) optimizer is used for training. The number of iterations is 81, and the initial learning rate is 0.01. The trained network is validated by using the validation set, and the mean average precision (mAP) and rank values are obtained. In addition, the optimal model is saved in accordance with the highest rank1 value. The backup model is based on the test set, and the data of the final experimental results are recorded and saved. Result A cross-modal retrieval dataset is collected and constructed through a 138 person sample. Our comparative experiments are carried out to verify the effect of non-local attention mechanism in related to the retrieval efficiency, multiple loss functions and different pooling methods based on feature embedding modules. Our illustrated algorithm is compared to fine-grained cross-modal retrieval derived fine-grained cross-model (FGC) method and the RGB-infrared cross-modal person re-identification based hetero-center (HC) method. The number of people in the training set, verification set and test set is 82, 28 and 28, respectively, including 16 400 images, 5 600 images and 5 600 images each. The ratio of query images and retrieval images in the verification set and test set is 1:2. The evaluation indexes of the experiment are mAP mean (mAP_Avg) and rank1 mean (rank1_Avg) of two retrieval modes. Our analysis demonstrates that the algorithm illustrated has a higher precision, and the mAP_Avg and rank1_Avg are 83.95% and 96.5%, respectively. Compared with FGC and HC, the evaluation indexes of the proposed algorithm is 40.01% and 36.50% (higher than FGC), and 26.07% and 19.32% (higher than HC). Conclusion A cross-modal barefoot footprint retrieval algorithm is facilitated based on a non-local attention dual-branch network through the integration of non-local attention mechanism and double constraint loss. Our algorithm considers the uniqueness and correlation of in-modal and inter-modal features, and improves the performance of cross-modal barefoot footprint retrieval further, which can provide theoretical basis and technical support for footprint comparison and identification.

Keywords

image retrieval cross-modal footprint retrieval non-local attention mechanism two-branch network barefoot footprint image

在线采编平台

在线出版

年度会议

下载中心

年度信息