基于注意力特征表示的多尺度度量小样本图像分类方法

王雪松; 吕理想; 程玉虎; 王浩宇

发布时间： 2024-03-04
摘要点击次数： 234
全文下载次数： 133
DOI: :10.11834/jig.230763
| Volume | Number

基于注意力特征表示的多尺度度量小样本图像分类方法

王雪松, 吕理想, 程玉虎, 王浩宇(中国矿业大学信息与控制工程学院)

摘要

目的在图像分类任务中,通常采用深度网络先对输入图像进行特征提取,然后根据提取出的特征向量进行分类,这一原则也同样适用于小样本图像分类任务。然而,在以向量形式提取特征的过程中,很容易出现信息丢失的情况,而且可能导致模型忽略一些类别强相关信息。为构建更丰富、更全面的特征表示,本文提出了基于基类的丰富表示特征提取器(Rich Representation Feature Extractor, RireFeat)。方法 RireFeat通过在特征提取网络中构建不同层级间的基于注意力机制的信息流通渠道,使得被忽略的类别强相关信息重新出现在新提取的特征表示中,从而根据重要性有效地利用图片信息以构建全面的特征表示。同时,为了增强模型的判别能力,本文从多个尺度对特征进行度量,构建基于对比学习和深度布朗距离协方差的损失函数,拉近类别强相关特征向量之间的距离,同时使不同类别特征向量距离更远。结果为了验证所提特征提取器的有效性,本文在标准的小样本数据集MiniImagenet、TierdeImageNet和CUB上进行了1-shot和5-shot的分类训练。实验结果显示在MiniImageNet数据集上RireFeat在基于卷积的骨干网络中于1-shot和5-shot情况下分别比SetFeat取得精度高出0.64%和1.10%。基于ResNet12的结构中于1-shot和5-shot情况下分别比SetFeat高出1.51%和1.46%。CUB数据集在基于卷积的骨干网络中分别于1-shot和5-shot情况下提供比SetFeat高0.03%和0.61%的增益。在基于ResNet12的结构中于1-shot和5-shot情况下比SetFeat提高了0.66%和0.75%。在TieredImageNet评估中,基于卷积的骨干网络结构中于1-shot和5-shot情况下比SetFeat提高了0.21%和0.38%。结论实验结果表明,所提出的RireFeat特征提取器能够有效地提高模型的分类性能,并且具有很好的泛化能力。

关键词

小样本图像分类注意力机制多尺度度量特征表示

Few-shot Image classification based on multi-scale measurement of attention feature representation

Wang Xuesong, Lv Lixiang, Cheng Yuhu, Wang Haoyu(School of Information and Control Engineering,China University of Mining and Technology)

Abstract

Objective The task of image classification based on few-shot learning refers to how to train a machine learning model capable of effectively classifying target images when there are limited target training samples available. The main challenge in few-shot image classification is the lack of a sufficient dataset, i.e., only a small amount of labeled data is available for training the model. To tackle this challenge, numerous advanced models have been proposed. A common and efficient strategy is to use deep networks as feature extractors. Deep networks are models capable of automatically extracting valuable features from input images. By using multi-layer convolution and pooling operations, they can extract feature vectors from the image. These feature vectors can be used to determine the category of the images, so as to realize the goal of image classification. As the model trains, the feature extractor gradually learns to extract relevant information related to the image"s category, which can then be used as the feature vector. Using deep networks as feature extractors is a common and efficient strategy for few-shot image classification. By leveraging the power of deep learning, these models can achieve high accuracy even when trained on limited labeled data. However, in the process of extracting features in the form of vectors, there is a risk of losing valuable information, including information strongly associated with the specific category. This can result in the disregard of crucial information that could greatly enhance image classification accuracy. In order to enhance the accuracy of classification, it is desirable for the extracted feature vectors to encompass a maximum amount of category-specific information. To achieve a more extensive and comprehensive image representation, this paper introduces a novel rich representation feature extractor (RireFeat) based on the base class. Methods To achieve a more comprehensive and class-specific feature extraction, this paper proposes a feature extractor called RireFeat. The main objective of RireFeat is to enhance the exchange and flow of information within the feature extractor, thereby facilitating the extraction of class-related features. Additionally, this method also pays attention to the multi-layer feature vectors both before and after the training of the feature extractor to ensure that the positive information for classification is not lost during the feature extraction process. RireFeat employs a pyramid-like design that divides the feature extractor into multiple levels. Each level will receive the image coding information from its upper level, and after several convolution and pooling operations at this level, the obtained information will flow to the next level. This hierarchical structure facilitates the transfer and fusion of information between different levels, maximizing the utilization of image extraction information within the feature extractor. The category correlation of feature vectors is deepened as a result, leading to an improved accuracy in image classification. Furthermore, RireFeat demonstrates superior generalization capabilities and can readily adapt to novel image classification tasks. Specifically, this paper starts from the process of feature extraction. After the image information traverses a multilayered hierarchical structure, local features related to categories are extracted, while information unrelated to categories is ignored. However, this process may also lead to the removal of certain category-specific information. To address this issue, the rich representation feature extractor (RireFeat) proposed in this paper, which integrates a small shaping module to add the shaping module at the distance across the hierarchy, so that the image information can still flow and merge with each other after crossing the hierarchy. This design enables the network to pay more attention to the changes of features before and after each level, facilitating the effective extraction of local features while disregarding information that is unrelated to the specific category. Consequently, this approach significantly enhances the classification accuracy. At the same time, this paper also introduces the idea of contrastive learning into few-shot image classification, and combines it with deep Brownian distance covariance to measure image features from multiple scales to contrastive loss functions. This method aims to bring the embeddings of the same distribution closer while pushing the embeddings of different distributions farther away, thereby improving classification accuracy. In the experiment, the SetFeat method was used to extract the feature set for each image. In terms of training, similar to other few shot image learning methods, the whole network is pre-trained first, and then fine-tuned in the meta-training stage. In the meta-training phase, the classification is performed by calculating the distance between the query (test) sample set and the support (training) sample set. Results In order to verify the validity of the proposed feature extraction structure, 1-shot and 5-shot classification training are carried out on the standard small sample datasets such as MiniImagenet, TierdeImageNet, and cub. Experimental results show that RireFeat achieves 0.64% and 1.10% higher accuracy than SetFeat in 1-shot and 5-shot convolution based backbone network on MiniImageNet dataset. The ResNet12-based structure is 1.51% and 1.46% higher than SetFeat in 1-shot and 5-shot cases, respectively. CUB data sets provide gains 0.03% and 0.61% higher than SetFeat at 1-shot and 5-shot, respectively, in convolution based backbone networks. Improved by 0.66% and 0.75% over SetFeat in 1-shot and 5-shot scenarios in the Resnet12-based structure. In TieredImageNet evaluation, the convolution based backbone network architecture achieves 0.21% and 0.38% improvement over SetFeat under 1-shot and 5-shot conditions. Conclusion In order to obtain a rich, comprehensive and accurate feature representation for few-shot image classification, this paper proposes a rich representation feature extractor (RireFeat). Different from traditional feature extractors and feature extraction forms, RireFeat increases the flow of information between feature extraction networks by paying attention to the changes of features before and after network transmission. It effectively reintegrates the category information lost during feature extraction into the feature representation. In addition, the concept of contrastive learning combined with deep Brownian distance covariance is introduced into the few-shot learning images classification to learn more categorical representations for each image. By doing so, it is able to capture more nuanced differences between images of different categories, resulting in improved classification performance. In addition, we also extract the feature vector set for the image to provide strong support for the subsequent classification task. The proposed method achieves high classification accuracy on MiniImageNet, TieredImageNet and CUB datasets. Moreover, this paper also verifies the universality of the proposed method with the current popular deep learning backbone, such as convolutional and residual backbones, highlighting its applicability to current state-of-the-art models.

Keywords

few-shot image classification attention mechanism multi-scale measurement feature representation

在线采编平台

在线出版

年度会议

下载中心

年度信息