Current Issue Cover
细粒度图像分类的自知识蒸馏学习

张睿1, 陈瑶1, 王家宝1, 李阳1, 张旭2(1.中国人民解放军陆军工程大学;2.中国人民解放军陆军工程大学,江苏经贸职业技术学院)

摘 要
摘 要 :目的 在无教师模型指导的条件下,自知识蒸馏方法可以让模型从自身学习知识来提升性能,但该类方法在解决细粒度图像分类任务时,因缺乏对图像判别性区域特征的有效提取导致蒸馏效果不理想。为了解决该问题,提出了一种融合高效通道注意力的细粒度图像分类自知识蒸馏学习方法。方法 首先,引入高效通道注意力(efficient channel attention, ECA)模块,设计提出了ECA残差模块并构建了ECA-ResNet18(residual network)轻量级骨干网,用以更好地提取图像判别性区域的多尺度特征;其次,构建了高效通道注意力加权双向特征金字塔ECA-BiFPN(bidirectional feature pyramid network)模块,用以融合不同尺度的特征构建更加鲁棒的跨尺度特征;最后,提出了一种多级特征知识蒸馏损失,用以跨尺度特征对多尺度特征的蒸馏学习。结果 在Caltech-UCSD Birds 200、Stanford Cars和FGVC-Aircraft三个公开数据集上,所提方法分别取得了76.04%、91.11%和87.64%的分类精度,与已有十五种自知识蒸馏方法中最佳方法的分类精度相比,分别提高了2.63%、1.56%和3.66%。结论 所提方法具有高效提取图像判别性区域特征的能力,能获得更好的细粒度图像分类精度,其轻量化的网络模型适合于面向嵌入式设备的边缘计算应用。
关键词
Self-knowledge distillation for fine-grained image classification

zhangrui, chenyao1, wangjiabao1, liyang1, zhangxu2(1.Army Engineering University of PLA;2.Army Engineering University of PLA,Jiangsu Vocational Institute of Commerce)

Abstract
Abstract: Objective Fine-grained image classification aims to classify a super-category into multiple sub-categories. This task is more challenging than general image classification due to subtle inter-class differences and large intra-class variations. The attention mechanism enables the model to focus on the key areas of the input image and will pay more attention to the discriminative regional features of the image, which is theoretically more suitable for fine-grained image classification tasks, and the attention-based classification model has higher interpretability. To enable the model to better focus on the image discriminative region, attention-based methods have been applied to the study of fine-grained image classification. Although the current attention-based fine-grained image classification models achieve high classification accuracy, they do not adequately consider the number of model parameters and computational volume. As a result, they are difficult to deploy on low-resource devices, which greatly limits their practical application. The concept of knowledge distillation involves transferring knowledge from a high-accuracy, high-parameter, and computationally expensive large teacher model to a low-parameter and computationally efficient small student model. This is done to enhance the performance of the small model and reduce the cost of model learning. To further reduce the model learning cost, researchers have proposed the self-knowledge distillation method. Unlike traditional knowledge distillation methods, Self-knowledge distillation enables models to improve performance by utilizing their own knowledge, rather than guidance from teacher networks. However, this method falls short in addressing fine-grained image classification tasks, due to the ineffective extraction of discriminative region features from images, resulting in unsatisfactory distillation outcomes. To tackle this issue, we propose a self-knowledge distillation learning method for fine-grained image classification by fusing efficient channel attention (ECASKD). Method The proposed method embeds an efficient channel attention mechanism into the structure of the self-knowledge distillation framework, which achieves the effective extraction of discriminative regional features of images. The framework mainly consists of a self-knowledge distillation network consisting of a lightweight backbone and a self-teacher subnetwork, and a joint loss consisting of a classification loss, a knowledge distillation loss, and a multi-layer feature-based knowledge distillation loss. Firstly, we introduce the Efficient Channel Attention (ECA) module and propose the ECA-Residual Block, and construct the ECA Residual Network18 (ECA-ResNet18) lightweight backbone for better extraction of multiscale features in discriminative regions of the input image. Compared to the residual module in the original ResNet18, the ECA residual block introduces the ECA module after each batch normalization operation. It consists of two ECA residual blocks to form a stage of the ECA-ResNet18 backbone network. This enhances the network"s focus on discriminative regions of the image and facilitates the extraction of multiscale features. Compared to ResNet18, which is commonly used in self-knowledge distillation methods, the proposed backbone is based on the ECA residual module. This module can significantly enhance the model"s ability to extract multi-scale features while maintaining a lightweight and highly efficient computational performance. Secondly, Considering the differences in the importance of different scale features output from the backbone network, we design and propose a the Efficient Channel Attention Bidirectional Feature Pyramid Network (ECA-BiFPN) block that assigns weights to the channels during the feature fusion process to differentiate the contribution of features from different channels to the fine-grained image classification task. Finally, we propose a Multi-layer feature-based knowledge distillation loss to enhance the backbone network"s learning from the self-teacher subnetwork and to focus on discriminative regions. Result Our proposed method achieves classification accuracies of 76.04%, 91.11%, and 87.64% on three publicly available datasets, Caltech-UCSD Birds 200 (CUB), Stanford Cars (CAR), and FGVC-Aircraft (AIR). To ensure a comprehensive and objective evaluation of the proposed ECASKD method, it was compared to 15 other methods, including data augmentation-based, auxiliary network-based, and attention-based methods. Compared to the data augmentation-based approach, ECASKD improves over the state-of-the-art Self-Knowledge Distillation from image Mixture by 3.89%, 1.94%, and 4.69% on CUB, CAR, and AIR, respectively. Compared to the auxiliary network-based method, ECASKD improves by 6.17%, 4.93%, and 7.81% over the state-of-the-art method Distillation with Reverse Guidance adds Distillation with Shape-wise Regularization on CUB, CAR, and AIR, respectively. ECASKD improves 2.63%, 1.56%, and 3.66% over the best of these methods compared to the jointly auxiliary network and data augmentation methods, respectively. It can be seen that ECASKD has better fine-grained image classification performance than the joint auxiliary network and data augmentation methods even without data augmentation. Compared with the attention-based self-knowledge distillation method Self Attention Distillation, ECASKD improves about 23.28%, 8.17%, and 14.02% on CUB, CAR and AIR, respectively. In summary, the proposed ECASKD method outperforms the three types of self-knowledge distillation methods and obtains better fine-grained image classification performance. Furthermore, comparing the proposed method with four current mainstream modeling methods, in terms of the number of parameters, floating-point operations, and TOP-1 classification accuracy. Compared with ResNet18, the ECA-ResNet18 backbone used in the proposed method is able to significantly improve the classification accuracy with an increase of only 0.4M Params and 0.2G FLOPs. Compared with the larger-scale ResNet50, the proposed method is less than one-half of ResNet50 in terms of the number of parameters and computation, but the classification accuracy on the CAR dataset differs from ResNet50 by only 0.6%. Compared with the larger ViT-Base (Vision Transformer) and Swin-Transformer-B, the proposed method is about one-eighth of both in terms of number of parameters and computation, and the classification accuracies on the CAR and AIR datasets are 3.7% and 5.3% lower than the optimal Swin-Transformer-B. The results demonstrate that ECASKD"s classification accuracy is significantly improved with only a small increase in model complexity. Conclusion The proposed self-knowledge distillation fine-grained image classification method achieves good performance results with 11.9M Params and 2.0G FLOPs, and its lightweight network model is suitable for edge computing applications for embedded devices.
Keywords

订阅号|日报