多层级特征融合与双教师协作的知识蒸馏

王硕

发布时间： 2024-04-02
摘要点击次数： 774
全文下载次数： 290
DOI: :10.11834/jig.240104
| Volume | Number

多层级特征融合与双教师协作的知识蒸馏

王硕(天津理工大学计算机科学与工程学院)

摘要

目的知识蒸馏旨在不影响原始模型性能的前提下，将一个性能强大且参数量也较大的教师模型的知识迁移到一个轻量级的学生模型上。在图像分类领域，以往的蒸馏方法大多聚焦于全局信息的提取而忽略了局部信息的重要性。并且这些方法多是围绕单教师架构蒸馏，忽视了学生可以同时向多名教师学习的潜力。因此，本文提出了一种融合全局和局部特征的双教师协作知识蒸馏框架。方法首先随机初始化一个教师（临时教师）与学生处理全局信息进行同步训练，利用其临时的全局输出逐步帮助学生以最优路径接近教师的最终预测。同时又引入了一个预训练的教师（专家教师）处理局部信息。专家教师将局部特征输出分离为源类别知识和其他类别知识并分别转移给学生以提供较为全面的监督信息。结果在CIFAR-100和Tiny-ImageNet数据集上进行实验并与其他蒸馏方法进行了比较。在CIFAR-100数据集中，与最近的NKD相比，在师生相同架构与不同架构下，平均分类准确率分别提高了0.63%和1.00%。在Tiny-ImageNet数据集中，ResNet34和MobileNetV1的师生组合下，分类准确率相较于SRRL提高了1.09%，相较于NKD提高了1.06%。同时也在CIFAR-100数据集中进行了消融实验和可视化分析以验证所提方法的有效性。结论本文所提出的双教师协作知识蒸馏框架，融合了全局和局部特征，并将模型的输出响应分离为源类别知识和其他类别知识并分别转移给学生，使得学生模型的图像分类结果具有更高的准确率。

关键词

知识蒸馏图像分类轻量级模型协作蒸馏特征融合

Knowledge distillation of multi-level feature fusion and dual-teacher collaboration

wangshuo()

Abstract

Objective Knowledge distillation aims to transfer the knowledge of a teacher model with powerful performance and large number of parameters to a lightweight student model and improve its performance without affecting the performance of the original model. Previous research on knowledge distillation has mostly delved into the direction of knowledge distillation from one teacher to one student, neglecting the potential for students to learn from multiple teachers simultaneously. Multi-teacher distillation can help the student model synthesize the knowledge of each teacher model, thereby improving the expressive ability of the student model. There have been few previous studies on the distillation of teacher models for these different situations, and learning from multiple teachers at the same time can integrate more useful knowledge and information and improve student performance. In addition, most existing knowledge distillation methods only focus on the global information of the image, ignore the importance of spatial local information. In image classification, local information refers to the features and details of specific regions in the image, including textures, shapes, boundaries, etc., which plays a crucial role in distinguishing different categories of images. The teacher network can distinguish local regions based on these details and make accurate predictions for similar appearances in different categories, but the student network may fail to predict. In response to the above issues, this article proposes a knowledge distillation method based on global and local dual-teacher collaboration, which integrates global and local information and can effectively improve the classification accuracy of the student model. Method In this study, the original input image is first represented as global and local image views. The original image (global image view) is randomly cropped locally, and the ratio of the area of the cropped area to the original image is specified to be 40%-70% to obtain local input information (local image view). Then, randomly initialize a teacher (scratch teacher) to synchronize training with the student in processing global information, and use its scratch global feature output to gradually help students approach the teacher"s final prediction with the optimal path. Meanwhile, a pre-trained teacher (expert teacher) was introduced to process local information. The method proposed in this article uses a dual-teacher distillation architecture to jointly train the student network on the premise of integrating global and local features. On the one hand, the scratch teacher works with the student to train and process global information from scratch. By introducing the scratch teacher, it is no longer just the final smooth output of the pre-trained model (expert teacher). Instead, it uses its temporary output to gradually help the student model, which forces the student model to approach the final output logits with higher accuracy through the optimal path. During the training process, the student model not only obtains the difference between the target and the scratch output, but also obtains the possible path to the final goal provided by a complex model with stronger learning ability. On the other hand, the expert teacher processes local information and separate the output local features into source category knowledge and other category knowledge. In this kind of collaborative teaching, the student model reaches a local optimum and its performance is close to that of the teacher model. Result We compared the proposed method with other knowledge distillation methods in the field of image classification. The experimental datasets include CIFAR-100 and Tiny-ImageNet, and the image classification accuracy is used as the evaluation index. In the CIFAR-100 dataset, compared with the optimal feature distillation method SemCKD, the average distillation accuracy of this method increased by 0.62% under the same architecture of teachers and students. In the case of heterogeneous teachers and students, the average accuracy rate increased by 0.89%. Compared with the state-of-the-art response distillation method NKD, the average classification accuracy increases by 0.63% and 1.00% respectively in the cases of homogeneous and heterogeneous teachers and students. In the Tiny-ImageNet dataset, the teacher network is selected as ResNet34 and the student network is ResNet18. The final test accuracy of the method proposed in this article reached the optimal accuracy of 68.86%, which was 0.74% higher than NKD. And better than the competitors listed. Similarly, under different architecture combinations of teachers and students, the proposed method also achieved the highest classification accuracy. In addition, we conduct ablation experiments and visual analysis in CIFAR-100 to clearly demonstrate the effectiveness of the proposed method. Conclusion In this study, we propose a dual-teacher collaborative knowledge distillation framework that integrates global and local information, and separates the teacher-student output features into source categories and other category knowledge and transfers them to the students separately. The experimental results show that the proposed method outperforms several state-of-the-art knowledge distillation methods in the field of image classification and can significantly improve the performance of the student model in the field of image classification.

Keywords

Knowledge distillation Image classification Lightweight model Collaborative distillation Feature fusion