面向高光谱场景分类的空-谱模型蒸馏网络

薛洁; 黄鸿; 蒲春宇; 杨鄞铭; 李远; 刘英旭

发布时间： 2024-01-08
摘要点击次数： 310
全文下载次数： 219
DOI: :10.11834/jig.230699
| Volume | Number

面向高光谱场景分类的空-谱模型蒸馏网络

薛洁, 黄鸿, 蒲春宇, 杨鄞铭, 李远, 刘英旭(重庆大学)

摘要

目的现有场景分类方法主要面向高空间分辨率图像，但这些图像包含极为有限的光谱信息，且现有基于卷积神经网络(CNN)的方法由于卷积操作的局部性忽略了远程上下文信息的捕获。针对上述问题，本文提出了一种面向高光谱场景分类的空-谱模型蒸馏网络 (Spatial-Spectral Model Distillation Network for Hyperspectral Scene Classification，SSMD)。方法利用复杂的教师模型指导小型的学生模型实现高光谱图像场景分类。教师模型是基于空-谱注意力的ViT方法(SSViT)，空-谱注意力机制中探测不同类别的光谱信息，通过寻找光谱信息之间的差异性对地物进行精细分类。用教师模型SSViT获取样本间的全局特征指导学生模型VGG16来捕获复杂场景的长距离依赖信息，并引入知识蒸馏让教师-学生模型协同合作，提取地物鉴别特征进行场景分类。结果实验在3个数据集上与10种分类方法（5种传统CNN分类方法+5种最新场景分类方法）进行了比较。综合考虑时间成本和分类精度，本文方法在不同数据集上取得了不同程度的领先。在OHID-SC、OHS-SC和HSRS-SC数据集上的精度，相比于性能第2的模型，分类精度分别提高了15.1%、2.9%和0.74%。同时在OHID-SC数据集中进行的对比实验证明提出的算法有效提高了高光谱场景分类精度。结论本文所提出的SSMD网络不仅有效利用高光谱数据目标光谱信息，并探索全局与局部间的特征关系，综合了传统模型和深度学习模型的优点，使分类结果更加准确。

关键词

高光谱场景分类卷积神经网络 Transformer 空-谱联合自注意力机制知识蒸馏

Spatial-spectral model distillation network for hyperspectral scene classification

Xue Jie, Huang Hong, Pu Chunyu, Yang Yinming, Li Yuan, Liu Yingxu(Chongqing University)

Abstract

Objective In recent years, the development of remote sensing technology has enabled us to acquire abundant remote sensing images and large datasets. As one of the hotspots in remote sensing research, scene classification tasks aim to distinguish and classify images with similar scene features by assigning fixed semantic labels to each scene image. Various scene classification methods have been proposed, including handcrafted feature-based methods and deep learning-based methods. However, handcrafted feature-based methods have limitations in describing scene semantic information due to high requirements for feature descriptors. On the other hand, deep learning-based methods for remote sensing image scene classification have shown powerful feature extraction capabilities and have been widely applied in scene classification. However, current scene classification methods mainly focus on high spatial resolution remote sensing images, which are mostly three-channel images with limited spectral information. This limitation often leads to confusion and misclassification in visually similar categories such as geometric structures, textures and colors. Therefore, integrating spectral information to improve the accuracy of scene classification has become an important research direction. However, existing methods have some shortcomings. For example, convolutional operations have translation invariance and are sensitive to local information, making it difficult to capture remote contextual information. On the other hand, although Transformer methods can extract long-range dependency information, they have limited capability in learning local information. Moreover, combining CNN and Transformer methods incurs high computational complexity, which hinders the balance between inference efficiency and classification accuracy. To address these issues, this paper proposes a high spectral scene classification method called Spatial-Spectral Model Distillation Network (SSMD). Method In the study, we utilize spectral information to improve the accuracy of scene classification and overcome the limitations of existing methods. Firstly, to fully exploit the spectral information of hyperspectral images, we propose a spatial-spectral joint self-attention mechanism called SSViT, which integrates spectral information into the Transformer architecture. By exploring the intrinsic relationships between pixels and between spectra, SSViT extracts richer features. In the spatial-spectral joint mechanism, SSViT leverages the spectral information of different categories to identify the differences between them, enabling fine-grained classification of land cover and improving the accuracy of scene classification. Secondly, to further enhance the classification performance, we introduce the concept of knowledge distillation. In the framework of teacher-student models, SSViT is used as the teacher model, and a pre-trained model, VGG16, is used as the student model to capture contextual information of complex scenes. The teacher model extracts spectral information and global features among samples, while the student model focuses on capturing local features. The student model can learn and mimic the prior knowledge of the teacher model, thereby improving the student model"s discriminative ability. The joint training of the teacher-student models enables comprehensive extraction of land cover features, thus improving the accuracy of scene classification. Specifically, the image is divided into 64 image patches in the spatial dimension, and 32 spectral bands in the spectral dimension. Each patch and band can be regarded as a token. Each patch and band are flattened into row vectors and mapped to a specific dimension through a Linear layer. The learned vectors are concatenated with the embedded samples for the teacher model"s final prediction of image classification. A position vector is generated and directly concatenated with the token mentioned above as the input to the Transformer. The multi-head attention mechanism outputs encoded representations containing information from different subspaces to model global contextual information, thereby improving the model"s representation capacity and learning effectiveness. Finally, feature integration is performed through a multi-layer perceptron and a classification layer to achieve classification. The process of knowledge distillation consists of two stages. The first stage optimizes the teacher and student models by minimizing the loss function with distillation coefficients. In the second stage, the student model is further adjusted using the loss function, leveraging the supervision from the performance-excellent complex model to train the simple model, aiming for higher accuracy and better classification performance. The complex model is referred to as the teacher model, while the simpler model is referred to as the student model. The training mode of knowledge distillation provides the student model with more informative content, allowing it to directly learn the generalization ability of the teacher model. Result We compared our model with 10 models, including 5 traditional CNN classification methods and 5 latest scene classification methods on 3 public datasets, namely, OHID-SC, OHS-SC and HSRS-SC. The quantitative evaluation metrics contained overall accuracy(OA), standard deviation(STD) and confusion matrix, and the confusion matrix on the three datasets is provided to clearly display the classification results of the algorithm in this paper. The experimental results show that our model outperforms all other methods on OHID-SC, OHS-SC and HSRS-SC datasets, and the classification accuracy on OHID-SC, OHS-SC and HSRS-SC datasets is improved by 15.1%, 2.9% and 0.74%, respectively, compared with the second-best model. Meanwhile, comparative experiments on OHID-SC dataset show that the proposed algorithm can effectively improve the classification accuracy of hyperspectral scenes. Conclusion In this study, SSMD network proposed not only effectively utilizes the target spectral information of hyperspectral data, but also explores the feature relationship between global and local, synthesizes the advantages of traditional model and deep learning model, and makes the classification results more accurate.

Keywords

hyperspectral scene classification convolutional neural network Transformer spatial-spectral joint self-attention mechanism knowledge distillation

在线采编平台

在线出版

年度会议

下载中心

年度信息