用于多光谱和高光谱图像融合的联合自注意力Transformer

李妙宇; 付莹

doi:10.11834/jig.220954

遥感图像处理 | 浏览量 : 0 下载量: 1 CSCD: 0

PDF
导出
分享
收藏
专辑

用于多光谱和高光谱图像融合的联合自注意力Transformer
Joint self-attention Transformer for multispectral and hyperspectral image fusion
2023年28卷第12期页码：3922-3934
纸质出版日期： 2023-12-16 ，
DOI： 10.11834/jig.220954
稿件说明：

移动端阅览

李妙宇，付莹. 2023. 用于多光谱和高光谱图像融合的联合自注意力Transformer. 中国图象图形学报， 28(12):3922-3934

Li Miaoyu， Fu Ying. 2023. Joint self-attention Transformer for multispectral and hyperspectral image fusion. Journal of Image and Graphics， 28(12):3922-3934
李妙宇，付莹. 2023. 用于多光谱和高光谱图像融合的联合自注意力Transformer. 中国图象图形学报， 28(12):3922-3934 DOI： 10.11834/jig.220954.

Li Miaoyu， Fu Ying. 2023. Joint self-attention Transformer for multispectral and hyperspectral image fusion. Journal of Image and Graphics， 28(12):3922-3934 DOI： 10.11834/jig.220954.

摘要

目的

将高光谱图像和多光谱图像进行融合，可以获得具有高空间分辨率和高光谱分辨率的光谱图像，提升光谱图像的质量。现有的基于深度学习的融合方法虽然表现良好，但缺乏对多源图像特征中光谱和空间长距离依赖关系的联合探索。为有效利用图像的光谱相关性和空间相似性，提出一种联合自注意力的Transformer网络来实现多光谱和高光谱图像融合超分辨。

方法

首先利用联合自注意力模块，通过光谱注意力机制提取高光谱图像的光谱相关性特征，通过空间注意力机制提取多光谱图像的空间相似性特征，将获得的联合相似性特征用于指导高光谱图像和多光谱图像的融合；随后，将得到的融合特征输入到基于滑动窗口的残差Transformer深度网络中，探索融合特征的长距离依赖信息，学习深度先验融合知识；最后，特征通过卷积层映射为高空间分辨率的高光谱图像。

结果

在CAVE和Harvard光谱数据集上分别进行了不同采样倍率下的实验，实验结果表明，与对比方法相比，本文方法从定量指标和视觉效果上，都取得了更好的效果。本文方法相较于性能第二的方法EDBIN（enhanced deep blind iterative network），在CAVE数据集上峰值信噪比提高了0.5 dB，在Harvard数据集上峰值信噪比提高了0.6 dB。

结论

本文方法能够更好地融合光谱信息和空间信息，显著提升高光谱融合超分图像的质量。

Abstract

Objective

Hyperspectral image （HSI） contains rich spectral information and has advantages over multispectral image （MSI） in accurately distinguishing different types of materials. Therefore， HSI has been widely used in many computer vision tasks， including vegetation detection， face recognition， and feature segmentation. However， due to the limitations in hardware equipment and the acquisition environment， an inevitable trade-off arises between spatial resolution and spectral resolution. Thus， HSIs under real scenes often have low spatial resolution， which negatively affects the performance of subsequent vision tasks. By fusing the low-resolution HSI （LR-HSI） with a high-resolution MSI （HR-MSI） under the same scene using the HSI super-resolution algorithm， the spatial resolution of HSIs can be effectively improved. Existing HSI fusion algorithms can be roughly classified into traditional-model-based and deep-learning-based methods. Traditional-model-based fusion methods employ various handcrafted shallow priors （e.g

， matrix/tensor factorization， total variation， and low rank） to utilize the intrinsic statistics of observed spectral images. However， these methods lack generalization ability to complex real scenarios and consume much time in iteratively optimizing the designed prior. Meanwhile， deep-learning-based fusion methods can automatically learn the prior knowledge from large-scale datasets. Although these methods often achieve better fusion results compared with traditional-model-based fusion methods， they do not jointly explore the inner self-similarity of multi-source spectral images， where the LR-HSI shows high correlation in the spectral dimension and the HR-MSI shows spatial similarities in texture and edges. In addition， the weights of these convolution-based networks are learned during training but are fixed during testing， hence limiting the potential adaptability of networks. To effectively exploit the inner spatial and spectral similarity of spectral images， we propose an MSI and HSI fusion network with a joint self-attention fusion module and Transformer.

Method

Given that LR-HSI has reliable information in the spectral dimension， the critical task of the HSI fusion method is to fill the missing texture details in the spatial dimension without losing discriminable spectral information. Given the LR-HSI and its matching HR-MSI， our proposed method fuses these two spectral images to obtain the desired HR-HSI in three steps. First， the similarity information of LR-HSI and HR-MSI is extracted by the joint self-attention module. Specifically， the spectral similarity features from LR-HSI are extracted by the channel attention module， and the spatial similarity features from HR-MSI are extracted by the spatial attention module. The obtained similarity features are then used to guide the fusion process. Second， to achieve a deep representation and explore the long-range dependencies of the fusion features， the preliminary fusion features are fed into the deep Transformer network， which comprises a shift window attention module， LayerNorm， and multilayer perceptron. The convolution layer and skip connection are also included in the proposed Transformer fusion network to further enhance the model flexibility. Third， the fusion features from Transformer are mapped to the desired high-resolution HSI. The overall network is implemented by the Pytorch framework and trained in an end-to-end manner. To generate training data， the training images are cropped to the size of 96 × 96 × 31， resulting in approximately 8 000 training patches that are smoothed by a Gaussian blur kernel and spatially down-sampled to obtain LR-HSI. The MSI images are generated by the spectral response function of a Nikon D700 camera.

Result

We compare our method with seven state-of-the-art fusion methods， including one traditional-model-based method and six deep-learning-based methods. The peak-signal-to-noise ratio （PSNR）， structural similarity index measure （SSIM）， erreur relative globale adimensionnelle de Synthèse （ERGAS）， and spectral angle mapper （SAM） are utilized as quantitative metrics in evaluating the performance of these fusion methods. To verify the effectiveness of the proposed model， we perform experiments on two widely used HSI datasets， namely， the CAVE and Harvard datasets. For the CAVE dataset， the first 20 images are selected for training， and the last 12 images are used for testing. Similarly， for the Harvard dataset， the first 30 images are selected for training， and the last 20 images are used for testing. Experimental results under different scale factors show that the proposed method achieves better fusion results in terms of quantitative metrics and visual effects compared to the other state-of-the-art methods. Under a scale factor of 8， the PSNR， SAM， and ERGAS of the proposed method is improved by 0.5 dB， 0.13， and 0.2， respectively， compared to EDBIN， which is the second best-performing method on the CAVE dataset. Under a scale factor 16， the PSNR of the proposed method is improved by at least 0.4 dB compared to the other methods on the Harvard dataset. The visual results show that our proposed method outperforms the other methods in recovering both fine-grained spatial textures and spectral details. The ablation study also proves that the employed Transformer fusion network significantly improves the fusion process.

Conclusion

In this paper， we propose a Transformer-based MSI and HSI fusion network with a joint self-attention fusion module， which can effectively utilize the spectral similarity of LR-HSI and the spatial similarity of HR-MSI to guide the fusion process through a 2D attention mechanism. The preliminary fusion results pass through the residual Transformer network to obtain a deep feature representation and to reconstruct the desired HR-HSI. Qualitative and quantitative experiments show that the proposed method has better spectral fidelity and spatial resolution compared to the state-of-the-art HSI fusion methods.

关键词

超分辨率高光谱图像多光谱图像联合自注意力Transformer融合算法

Keywords

super-resolutionhyperspectral imagesmultispectral imagesjoint self-attentionTransformerfusion method

references

Adão T， Hruška J， Pádua L， Bessa J， Peres E， Morais R and Sousa J J. 2017. Hyperspectral imaging： a review on UAV-based sensors， data processing and applications for agriculture and forestry. Remote Sensing， 9（11）： #1110 ［DOI： 10.3390/rs9111110http://dx.doi.org/10.3390/rs9111110］

Akhtar N， Shafait F and Mian A. 2015. Bayesian sparse representation for hyperspectral image super resolution//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston， USA： IEEE： 3631-3640 ［DOI： 10.1109/CVPR.2015.7298986http://dx.doi.org/10.1109/CVPR.2015.7298986］

Chen D J， Hsieh H Y and Liu T L. 2021. Adaptive image Transformer for one-shot object detection//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 12242-12251 ［DOI： 10.1109/CVPR46437.2021.01207http://dx.doi.org/10.1109/CVPR46437.2021.01207］

Dong W S， Fu F Z， Shi G M， Cao X， Wu J J， Li G Y and Li X. 2016. Hyperspectral image super-resolution via non-negative structured sparse representation. IEEE Transactions on Image Processing， 25（5）： 2337-2352 ［DOI： 10.1109/TIP.2016.2542360http://dx.doi.org/10.1109/TIP.2016.2542360］

Dong W S， Zhou C， Wu F F， Wu J J， Shi G M and Li X. 2021. Model-Guided deep hyperspectral image super-resolution. IEEE Transactions on Image Processing， 30： 5754-5768 ［DOI： 10.1109/TIP.2021.3078058http://dx.doi.org/10.1109/TIP.2021.3078058］

Dosovitskiy A， Beyer L， Kolesnikov A， Weissenborn D， Zhai X H， Unterthiner T， Dehghani M， Minderer M， Heigold G， Gelly S， Uszkoreit J and Houlsby N. 2021. An image is worth 16 × 16 words： Transformers for image recognition at scale ［EB/OL］. ［2023-09-10］. http://arxiv.org/pdf/2010.11929.pdfhttp://arxiv.org/pdf/2010.11929.pdf

Han X H， Shi B X and Zheng Y Q. 2018. Self-similarity constrained sparse representation for hyperspectral image super-resolution. IEEE Transactions on Image Processing， 27（11）： 5625-5637［DOI：10.1109/TIP.2018.2855418http://dx.doi.org/10.1109/TIP.2018.2855418］

Hu J F， Huang T Z， Deng L J， Jiang T X， Vivone G and Chanussot J. 2021. Hyperspectral image super-resolution via deep spatiospectral attention convolutional neural networks. IEEE Transactions on Neural Networks and Learning Systems， 33（12）： 7251-7265 ［DOI： 10.1109/TNNLS.2021.3084682http://dx.doi.org/10.1109/TNNLS.2021.3084682］

Kawakami R， Matsushita Y， Wright J， Ben-Ezra M， Tai Y W and Ikeuchi K. 2011. High-resolution hyperspectral imaging via matrix factorization//Proceedings of the CVPR 2011. Colorado Springs， USA： IEEE： 2329-2336 ［DOI： 10.1109/CVPR.2011.5995457http://dx.doi.org/10.1109/CVPR.2011.5995457］

Kingma D P and Ba J. 2017. Adam： a method for stochastic optimization ［EB/OL］. ［2022-05-25］. http://arxiv.org/pdf/1412.6980.pdfhttp://arxiv.org/pdf/1412.6980.pdf

Li K， Dai D X and Van Gool L. 2022. Hyperspectral image super-resolution with RGB image super-resolution as an auxiliary task//Proceedings of 2022 IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa， USA： IEEE： 4039-4048 ［DOI： 10.1109/WACV51458.2022.00409http://dx.doi.org/10.1109/WACV51458.2022.00409］

Li W， Du Q and Zhang B. 2015. Combined sparse and collaborative representation for hyperspectral target detection. Pattern Recognition， 48（12）： 3904-3916 ［DOI： 10.1016/j.patcog.2015.05.024http://dx.doi.org/10.1016/j.patcog.2015.05.024］

Li X S， Zhang Y Q， Ge Z X， Cao G， Shi H and Fu P. 2021. Adaptive nonnegative sparse representation for hyperspectral image super-resolution. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing， 14： 4267-4283 ［DOI： 10.1109/JSTARS.2021.3072044http://dx.doi.org/10.1109/JSTARS.2021.3072044］

Liang J， Zhou J， Bai X and Qian Y T. 2013. Salient object detection in hyperspectral imagery//Proceedings of 2013 IEEE International Conference on Image Processing. Melbourne， Australia： IEEE： 2393-2397 ［DOI： 10.1109/ICIP.2013.6738493http://dx.doi.org/10.1109/ICIP.2013.6738493］

Liang J Y， Cao J Z， Sun G L， Zhang K， Van Gool L and Timofte R. 2021. SwinIR： image restoration using Swin Transformer//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision Workshops. Montreal， Canada： IEEE： 1833-1844 ［DOI： 10.1109/ICCVW54120.2021.00210http://dx.doi.org/10.1109/ICCVW54120.2021.00210］

Liu J J， Wu Z B， Xiao L， Sun J and Yan H. 2020. A truncated matrix decomposition for hyperspectral image super-resolution. IEEE Transactions on Image Processing， 29： 8028-8042 ［DOI： 10.1109/TIP.2020.3009830http://dx.doi.org/10.1109/TIP.2020.3009830］

Liu Z， Lin Y T， Cao Y， Hu H， Wei Y X， Zhang Z， Lin S and Guo B N. 2021. Swin Transformer： hierarchical vision Transformer using shifted windows//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 9992-10002 ［DOI： 10.1109/ICCV48922.2021.00986http://dx.doi.org/10.1109/ICCV48922.2021.00986］

Peng Y D， Li W S， Luo X B and Du J. 2021. Hyperspectral image superresolution using global gradient sparse and nonlocal low-rank tensor decomposition with hyper-laplacian prior. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing， 14： 5453-5469 ［DOI： 10.1109/JSTARS.2021.3076170http://dx.doi.org/10.1109/JSTARS.2021.3076170］

Ronneberger O， Fischer P and Brox T. 2015. U-Net： convolutional networks for biomedical image segmentation//Proceedings of 2021 IEEE/CVF International Conference on Computer the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich， Germany： Springer： 234-241 ［DOI： 10.1007/978-3-319-24574-4_28http://dx.doi.org/10.1007/978-3-319-24574-4_28］

Simões M， Bioucas-Dias J， Almeida L B and Chanussot J. 2015. A convex formulation for hyperspectral image superresolution via subspace-based regularization. IEEE Transactions on Geoscience and Remote Sensing， 53（6）： 3373-3388 ［DOI： 10.1109/TGRS.2014.2375320http://dx.doi.org/10.1109/TGRS.2014.2375320］

Uzair M， Mahmood A and Mian A. 2015. Hyperspectral face recognition with spatiospectral information fusion and PLS regression. IEEE Transactions on Image Processing， 24（3）： 1127-1137 ［DOI： 10.1109/TIP.2015.2393057http://dx.doi.org/10.1109/TIP.2015.2393057］

Vaswani A， Shazeer N， Parmar N， Uszkoreit J， Jones L， Gomez A N， Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 6000-6010

Wang W， Fu X Y， Zeng W H， Sun L Y， Zhan R H， Huang Y and Ding X H. 2021a. Enhanced deep blind hyperspectral image fusion. IEEE Transactions on Neural Networks and Learning Systems， 34（3）： 1513-1523 ［DOI： 10.1109/TNNLS.2021.3105543http://dx.doi.org/10.1109/TNNLS.2021.3105543］

Wang W H， Xie E Z， Li X， Fan D P， Song K T， Liang D， Lu T， Luo P and Shao L. 2021b. Pyramid vision Transformer： a versatile backbone for dense prediction without convolutions//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 548-558 ［DOI： 10.1109/ICCV48922.2021.00061http://dx.doi.org/10.1109/ICCV48922.2021.00061］

Wang Y， Chen X， Han Z and He S Y. 2017. Hyperspectral image super-resolution via nonlocal low-rank tensor approximation and total variation regularization. Remote Sensing， 9（12）： #1286 ［DOI： 10.3390/rs9121286http://dx.doi.org/10.3390/rs9121286］

Wang Z D， Cun X， Bao J M， Zhou W G， Liu J Z and Li H Q. 2022. Uformer： a general U-shaped Transformer for image restoration//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 17662-17672 ［DOI： 10.1109/CVPR52688.2022.01716http://dx.doi.org/10.1109/CVPR52688.2022.01716］

Xie Q， Zhou M H， Zhao Q， Xu Z B and Meng D Y. 2022. MHF-Net： an interpretable deep network for multispectral and hyperspectral image fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence， 44（3）， 1457-1473［DOI：10.1109/TPAMI.2020.3015691http://dx.doi.org/10.1109/TPAMI.2020.3015691］

Yao J， Hong D F， Chanussot J， Meng D Y， Zhu X X and Xu Z B. 2020. Cross-Attention in coupled unmixing nets for unsupervised hyperspectral super-resolution//Proceedings of 2021 IEEE/CVF International Conference on Computer 16th European Conference on Computer Vision. Glasgow， UK： Springer： 208-224 ［DOI： 10.1007/978-3-030-58526-6_13http://dx.doi.org/10.1007/978-3-030-58526-6_13］

Zamir S W， Arora A， Khan S， Hayat M， Khan F S and Yang M H. 2022. Restormer： efficient Transformer for high-resolution image restoration//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 5718-5729 ［DOI： 10.1109/CVPR52688.2022.00564http://dx.doi.org/10.1109/CVPR52688.2022.00564］

Zhang K， Wang M， Yang S Y and Jiao L C. 2018a. Spatial-Spectral-Graph-Regularized low-rank tensor decomposition for multispectral and hyperspectral image fusion. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing， 11（4）： 1030-1040 ［DOI： 10.1109/JSTARS.2017.2785411http://dx.doi.org/10.1109/JSTARS.2017.2785411］

Zhang L， Nie J T， Wei W， Zhang Y N， Liao S C and Shao L. 2020. Unsupervised adaptation learning for hyperspectral imagery super-resolution//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 3070-3079 ［DOI： 10.1109/CVPR42600.2020.00314http://dx.doi.org/10.1109/CVPR42600.2020.00314］

Zhang L， Wei W， Bai C C， Gao Y F and Zhang Y N. 2018b. Exploiting clustering manifold structure for hyperspectral imagery super-resolution. IEEE Transactions on Image Processing， 27（12）： 5969-5982 ［DOI： 10.1109/TIP.2018.2862629http://dx.doi.org/10.1109/TIP.2018.2862629］

Zhang L， Nie J T， Wei W， Li Y and Zhang Y N. 2021. Deep blind hyperspectral image super-resolution. IEEE Transactions on Neural Networks and Learning Systems， 32（6）： 2388-2400 Li［DOI： 10.1109/TNNLS.2020.3005234http://dx.doi.org/10.1109/TNNLS.2020.3005234］

Zhao C Q， Wang H H， Zhao J J， Ji L W， Wang Q D， Li H Z and Zhao Z J. 2022. Cerebral stroke detection algorithm for visual Transformer and multi-feature fusion. Journal of Image and Graphics， 27（3）： 923-934

赵琛琦，王华虎，赵涓涓，冀伦文，王麒达，李慧芝，赵紫娟. 2022. 视觉Transformer与多特征融合的脑卒中检测算法. 中国图象图形学报， 27（3）： 923-934 ［DOI： 10.11834/jig.210745http://dx.doi.org/10.11834/jig.210745］

文章被引用时，请邮件提醒。

提交

混合监督学习的乳腺癌全切片病理图像分类

面向弱纹理目标立体匹配的Transformer网络

面向高光谱场景分类的空—谱模型蒸馏网络

基于深度学习的光谱图像超分辨率综述

轻量级图像超分辨率的蓝图可分离卷积Transformer网络