用于单幅模糊图像超分辨的Transformer融合网络

刘花成; 任文琦; 王蕊; 操晓春

doi:10.11834/jig.210847

图像超分辨 | 浏览量 : 0 下载量: 77 CSCD: 4

PDF
导出
分享
收藏
专辑

用于单幅模糊图像超分辨的Transformer融合网络
A super-resolution Transformer fusion network for single blurred image
2022年27卷第5期页码：1616-1631
收稿日期：2021-09-16，

修回日期：2021-10-25，

录用日期：2021-11-1，

纸质出版日期：2022-05-16
DOI： 10.11834/jig.210847
稿件说明：

移动端阅览

刘花成, 任文琦, 王蕊, 操晓春. 用于单幅模糊图像超分辨的Transformer融合网络[J]. 中国图象图形学报, 2022,27(5):1616-1631. DOI： 10.11834/jig.210847.

Huacheng Liu, Wenqi Ren, Rui Wang, Xiaochun Cao. A super-resolution Transformer fusion network for single blurred image[J]. Journal of image and graphics, 2022, 27(5): 1616-1631. DOI： 10.11834/jig.210847.

摘要

目的

以卷积神经网络为代表的深度学习方法已经在单帧图像超分辨领域取得了丰硕成果，这些方法大多假设低分辨图像不存在模糊效应。然而，由于相机抖动、物体运动等原因，真实场景下的低分辨率图像通常会伴随着模糊现象。因此，为了解决模糊图像的超分辨问题，提出了一种新颖的Transformer融合网络。

方法

首先使用去模糊模块和细节纹理特征提取模块分别提取清晰边缘轮廓特征和细节纹理特征。然后，通过多头自注意力机制计算特征图任一局部信息对于全局信息的响应，从而使Transformer融合模块对边缘特征和纹理特征进行全局语义级的特征融合。最后，通过一个高清图像重建模块将融合特征恢复成高分辨率图像。

结果

实验在2个公开数据集上与最新的9种方法进行了比较，在GOPRO数据集上进行2倍、4倍、8倍超分辨重建，相比于性能第2的模型GFN(gated fusion network)

峰值信噪比(peak signal-to-noive ratio

PSNR)分别提高了0.12 dB、0.18 dB、0.07 dB；在Kohler数据集上进行2倍、4倍、8倍超分辨重建，相比于性能第2的模型GFN，PSNR值分别提高了0.17 dB、0.28 dB、0.16 dB。同时也在GOPRO数据集上进行了对比实验以验证Transformer融合网络的有效性。对比实验结果表明，提出的网络明显提升了对模糊图像超分辨重建的效果。

结论

本文所提出的用于模糊图像超分辨的Transformer融合网络，具有优异的长程依赖关系和全局信息捕捉能力，其通过多头自注意力层计算特征图任一局部信息在全局信息上的响应，实现了对去模糊特征和细节纹理特征在全局语义层次的深度融合，从而提升了对模糊图像进行超分辨重建的效果。

Abstract

Objective

Single image super-resolution is an essential task for vision applications to enhance the spatial resolution based image quality in the context of computer vision. Deep learning based methods are beneficial to single image super-resolution nowadays. Low-resolution images are regarded as clear images without blur effects. However

low-resolution images in real scenes are constrained of blur artifacts factors like camera shake and object motion. The degradation derived blur artifacts could be amplified in the super-resolution reconstruction process. Hence

our research focus on the single image super-resolution task to resolve motion blurred issue.

Method

Our Transformer fusion network (TFN) can be handle super-resolution reconstruction of low-resolution blurred images for super-resolution reconstruction of blurred images. Our TFN method implements a dual-branch strategy to remove some blurring regions based on super-resolution reconstruction of blurry images. First

we facilitate a deblurring module (DM) to extract deblurring features like clear edge structures. Specifically

we use the encoder-decoder architecture to design our DM module. For the encoder part of DM module

we use three convolutional layers to decrease the spatial resolution of feature maps and increase the channels of feature maps. For the decoder part of DM module

we use two de-convolutional layers to increase the spatial resolution of feature maps and decrease the channels of feature maps. In terms of the supervision of L1 deblurring loss function

the DM module is used to generate the clear feature maps in related to the down-sampling and up-sampling process of the DM module. But

our DM module tends to some detailed information loss of input images due to detailed information removal with the blur artifacts. Then

we designate additional texture feature extraction module (TFEM) to extract detailed texture features. The TFEM module is composed of six residual blocks

which can resolve some gradient explosion issues and speed up convergence. Apparently

the TFEM does not have down-sampling and up-sampling process like DM module

so TFEM can extract more detailed texture features than DM although this features has some blur artifacts. In order to take advantage of both clear deblurring features extracted by DM module and the detailed features extracted by TFEM module

we make use of a Transformer fusion module (TFM) to fuse them. We can use the clear deblurring features and detailed features in TFM module. We customize the multi-head attention layer to design the TFM module. Because the input of the transformer encoder part is one dimensional vector

we use flatten and unflatten operations in the TFM module. In addition

we can use the TFM module to fuse deblurring features extracted by the DM module and detailed texture features extracted by the TFEM module more effectively in the global sematic level based on long-range and global dependencies multi-head attention capturing ability. Finally

we use reconstruction module (RM) to carry out super-resolution reconstruction based on the fusion features obtained to generate a better super-resolved image.

Result

The extensive experiments demonstrate that our method generates sharper super-resolved images based on low-resolution blurred input images. We compare the proposed TFN to several algorithms

including the tailored single image super-resolution methods

the joint image deblurring and image super-resolution approaches

the combinations of image super-resolution algorithms and non-uniform deblurring algorithms. Specially

the single image super-resolution methods are based on the residual channel attention network(RCAN) and holistic attention network(HAN) algorithms

the image deblurring methods are melted scale-recurrent network(SRN) and deblur generative adversarial network(DBGAN) in

and the joint image deblurring and image super-resolution approaches are integrated the gated fusion network(GFN). To further evaluate the proposed TFN

we conduct experiments on two test data sets

including GOPRO test dataset and Kohler dataset. For GOPRO test dataset

the peak signal-to-noise ratio(PSNR) value of our TFN based super-resolved results by is 0.12 dB

0.18 dB

and 0.07 dB higher than the very recent work of GFN for the 2×

4× and 8× scales

respectively. For Kohler dataset

the PSNR value of our TFN based super-resolved results is 0.17 dB

0.28 dB

and 0.16 dB in the 2×

4× and 8× scales

respectively. In addition

the PSNR value of model with DM model result is 1.04 dB higher than model with TFEM in ablation study. the PSNR value of model with DM and TFME module is 1.84 dB and 0.80 dB higher than model with TFEM

and model with DM respectively. The PSNR value of TFN model with TFEM

and TFM

which is 2.28 dB

1.24 dB

and 0.44 dB higher than model with TFEM

model with DM

and model with TFEM/DM

respectively. To sum up

the GOPRO dataset based ablation experiments illustrates that the TFM promote global semantic hierarchical feature fusion in the context of deblurring features and detailed texture features

which greatly improves the effect of the network on the super-resolution reconstruction of low-resolution blurred images. The GOPRO test dataset and Kohler dataset based experimental results illustrates our network has a certain improvement of visual results qualitatively quantitatively.

Conclusion

We harnesses a Transformer fusion network for blurred image super-resolution. This network can super-resolve blurred image and remove blur artifacts

to fuse DB-module-extracted deblurring features by and TFEM-module-extracted texture features via a transformer fusion module. In the transformer fusion module

we uses the multi-head self-attention layer to calculate the response of local information of the feature map to global information

which can effectively fuse deblurring features and detailed texture features at the global semantic level and improves the effect of super-resolution reconstruction of blurred images. Extensive ablation experiments and comparative experiment demonstrate that our TFN demonstrations have its priority on the visual result quantity and quantitative ability.

关键词

Keywords

references

Ba J L, Kiros J R and Hinton G E. 2016. Layer normalization[EB/OL]. [2021-08-30] . https://arxiv.org/pdf/1607.06450.pdf https://arxiv.org/pdf/1607.06450.pdf

Badrinarayanan V, Kendall A and Cipolla R. 2017. SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12): 2481-2495[DOI: 10.1109/TPAMI.2016.2644615]

Chen H T, Wang Y H, Guo T Y, Xu C, Deng Y P, Liu Z H, Ma S W, Xu C J, Xu C and Gao W. 2021. Pre-trained image processing transformer[EB/OL]. [2021-08-30] . https://arxiv.org/pdf/2012.00364.pdf https://arxiv.org/pdf/2012.00364.pdf

Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H and Bengio Y. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: ACL: 1724-1734[ DOI: 10.3115/v1/D14-1179 http://dx.doi.org/10.3115/v1/D14-1179 ]

Clevert D A, Unterthiner T and Hochreiter S. 2016. Fast and accurate deep network learning by exponential linear units (ELUS)[EB/OL]. [2021-08-30] . https://arxiv.org/pdf/1511.07289.pdf https://arxiv.org/pdf/1511.07289.pdf

Dong C, Loy C C, He K M and Tang X O. 2014. Learning a deep convolutional network for image super-resolution//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 184-199[ DOI: 10.1007/978-3-319-10593-2_13 http://dx.doi.org/10.1007/978-3-319-10593-2_13 ]

Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X H, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J and Houlsby N. 2021. An image is worth 16×16 words: transformers for image recognition at scale[EB/OL]. [2021-08-30] . https://arxiv.org/pdf/2010.11929.pdf https://arxiv.org/pdf/2010.11929.pdf

Freeman W T, Pasztor E C and Carmichael O T. 2000. Learning low-level vision. International Journal of Computer Vision, 40(1): 25-47[DOI: 10.1109/ICCV.1999.790414]

He K M, Zhang X Y, Ren S Q and Sun J. 2015. Delving deep into rectifiers: surpassing human-level performance on ImageNet classification//Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE: 1026-1034[ DOI: 10.1109/ICCV.2015.123 http://dx.doi.org/10.1109/ICCV.2015.123 ]

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 770-778[ DOI: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]

Kim S Y, Oh J and Kim M. 2020. JSI-GAN: GAN-based joint super-resolution and inverse tone-mapping with pixel-wise task-specific filters for UHD HDR video//Proceedings of the AAAI Conference on Artificial Intelligence. New York, USA: AAAI: 11287-11295[ DOI: 10.1609/aaai.v34i07.6789 http://dx.doi.org/10.1609/aaai.v34i07.6789 ]

Kingma D P and Ba J L. 2017. Adam: a method for stochastic optimization[EB/OL]. [2021-08-30] . https://arxiv.org/pdf/1412.6980.pdf https://arxiv.org/pdf/1412.6980.pdf

Köhler R, Hirsch M, Mohler B, Schölkopf B and Harmeling S. 2012. Recording and playback of camera shake: benchmarking blind deconvolution with a real-world database//Proceedings of the 12th European Conference on Computer Vision. Florence, Italy: Springer: 27-40[ DOI: 10.1007/978-3-642-33786-4_3 http://dx.doi.org/10.1007/978-3-642-33786-4_3 ]

Kupyn O, Budzan V, Mykhailych M, Mishkin D and Matas J. 2018. DeblurGAN: blind motion deblurring using conditional adversarial networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 8183-8192[ DOI: 10.1109/CVPR.2018.00854 http://dx.doi.org/10.1109/CVPR.2018.00854 ]

Ledig C, Theis L, Huszár F, Caballero J, Cunningham A, Acosta A, Aitken A, Tejani A, Totz J, Wang Z H and Shi W Z. 2017. Photo-realistic single image super-resolution using a generative adversarial network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 105-114[ DOI: 10.1109/CVPR.2017.19 http://dx.doi.org/10.1109/CVPR.2017.19 ]

Liang J Y, Sun G L, Zhang K, Van Gool L and Timofte R. 2021. Mutual affine network for spatially variant kernel estimation in blind image super-resolution[EB/OL]. [2021-08-30] . https://arxiv.org/pdf/2108.05302.pdf https://arxiv.org/pdf/2108.05302.pdf

Lim B, Son S, Kim H, Nah S and Lee K M. 2017. Enhanced deep residual networks for single image super-resolution//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Honolulu, USA: IEEE: 1132-1140[ DOI: 10.1109/CVPRW.2017.151 http://dx.doi.org/10.1109/CVPRW.2017.151 ]

Maas A L, Hannun A Y and Ng A Y. 2013. Rectifier nonlinearities improve neural network acoustic models//Proceedings of the 30th International Conference on Machine Learning. Atlanta, Georgia, USA: JMLR: 2-3[ DOI: 10.1.1.693.1422 http://dx.doi.org/10.1.1.693.1422 ]

Nah S, Kim T H and Lee K M. 2017. Deep multi-scale convolutional neural network for dynamic scene deblurring//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 257-265[ DOI: 10.1109/CVPR.2017.35 http://dx.doi.org/10.1109/CVPR.2017.35 ]

Niu B, Wen W L, Ren W Q, Zhang X D, Yang L P, Wang S Z, Zhang K H, Cao X C and Shen H F. 2020. Single image super-resolution via a holistic attention network//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 191-207[ DOI: 10.1007/978-3-030-58610-2_12 http://dx.doi.org/10.1007/978-3-030-58610-2_12 ]

Park H and Lee K M. 2017. Joint estimation of camera pose, depth, deblurring, and super-resolution from a blurred image sequence//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 4623-4631[ DOI: 10.1109/ICCV.2017.494 http://dx.doi.org/10.1109/ICCV.2017.494 ]

Parmar N, Vaswani A, Uszkoreit J, Kaiser L, Shazeer N, Ku A and Tran D. 2018. Image transformer[EB/OL]. [2021-08-30] . http://proceedings.mlr.press/v80/parmar18a/parmar18a.pdf http://proceedings.mlr.press/v80/parmar18a/parmar18a.pdf

Shi W Z, Caballero J, Huszár F, Totz J, Aitken A P, Bishop R, Rueckert D and Wang Z H. 2016. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 1874-1883[ DOI: 10.1109/CVPR.2016.207 http://dx.doi.org/10.1109/CVPR.2016.207 ]

Tao X, Gao H Y, Shen X Y, Wang J and Jia J Y. 2018. Scale-recurrent network for deep image deblurring//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 8174-8182[ DOI: 10.1109/CVPR.2018.00853 http://dx.doi.org/10.1109/CVPR.2018.00853 ]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, KaiserŁand Polosukhin I. 2017. Attention is all you need[EB/OL]. [2021-08-30] . https://arxiv.org/pdf/1706.03762.pdf https://arxiv.org/pdf/1706.03762.pdf

Wang X L, Girshick R, Gupta A and He K M. 2018. Non-local neural networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7794-7803[ DOI: 10.1109/CVPR.2018.00813 http://dx.doi.org/10.1109/CVPR.2018.00813 ]

Wang X T, Yu K, Wu S X, Gu J J, Liu Y H, Dong C, Qiao Y and Loy C C. 2019. ESRGAN: enhanced super-resolution generative adversarial networks//Proceedings of 2019 European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 63-79[ DOI: 10.1007/978-3-030-11021-5_5 http://dx.doi.org/10.1007/978-3-030-11021-5_5 ]

Xu L, Ren J S J, Liu C and Jia J Y. 2014. Deep convolutional neural network for image deconvolution[EB/OL]. [2021-08-30] . https://proceedings.neurips.cc/paper/2014/file/1c1d4df596d01da60385f0bb17a4a9e0-Paper.pdf https://proceedings.neurips.cc/paper/2014/file/1c1d4df596d01da60385f0bb17a4a9e0-Paper.pdf

Xu X Y, Sun D Q, Pan J S, Zhang Y J, Pfister H and Yang M H. 2017. Learning to super-resolve blurry face and text images//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 251-260[ DOI: 10.1109/ICCV.2017.36 http://dx.doi.org/10.1109/ICCV.2017.36 ]

Yang F Z, Yang H, Fu J L, Lu H T and Guo B N. 2020. Learning texture transformer network for image super-resolution//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE: 5790-5799[ DOI: 10.1109/CVPR42600.2020.00583 http://dx.doi.org/10.1109/CVPR42600.2020.00583 ]

Zhang H G, Dai Y C, Li H D and Koniusz P. 2019. Deep stacked hierarchical multi-patch network for image deblurring//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 5971-5979[ DOI: 10.1109/CVPR.2019.00613 http://dx.doi.org/10.1109/CVPR.2019.00613 ]

Zhang K, Liang J Y, van Gool L and Timofte R. 2021. Designing a practical degradation model for deep blind image super-resolution[EB/OL]. [2021-08-30] . https://arxiv.org/pdf/2103.14006.pdf https://arxiv.org/pdf/2103.14006.pdf

Zhang K H, Luo W H, Zhong Y R, Ma L, Stenger B, Liu W and Li H D. 2020a. Deblurring by realistic blurring//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE: 2734-2743[ DOI: 10.1109/CVPR42600.2020.00281 http://dx.doi.org/10.1109/CVPR42600.2020.00281 ]

Zhang N, Wang Y C, Zhang X and Xu D D. 2020. A review of single image super-resolution based on deep learning. Acta Automatica Sinica, 46(12): 2479-2499

张宁, 王永成, 张欣, 徐东东. 2020. 基于深度学习的单幅图片超分辨率重构研究进展. 自动化学报, 43(5): 697-709[DOI:10.16383/j.aas.c190031]

Zhang X Y, Dong H, Hu Z, Lai W S, Wang F and Yang M H. 2020b. Gated fusion network for degraded image super resolution. International Journal of Computer Vision, 128(6): 1699-1721[DOI: 10.1007/s11263-019-01285-y]

Zhang X Y, Wang F, Dong H and Guo Y. 2018b. A deep encoder-decoder networks for joint deblurring and super-resolution//Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary, Canada: IEEE: 1448-1452[ DOI: 10.1109/ICASSP.2018.8462601 http://dx.doi.org/10.1109/ICASSP.2018.8462601 ]

Zhang Y L, Li K P, Li K, Wang L C, Zhong B N and Fu Y. 2018c. Image super-resolution using very deep residual channel attention networks//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 294-310[ DOI: 10.1007/978-3-030-01234-2_18 http://dx.doi.org/10.1007/978-3-030-01234-2_18 ]

文章被引用时，请邮件提醒。

提交

改进实时目标检测Transformer的持刀危险行为检测算法

针对高光谱遥感图像变化检测的混合注意力和双向门控网络

面向高光谱全色锐化的混合注意力双分支U型网络

线性分解注意力的边缘端高效Transformer跟踪

融合上下文感知注意力的Transformer目标跟踪方法