用于单幅模糊图像超分辨的Transformer融合网络
A super-resolution Transformer fusion network for single blurred image
- 2022年27卷第5期 页码:1616-1631
收稿日期:2021-09-16,
修回日期:2021-10-25,
录用日期:2021-11-1,
纸质出版日期:2022-05-16
DOI: 10.11834/jig.210847
移动端阅览
浏览全部资源
扫码关注微信
收稿日期:2021-09-16,
修回日期:2021-10-25,
录用日期:2021-11-1,
纸质出版日期:2022-05-16
移动端阅览
目的
2
以卷积神经网络为代表的深度学习方法已经在单帧图像超分辨领域取得了丰硕成果,这些方法大多假设低分辨图像不存在模糊效应。然而,由于相机抖动、物体运动等原因,真实场景下的低分辨率图像通常会伴随着模糊现象。因此,为了解决模糊图像的超分辨问题,提出了一种新颖的Transformer融合网络。
方法
2
首先使用去模糊模块和细节纹理特征提取模块分别提取清晰边缘轮廓特征和细节纹理特征。然后,通过多头自注意力机制计算特征图任一局部信息对于全局信息的响应,从而使Transformer融合模块对边缘特征和纹理特征进行全局语义级的特征融合。最后,通过一个高清图像重建模块将融合特征恢复成高分辨率图像。
结果
2
实验在2个公开数据集上与最新的9种方法进行了比较,在GOPRO数据集上进行2倍、4倍、8倍超分辨重建,相比于性能第2的模型GFN(gated fusion network)
峰值信噪比(peak signal-to-noive ratio
PSNR)分别提高了0.12 dB、0.18 dB、0.07 dB;在Kohler数据集上进行2倍、4倍、8倍超分辨重建,相比于性能第2的模型GFN,PSNR值分别提高了0.17 dB、0.28 dB、0.16 dB。同时也在GOPRO数据集上进行了对比实验以验证Transformer融合网络的有效性。对比实验结果表明,提出的网络明显提升了对模糊图像超分辨重建的效果。
结论
2
本文所提出的用于模糊图像超分辨的Transformer融合网络,具有优异的长程依赖关系和全局信息捕捉能力,其通过多头自注意力层计算特征图任一局部信息在全局信息上的响应,实现了对去模糊特征和细节纹理特征在全局语义层次的深度融合,从而提升了对模糊图像进行超分辨重建的效果。
Objective
2
Single image super-resolution is an essential task for vision applications to enhance the spatial resolution based image quality in the context of computer vision. Deep learning based methods are beneficial to single image super-resolution nowadays. Low-resolution images are regarded as clear images without blur effects. However
low-resolution images in real scenes are constrained of blur artifacts factors like camera shake and object motion. The degradation derived blur artifacts could be amplified in the super-resolution reconstruction process. Hence
our research focus on the single image super-resolution task to resolve motion blurred issue.
Method
2
Our Transformer fusion network (TFN) can be handle super-resolution reconstruction of low-resolution blurred images for super-resolution reconstruction of blurred images. Our TFN method implements a dual-branch strategy to remove some blurring regions based on super-resolution reconstruction of blurry images. First
we facilitate a deblurring module (DM) to extract deblurring features like clear edge structures. Specifically
we use the encoder-decoder architecture to design our DM module. For the encoder part of DM module
we use three convolutional layers to decrease the spatial resolution of feature maps and increase the channels of feature maps. For the decoder part of DM module
we use two de-convolutional layers to increase the spatial resolution of feature maps and decrease the channels of feature maps. In terms of the supervision of L1 deblurring loss function
the DM module is used to generate the clear feature maps in related to the down-sampling and up-sampling process of the DM module. But
our DM module tends to some detailed information loss of input images due to detailed information removal with the blur artifacts. Then
we designate additional texture feature extraction module (TFEM) to extract detailed texture features. The TFEM module is composed of six residual blocks
which can resolve some gradient explosion issues and speed up convergence. Apparently
the TFEM does not have down-sampling and up-sampling process like DM module
so TFEM can extract more detailed texture features than DM although this features has some blur artifacts. In order to take advantage of both clear deblurring features extracted by DM module and the detailed features extracted by TFEM module
we make use of a Transformer fusion module (TFM) to fuse them. We can use the clear deblurring features and detailed features in TFM module. We customize the multi-head attention layer to design the TFM module. Because the input of the transformer encoder part is one dimensional vector
we use flatten and unflatten operations in the TFM module. In addition
we can use the TFM module to fuse deblurring features extracted by the DM module and detailed texture features extracted by the TFEM module more effectively in the global sematic level based on long-range and global dependencies multi-head attention capturing ability. Finally
we use reconstruction module (RM) to carry out super-resolution reconstruction based on the fusion features obtained to generate a better super-resolved image.
Result
2
The extensive experiments demonstrate that our method generates sharper super-resolved images based on low-resolution blurred input images. We compare the proposed TFN to several algorithms
including the tailored single image super-resolution methods
the joint image deblurring and image super-resolution approaches
the combinations of image super-resolution algorithms and non-uniform deblurring algorithms. Specially
the single image super-resolution methods are based on the residual channel attention network(RCAN) and holistic attention network(HAN) algorithms
the image deblurring methods are melted scale-recurrent network(SRN) and deblur generative adversarial network(DBGAN) in
and the joint image deblurring and image super-resolution approaches are integrated the gated fusion network(GFN). To further evaluate the proposed TFN
we conduct experiments on two test data sets
including GOPRO test dataset and Kohler dataset. For GOPRO test dataset
the peak signal-to-noise ratio(PSNR) value of our TFN based super-resolved results by is 0.12 dB
0.18 dB
and 0.07 dB higher than the very recent work of GFN for the 2×
4× and 8× scales
respectively. For Kohler dataset
the PSNR value of our TFN based super-resolved results is 0.17 dB
0.28 dB
and 0.16 dB in the 2×
4× and 8× scales
respectively. In addition
the PSNR value of model with DM model result is 1.04 dB higher than model with TFEM in ablation study. the PSNR value of model with DM and TFME module is 1.84 dB and 0.80 dB higher than model with TFEM
and model with DM respectively. The PSNR value of TFN model with TFEM
DM
and TFM
which is 2.28 dB
1.24 dB
and 0.44 dB higher than model with TFEM
model with DM
and model with TFEM/DM
respectively. To sum up
the GOPRO dataset based ablation experiments illustrates that the TFM promote global semantic hierarchical feature fusion in the context of deblurring features and detailed texture features
which greatly improves the effect of the network on the super-resolution reconstruction of low-resolution blurred images. The GOPRO test dataset and Kohler dataset based experimental results illustrates our network has a certain improvement of visual results qualitatively quantitatively.
Conclusion
2
We harnesses a Transformer fusion network for blurred image super-resolution. This network can super-resolve blurred image and remove blur artifacts
to fuse DB-module-extracted deblurring features by and TFEM-module-extracted texture features via a transformer fusion module. In the transformer fusion module
we uses the multi-head self-attention layer to calculate the response of local information of the feature map to global information
which can effectively fuse deblurring features and detailed texture features at the global semantic level and improves the effect of super-resolution reconstruction of blurred images. Extensive ablation experiments and comparative experiment demonstrate that our TFN demonstrations have its priority on the visual result quantity and quantitative ability.
Ba J L, Kiros J R and Hinton G E. 2016. Layer normalization[EB/OL]. [2021-08-30] . https://arxiv.org/pdf/1607.06450.pdf https://arxiv.org/pdf/1607.06450.pdf
Badrinarayanan V, Kendall A and Cipolla R. 2017. SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12): 2481-2495[DOI: 10.1109/TPAMI.2016.2644615]
Chen H T, Wang Y H, Guo T Y, Xu C, Deng Y P, Liu Z H, Ma S W, Xu C J, Xu C and Gao W. 2021. Pre-trained image processing transformer[EB/OL]. [2021-08-30] . https://arxiv.org/pdf/2012.00364.pdf https://arxiv.org/pdf/2012.00364.pdf
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H and Bengio Y. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: ACL: 1724-1734[ DOI: 10.3115/v1/D14-1179 http://dx.doi.org/10.3115/v1/D14-1179 ]
Clevert D A, Unterthiner T and Hochreiter S. 2016. Fast and accurate deep network learning by exponential linear units (ELUS)[EB/OL]. [2021-08-30] . https://arxiv.org/pdf/1511.07289.pdf https://arxiv.org/pdf/1511.07289.pdf
Dong C, Loy C C, He K M and Tang X O. 2014. Learning a deep convolutional network for image super-resolution//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 184-199[ DOI: 10.1007/978-3-319-10593-2_13 http://dx.doi.org/10.1007/978-3-319-10593-2_13 ]
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X H, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J and Houlsby N. 2021. An image is worth 16×16 words: transformers for image recognition at scale[EB/OL]. [2021-08-30] . https://arxiv.org/pdf/2010.11929.pdf https://arxiv.org/pdf/2010.11929.pdf
Freeman W T, Pasztor E C and Carmichael O T. 2000. Learning low-level vision. International Journal of Computer Vision, 40(1): 25-47[DOI: 10.1109/ICCV.1999.790414]
He K M, Zhang X Y, Ren S Q and Sun J. 2015. Delving deep into rectifiers: surpassing human-level performance on ImageNet classification//Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE: 1026-1034[ DOI: 10.1109/ICCV.2015.123 http://dx.doi.org/10.1109/ICCV.2015.123 ]
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 770-778[ DOI: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]
Kim S Y, Oh J and Kim M. 2020. JSI-GAN: GAN-based joint super-resolution and inverse tone-mapping with pixel-wise task-specific filters for UHD HDR video//Proceedings of the AAAI Conference on Artificial Intelligence. New York, USA: AAAI: 11287-11295[ DOI: 10.1609/aaai.v34i07.6789 http://dx.doi.org/10.1609/aaai.v34i07.6789 ]
Kingma D P and Ba J L. 2017. Adam: a method for stochastic optimization[EB/OL]. [2021-08-30] . https://arxiv.org/pdf/1412.6980.pdf https://arxiv.org/pdf/1412.6980.pdf
Köhler R, Hirsch M, Mohler B, Schölkopf B and Harmeling S. 2012. Recording and playback of camera shake: benchmarking blind deconvolution with a real-world database//Proceedings of the 12th European Conference on Computer Vision. Florence, Italy: Springer: 27-40[ DOI: 10.1007/978-3-642-33786-4_3 http://dx.doi.org/10.1007/978-3-642-33786-4_3 ]
Kupyn O, Budzan V, Mykhailych M, Mishkin D and Matas J. 2018. DeblurGAN: blind motion deblurring using conditional adversarial networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 8183-8192[ DOI: 10.1109/CVPR.2018.00854 http://dx.doi.org/10.1109/CVPR.2018.00854 ]
Ledig C, Theis L, Huszár F, Caballero J, Cunningham A, Acosta A, Aitken A, Tejani A, Totz J, Wang Z H and Shi W Z. 2017. Photo-realistic single image super-resolution using a generative adversarial network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 105-114[ DOI: 10.1109/CVPR.2017.19 http://dx.doi.org/10.1109/CVPR.2017.19 ]
Liang J Y, Sun G L, Zhang K, Van Gool L and Timofte R. 2021. Mutual affine network for spatially variant kernel estimation in blind image super-resolution[EB/OL]. [2021-08-30] . https://arxiv.org/pdf/2108.05302.pdf https://arxiv.org/pdf/2108.05302.pdf
Lim B, Son S, Kim H, Nah S and Lee K M. 2017. Enhanced deep residual networks for single image super-resolution//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Honolulu, USA: IEEE: 1132-1140[ DOI: 10.1109/CVPRW.2017.151 http://dx.doi.org/10.1109/CVPRW.2017.151 ]
Maas A L, Hannun A Y and Ng A Y. 2013. Rectifier nonlinearities improve neural network acoustic models//Proceedings of the 30th International Conference on Machine Learning. Atlanta, Georgia, USA: JMLR: 2-3[ DOI: 10.1.1.693.1422 http://dx.doi.org/10.1.1.693.1422 ]
Nah S, Kim T H and Lee K M. 2017. Deep multi-scale convolutional neural network for dynamic scene deblurring//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 257-265[ DOI: 10.1109/CVPR.2017.35 http://dx.doi.org/10.1109/CVPR.2017.35 ]
Niu B, Wen W L, Ren W Q, Zhang X D, Yang L P, Wang S Z, Zhang K H, Cao X C and Shen H F. 2020. Single image super-resolution via a holistic attention network//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 191-207[ DOI: 10.1007/978-3-030-58610-2_12 http://dx.doi.org/10.1007/978-3-030-58610-2_12 ]
Park H and Lee K M. 2017. Joint estimation of camera pose, depth, deblurring, and super-resolution from a blurred image sequence//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 4623-4631[ DOI: 10.1109/ICCV.2017.494 http://dx.doi.org/10.1109/ICCV.2017.494 ]
Parmar N, Vaswani A, Uszkoreit J, Kaiser L, Shazeer N, Ku A and Tran D. 2018. Image transformer[EB/OL]. [2021-08-30] . http://proceedings.mlr.press/v80/parmar18a/parmar18a.pdf http://proceedings.mlr.press/v80/parmar18a/parmar18a.pdf
Shi W Z, Caballero J, Huszár F, Totz J, Aitken A P, Bishop R, Rueckert D and Wang Z H. 2016. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 1874-1883[ DOI: 10.1109/CVPR.2016.207 http://dx.doi.org/10.1109/CVPR.2016.207 ]
Tao X, Gao H Y, Shen X Y, Wang J and Jia J Y. 2018. Scale-recurrent network for deep image deblurring//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 8174-8182[ DOI: 10.1109/CVPR.2018.00853 http://dx.doi.org/10.1109/CVPR.2018.00853 ]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, KaiserŁand Polosukhin I. 2017. Attention is all you need[EB/OL]. [2021-08-30] . https://arxiv.org/pdf/1706.03762.pdf https://arxiv.org/pdf/1706.03762.pdf
Wang X L, Girshick R, Gupta A and He K M. 2018. Non-local neural networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7794-7803[ DOI: 10.1109/CVPR.2018.00813 http://dx.doi.org/10.1109/CVPR.2018.00813 ]
Wang X T, Yu K, Wu S X, Gu J J, Liu Y H, Dong C, Qiao Y and Loy C C. 2019. ESRGAN: enhanced super-resolution generative adversarial networks//Proceedings of 2019 European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 63-79[ DOI: 10.1007/978-3-030-11021-5_5 http://dx.doi.org/10.1007/978-3-030-11021-5_5 ]
Xu L, Ren J S J, Liu C and Jia J Y. 2014. Deep convolutional neural network for image deconvolution[EB/OL]. [2021-08-30] . https://proceedings.neurips.cc/paper/2014/file/1c1d4df596d01da60385f0bb17a4a9e0-Paper.pdf https://proceedings.neurips.cc/paper/2014/file/1c1d4df596d01da60385f0bb17a4a9e0-Paper.pdf
Xu X Y, Sun D Q, Pan J S, Zhang Y J, Pfister H and Yang M H. 2017. Learning to super-resolve blurry face and text images//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 251-260[ DOI: 10.1109/ICCV.2017.36 http://dx.doi.org/10.1109/ICCV.2017.36 ]
Yang F Z, Yang H, Fu J L, Lu H T and Guo B N. 2020. Learning texture transformer network for image super-resolution//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE: 5790-5799[ DOI: 10.1109/CVPR42600.2020.00583 http://dx.doi.org/10.1109/CVPR42600.2020.00583 ]
Zhang H G, Dai Y C, Li H D and Koniusz P. 2019. Deep stacked hierarchical multi-patch network for image deblurring//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 5971-5979[ DOI: 10.1109/CVPR.2019.00613 http://dx.doi.org/10.1109/CVPR.2019.00613 ]
Zhang K, Liang J Y, van Gool L and Timofte R. 2021. Designing a practical degradation model for deep blind image super-resolution[EB/OL]. [2021-08-30] . https://arxiv.org/pdf/2103.14006.pdf https://arxiv.org/pdf/2103.14006.pdf
Zhang K H, Luo W H, Zhong Y R, Ma L, Stenger B, Liu W and Li H D. 2020a. Deblurring by realistic blurring//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE: 2734-2743[ DOI: 10.1109/CVPR42600.2020.00281 http://dx.doi.org/10.1109/CVPR42600.2020.00281 ]
Zhang N, Wang Y C, Zhang X and Xu D D. 2020. A review of single image super-resolution based on deep learning. Acta Automatica Sinica, 46(12): 2479-2499
张宁, 王永成, 张欣, 徐东东. 2020. 基于深度学习的单幅图片超分辨率重构研究进展. 自动化学报, 43(5): 697-709[DOI:10.16383/j.aas.c190031]
Zhang X Y, Dong H, Hu Z, Lai W S, Wang F and Yang M H. 2020b. Gated fusion network for degraded image super resolution. International Journal of Computer Vision, 128(6): 1699-1721[DOI: 10.1007/s11263-019-01285-y]
Zhang X Y, Wang F, Dong H and Guo Y. 2018b. A deep encoder-decoder networks for joint deblurring and super-resolution//Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary, Canada: IEEE: 1448-1452[ DOI: 10.1109/ICASSP.2018.8462601 http://dx.doi.org/10.1109/ICASSP.2018.8462601 ]
Zhang Y L, Li K P, Li K, Wang L C, Zhong B N and Fu Y. 2018c. Image super-resolution using very deep residual channel attention networks//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 294-310[ DOI: 10.1007/978-3-030-01234-2_18 http://dx.doi.org/10.1007/978-3-030-01234-2_18 ]
相关作者
相关机构