Current Issue Cover

华夏, 舒婷, 时愈, 李明欣, 洪汉玉(武汉工程大学)

摘 要
目的 基于深度学习的端到端单图像去模糊方法已取得了优秀成果。但大多数网络中的构建块仅专注于提取局部特征,而在建模远距离像素依赖关系方面表现出局限性。为解决这一问题,本文提出了一种为网络引入局部特征和非局部特征的方法。方法 采用现有的优秀构建块提取局部特征,将大窗口的Transformer块划分为更小的不重叠图像块,对每个图像块仅采样一个最大值点参与自注意力运算,在不占用过多计算资源的情况下提取非局部特征。最后将两个模块结合应用,在块内耦合局部信息和非局部信息,从而有效捕捉更丰富的特征信息。结果 实验表明,相比于仅能提取局部信息的模块,提出的模块在峰值信噪比(peak signal to noise ratio, PSNR)指标上提升了至少1.3 dB。此外,设计两个局部与非局部特征耦合的图像复原网络,分别运用在单图像去运动模糊和去散焦模糊任务上,与Uformer(a general u-shaped transformer for image restoration)相比,在去运动模糊测试集GOPRO和HIDE上的平均PSNR分别提高了0.29 dB、0.25 dB,并且模型的浮点数更低。在去散焦模糊测试集DPD上,平均PSNR提高了0.42 dB。结论 本文方法能在块内成功引入非局部信息,使得模型能同时捕捉局部特征和非局部特征,获得更多的特征表示,提升了去模糊网络的性能。同时,恢复图像也具有更清楚的边缘,更接近于真实图像。
Non-Local feature representation embedded blurred image restoration

Xia Hua, Shu Ting, Shi Yu, Li Mingxin, Hong Hanyu(Wuhan Institute of Technology)

Objective Image deblurring is a classic problem in low-level computer vision, that aims to restore a sharp image from a blurry image. In recent years, convolutional neural networks (CNNs) have boosted the advancement of computer vision significantly, and various CNN-based deblurring methods have been developed with remarkable results. Although convolution operation is powerful in capturing local information, the CNNs show a limitation in modeling long-range dependencies. By employing self-attention mechanisms, visual transformers have shown a high ability to model long-range pixel relationships. However, most transformer models designed for computer vision tasks involving high-resolution images use a local window self-attention mechanism. This is contradictory to the goal of employing transformer structures to capture true long-range pixel dependencies. We review some deblurring models that are sufficient for processing high resolution images, most CNN-based and vision transformer-based approaches can only extract spatial local features. Some studies obtain the information with larger receptive field by directly increasing the window size, but this method not only has excessive computational overhead, but also lacks flexibility in the process of feature extraction. To solve the above problems, in this paper we propose a method that can incorporate both local and non-local information for the network. Method We employ the Local Feature Representation (LFR) modules and Non-Local Feature Representation (NLFR) modules to extract enriched information. For the extraction of local information, most of the existing building blocks have this capability, and we can treat these blocks directly as LFR modules. In addition to obtaining local information, we also designed a generic NLFR module that can be easily combined with LFR module for extracting non-local information. The NLFR module consists of a non-local feature extraction (NLFE) block and an inter-block transmission (IBT) mechanism. The NLFE block applies a non-local self-attention mechanism, which avoids the interference of local information and texture details, captures purer non-local information, and significantly reduces the computational complexity. To reduce the impact of accumulating more local information in the NLFE block as the network depth increases, we introduce an IBT mechanism for successive NLFE blocks, which provides a direct data flow for the transfer of non-local information. This design has two advantages: 1) The NLFR module ignores local texture details in features when extracting information to ensure that information does not interfere with each other. 2) Instead of computing the self-similarity of all pixels within the receptive field, the NLFR module adaptively samples the salient pixels, significantly reducing computational complexity. We selected LeFF and ResBlock as the LFR module combined with the NLFR module, and designed two models named NLCNet_L and NLCNet_R to deal with motion blur removal and defocus blur removal, respectively, based on the single-stage UNet as the model architecture. Result We verify the gains of each component of NLFR module in the network, the network consisting of the NLFR module combined with the LFR module obtains PSNR gains of 0.89 dB compared to using only the LFR as the building block. Applying the IBT module over this, the performance is further improved by 0.09 dB on PSNR. For fair comparisons, we build a baseline model only using ResBlock as the building block with similar computational overhead and number of parameters to the proposed network. The results demonstrate that NLFR-combined ResBlock will be more effective in constructing a deblurred network than directly using ResBlock as the building block. In Scalability experiments, the experiment shows that the combination of NLFR modules with existing building blocks can significantly improve the deblurring performance, including convolutional residual blocks and a transformer block. In particularly, two networks designed with NLFR-combination LeFF block and ResBlock as the building blocks achieve the excellent results in single image motion deblurring and dual-pixel defocus deblurring compared to other methods. According to the popular training method, NLCNet_L was trained on the GoPro dataset with 3000 epochs and tested on the GoPro test set. Our method achieves the best results on the GoPro test set with the lowest computational complexity. Compared to the previous method Uformer, our method improves PSNR by 0.29 dB. We trained NLCNet_R on the DPD dataset for 200 epochs for two-pixel defocus deblurring experiments. In the combined scene category, we achieved excellent performance in all four metrics. Compared to the previous method Uformer, our method improves the PSNR in indoor and outdoor scenes by 1.37 dB and 0.94 dB, respectively. Conclusion We propose a generic NLFR module to represent the extraction of real non-local information from images, which can be coupled with local information within the block to improve the expressive ability of the model. Through rational design, the network composed of NLFR modules achieves excellent performance with low computational consumption, and the visual effect of the recovered image, especially the edge contours, is clearer and more complete.