Current Issue Cover
图像复原中自注意力和卷积的动态关联学习

江奎1, 贾雪梅2, 黄文心3, 王文兵4, 王正2, 江俊君1(1.哈尔滨工业大学计算机科学与技术学院, 哈尔滨 150000;2.武汉大学计算机学院, 武汉 430072;3.湖北大学计算机与信息工程学院, 武汉 430062;4.杭州灵伴科技有限公司, 杭州 310000)

摘 要
目的 卷积神经网络(convolutional neural network,CNN)和自注意力(self-attention,SA)在多媒体应用领域已经取得了巨大的成功。然而,鲜有研究人员能够在图像修复任务中有效地协调这两种架构。针对这两种架构各自的优缺点,提出了一种关联学习的方式以综合利用两种方法的优点并抑制各自的不足,实现高质高效的图像修复。方法 本文结合CNN和SA两种架构的优势,尤其是在特定的局部上下文和全局结构表示中充分利用CNN的局部感知和平移不变性,以及SA的全局聚合能力。此外,图像的降质分布揭示了图像空间中退化的位置和程度。受此启发,本文在背景修复中引入退化先验,并据此提出一种动态关联学习的图像修复方法。核心是一个新的多输入注意力模块,将降质扰动的消除和背景修复关联起来。通过结合深度可分离卷积,利用CNN和SA两种架构的优势实现高效率和高质量图像修复。结果 在Test1200数据集中进行了消融实验以验证算法各个部分的有效性,实验结果证明CNN和SA的融合可以有效提升模型的表达能力;同时,降质扰动的消除和背景修复关联学习可以有效提升整体的修复效果。本文方法在3个图像修复任务的合成和真实数据上与其他10余种方法进行了比较,提出的方法取得了显著的提升。在图像去雨任务上,本文提出的ELF(image deeraining meets association learning and Transformer)方法在合成数据集Test1200上,相比于MPRNet(multi-stage progressive image restoration network),PSNR(peaksignal-to-noise ratio)值提高0.9dB;在水下图像增强任务上,ELF在R90数据集上超过Ucolor方法4.15dB;在低照度图像增强任务上,相对于LLFlow(flow-based low-light image enhancement)算法,ELF获得了1.09dB的提升。结论 本文方法在效果和性能上具有优势,在常见的图像去雨、低照度图像增强和水下图像修复等任务上优于代表性的方法。
关键词
Dynamic association learning of self-attention and convolution in image restoration

Jiang Kui1, Jia Xuemei2, Huang Wenxin3, Wang Wenbing4, Wang Zheng2, Jiang Junjun1(1.School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150000, China;2.School of Computer Science, Wuhan University, Wuhan 430072, China;3.School of Computer Science and Information Engineering, Hubei University, Wuhan 430062, China;4.Hangzhou Lingban Technology Ltd., Hangzhou 310000, China)

Abstract
Objective Convolutional neural networks(CNNs)and self-attention(SA)have achieved great success in the field of multimedia applications for dynamic association learning of SA and convolution in image restoration.However,owing to the intrinsic characteristics of local connectivity and translation equivariance,CNNs have at least two shortcomings,1)limited receptive field and 2)static weight of sliding window at inference,unable to cope with content diversity.The former prevents the network from capturing long-range pixel dependencies,while the latter sacrifices the adaptability to input contents.As a result,they are far from meeting the requirement in modeling global rain distribution and generate results with obvious rain residue.Meanwhile,because of the global calculation of SA,its computational complexity grows quadratically with the spatial resolution,making it infeasible to apply to high-resolution images.In view of the advantages and disadvantages of these two architectures,this study proposes an association learning method to utilize the advantages of the two methods comprehensively and suppress their respective shortcomings to achieve high-quality and efficient inpainting.Method This study combines the advantages of CNN and SA architectures,particularly by fully utilizing CNNs'local perception and translation invariance in specific local context and global structural representations,as well as SA's global aggregation ability.We take inspiration from the observation that rain distribution reflects the degradation location and degree,in addition to rain distribution prediction.Therefore,we propose to refine background textures with the predicted degradation prior in an association learning manner.We accomplish image deraining by associating rain streak removal and background recovery,in which an image deraining network and a background recovery network are specifically designed for these two subtasks.The key part of association learning is a novel multi-input attention module(MAM).It generates the degradation prior and produces the degradation mask according to the predicted rainy distribution.Benefiting from the global correlation calculation of SA,MAM can extract informative complementary components from the rainy input(query) with a degradation mask(key)and then help realize accurate texture restoration.SA tends to aggregate feature maps with SA importance,but convolution diversifies them to focus on local textures.Unlike Restormer equipped with pure Transformer blocks,the design paradigm is promoted in a parallel manner of SA and CNNs,and a hybrid fusion network is proposed.The network involves one residual Transformer branch(RTB)and one encoder-decoder branch(EDB).The former takes a few learnable tokens(feature channels)as input and stacks multihead attention and feed-forward networks to encode global features of the image.The latter,conversely,leverages the multiscale encoder-decoder to represent contexture knowledge.We propose a lightweight hybrid fusion block to aggregate the outcomes of RTB and EDB to yield a final solution to the subtask.In this way,we construct our final model as a two-stage Transformer-based method,namely,ELF,for single image deraining.Result An ablation experiment is conducted on the Test1200 dataset to validate the effectiveness of various parts of the algorithm.The experimental results show that the fusion of CNN and SA can effectively improve the model's expression ability.At the same time,the elimination of degraded disturbances and background repair association learning can effectively improve the overall repair effect.The method proposed in this paper is compared with over 10 new methods on the synthetic and real data of three inpainting tasks,and the proposed method achieves significant improvement.In the task of image rain removal,the ELF method improves the peak signal-to-noise ratio(PSNR)value by 0.9 dB compared with multi-stage progressive image restoration network (MPRNet)on the synthetic dataset Test1200.In the underwater enhancement task,ELF exceeds Ucolor by 4.15 dB on the R90 dataset.In the low-illumination image enhancement task,ELF achieves a 1.09 dB improvement compared with the LLFlow algorithm.Conclusion We rethink image deraining as a composite task of rain streak removal,texture recovery,and their association learning and propose an ELF model for image deraining.Accordingly,a two-stage architecture and an associated learning module are adopted in ELF to account for the two goals of rain streak removal and texture reconstruction while facilitating the learning capability.The joint optimization promotes the compatibility while maintaining the model compactness.Extensive results on image deraining and joint detection tasks demonstrate the superiority of our ELF model over state-of-the-art techniques.The method proposed in this paper possesses efficiency and effectiveness and is superior to representative methods in common tasks such as image rain removal,low-light image enhancement,and underwater enhancement.
Keywords

订阅号|日报