兰治1, 严彩萍1, 李红2, 郑雅丹1(1.杭州师范大学, 杭州 311121;2.杭州启源视觉科技有限公司, 杭州 311121)
目的 图像修复是指用合理的内容来填补图像缺失或损坏的部分。尽管生成对抗网络（generative adversarial network，GAN）取得了巨大的进步，但当缺失区域很大时，现有的大多数方法仍然会产生扭曲的结构和模糊的纹理。其中一个主要原因是卷积操作的局域性，它不考虑全局或远距离结构信息，只是扩大了局部感受野。方法 为了克服上述问题，提出了一种新的图像修复网络，即混合注意力生成对抗网络（hybrid dual attention generativeadversarial network，HDA-GAN），它可以同时捕获全局结构信息和局部细节纹理。具体地，HDA-GAN将两种级联的通道注意力传播模块和级联的自注意力传播模块集成到网络的不同层中。对于级联的通道注意力传播模块，将多个多尺度通道注意力块级联在网络的高层，用于学习从低级细节到高级语义的特征。对于级联的自注意力传播模块，将多个基于分块的自注意力块级联在网络的中低层，以便在保留更多的细节的同时捕获远程依赖关系。级联模块将多个相同的注意力块堆叠成不同的层，能够增强局部纹理传播到全局结构。结果 本文采用客观评价指标：均方差（mean squared error，MSE）、峰值信噪比（peak signal-to-noise ratio，PSNR）和结构相似性指数（structural similarityindex，SSIM）在Paris Street View数据集和CelebA-HQ（CelebA-high quality）数据集上进行了大量实验。定量比较中，HDA-GAN在Paris Street View数据集上相比于Edge-LBAM（edge-guided learnable bidirectional attention maps）方法，在掩码不同的比例上，PSNR提升了1.28 dB、1.13 dB、0.93 dB和0.80 dB，SSIM分别提升了5.2%、8.2%、10.6%和13.1%。同样地，在CelebA-HQ数据集上相比于AOT-GAN（aggregated contextual transformations generative adversarialnetwork）方法，在掩码不同的比例上，MAE分别降低了2.2%、5.4%、11.1%、18.5%和28.1%，PSNR分别提升了0.93 dB、0.68 dB、0.73 dB、0.84 dB和0.74 dB。通过可视化实验可以明显观察到修复效果优于以上方法。结论 本文提出的图像修复方法，充分发挥了深度学习模型进行特征学习和图像生成的优点，使得修复图像缺失或损坏的部分更加准确。
HDA-GAN：hybrid dual attention generative adversarial network for image inpainting
Objective Image inpainting has been extensively examined as a basic topic in the field of image processing over the past two decades. Image inpainting attempts to fill in the missing or corrupted parts of an image with satisfactory and reasonable content. Given their inability to generate semantically compliant images，traditional techniques can succeed in certain straightforward situations but fall short when the missing region is large or complex. Image inpainting methods based on deep learning and adversarial learning have produced increasingly promising results in recent years. However， most of these methods produce distorted structures and hazy textures when the missing region is large. One primary cause of this problem is that these methods do not consider global or long-range structural information due to the locality of vanilla convolution operations，even with dilated convolution that enlarges the local receptive field. Method To overcome this issue，this study proposes a novel image inpainting network called hybrid dual attention generative adversarial network （HDA-GAN），which captures both global structural information and local detailed textures. Specifically，HDA-GAN integrates two types of cascaded attention propagation modules，namely，cascaded channel-attention propagation and cascaded self-attention propagation，into different convolutional layers of the generator network. For the cascaded channel-attention propagation module，several multi-scale channel-attention blocks are cascaded into shallow layers to learn features from low-level details to high-level semantics. The multi-scale channel-attention block adopts the split-attention-merge strategy and residual-gated operations to aggregate multiple channel attention correlations for enhancing high-level semantics while preserving low-level details. For the cascaded self-attention propagation module，several positional-separated self-attention blocks are stacked into middle and deep layers. These blocks also adopt the same split-attention-merge strategy and residual-gated operations as the multi-scale channel-attention blocks but with some changes. The purpose of this design is to use the positional-separated self-attention blocks to maintain the details while learning long-range semantic information interaction. The design of these blocks also further reduces the computational complexity compared with original selfattention. Result Numerous tests using the Paris Street View and CelebA-HQ datasets demonstrate that HDA-GAN can produce superior image inpainting in terms of quality and quantity with better restoration effects compared with several state-ofthe-art algorithms. The Paris Street View dataset includes 15 000 street images of Paris，14 900 training images，and 100 test images，while the CelebA-HQ dataset contains 30 000 high-quality human face images. The fine-grained texture synthesis of models may be evaluated using the high-frequency features of the hair and skin. Following a standard configuration，28 000 images are used for training，and 2 000 are used for testing. In both training and testing，free-form masks are employed while adhering to the standard settings. Free-form masks are highly applicable to real-world settings and thus are used in many inpainting techniques. Following a standard setting，all images are resized to 512×512 pixels or 256×256 pixels for training and testing depending on the datasets. The mean squared error（MSE），peak signal-to-noise ratio （PSNR），and structural similarity index（SSIM）are introduced to evaluate the performance of different methods in filling holes with different hole-to-image region ratios. In the Paris Street View dataset，the PSNR of the proposed method increases by 1. 28 dB，1. 13 dB，0. 93 dB，and 0. 80 dB，while its SSIM increases by 5. 2%，8. 2%，10. 6%，and 13. 1% compared with the Edge-LBAM method as the hole-to-image region ratios increase. Meanwhile，in the CelebA-HQ dataset， the MSE value of the proposed method decreases by 2. 2%，5. 4%，11. 1%，18. 5%，and 28. 1%，while its PSNR increases by 0. 93 dB，0. 68 dB，0. 73 dB，0. 84 dB，and 0. 74 dB compared with the AOT-GAN method as the hole-toimage region ratios increase. These experimental results show that the proposed method quantitatively and qualitatively outperform the other algorithms. Conclusion This study proposes a novel hybrid attention generative adversarial network for image inpainting called HDA-GAN that can generate reasonable and satisfactory content for a distorted image by fusing two carefully designed attention propagation modules. Using the cascaded attention propagation module in skip-connect layers can significantly improve the global structure and local texture captured by the generator，which is crucial for inpainting， particularly when filling complex missing regions or large holes. The cascaded attention propagation modules will be applied to other vision tasks，such as image denoising，image translation，and single image super-resolution，in future work.