基于Transformer方法的任意风格迁移策略

孙梅婷; 代龙泉; 唐金辉

doi:10.11834/jig.211237

图像理解和计算机视觉 | 浏览量 : 0 下载量: 3 CSCD: 0

PDF
导出
分享
收藏
专辑

基于Transformer方法的任意风格迁移策略
Transformer-based multi-style information transfer in image processing
2023年28卷第11期页码：3536-3549
纸质出版日期： 2023-11-16 ，
DOI： 10.11834/jig.211237
稿件说明：

移动端阅览

孙梅婷，代龙泉，唐金辉. 2023. 基于Transformer方法的任意风格迁移策略. 中国图象图形学报， 28(11):3536-3549

Sun Meiting， Dai Longquan， Tang Jinhui. 2023. Transformer-based multi-style information transfer in image processing. Journal of Image and Graphics， 28(11):3536-3549
孙梅婷，代龙泉，唐金辉. 2023. 基于Transformer方法的任意风格迁移策略. 中国图象图形学报， 28(11):3536-3549 DOI： 10.11834/jig.211237.

Sun Meiting， Dai Longquan， Tang Jinhui. 2023. Transformer-based multi-style information transfer in image processing. Journal of Image and Graphics， 28(11):3536-3549 DOI： 10.11834/jig.211237.

摘要

目的

任意风格迁移是图像处理任务的重要分支，卷积神经网络作为其常用的网络架构，能够协助内容和风格信息的提取与分离，但是受限于卷积操作感受野，只能捕获图像局部关联先验知识；而自然语言处理领域的Transformer网络能够突破距离限制，捕获长距离依赖关系，更好地建模全局信息，但是因为需要学习所有元素间的关联性，其表达能力的提高也带来了计算成本的增加。鉴于风格迁移过程与句子翻译过程的相似性，提出了一种混合网络模型，综合利用卷积神经网络和Transformer网络的优点并抑制其不足。

方法

首先使用卷积神经网络提取图像高级特征，同时降低图像尺寸。随后将提取的特征送入Transformer中，求取内容特征与风格特征间的关联性，并将内容特征替换为风格特征的加权和，实现风格转换。最后使用卷积神经网络将处理好的特征映射回图像域，生成艺术化图像。

结果

与5种先进的任意风格迁移方法进行定性和定量比较。在定性方面，进行用户调查，比较各方法生成图像的风格化效果，结果表明本文网络生成的风格化图像渲染效果更受用户喜爱；在定量方面，比较各方法的风格化处理速度，结果表明本文网络风格化速率排名第3，属于可接受范围内。此外，本文与现有的基于Transformer的任意风格迁移方法进行比较，突出二者间差异；对判别网络进行消融实验，表明判别网络的引入能够有效提升图像的光滑度和整洁度；最后，将本文网络应用于多种风格迁移任务，表明本文网络具有灵活性。

结论

本文提出的混合网络模型，综合了卷积神经网络和Transformer网络的优点，同时引入了判别网络，使生成的风格化图像更加真实和生动。

Abstract

Objective

The multi-style transfer technique can be applied and focused on the stylized image via visual styles-transferred content image. The stylized image can be used to preserve original content structure and it has the similar features with the style image. Style transfer can be as one of the essential branches of image processing. Conventional style transfer methods are oriented for style rendering in terms of low-quality information processing. Current deep learning based convolutional neural network （CNN） has been adopted in the style transfer domain. To balance and re-integrate images’ content and style information， the CNN can be used to extract content features and style features. However， due to the constrained range of visual perception of the convolutional layers， it can capture local associations only. To model its global information， the Transformer network proposed in natural language processing（NLP）can capture long-distance dependencies. However， its expression-related requirement is still challenged for computational cost because of correlation-learnt between all input elements. Furthermore， the Transformer has its slower convergence performance due to a lack of image prior. Given the similarity between the image style transfer process and the sentence translation process， we develop a CNNs-based hybrid network in terms of Transformer network.

Method

The network we proposed consists of four aspects in relevance to： encoding， style transformation， decoding， and the discriminative network. For network-encoded： convolutional layers are used to extract the high-level image features through lowering the image size. To exact more value of the pixels， features-extracted are more sensitive to semantic and consistent information in related to such specific objects and content in the image. However， the lower quality information of the image， such as lines and textures， is beneficial to reflect the stylistic features. Thus， the network-encoded can be added into the residual connection to enrich feature representation. For Transformer network structure-built style transformation network： it consists of three subparts： content encoder， style encoder， and the decoder. The content and style encoder is paid attention to global information-added for each of content features and style features. The decoder can used to optimize the original content features in related to the weighted sum of style features and stylized features can be generated further. For network-decoded： to up-sample stylized features back to the original image size and generate a final stylized image， topology-interpolated operation is interlinked to this sort of symmetric structure with the encoding network. For discriminative network： it is focused on distinguishing between the generated and natural style images.

Result

To verify the style transfer evaluation criteria， qualitative and quantitative comparative analysis is carried out with several other related flexible style transfer methods. The qualitative analysis consists of two parts： performance comparison and user experience. The performance comparison can show that the proposed network can generate more smooth and clear stylized images. The user-oriented result can show its user-preference ability as well. The results of the stylizing speed comparison in the quantitative comparison can demonstrate that the speed of the proposed network ranks is in the acceptable threshold. Additionally， the speed can keep its stability as the image size grows from 256 pixels to 512 pixels. An ablation experiment is designed to verify the effectiveness of the discriminative network as well. Its results show that the discriminative network-introduced can yield the network to get extraction ability better and generate more realistic images. To demonstrate the flexibility of the network we proposed， the trained network can be used for other related style transfer tasks， including content-style tradeoff， style interpolation， and region painting.

Conclusion

A hybrid network is facilitated and mutual-benefits are shown between CNNs and Transformer network. Experiment results show that the network we proposed can optimize the transferring speed further. The stylized images have its potentials for smaller image sizes （e.g.， 256 pixels） in terms of its content structure and stylistic features.

关键词

计算机视觉图像处理任意风格迁移注意力机制Transformer

Keywords

computer visionimage processingmulti-style image information transferattention mechanismTransformer

references

Chen H T， Wang Y H， Guo T Y， Xu C， Deng Y P， Liu Z H， Ma S W， Xu C J， Xu C and Gao W. 2021b. Pre-trained image processing Transformer//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 12294-12305 ［DOI： 10.1109/CVPR46437.2021.01212http://dx.doi.org/10.1109/CVPR46437.2021.01212］

Chen T Q and Schmidt M. 2016. Fast patch-based style transfer of arbitrary style ［EB/OL］. ［2021-12-26］. https://arxiv.org/pdf/1612.04337v1.pdfhttps://arxiv.org/pdf/1612.04337v1.pdf

Chen Y， Lai Y K and Liu Y J. 2018. CartoonGAN： generative adversarial networks for photo cartoonization//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 9465-9474 ［DOI： 10.1109/CVPR.2018.00986http://dx.doi.org/10.1109/CVPR.2018.00986］

Deng Y Y， Tang F， Dong W M， Huang H B， Ma C Y and Xu C S. 2021. Arbitrary video style transfer via multi-channel correlation//Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto， USA： AAAI Press： 1210-1217

Deng Y Y， Tang F， Dong W M， Ma C Y， Pan X J， Wang L and Xu C S. 2022. StyTr2： image style transfer with Transformers//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 11316-11326 ［DOI： 10.1109/CVPR52688.2022.01104http://dx.doi.org/10.1109/CVPR52688.2022.01104］

Deng Y Y， Tang F， Dong W M， Sun W， Huang F Y and Xu C S. 2020. Arbitrary style transfer via multi-adaptation network//Proceedings of the 28th ACM International Conference on Multimedia. Seattle， USA： ACM： 2719-2727 ［DOI： 10.1145/3394171.3414015http://dx.doi.org/10.1145/3394171.3414015］

Dosovitskiy A， Beyer L， Kolesnikov A， Weissenborn D， Zhai X H， Unterthiner T， Dehghani M， Minderer M， Heigold G， Gelly S， Uszkoreit J and Houlsby N. 2021. An image is worth 16 × 16 words： Transformers for image recognition at scale//Proceedings of the 9th International Conference on Learning Representations. Virtual： ICLR

Efros A A and Freeman W T. 2001. Image quilting for texture synthesis and transfer//Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques. Los Angeles， USA： ACM： 341-346 ［DOI： 10.1145/383259.383296http://dx.doi.org/10.1145/383259.383296］

Esser P， Rombach R and Ommer B. 2021. Taming Transformers for high-resolution image synthesis//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 12868-12878 ［DOI： 10.1109/CVPR46437.2021.01268http://dx.doi.org/10.1109/CVPR46437.2021.01268］

Fu J， Liu J， Tian H J， Li Y， Bao Y J， Fang Z W and Lu H Q. 2019. Dual attention network for scene segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 3141-3149 ［DOI： 10.1109/CVPR.2019.00326http://dx.doi.org/10.1109/CVPR.2019.00326］

Gatys L A， Ecker A S and Bethge M. 2016. Image style transfer using convolutional neural networks//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 2414-2423 ［DOI： 10.1109/CVPR.2016.265http://dx.doi.org/10.1109/CVPR.2016.265］

Goodfellow I J， Pouget-Abadie J， Mirza M， Xu B， Warde-Farley D， Ozair S， Courville A and Bengio Y. 2014. Generative adversarial nets//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal， Canada： MIT Press： 2672-2680

Haeberli P. 1990. Paint by numbers： abstract image representations//Proceedings of the 17th Annual Conference on Computer Graphics and Interactive Techniques. Dallas， USA： ACM： 207-214 ［DOI： 10.1145/97879.97902http://dx.doi.org/10.1145/97879.97902］

He K M， Zhang X Y， Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 770-778 ［DOI： 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90］

Hertzmann A. 1998. Painterly rendering with curved brush strokes of multiple sizes//Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques. Orlando， USA： ACM： 453-460 ［DOI： 10.1145/280814.280951http://dx.doi.org/10.1145/280814.280951］

Hertzmann A， Jacobs C E， Oliver N， Curless B and Salesin D H. 2001. Image analogies//Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques. Los Angeles， USA： ACM： 327-340 ［DOI： 10.1145/383259.383295http://dx.doi.org/10.1145/383259.383295］

Hicsonmez S， Samet N， Akbas E and Duygulu P. 2020. GANILLA： generative adversarial networks for image to illustration translation. Image and Vision Computing， 95： #103886 ［DOI： 10.1016/j.imavis.2020.103886http://dx.doi.org/10.1016/j.imavis.2020.103886］

Huang X and Belongie S. 2017. Arbitrary style transfer in real-time with adaptive instance normalization//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice， Italy： IEEE： 1510-1519 ［DOI： 10.1109/ICCV.2017.167http://dx.doi.org/10.1109/ICCV.2017.167］

Jiang Y F， Chang S Y and Wang Z Y. 2021. TransGAN： two Transformers can make one strong GAN ［EB/OL］. ［2021-12-26］. https://arxiv.org/pdf/2102.07074v1.pdfhttps://arxiv.org/pdf/2102.07074v1.pdf

Johnson J， Alahi A and Li F F. 2016. Perceptual losses for real-time style transfer and super-resolution//Proceedings of the 14th European Conference on Computer Vision. Amsterdam， the Netherlands： Springer： 694-711 ［DOI： 10.1007/978-3-319-46475-6_43http://dx.doi.org/10.1007/978-3-319-46475-6_43］

Kotovenko D， Sanakoyeu A， Ma P C， Lang S and Ommer B. 2019. A content transformation block for image style transfer//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 10024-10033 ［DOI： 10.1109/CVPR.2019.01027http://dx.doi.org/10.1109/CVPR.2019.01027］

Lee H J， Kim H E and Nam H. 2019. SRM： a style-based recalibration module for convolutional neural networks//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul， Korea （South）： IEEE： 1854-1862 ［DOI： 10.1109/ICCV.2019.00194http://dx.doi.org/10.1109/ICCV.2019.00194］

Li Y J， Fang C， Yang J M， Wang Z W， Lu X and Yang M H. 2017. Universal style transfer via feature transforms//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 385-395

Liu H C， Ren W Q， Wang R and Cao X C. 2022. A super-resolution Transformer fusion network for single blurred image. Journal of Image and Graphics， 27（5）： 1616-1631

刘花成，任文琦，王蕊，操晓春. 2022. 用于单幅模糊图像超分辨的Transformer融合网络. 中国图象图形学报， 27（5）： 1616-1631 ［DOI： 10.11834/jig.210847http://dx.doi.org/10.11834/jig.210847］

Liu Z， Lin Y T， Cao Y， Hu H， Wei Y X， Zhang Z， Lin S and Guo B N. 2021. Swin Transformer： hierarchical vision Transformer using shifted windows//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision （ICCV）. Montreal， Canada： IEEE： 10012-10022

Luan F J， Paris S， Shechtman E and Bala K. 2017. Deep photo style transfer//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 6997-7005 ［DOI： 10.1109/CVPR.2017.740http://dx.doi.org/10.1109/CVPR.2017.740］

Luan F J， Paris S， Shechtman E and Bala K. 2018. Deep painterly harmonization. Computer Graphics Forum， 37（4）： 95-106 ［DOI： 10.1111/cgf.13478http://dx.doi.org/10.1111/cgf.13478］

Mirza M and Osindero S. 2014. Conditional generative adversarial nets ［EB/OL］. ［2021-12-26］. https://arxiv.org/pdf/1411.1784v1.pdfhttps://arxiv.org/pdf/1411.1784v1.pdf

Park D Y and Lee K H. 2019. Arbitrary style transfer with style-attentional networks//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 5873-5881 ［DOI： 10.1109/CVPR.2019.00603http://dx.doi.org/10.1109/CVPR.2019.00603］

Parmar N， Vaswani A， Uszkoreit J， Kaiser L， Shazeer N， Ku A and Tran D. 2018. Image Transformer ［EB/OL］. ［2021-12-26］. https://arxiv.org/pdf/1802.05751v3.pdfhttps://arxiv.org/pdf/1802.05751v3.pdf

Sheng L， Lin Z Y， Shao J and Wang X G. 2018. Avatar-Net： multi-scale zero-shot style transfer by feature decoration//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 8242-8250 ［DOI： 10.1109/CVPR.2018.00860http://dx.doi.org/10.1109/CVPR.2018.00860］

Simonyan K and Zisserman A. 2015. Very deep convolutional networks for large-scale image recognition//Proceedings of the 3rd International Conference on Learning Representations. San Diego， USA： ICLR： 1-12

Vaswani A， Shazeer N， Parmar N， Uszkoreit J， Jones L， Gomez A N， Kaiser L and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 6000-6010

Wang Z Z， Zhao L， Chen H B， Qiu L H， Mo Q H， Lin S H， Xing W and Lu D M. 2020. Diversified arbitrary style transfer via deep feature perturbation//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 7786-7795 ［DOI： 10.1109/CVPR42600.2020.00781http://dx.doi.org/10.1109/CVPR42600.2020.00781］

Winnemöller H， Olsen S C and Gooch B. 2006. Real-time video abstraction//Proceedings of ACM SIGGRAPH 2006. Boston， USA： ACM： 1221-1226 ［DOI： 10.1145/1179352.1142018http://dx.doi.org/10.1145/1179352.1142018］

Xie B， Wang N and Fan Y W. 2020. Correlation alignment total variation model and algorithm for style transfer. Journal of Image and Graphics， 25（2）： 241-254

谢斌，汪宁，范有伟. 2020. 相关对齐的总变分风格迁移新模型. 中国图象图形学报， 25（2）： 241-254 ［DOI： 10.11834/jig.190199http://dx.doi.org/10.11834/jig.190199］

Yanai K. 2017. Unseen style transfer based on a conditional fast style transfer network//Workshop of 5th International Conference on Learning Representations. Toulon， France： penReview.net

Yao Y， Ren J Q， Xie X S， Liu W D， Liu Y J and Wang J. 2019. Attention-aware multi-stroke style transfer//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 1467-1475 ［DOI： 10.1109/CVPR.2019.00156http://dx.doi.org/10.1109/CVPR.2019.00156］

Zhang H， Goodfellow I， Metaxas D and Odena A. 2019. Self-attention generative adversarial networks//Proceedings of the 36th International Conference on Machine Learning. Long Beach， USA： PMLR： 7354-7363.

Zhu J Y， Park T， Isola P and Efros A A. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice， Italy： IEEE： 2242-2251 ［DOI： 10.1109/ICCV.2017.244http://dx.doi.org/10.1109/ICCV.2017.244］

文章被引用时，请邮件提醒。

提交

轻量级图像超分辨率的蓝图可分离卷积Transformer网络

采用多尺度视觉注意力分割腹部CT和心脏MR图像

融合多尺度特征的复杂手势姿态估计网络

航空遥感图像深度学习目标检测技术研究进展

基于图像的自动驾驶3D目标检测综述——基准、制约因素和误差分析