Critical review of human face reenactment methods

Jin Liu; Peng Chen; Xi Wang; Xiaomeng Fu; Jiao Dai; Jizhong Han

doi:10.11834/jig.211243

Review | Views : 0 下载量: 530 CSCD: 1

PDF
Export
Share
Collection
Album

Critical review of human face reenactment methods
Vol. 27, Issue 9, Pages: 2629-2651(2022)
Received：20 January 2022，

Revised：2022-5-25，

Accepted：02 June 2022，

Published：16 September 2022
DOI： 10.11834/jig.211243
稿件说明：

移动端阅览

DOI：

Jin Liu, Peng Chen, Xi Wang, Xiaomeng Fu, Jiao Dai, Jizhong Han. Critical review of human face reenactment methods[J]. Journal of Image and Graphics, 2022, 27(9): 2629-2651. DOI： 10.11834/jig.211243.

摘要

随着计算机视觉领域图像生成研究的发展

面部重演引起广泛关注

这项技术旨在根据源人脸图像的身份以及驱动信息提供的嘴型、表情和姿态等信息合成新的说话人图像或视频。面部重演具有十分广泛的应用

例如虚拟主播生成、线上授课、游戏形象定制、配音视频中的口型配准以及视频会议压缩等

该项技术发展时间较短

但是涌现了大量研究。然而目前国内外几乎没有重点关注面部重演的综述

面部重演的研究概述只是在深度伪造检测综述中以深度伪造的内容出现。鉴于此

本文对面部重演领域的发展进行梳理和总结。本文从面部重演模型入手

对面部重演存在的问题、模型的分类以及驱动人脸特征表达进行阐述

列举并介绍了训练面部重演模型常用的数据集及评估模型的评价指标

对面部重演近年研究工作进行归纳、分析与比较

最后对面部重演的演化趋势、当前挑战、未来发展方向、危害及应对策略进行了总结和展望。

Abstract

Current image and video data have been increasing dramatically in terms of huge artificial intelligence (AI)-generated contents. The derived face reenactment has been developing based on generated facial images or videos. Given source face information and driving motion information

face reenactment aims to generate a reenacted face or corresponding reenacted face video of driving motion information in related to the animation of expression

mouth shape

eye gazing and pose while preserving the identity information of the source face. Face reenactment methods can generate a variety of multiple feature-based and motion-based face videos

which are widely used with less constraints and becomes a research focus in the field of face generation. However

almost no reviews are specially written for the aspect of face reenactment. In view of this

we carry out the critical review of the development of face reenactment beyond DeepFake detection contexts. Our review is focused on the nine perspectives as following: 1) the universal process of face reenactment model; 2) facial information representation; 3) key challenges and barriers; 4) the classification of related methods; 5) introduction of various face reenactment methods; 6) evaluation metrics; 7) commonly used datasets; 8) practical applications; and 9) conclusion and future prospect. The identity information and background information is extracted from source faces while motion features are extracted from driving information

which are combined to generate the reenacted faces. Generally

latent codes

3D morphable face models (3DMM) coefficients

facial landmarks and facial action units are all served as motion features. Besides

there exist several challenges and problems which are always focused in related research. The identity mismatch problem means the inability of face reenactment model to preserve the identity of source faces. The issue of temporal or background inconsistency indicates that the generated face videos are related to the cross-framing jitter or obvious artifacts between the facial contour and the background. The constraints of identity are originated from the model design and training procedure

which can merely reenact the specific person seen in the training data. As for the category of face reenactment methods

image-driven methods and cross-modality-driven methods are involved according to the modality of driving information. Based on the difference of driving information representation

image-driven methods can be divided into four categories. The driving information representation includes facial landmarks

3DMM

motion field prediction and feature decoupling. The subclasses of identity restriction (yes/no issue) can be melted into the landmark-based and 3DMM-based methods further in terms of whether the model could generate unseen subjects or not. Our demonstration of each category

corresponding model flowchart and following improvement work will be illustrated in detail. As for the cross-modality driven methods

the text and audio related methods are introduced

which are ill-posed questions due to audio or text facial motion information may have multiple corresponding solutions. For instance

different facial poses or motions of same identity can produce basically the same audio. Cross-modality face reenactment is challenged to attract attention

which will also be introduced comprehensively. Text driven methods are developed based on three stages in terms of driving content progressively

which are extra required audio

restricted text-driven and arbitrary text-driven. The audio driven methods can be further divided into two categories depending on whether additional driving information is demanded or not. The additional driving information refers to eye blinking label or head pose videos

which offer auxiliary information in generation procedure. Moreover

comparative experiments are conducted to evaluate the performance between various methods. Image quality and facial motion accuracy are taken into consideration during evaluation. The peak signal-to-noise ratio (PSNR)

structural similarity index measure (SSIM)

cumulative probability of blur detection (CPBD)

frechet inception distance (FID) or other traditional image generation evaluation metrics are adopted together. To judge the facial motion accuracy

landmark difference

action unit detection analysis

and pose difference are utilized. In most facial-related cases

the landmarks

the presence of action unit or Euler angle are predicted all via corresponding pre-trained models. As for audio driven methods

the lip synchronization extent is also estimated in the aid of the pretrained evaluation model. Apart from the objective evaluations

subjective metrics like user study are applied as well. Furthermore

the commonly-used datasets in face reenactment are illustrated

each of which contains face images or videos of various expressions

view angles

illumination conditions or corresponding talking audios. The videos are usually collected from the interviews

news broadcast or actor recording. To reflect different level of difficulties

the image and video datasets are tested related to indoor and outdoor scenario. Commonly

the indoor scenario refers to white or grey walls while the outdoor scenario denotes actual moving scenes or the news live room. As for conclusion part

the practical applications and potential threats are critically illustrated. Face reenactment can contribute to entertainment industry like movie video dubbing

video production

game character avatar or old photo colorization. It can be utilized in conference compressing

online customer service

virtual uploader or 3D digital person as well. However

it is warning that misused face reenactment behaviors of lawbreakers can be used for calumniate

false information spreading or harmful media content creation in DeepFake

which will definitely damage the social stability and causing panic on social media. Therefore

it is important to consider more ethical issues of face reenactment. Furthermore

the development status of each category and corresponding future directions are displayed. Overall

model optimization and generation-scenario robustness are served as the two main concerns. Optimization issue is focused on data dependence alleviation

feature disentanglement

real time testing or evaluation metric improvement. Robustness improvement of face reenactment denotes generate high-quality reenacted faces under situations like face occlusion

outdoor scenario

large pose faces or complicated illumination. In a word

our critical review covers the universal pipeline of face reenactment model

main challenges

the classification and detailed explanation about each category of methods

the evaluation metrics and commonly used datasets

the current research analysis and prospects. The potential introduction and guidance of face reenactment research is facilitated.

关键词

Keywords

references

Averbuch-Elor H, Cohen-Or D, Kopf J and Cohen M F. 2017. Bringing portraits to life. ACM Transactions on Graphics, 36(6): #196 [DOI: 10.1145/3130800.3130818]

Baltrusaitis T, Zadeh A, Lim Y C and Morency L P. 2018. OpenFace 2.0: facial behavior analysis toolkit//Proceedings of the 13th IEEE International Conference on Automatic Face and Gesture Recognition. Xi′an, China: IEEE: 59-66 [ DOI: 10.1109/FG.2018.00019 http://dx.doi.org/10.1109/FG.2018.00019 ]

Blanz V and Vetter T. 1999. A morphable model for the synthesis of 3D faces//Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques. New York, USA: ACM: 187-194 [ DOI: 10.1145/311535.311556 http://dx.doi.org/10.1145/311535.311556 ]

Burkov E, Pasechnik I, Grigorev A and Lempitsky V. 2020. Neural head reenactment with latent pose descriptors//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 13783-13792 [ DOI: 10.1109/CVPR42600.2020.01380 http://dx.doi.org/10.1109/CVPR42600.2020.01380 ]

Cao Q, Shen L, Xie W D, Parkhi O M and Zisserman A. 2018. VGGFace2: a dataset for recognising faces across pose and age//Proceedings of the 13th IEEE International Conference on Automatic Face and Gesture Recognition. Xi′an, China: IEEE: 67-74 [ DOI: 10.1109/FG.2018.00020 http://dx.doi.org/10.1109/FG.2018.00020 ]

Chen L L, Li Z H, Maddox R K, Duan Z Y and Xu C L. 2018. Lip movements generation at a glance//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 538-553 [ DOI: 10.1007/978-3-030-01234-2_32 http://dx.doi.org/10.1007/978-3-030-01234-2_32 ]

Chen L L, Maddox R K, Duan Z Y and Xu C L. 2019. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 7824-7833 [ DOI: 10.1109/CVPR.2019.00802 http://dx.doi.org/10.1109/CVPR.2019.00802 ]

Chung J S, Jamaludin A and Zisserman A. 2017. You said that? [EB/OL ] . [2021-05-24 ] . https://arxiv.org/pdf/1705.02966.pdf https://arxiv.org/pdf/1705.02966.pdf

Chung J S, Nagrani A and Zisserman A. 2018. VoxCeleb2: deep speakerrecognition [EB/OL ] . [2021-05-02 ] . https://arxiv.org/pdf/1806.05622.pdf https://arxiv.org/pdf/1806.05622.pdf

Chung J S and Zisserman A. 2017a. Out of time: automated lip sync in the wild//Proceedings of 2016 ACCV International Workshops on Computer Vision. Taipei, China: Springer: 251-263 [ DOI: 10.1007/978-3-319-54427-4_19 http://dx.doi.org/10.1007/978-3-319-54427-4_19 ]

Chung J S and Zisserman A. 2017b. Lip reading in the wild//Proceedings of the 13th Asian Conference on Computer Vision. Taipei, China: Springer: 87-103 [ DOI: 10.1007/978-3-319-54184-6_6 http://dx.doi.org/10.1007/978-3-319-54184-6_6 ]

Fox G, Liu W T, Kim H, Seidel H P, Elgharib M and Theobalt C. 2021. Videoforensicshq: detecting high-quality manipulated face videos [EB/OL ] . [2021-05-24 ] . https://arxiv.org/pdf/2005.10360.pdf https://arxiv.org/pdf/2005.10360.pdf

Fried O, Tewari A, Zollhöfer M, Finkelstein A, Shechtman E, Goldman D B, Genova K, Jin Z Y, Theobalt C and Agrawala M. 2019. Text-based editing of talking-head video. ACM Transactions on Graphics, 38(4): #68 [DOI: 10.1145/3306346.3323028]

Fu C Y, Hu Y B, Wu X, Wang G L, Zhang Q and He R. 2021. High-fidelity face manipulation with extreme posesand expressions. IEEE Transactions on Information Forensics and Security, 16: 2218-2231 [DOI: 10.1109/TIFS.2021.3050065]

Geng J H, Shao T J, Zheng Y Y, Weng Y L and Zhou K. 2018. Warp-guided GANs for single-photo facial animation. ACM Transactions on Graphics, 37(6): #231 [DOI: 10.1145/3272127.3275043]

Gross R, Matthews I, Cohn J, Kanade T and Baker S. 2010. Multi-PIE. Image and Vision Computing, 28(5): 807-813 [DOI: 10.1016/j.imavis.2009.08.002]

Gu K X, Zhou Y Q and Huang T. 2020. FLNet: landmark driven fetching and learning network for faithful talking facial animation synthesis. Proceedings of the AAAI Conference on Artificial Intelligence, 34(7): 10861-10868 [DOI: 10.1609/aaai.v34i07.6717]

Ha S, Kersner M, Kim B, Seo S and Kim D. 2020. MarioNETte: few-shot face reenactment preserving identity of unseen targets. Proceedings of the AAAI Conference on Artificial Intelligence, 34(7): 10893-10900 [DOI: 10.1609/aaai.v34i07.6721]

He Y, Gan B, Chen S Y, Zhou Y C, Yin G J, Song L C, Sheng L, Shao J and Liu Z W. 2021. ForgeryNet: a versatile benchmark for comprehensive forgery analysis [EB/OL ] . [2021-05-24 ] . https://arxiv.org/pdf/2103.05630.pdf https://arxiv.org/pdf/2103.05630.pdf

Heusel M, Ramsauer H, Unterthiner T, Nessler B and Hochreiter S. 2018. GANs trained by a two time-scale update rule converge to a local nash equilibrium [EB/OL ] . [2021-05-02 ] . https://arxiv.org/pdf/1706.08500.pdf https://arxiv.org/pdf/1706.08500.pdf

Huang P H, Yang F E and Wang Y C F. 2020. Learning identity-invariant motion representations for cross-ID face reenactment//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 7082-7090 [ DOI: 10.1109/CVPR42600.2020.00711 http://dx.doi.org/10.1109/CVPR42600.2020.00711 ]

Isola P, Zhu J Y, Zhou T H and Efros A A. 2017. Image-to-image translation with conditional adversarial networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 5967-5976 [ DOI: 10.1109/CVPR.2017.632 http://dx.doi.org/10.1109/CVPR.2017.632 ]

Jalalifar S A, Hasani H and Aghajan H. 2018. Speech-driven facial reenactment using conditional generative adversarial networks [EB/OL ] . [2021-05-02 ] . https://arxiv.org/pdf/1803.07461.pdf https://arxiv.org/pdf/1803.07461.pdf

Japhet G. 2021. Animate your family photos [EB/OL ] . [2021-05-26 ] . https://www.myheritage.tw/deep-nostalgia https://www.myheritage.tw/deep-nostalgia

Ji X Y, Zhou H, Wang K S Y, Wu W, Loy C C, Cao X and Xu F. 2021. Audio-driven emotional video portraits [EB/OL ] . [2021-05-24 ] . https://arxiv.org/pdf/2104.07452.pdf https://arxiv.org/pdf/2104.07452.pdf

Jiang L M, Li R, Wu W, Qian C and Loy C C. 2020. Deeperforensics-1.0: a large-scale dataset for real-world face forgery detection//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 2886-2895 [ DOI: 10.1109/CVPR42600.2020.00296 http://dx.doi.org/10.1109/CVPR42600.2020.00296 ]

Karras T, Laine S, Aittala M, Hellsten J, Lehtinen J and Aila T. 2020. Analyzing and improving the image quality of StyleGAN//Proceedings of 2020 IEEE/CVF Conferenceon Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 8107-8116 [ DOI: 10.1109/CVPR42600.2020.00813 http://dx.doi.org/10.1109/CVPR42600.2020.00813 ]

Kim H, Garrido P, Tewari A, Xu W P, Thies J, Niessner M, Pérez P, Richardt C, Zollhöfer M and Theobalt C. 2018. Deep video portraits. ACM Transactions on Graphics, 37(4): #163 [DOI: 10.1145/3197517.3201283]

Langner O, Dotsch R, Bijlstra G, Wigboldus D H J, Hawk S T and Van Knippenberg A. 2010. Presentation and validation of the Radboud Faces Database. Cognition and Emotion, 24(8): 1377-1388 [DOI: 10.1080/02699930903485076]

Lee D. 2019. Deepfake Salvador Dalí takes selfies with museum visitors [EB/OL ] . [2021-05-02 ] . https://www.theverge.com/2019/5/10/18540953/salvador-dali-lives-deepfake-museum https://www.theverge.com/2019/5/10/18540953/salvador-dali-lives-deepfake-museum

Li L C, Wang S Z, Zhang Z M, Ding Y, Zheng Y X, Yu X and Fan C J. 2021a. Write-a-speaker: text-based emotional and rhythmic talking-head generation. Proceedings of the AAAI Conference on Artificial Intelligence, 35(3): 1911-1920

Li X R, Ji S L, Wu C M, Liu Z G, Deng S G, Cheng P, Yang M and Kong X W. 2021b. Survey on deepfakes and detection techniques. Journal of Software, 32(2): 496-518

李旭嵘, 纪守领, 吴春明, 刘振广, 邓水光, 程鹏, 杨珉, 孔祥维. 2021b. 深度伪造与检测技术综述. 软件学报, 32(2): 496-518) [DOI: 10.13328/j.cnki.jos.006140]

Liu J, Chen P, Liang T, Li Z X, Yu C, Zou S Q, Dai J and Han J Z. 2021. LI-Net: large-pose identity-preserving face reenactment network [EB/OL ] . [2021-05-07 ] . https://arxiv.org/pdf/2104.02850.pdf https://arxiv.org/pdf/2104.02850.pdf

Liu Z X, Hu H, Wang Z P, Wang K, Bai J Q and Lian S G. 2019. Video synthesis of human upper body with realistic face//2019 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct). Beijing, China: IEEE: 200-202 [ DOI: 10.1109/ISMAR-Adjunct.2019.00-47 http://dx.doi.org/10.1109/ISMAR-Adjunct.2019.00-47 ]

Lyu S. 2020. Deepfake detection: current challenges and next steps//Proceedings of 2020 IEEE International Conference on Multimedia and Expo Workshops (ICMEW). London, UK: IEEE: 1-6 [ DOI: 10.1109/ICMEW46912.2020.9105991 http://dx.doi.org/10.1109/ICMEW46912.2020.9105991 ]

Ma T X, Peng B, Wang W and Dong J. 2019. Any-to-one face reenactment based on conditional generative adversarial network//Proceedings of 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Lanzhou, China: IEEE: 1657-1664 [ DOI: 10.1109/APSIPAASC47483.2019.9023328 http://dx.doi.org/10.1109/APSIPAASC47483.2019.9023328 ]

Martinez B, Valstar M F, Jiang B H and Pantic M. 2019. Automatic analysis of facial actions: a survey. IEEE Transactions on Affective Computing, 10(3): 325-347 [DOI: 10.1109/TAFFC.2017.2731763]

Mirsky Y and Lee W. 2022. The creation and detection of deepfakes: a survey. ACM Computing Surveys, 54(1): #7 [DOI: 10.1145/3425780]

Nagano K, Seo J, Xing J, Wei L Y, Li Z M, Saito S, Agarwal A, Fursund J and Li H. 2018. paGAN: real-time avatars using dynamic textures. ACM Transactions on Graphics, 37(6): #528 [DOI: 10.1145/3272127.3275075]

Nagrani A, Chung J S and Zisserman A. 2017. VoxCeleb: a large-scale speaker identification dataset [EB/OL ] . [2021-05-02 ] . https://arxiv.org/pdf/1706.08612.pdf https://arxiv.org/pdf/1706.08612.pdf

Narvekar N D and Karam L J. 2009. A no-reference perceptual image sharpness metric based on a cumulative probability of blur detection//Proceedings of 2009 International Workshop on Quality of Multimedia Experience. San Diego, USA: IEEE: 87-91 [ DOI: 10.1109/QOMEX.2009.5246972 http://dx.doi.org/10.1109/QOMEX.2009.5246972 ]

Nirkin Y, Keller Y and Hassner T. 2019. FSGAN: subject agnostic face swap ping and reenactment//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 7183-7192 [ DOI: 10.1109/ICCV.2019.00728 http://dx.doi.org/10.1109/ICCV.2019.00728 ]

Prajwal K R, Mukhopadhyay R, Namboodiri V P and Jawahar C V. 2020. A lip sync expert is all you need for speech to lip generation in the wild//Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM: 484-492 [ DOI: 10.1145/3394171.3413532 http://dx.doi.org/10.1145/3394171.3413532 ]

Prajwal KR, Mukhopadhyay R, Philip J, Jha A, Namboodiri V and Jawahar CV. 2019. Towards automatic face-to-face translation//Proceedings of the 27th ACM International Conference on Multimedia. Nice, France: ACM: 1428-1436 [ DOI: 10.1145/3343031.3351066 http://dx.doi.org/10.1145/3343031.3351066 ]

Pumarola A, Agudo A, Martinez A M, Sanfeliu A and Moreno-Noguer F. 2018. GANimation: anatomically-aware facial animation from a single image//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 835-851 [ DOI: 10.1007/978-3-030-01249-6_50 http://dx.doi.org/10.1007/978-3-030-01249-6_50 ]

Rössler A, Cozzolino D, Verdoliva L, Riess C, Thies J an d Nießner M. 2019. FaceForensics++: learning to detect manipulated facial images//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 1-11 [ DOI: 10.1109/ICCV.2019.00009 http://dx.doi.org/10.1109/ICCV.2019.00009 ]

Sanchez E and Valstar M. 2018. Triple consistency loss for pairing distributions in GAN-based face synthesis [EB/OL ] . [2021-05-24 ] . https://arxiv.org/pdf/1811.03492.pdf https://arxiv.org/pdf/1811.03492.pdf

Shen Y J, Luo P, Luo P, Yan J J, Wang X G and Tang X O. 2018a. FaceID-GAN: learning a symmetry three-player GAN for identity-preserving face synthesis//Proceedings of 2018 IEEE/CVF conference oncomputer vision and pattern recognition. Salt Lake City, USA: IEEE: 821-830 [ DOI: 10.1109/CVPR.2018.00092 http://dx.doi.org/10.1109/CVPR.2018.00092 ]

Shen Y J, Zhou B L, Luo P and Tang X O. 2018b. FaceFeat-GAN: a two-stage approach for identity-preserving face synthesis [EB/OL ] . [2021-05-25 ] . https://arxiv.org/pdf/1812.01288.pdf https://arxiv.org/pdf/1812.01288.pdf

Siarohin A, Lathuilière S, Tulyakov S, Ricci E and Sebe N. 2019a. Animating arbitrary objects via deep motion transfer//Proceedings of 2019 IEEE/CVF Conference on Computer Vis ion and Pattern Recognition. Long Beach, USA: IEEE: 2372-2381 [ DOI: 10.1109/CVPR.2019.00248 http://dx.doi.org/10.1109/CVPR.2019.00248 ]

Siarohin A, Lathuilière S, Tulyakov S, Ricci E and Sebe N. 2019b. First order motion model for image animation//Proceedings of the 33rd Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc. : 7137-7147

Song L S, Wu W, Qian C, He R and Loy C C. 2022. Everybody's talkin': let me talk as you want. IEEE Transactions on Information Forensics and Security, 17: 585-598 [ DOI: 10.1109/TIFS.2022.3146783 http://dx.doi.org/10.1109/TIFS.2022.3146783 ]

Song Y, Zhu J W, Li D W, Wang X and Qi H R. 2019. Talking face generation by conditional recurrent adversarial network [EB/OL ] . [2021-12-29 ] . https://arxiv.org/pdf/1804.04786.pdf https://arxiv.org/pdf/1804.04786.pdf

Sun P, Li Y Z, QiH G and Lyu S W. 2021. LandmarkGAN: synthesizing Faces from Landmarks [EB/OL ] . [2021-05-25 ] . https://arxiv.org/pdf/2011.00269.pdf https://arxiv.org/pdf/2011.00269.pdf

Suwajanakorn S, Seitz S M and Kemelmacher-Shlizerman I. 2017. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics, 36(4): #95 [DOI: 10.1145/3072959.3073640]

Thies J, Elgharib M, Tewari A, Theobalt C and Nießner M. 2020. Neural voice puppetry: audio-driven facial reenactment//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 716-731 [ DOI: 10.1007/978-3-030-58517-4_42 http://dx.doi.org/10.1007/978-3-030-58517-4_42 ]

Thies J, Zollhöfer M and Nießner M. 2019. Deferred neural rendering: Image synthesis using neural textures. ACM Transactions on Graphics, 38(4): # 66 [DOI: 10.1145/3306346.3323035]

Thies J, Zollhöfer M, Nießner M, Valgaerts L, Stamminger M and Theobalt C. 2015. Real-time expression transfer for facial reenactment. ACM Transactions on Graphics, 34(6): #183 [DOI: 10.1145/2816795.2818056]

Thies J, Zollhöfer M, Stamminger M, Theobalt C and Nießner M. 2016. Face2face: real-time face capture and reenactment of RGB videos//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 2387-2395 [ DOI: 10.1109/CVPR.2016.262 http://dx.doi.org/10.1109/CVPR.2016.262 ]

Tolosana R, Romero-Tapiador S, Fierrez J and Vera-Rodriguez R. 2020a. DeepFakes evolution: analysis of facial regions and fake detection performance [EB/OL ] . [2021-05-27 ] . https://arxiv.org/pdf/2004.07532.pdf https://arxiv.org/pdf/2004.07532.pdf

Tolosana R, Vera-Rodriguez R, Fierrez J, Morales A and Ortega-Garcia J. 2020b. Deepfakes and beyond: a survey of face manipulation and fake detection. Information Fusion, 64: 131-148 [DOI: 10.1016/j.inffus.2020.06.014]

Tripathy S, Kannala J and Rahtu E. 2021. FACEGAN: facial attribute controllable rEenactment GAN//Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision. Waikoloa, USA: IEEE: 1328-1337 [ DOI: 10.1109/WACV48630.2021.00137 http://dx.doi.org/10.1109/WACV48630.2021.00137 ]

Wang K S Y, Wu Q Y, Song L S, Yang Z Q, Wu W, Qian C, He R, Qiao Y and Loy C C. 2020a. MEAD: a large-scale audio-visual dataset for emotional talking-face generation//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 700-717 [ DOI: 10.1007/978-3-030-58589-1_42 http://dx.doi.org/10.1007/978-3-030-58589-1_42 ]

Wang S Z, Li L C, Ding Y, Fan C J and Yu X. 2021a. Audio2Head: audio-driven one-shot talking-head generat ion with natural head motion [EB/OL ] . [2022-02-17 ] . https://arxiv.org/pdf/2107.09293.pdf https://arxiv.org/pdf/2107.09293.pdf

Wang S Z, Li L C, Ding Y and Yu X. 2021b. One-shot talking face generation from single-speaker audio-visual correlation learning [EB/OL ] . [2022-06-22 ] . https://arxiv.org/pdf/2112.02749.pdf https://arxiv.org/pdf/2112.02749.pdf

Wang T C, Liu M Y, Zhu J Y, Liu G L, Tao A, Kautz J and Catanzaro B. 2018. Video-to-video synthesis [EB/OL ] . [2022-02-17 ] . https://arxiv.org/pdf/1808.06601.pdf https://arxiv.org/pdf/1808.06601.pdf

Wang T C, Mallya A and Liu M Y. 2021c. One-shot free-view neural talking-head synthesis for video conferencing [EB/OL ] . [2021-05-02 ] . https://arxiv.org/pdf/2011.15126.pdf https://arxiv.org/pdf/2011.15126.pdf

Wang Y H, Bilinski P, Bremond F and Dantcheva A. 2020b. ImaGINator: conditional spatio-temporal GAN for video generation//Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision. Snowmass, USA: IEEE: 1149-1158 [ DOI: 10.1109/WACV45572.2020.9093492 http://dx.doi.org/10.1109/WACV45572.2020.9093492 ]

Wang Z, Bovik A C, Sheikh H R and Simoncelli E P. 2004. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4): 600-612 [DOI: 10.1109/TIP.2003.819861]

Wiles O, Koepke A S and Zisserman A. 2018. X2Face: a network for controlling face generation using images, audio, and pose codes//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 690-706 [ DOI: 10.1007/978-3-030-01261-8_41 http://dx.doi.org/10.1007/978-3-030-01261-8_41 ]

Wu W, Zhang Y X, Li C, Qian C and Loy C C. 2018. ReenactGAN: learning to reenact faces via boundary transfer//Proceedings of the 15th European Conference on Computer Vision (ECCV) Munich, Germany: Springer: 622-638 [ DOI: 10.1007/978-3-030-01246-5_37 http://dx.doi.org/10.1007/978-3-030-01246-5_37 ]

Xiang S T, Gu Y M, Xiang P D, He M M, Nagano K, Chen H W and Li H. 2020. One-shot identity-preserving portrait reenactment [EB/OL ] . [2021-05-26 ] . https://arxiv.org/pdf/2004.12452.pdf https://arxiv.org/pdf/2004.12452.pdf

Xuan S X. 2021. Video of Donald Trump singing [EB/OL ] . [2021-05-26 ] . https://www.bilibili.com/video/BV1Xz4y1U7EY https://www.bilibili.com/video/BV1Xz4y1U7EY

Yao G M, Yuan Y, Shao T J, Li S, Liu S Q, Liu Y, Wang M M and Zhou K. 2021. One-shot face reenactment using appearance adaptive normalization [EB/OL ] . [2021-05-07 ] . https://arxiv.org/pdf/2102.03984.pdf https://arxiv.org/pdf/2102.03984.pdf

Yao G M, Yuan Y, Shao T J and Zhou K. 2020a. Mesh guided one-shot face reenactment using graph convolutional networks//Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM: 1773-1781 [ DOI: 10.1145/3394171.3413865 http://dx.doi.org/10.1145/3394171.3413865 ]

Yao X W, Fried O, Fatahalian K and Agrawala M. 2020b. Iterative text-based editing of talking-heads using neural retargeting [EB/OL ] . [2021-05-24 ] . https://arxiv.org/pdf/2011.10688.pdf https://arxiv.org/pdf/2011.10688.pdf

Yu L Y, Yu J and Ling Q. 2019. Mining audio, text and visual information for talking face generation//Proceedings of 2019 IEEE International Conference on Data Mining (ICDM). Beijing, China: IEEE: 787-795 [ DOI: 10.1109/ICDM.2019.00089 http://dx.doi.org/10.1109/ICDM.2019.00089 ]

Zakharov E, Ivakhnenko A, Shysheya A and Lempitsky V. 2020. Fast Bi-layer neural synthesis of one-shot realistic head avatars//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 524-540 [ DOI: 10.1007/978-3-030-58610-2_31 http://dx.doi.org/10.1007/978-3-030-58610-2_31 ]

Zakharov E, Shysheya A, Burkov E and Lempitsky V. 2019. Few-shot adversarial learning of realistic neural talking head models//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 9458-9467 [ DOI: 10.1109/ICCV.2019.00955 http://dx.doi.org/10.1109/ICCV.2019.00955 ]

Zeng X F, Pan Y S, Wang M M, Zhang J N and Liu Y. 2020. Realistic face reenactment via self-supervised disentangling of identity and pose. Proceedings of the AAAI Conference on Artificial Intelligence, 34(7): 12757-12764 [DOI: 10.1609/aaai.v34i07.6970]

Zhang J N, Liu L, Xue Z C and Liu Y. 2020a. APB2Face: audio-guided face reenactment with auxiliary pose and blink signals//Proceedings of ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona, Spain: IEEE: 4402-4406 [ DOI: 10.1109/ICASSP40776.2020.9052977 http://dx.doi.org/10.1109/ICASSP40776.2020.9052977 ]

Zhang J N, Zeng X F, Wang M M, Pan Y S, Liu L, Liu Y, Ding Y and Fan C J. 2020b. FreeNet: multi-identity face reenactment//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 5325-5334 [ DOI: 10.1109/CVPR42600.2020.00537 http://dx.doi.org/10.1109/CVPR42600.2020.00537 ]

Zhang J N, Zeng X F, Xu C, Chen J, Liu Y and Jiang Y L. 2020c. APB2FaceV2: real-time audio-guided multi-face reenactment [EB/OL ] . [2021-05-24 ] . https://arxiv.org/pdf/2010.13017.pdf https://arxiv.org/pdf/2010.13017.pdf

Zhang Y X, Zhang S W, He Y, Li C, Loy C C and Liu Z W. 2019. One-shot face reenactment [EB/OL ] . [2021-05-26 ] . https://arxiv.org/pdf/1908.03251.pdf https://arxiv.org/pdf/1908.03251.pdf

Zhang Z M, Li L C, Ding Y and Fan C J. 2021. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 3660-3669 [ DOI: 10.1109/CVPR46437.2021.00366 http://dx.doi.org/10.1109/CVPR46437.2021.00366 ]

Zhou H, Liu Y, Liu Z W, Luo P and Wang X G. 2019. Talking face generation by adversarially disentangled audio-visual representation. Proceedings of the AAAI Conference on Artificial Intelligence, 33(1): 9299-9306 [DOI: 10.1609/aaai.v33i01.33019299]

Zhou H, Sun Y S, Wu W,Loy C C, Wang X G and Liu Z W. 2021. Pose-controllable talking face generation by implicitly modularized audio-visual representation [EB/OL ] . [2021-05-07 ] . https://arxiv.org/pdf/2104.11116.pdf https://arxiv.org/pdf/2104.11116.pdf

Zhou Y, Han X T, Shechtman E, Echevarria J, Kalogerakis E and Li D. 2020. MakeltTalk: speaker-aware talking-head animation. ACM Transactions on Graphics, 39(6): #221 [DOI: 10.1145/3414685.3417774]

Alert me when the article has been cited

提交

Image dehazing network based on dual-domain feature fusion

Review of physical adversarial attacks against visual deep learning models

Comprehensive survey on 3D visual-language understanding techniques

Deep learning-based real-time semantic segmentation： a survey

Survey on knowledge distillation and its application