人类面部重演方法综述

刘锦; 陈鹏; 王茜; 付晓蒙; 戴娇; 韩冀中

doi:10.11834/jig.211243

综述 | 浏览量 : 0 下载量: 0 CSCD: 0

PDF
导出
分享
收藏
专辑

人类面部重演方法综述
Critical review of human face reenactment methods
2022年27卷第9期页码：2629-2651
纸质出版日期： 2022-09-16 ，

录用日期： 2022-06-02
DOI： 10.11834/jig.211243
稿件说明：

移动端阅览

刘锦, 陈鹏, 王茜, 付晓蒙, 戴娇, 韩冀中. 人类面部重演方法综述[J]. 中国图象图形学报, 2022,27(9):2629-2651.

Jin Liu, Peng Chen, Xi Wang, Xiaomeng Fu, Jiao Dai, Jizhong Han. Critical review of human face reenactment methods[J]. Journal of Image and Graphics, 2022,27(9):2629-2651.
刘锦, 陈鹏, 王茜, 付晓蒙, 戴娇, 韩冀中. 人类面部重演方法综述[J]. 中国图象图形学报, 2022,27(9):2629-2651. DOI： 10.11834/jig.211243.

Jin Liu, Peng Chen, Xi Wang, Xiaomeng Fu, Jiao Dai, Jizhong Han. Critical review of human face reenactment methods[J]. Journal of Image and Graphics, 2022,27(9):2629-2651. DOI： 10.11834/jig.211243.

摘要

随着计算机视觉领域图像生成研究的发展

面部重演引起广泛关注

这项技术旨在根据源人脸图像的身份以及驱动信息提供的嘴型、表情和姿态等信息合成新的说话人图像或视频。面部重演具有十分广泛的应用

例如虚拟主播生成、线上授课、游戏形象定制、配音视频中的口型配准以及视频会议压缩等

该项技术发展时间较短

但是涌现了大量研究。然而目前国内外几乎没有重点关注面部重演的综述

面部重演的研究概述只是在深度伪造检测综述中以深度伪造的内容出现。鉴于此

本文对面部重演领域的发展进行梳理和总结。本文从面部重演模型入手

对面部重演存在的问题、模型的分类以及驱动人脸特征表达进行阐述

列举并介绍了训练面部重演模型常用的数据集及评估模型的评价指标

对面部重演近年研究工作进行归纳、分析与比较

最后对面部重演的演化趋势、当前挑战、未来发展方向、危害及应对策略进行了总结和展望。

Abstract

Current image and video data have been increasing dramatically in terms of huge artificial intelligence (AI)-generated contents. The derived face reenactment has been developing based on generated facial images or videos. Given source face information and driving motion information

face reenactment aims to generate a reenacted face or corresponding reenacted face video of driving motion information in related to the animation of expression

mouth shape

eye gazing and pose while preserving the identity information of the source face. Face reenactment methods can generate a variety of multiple feature-based and motion-based face videos

which are widely used with less constraints and becomes a research focus in the field of face generation. However

almost no reviews are specially written for the aspect of face reenactment. In view of this

we carry out the critical review of the development of face reenactment beyond DeepFake detection contexts. Our review is focused on the nine perspectives as following: 1) the universal process of face reenactment model; 2) facial information representation; 3) key challenges and barriers; 4) the classification of related methods; 5) introduction of various face reenactment methods; 6) evaluation metrics; 7) commonly used datasets; 8) practical applications; and 9) conclusion and future prospect. The identity information and background information is extracted from source faces while motion features are extracted from driving information

which are combined to generate the reenacted faces. Generally

latent codes

3D morphable face models (3DMM) coefficients

facial landmarks and facial action units are all served as motion features. Besides

there exist several challenges and problems which are always focused in related research. The identity mismatch problem means the inability of face reenactment model to preserve the identity of source faces. The issue of temporal or background inconsistency indicates that the generated face videos are related to the cross-framing jitter or obvious artifacts between the facial contour and the background. The constraints of identity are originated from the model design and training procedure

which can merely reenact the specific person seen in the training data. As for the category of face reenactment methods

image-driven methods and cross-modality-driven methods are involved according to the modality of driving information. Based on the difference of driving information representation

image-driven methods can be divided into four categories. The driving information representation includes facial landmarks

3DMM

motion field prediction and feature decoupling. The subclasses of identity restriction (yes/no issue) can be melted into the landmark-based and 3DMM-based methods further in terms of whether the model could generate unseen subjects or not. Our demonstration of each category

corresponding model flowchart and following improvement work will be illustrated in detail. As for the cross-modality driven methods

the text and audio related methods are introduced

which are ill-posed questions due to audio or text facial motion information may have multiple corresponding solutions. For instance

different facial poses or motions of same identity can produce basically the same audio. Cross-modality face reenactment is challenged to attract attention

which will also be introduced comprehensively. Text driven methods are developed based on three stages in terms of driving content progressively

which are extra required audio

restricted text-driven and arbitrary text-driven. The audio driven methods can be further divided into two categories depending on whether additional driving information is demanded or not. The additional driving information refers to eye blinking label or head pose videos

which offer auxiliary information in generation procedure. Moreover

comparative experiments are conducted to evaluate the performance between various methods. Image quality and facial motion accuracy are taken into consideration during evaluation. The peak signal-to-noise ratio (PSNR)

structural similarity index measure (SSIM)

cumulative probability of blur detection (CPBD)

frechet inception distance (FID) or other traditional image generation evaluation metrics are adopted together. To judge the facial motion accuracy

landmark difference

action unit detection analysis

and pose difference are utilized. In most facial-related cases

the landmarks

the presence of action unit or Euler angle are predicted all via corresponding pre-trained models. As for audio driven methods

the lip synchronization extent is also estimated in the aid of the pretrained evaluation model. Apart from the objective evaluations

subjective metrics like user study are applied as well. Furthermore

the commonly-used datasets in face reenactment are illustrated

each of which contains face images or videos of various expressions

view angles

illumination conditions or corresponding talking audios. The videos are usually collected from the interviews

news broadcast or actor recording. To reflect different level of difficulties

the image and video datasets are tested related to indoor and outdoor scenario. Commonly

the indoor scenario refers to white or grey walls while the outdoor scenario denotes actual moving scenes or the news live room. As for conclusion part

the practical applications and potential threats are critically illustrated. Face reenactment can contribute to entertainment industry like movie video dubbing

video production

game character avatar or old photo colorization. It can be utilized in conference compressing

online customer service

virtual uploader or 3D digital person as well. However

it is warning that misused face reenactment behaviors of lawbreakers can be used for calumniate

false information spreading or harmful media content creation in DeepFake

which will definitely damage the social stability and causing panic on social media. Therefore

it is important to consider more ethical issues of face reenactment. Furthermore

the development status of each category and corresponding future directions are displayed. Overall

model optimization and generation-scenario robustness are served as the two main concerns. Optimization issue is focused on data dependence alleviation

feature disentanglement

real time testing or evaluation metric improvement. Robustness improvement of face reenactment denotes generate high-quality reenacted faces under situations like face occlusion

outdoor scenario

large pose faces or complicated illumination. In a word

our critical review covers the universal pipeline of face reenactment model

main challenges

the classification and detailed explanation about each category of methods

the evaluation metrics and commonly used datasets

the current research analysis and prospects. The potential introduction and guidance of face reenactment research is facilitated.

关键词

人工智能(AI)计算机视觉深度学习生成对抗网络(GAN)深度伪造面部重演

Keywords

artificial intelligence (AI)computer visiondeep learninggenerative adversarial network(GAN)DeepFakeface reenactment

references

Averbuch-Elor H, Cohen-Or D, Kopf J and Cohen M F. 2017. Bringing portraits to life. ACM Transactions on Graphics, 36(6): #196 [DOI: 10.1145/3130800.3130818]

Baltrusaitis T, Zadeh A, Lim Y C and Morency L P. 2018. OpenFace 2.0: facial behavior analysis toolkit//Proceedings of the 13th IEEE International Conference on Automatic Face and Gesture Recognition. Xi′an, China: IEEE: 59-66 [DOI: 10.1109/FG.2018.00019http://dx.doi.org/10.1109/FG.2018.00019]

Blanz V and Vetter T. 1999. A morphable model for the synthesis of 3D faces//Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques. New York, USA: ACM: 187-194 [DOI: 10.1145/311535.311556http://dx.doi.org/10.1145/311535.311556]

Burkov E, Pasechnik I, Grigorev A and Lempitsky V. 2020. Neural head reenactment with latent pose descriptors//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 13783-13792 [DOI: 10.1109/CVPR42600.2020.01380http://dx.doi.org/10.1109/CVPR42600.2020.01380]

Cao Q, Shen L, Xie W D, Parkhi O M and Zisserman A. 2018. VGGFace2: a dataset for recognising faces across pose and age//Proceedings of the 13th IEEE International Conference on Automatic Face and Gesture Recognition. Xi′an, China: IEEE: 67-74 [DOI: 10.1109/FG.2018.00020http://dx.doi.org/10.1109/FG.2018.00020]

Chen L L, Li Z H, Maddox R K, Duan Z Y and Xu C L. 2018. Lip movements generation at a glance//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 538-553 [DOI: 10.1007/978-3-030-01234-2_32http://dx.doi.org/10.1007/978-3-030-01234-2_32]

Chen L L, Maddox R K, Duan Z Y and Xu C L. 2019. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 7824-7833 [DOI: 10.1109/CVPR.2019.00802http://dx.doi.org/10.1109/CVPR.2019.00802]

Chung J S, Jamaludin A and Zisserman A. 2017. You said that? [EB/OL]. [2021-05-24].https://arxiv.org/pdf/1705.02966.pdfhttps://arxiv.org/pdf/1705.02966.pdf

Chung J S, Nagrani A and Zisserman A. 2018. VoxCeleb2: deep speakerrecognition [EB/OL]. [2021-05-02].https://arxiv.org/pdf/1806.05622.pdfhttps://arxiv.org/pdf/1806.05622.pdf

Chung J S and Zisserman A. 2017a. Out of time: automated lip sync in the wild//Proceedings of 2016 ACCV International Workshops on Computer Vision. Taipei, China: Springer: 251-263 [DOI: 10.1007/978-3-319-54427-4_19http://dx.doi.org/10.1007/978-3-319-54427-4_19]

Chung J S and Zisserman A. 2017b. Lip reading in the wild//Proceedings of the 13th Asian Conference on Computer Vision. Taipei, China: Springer: 87-103 [DOI: 10.1007/978-3-319-54184-6_6http://dx.doi.org/10.1007/978-3-319-54184-6_6]

Fox G, Liu W T, Kim H, Seidel H P, Elgharib M and Theobalt C. 2021. Videoforensicshq: detecting high-quality manipulated face videos [EB/OL]. [2021-05-24].https://arxiv.org/pdf/2005.10360.pdfhttps://arxiv.org/pdf/2005.10360.pdf

Fried O, Tewari A, Zollhöfer M, Finkelstein A, Shechtman E, Goldman D B, Genova K, Jin Z Y, Theobalt C and Agrawala M. 2019. Text-based editing of talking-head video. ACM Transactions on Graphics, 38(4): #68 [DOI: 10.1145/3306346.3323028]

Fu C Y, Hu Y B, Wu X, Wang G L, Zhang Q and He R. 2021. High-fidelity face manipulation with extreme posesand expressions. IEEE Transactions on Information Forensics and Security, 16: 2218-2231 [DOI: 10.1109/TIFS.2021.3050065]

Geng J H, Shao T J, Zheng Y Y, Weng Y L and Zhou K. 2018. Warp-guided GANs for single-photo facial animation. ACM Transactions on Graphics, 37(6): #231 [DOI: 10.1145/3272127.3275043]

Gross R, Matthews I, Cohn J, Kanade T and Baker S. 2010. Multi-PIE. Image and Vision Computing, 28(5): 807-813 [DOI: 10.1016/j.imavis.2009.08.002]

Gu K X, Zhou Y Q and Huang T. 2020. FLNet: landmark driven fetching and learning network for faithful talking facial animation synthesis. Proceedings of the AAAI Conference on Artificial Intelligence, 34(7): 10861-10868 [DOI: 10.1609/aaai.v34i07.6717]

Ha S, Kersner M, Kim B, Seo S and Kim D. 2020. MarioNETte: few-shot face reenactment preserving identity of unseen targets. Proceedings of the AAAI Conference on Artificial Intelligence, 34(7): 10893-10900 [DOI: 10.1609/aaai.v34i07.6721]

He Y, Gan B, Chen S Y, Zhou Y C, Yin G J, Song L C, Sheng L, Shao J and Liu Z W. 2021. ForgeryNet: a versatile benchmark for comprehensive forgery analysis [EB/OL]. [2021-05-24].https://arxiv.org/pdf/2103.05630.pdfhttps://arxiv.org/pdf/2103.05630.pdf

Heusel M, Ramsauer H, Unterthiner T, Nessler B and Hochreiter S. 2018. GANs trained by a two time-scale update rule converge to a local nash equilibrium [EB/OL]. [2021-05-02].https://arxiv.org/pdf/1706.08500.pdfhttps://arxiv.org/pdf/1706.08500.pdf

Huang P H, Yang F E and Wang Y C F. 2020. Learning identity-invariant motion representations for cross-ID face reenactment//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 7082-7090 [DOI: 10.1109/CVPR42600.2020.00711http://dx.doi.org/10.1109/CVPR42600.2020.00711]

Isola P, Zhu J Y, Zhou T H and Efros A A. 2017. Image-to-image translation with conditional adversarial networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 5967-5976 [DOI: 10.1109/CVPR.2017.632http://dx.doi.org/10.1109/CVPR.2017.632]

Jalalifar S A, Hasani H and Aghajan H. 2018. Speech-driven facial reenactment using conditional generative adversarial networks [EB/OL]. [2021-05-02].https://arxiv.org/pdf/1803.07461.pdfhttps://arxiv.org/pdf/1803.07461.pdf

Japhet G. 2021. Animate your family photos [EB/OL]. [2021-05-26].https://www.myheritage.tw/deep-nostalgiahttps://www.myheritage.tw/deep-nostalgia

Ji X Y, Zhou H, Wang K S Y, Wu W, Loy C C, Cao X and Xu F. 2021. Audio-driven emotional video portraits [EB/OL]. [2021-05-24].https://arxiv.org/pdf/2104.07452.pdfhttps://arxiv.org/pdf/2104.07452.pdf

Jiang L M, Li R, Wu W, Qian C and Loy C C. 2020. Deeperforensics-1.0: a large-scale dataset for real-world face forgery detection//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 2886-2895 [DOI: 10.1109/CVPR42600.2020.00296http://dx.doi.org/10.1109/CVPR42600.2020.00296]

Karras T, Laine S, Aittala M, Hellsten J, Lehtinen J and Aila T. 2020. Analyzing and improving the image quality of StyleGAN//Proceedings of 2020 IEEE/CVF Conferenceon Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 8107-8116 [DOI: 10.1109/CVPR42600.2020.00813http://dx.doi.org/10.1109/CVPR42600.2020.00813]

Kim H, Garrido P, Tewari A, Xu W P, Thies J, Niessner M, Pérez P, Richardt C, Zollhöfer M and Theobalt C. 2018. Deep video portraits. ACM Transactions on Graphics, 37(4): #163 [DOI: 10.1145/3197517.3201283]

Langner O, Dotsch R, Bijlstra G, Wigboldus D H J, Hawk S T and Van Knippenberg A. 2010. Presentation and validation of the Radboud Faces Database. Cognition and Emotion, 24(8): 1377-1388 [DOI: 10.1080/02699930903485076]

Lee D. 2019. Deepfake Salvador Dalí takes selfies with museum visitors [EB/OL]. [2021-05-02].https://www.theverge.com/2019/5/10/18540953/salvador-dali-lives-deepfake-museumhttps://www.theverge.com/2019/5/10/18540953/salvador-dali-lives-deepfake-museum

Li L C, Wang S Z, Zhang Z M, Ding Y, Zheng Y X, Yu X and Fan C J. 2021a. Write-a-speaker: text-based emotional and rhythmic talking-head generation. Proceedings of the AAAI Conference on Artificial Intelligence, 35(3): 1911-1920

Li X R, Ji S L, Wu C M, Liu Z G, Deng S G, Cheng P, Yang M and Kong X W. 2021b. Survey on deepfakes and detection techniques. Journal of Software, 32(2): 496-518

李旭嵘, 纪守领, 吴春明, 刘振广, 邓水光, 程鹏, 杨珉, 孔祥维. 2021b. 深度伪造与检测技术综述. 软件学报, 32(2): 496-518) [DOI: 10.13328/j.cnki.jos.006140]

Liu J, Chen P, Liang T, Li Z X, Yu C, Zou S Q, Dai J and Han J Z. 2021. LI-Net: large-pose identity-preserving face reenactment network [EB/OL]. [2021-05-07].https://arxiv.org/pdf/2104.02850.pdfhttps://arxiv.org/pdf/2104.02850.pdf

Liu Z X, Hu H, Wang Z P, Wang K, Bai J Q and Lian S G. 2019. Video synthesis of human upper body with realistic face//2019 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct). Beijing, China: IEEE: 200-202 [DOI: 10.1109/ISMAR-Adjunct.2019.00-47http://dx.doi.org/10.1109/ISMAR-Adjunct.2019.00-47]

Lyu S. 2020. Deepfake detection: current challenges and next steps//Proceedings of 2020 IEEE International Conference on Multimedia and Expo Workshops (ICMEW). London, UK: IEEE: 1-6 [DOI: 10.1109/ICMEW46912.2020.9105991http://dx.doi.org/10.1109/ICMEW46912.2020.9105991]

Ma T X, Peng B, Wang W and Dong J. 2019. Any-to-one face reenactment based on conditional generative adversarial network//Proceedings of 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Lanzhou, China: IEEE: 1657-1664 [DOI: 10.1109/APSIPAASC47483.2019.9023328http://dx.doi.org/10.1109/APSIPAASC47483.2019.9023328]

Martinez B, Valstar M F, Jiang B H and Pantic M. 2019. Automatic analysis of facial actions: a survey. IEEE Transactions on Affective Computing, 10(3): 325-347 [DOI: 10.1109/TAFFC.2017.2731763]

Mirsky Y and Lee W. 2022. The creation and detection of deepfakes: a survey. ACM Computing Surveys, 54(1): #7 [DOI: 10.1145/3425780]

Nagano K, Seo J, Xing J, Wei L Y, Li Z M, Saito S, Agarwal A, Fursund J and Li H. 2018. paGAN: real-time avatars using dynamic textures. ACM Transactions on Graphics, 37(6): #528 [DOI: 10.1145/3272127.3275075]

Nagrani A, Chung J S and Zisserman A. 2017. VoxCeleb: a large-scale speaker identification dataset [EB/OL]. [2021-05-02].https://arxiv.org/pdf/1706.08612.pdfhttps://arxiv.org/pdf/1706.08612.pdf

Narvekar N D and Karam L J. 2009. A no-reference perceptual image sharpness metric based on a cumulative probability of blur detection//Proceedings of 2009 International Workshop on Quality of Multimedia Experience. San Diego, USA: IEEE: 87-91 [DOI: 10.1109/QOMEX.2009.5246972http://dx.doi.org/10.1109/QOMEX.2009.5246972]

Nirkin Y, Keller Y and Hassner T. 2019. FSGAN: subject agnostic face swapping and reenactment//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 7183-7192 [DOI: 10.1109/ICCV.2019.00728http://dx.doi.org/10.1109/ICCV.2019.00728]

Prajwal K R, Mukhopadhyay R, Namboodiri V P and Jawahar C V. 2020. A lip sync expert is all you need for speech to lip generation in the wild//Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM: 484-492 [DOI: 10.1145/3394171.3413532http://dx.doi.org/10.1145/3394171.3413532]

Prajwal KR, Mukhopadhyay R, Philip J, Jha A, Namboodiri V and Jawahar CV. 2019. Towards automatic face-to-face translation//Proceedings of the 27th ACM International Conference on Multimedia. Nice, France: ACM: 1428-1436 [DOI: 10.1145/3343031.3351066http://dx.doi.org/10.1145/3343031.3351066]

Pumarola A, Agudo A, Martinez A M, Sanfeliu A and Moreno-Noguer F. 2018. GANimation: anatomically-aware facial animation from a single image//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 835-851 [DOI: 10.1007/978-3-030-01249-6_50http://dx.doi.org/10.1007/978-3-030-01249-6_50]

Rössler A, Cozzolino D, Verdoliva L, Riess C, Thies J and Nießner M. 2019. FaceForensics++: learning to detect manipulated facial images//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 1-11 [DOI: 10.1109/ICCV.2019.00009http://dx.doi.org/10.1109/ICCV.2019.00009]

Sanchez E and Valstar M. 2018. Triple consistency loss for pairing distributions in GAN-based face synthesis [EB/OL]. [2021-05-24].https://arxiv.org/pdf/1811.03492.pdfhttps://arxiv.org/pdf/1811.03492.pdf

Shen Y J, Luo P, Luo P, Yan J J, Wang X G and Tang X O. 2018a. FaceID-GAN: learning a symmetry three-player GAN for identity-preserving face synthesis//Proceedings of 2018 IEEE/CVF conference oncomputer vision and pattern recognition. Salt Lake City, USA: IEEE: 821-830 [DOI: 10.1109/CVPR.2018.00092http://dx.doi.org/10.1109/CVPR.2018.00092]

Shen Y J, Zhou B L, Luo P and Tang X O. 2018b. FaceFeat-GAN: a two-stage approach for identity-preserving face synthesis [EB/OL]. [2021-05-25].https://arxiv.org/pdf/1812.01288.pdfhttps://arxiv.org/pdf/1812.01288.pdf

Siarohin A, Lathuilière S, Tulyakov S, Ricci E and Sebe N. 2019a. Animating arbitrary objects via deep motion transfer//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 2372-2381 [DOI: 10.1109/CVPR.2019.00248http://dx.doi.org/10.1109/CVPR.2019.00248]

Siarohin A, Lathuilière S, Tulyakov S, Ricci E and Sebe N. 2019b. First order motion model for image animation//Proceedings of the 33rd Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc. : 7137-7147

Song L S, Wu W, Qian C, He R and Loy C C. 2022. Everybody's talkin': let me talk as you want. IEEE Transactions on Information Forensics and Security, 17: 585-598 [DOI: 10.1109/TIFS.2022.3146783http://dx.doi.org/10.1109/TIFS.2022.3146783]

Song Y, Zhu J W, Li D W, Wang X and Qi H R. 2019. Talking face generation by conditional recurrent adversarial network [EB/OL]. [2021-12-29].https://arxiv.org/pdf/1804.04786.pdfhttps://arxiv.org/pdf/1804.04786.pdf

Sun P, Li Y Z, QiH G and Lyu S W. 2021. LandmarkGAN: synthesizing Faces from Landmarks [EB/OL]. [2021-05-25].https://arxiv.org/pdf/2011.00269.pdfhttps://arxiv.org/pdf/2011.00269.pdf

Suwajanakorn S, Seitz S M and Kemelmacher-Shlizerman I. 2017. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics, 36(4): #95 [DOI: 10.1145/3072959.3073640]

Thies J, Elgharib M, Tewari A, Theobalt C and Nießner M. 2020. Neural voice puppetry: audio-driven facial reenactment//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 716-731 [DOI: 10.1007/978-3-030-58517-4_42http://dx.doi.org/10.1007/978-3-030-58517-4_42]

Thies J, Zollhöfer M and Nießner M. 2019. Deferred neural rendering: Image synthesis using neural textures. ACM Transactions on Graphics, 38(4): # 66 [DOI: 10.1145/3306346.3323035]

Thies J, Zollhöfer M, Nießner M, Valgaerts L, Stamminger M and Theobalt C. 2015. Real-time expression transfer for facial reenactment. ACM Transactions on Graphics, 34(6): #183 [DOI: 10.1145/2816795.2818056]

Thies J, Zollhöfer M, Stamminger M, Theobalt C and Nießner M. 2016. Face2face: real-time face capture and reenactment of RGB videos//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 2387-2395 [DOI: 10.1109/CVPR.2016.262http://dx.doi.org/10.1109/CVPR.2016.262]

Tolosana R, Romero-Tapiador S, Fierrez J and Vera-Rodriguez R. 2020a. DeepFakes evolution: analysis of facial regions and fake detection performance [EB/OL]. [2021-05-27].https://arxiv.org/pdf/2004.07532.pdfhttps://arxiv.org/pdf/2004.07532.pdf

Tolosana R, Vera-Rodriguez R, Fierrez J, Morales A and Ortega-Garcia J. 2020b. Deepfakes and beyond: a survey of face manipulation and fake detection. Information Fusion, 64: 131-148 [DOI: 10.1016/j.inffus.2020.06.014]

Tripathy S, Kannala J and Rahtu E. 2021. FACEGAN: facial attribute controllable rEenactment GAN//Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision. Waikoloa, USA: IEEE: 1328-1337 [DOI: 10.1109/WACV48630.2021.00137http://dx.doi.org/10.1109/WACV48630.2021.00137]

Wang K S Y, Wu Q Y, Song L S, Yang Z Q, Wu W, Qian C, He R, Qiao Y and Loy C C. 2020a. MEAD: a large-scale audio-visual dataset for emotional talking-face generation//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 700-717 [DOI: 10.1007/978-3-030-58589-1_42http://dx.doi.org/10.1007/978-3-030-58589-1_42]

Wang S Z, Li L C, Ding Y, Fan C J and Yu X. 2021a. Audio2Head: audio-driven one-shot talking-head generation with natural head motion [EB/OL]. [2022-02-17].https://arxiv.org/pdf/2107.09293.pdfhttps://arxiv.org/pdf/2107.09293.pdf

Wang S Z, Li L C, Ding Y and Yu X. 2021b. One-shot talking face generation from single-speaker audio-visual correlation learning [EB/OL]. [2022-06-22].https://arxiv.org/pdf/2112.02749.pdfhttps://arxiv.org/pdf/2112.02749.pdf

Wang T C, Liu M Y, Zhu J Y, Liu G L, Tao A, Kautz J and Catanzaro B. 2018. Video-to-video synthesis [EB/OL]. [2022-02-17].https://arxiv.org/pdf/1808.06601.pdfhttps://arxiv.org/pdf/1808.06601.pdf

Wang T C, Mallya A and Liu M Y. 2021c. One-shot free-view neural talking-head synthesis for video conferencing [EB/OL]. [2021-05-02].https://arxiv.org/pdf/2011.15126.pdfhttps://arxiv.org/pdf/2011.15126.pdf

Wang Y H, Bilinski P, Bremond F and Dantcheva A. 2020b. ImaGINator: conditional spatio-temporal GAN for video generation//Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision. Snowmass, USA: IEEE: 1149-1158 [DOI: 10.1109/WACV45572.2020.9093492http://dx.doi.org/10.1109/WACV45572.2020.9093492]

Wang Z, Bovik A C, Sheikh H R and Simoncelli E P. 2004. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4): 600-612 [DOI: 10.1109/TIP.2003.819861]

Wiles O, Koepke A S and Zisserman A. 2018. X2Face: a network for controlling face generation using images, audio, and pose codes//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 690-706 [DOI: 10.1007/978-3-030-01261-8_41http://dx.doi.org/10.1007/978-3-030-01261-8_41]

Wu W, Zhang Y X, Li C, Qian C and Loy C C. 2018. ReenactGAN: learning to reenact faces via boundary transfer//Proceedings of the 15th European Conference on Computer Vision (ECCV) Munich, Germany: Springer: 622-638 [DOI: 10.1007/978-3-030-01246-5_37http://dx.doi.org/10.1007/978-3-030-01246-5_37]

Xiang S T, Gu Y M, Xiang P D, He M M, Nagano K, Chen H W and Li H. 2020. One-shot identity-preserving portrait reenactment [EB/OL]. [2021-05-26].https://arxiv.org/pdf/2004.12452.pdfhttps://arxiv.org/pdf/2004.12452.pdf

Xuan S X. 2021. Video of Donald Trump singing [EB/OL]. [2021-05-26].https://www.bilibili.com/video/BV1Xz4y1U7EYhttps://www.bilibili.com/video/BV1Xz4y1U7EY

Yao G M, Yuan Y, Shao T J, Li S, Liu S Q, Liu Y, Wang M M and Zhou K. 2021. One-shot face reenactment using appearance adaptive normalization [EB/OL]. [2021-05-07].https://arxiv.org/pdf/2102.03984.pdfhttps://arxiv.org/pdf/2102.03984.pdf

Yao G M, Yuan Y, Shao T J and Zhou K. 2020a. Mesh guided one-shot face reenactment using graph convolutional networks//Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM: 1773-1781 [DOI: 10.1145/3394171.3413865http://dx.doi.org/10.1145/3394171.3413865]

Yao X W, Fried O, Fatahalian K and Agrawala M. 2020b. Iterative text-based editing of talking-heads using neural retargeting [EB/OL]. [2021-05-24].https://arxiv.org/pdf/2011.10688.pdfhttps://arxiv.org/pdf/2011.10688.pdf

Yu L Y, Yu J and Ling Q. 2019. Mining audio, text and visual information for talking face generation//Proceedings of 2019 IEEE International Conference on Data Mining (ICDM). Beijing, China: IEEE: 787-795 [DOI: 10.1109/ICDM.2019.00089http://dx.doi.org/10.1109/ICDM.2019.00089]

Zakharov E, Ivakhnenko A, Shysheya A and Lempitsky V. 2020. Fast Bi-layer neural synthesis of one-shot realistic head avatars//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 524-540 [DOI: 10.1007/978-3-030-58610-2_31http://dx.doi.org/10.1007/978-3-030-58610-2_31]

Zakharov E, Shysheya A, Burkov E and Lempitsky V. 2019. Few-shot adversarial learning of realistic neural talking head models//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 9458-9467 [DOI: 10.1109/ICCV.2019.00955http://dx.doi.org/10.1109/ICCV.2019.00955]

Zeng X F, Pan Y S, Wang M M, Zhang J N and Liu Y. 2020. Realistic face reenactment via self-supervised disentangling of identity and pose. Proceedings of the AAAI Conference on Artificial Intelligence, 34(7): 12757-12764 [DOI: 10.1609/aaai.v34i07.6970]

Zhang J N, Liu L, Xue Z C and Liu Y. 2020a. APB2Face: audio-guided face reenactment with auxiliary pose and blink signals//Proceedings of ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona, Spain: IEEE: 4402-4406 [DOI: 10.1109/ICASSP40776.2020.9052977http://dx.doi.org/10.1109/ICASSP40776.2020.9052977]

Zhang J N, Zeng X F, Wang M M, Pan Y S, Liu L, Liu Y, Ding Y and Fan C J. 2020b. FreeNet: multi-identity face reenactment//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 5325-5334 [DOI: 10.1109/CVPR42600.2020.00537http://dx.doi.org/10.1109/CVPR42600.2020.00537]

Zhang J N, Zeng X F, Xu C, Chen J, Liu Y and Jiang Y L. 2020c. APB2FaceV2: real-time audio-guided multi-face reenactment [EB/OL]. [2021-05-24].https://arxiv.org/pdf/2010.13017.pdfhttps://arxiv.org/pdf/2010.13017.pdf

Zhang Y X, Zhang S W, He Y, Li C, Loy C C and Liu Z W. 2019. One-shot face reenactment [EB/OL]. [2021-05-26].https://arxiv.org/pdf/1908.03251.pdfhttps://arxiv.org/pdf/1908.03251.pdf

Zhang Z M, Li L C, Ding Y and Fan C J. 2021. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 3660-3669 [DOI: 10.1109/CVPR46437.2021.00366http://dx.doi.org/10.1109/CVPR46437.2021.00366]

Zhou H, Liu Y, Liu Z W, Luo P and Wang X G. 2019. Talking face generation by adversarially disentangled audio-visual representation. Proceedings of the AAAI Conference on Artificial Intelligence, 33(1): 9299-9306 [DOI: 10.1609/aaai.v33i01.33019299]

Zhou H, Sun Y S, Wu W,Loy C C, Wang X G and Liu Z W. 2021. Pose-controllable talking face generation by implicitly modularized audio-visual representation [EB/OL]. [2021-05-07].https://arxiv.org/pdf/2104.11116.pdfhttps://arxiv.org/pdf/2104.11116.pdf

Zhou Y, Han X T, Shechtman E, Echevarria J, Kalogerakis E and Li D. 2020. MakeltTalk: speaker-aware talking-head animation. ACM Transactions on Graphics, 39(6): #221 [DOI: 10.1145/3414685.3417774]

文章被引用时，请邮件提醒。

提交

“三维视觉—语言”推理技术的前沿研究与最新趋势