Critical review of human face reenactment methods
- Vol. 27, Issue 9, Pages: 2629-2651(2022)
Received:20 January 2022,
Revised:2022-5-25,
Accepted:02 June 2022,
Published:16 September 2022
DOI: 10.11834/jig.211243
移动端阅览

浏览全部资源
扫码关注微信
Received:20 January 2022,
Revised:2022-5-25,
Accepted:02 June 2022,
Published:16 September 2022
移动端阅览
随着计算机视觉领域图像生成研究的发展
面部重演引起广泛关注
这项技术旨在根据源人脸图像的身份以及驱动信息提供的嘴型、表情和姿态等信息合成新的说话人图像或视频。面部重演具有十分广泛的应用
例如虚拟主播生成、线上授课、游戏形象定制、配音视频中的口型配准以及视频会议压缩等
该项技术发展时间较短
但是涌现了大量研究。然而目前国内外几乎没有重点关注面部重演的综述
面部重演的研究概述只是在深度伪造检测综述中以深度伪造的内容出现。鉴于此
本文对面部重演领域的发展进行梳理和总结。本文从面部重演模型入手
对面部重演存在的问题、模型的分类以及驱动人脸特征表达进行阐述
列举并介绍了训练面部重演模型常用的数据集及评估模型的评价指标
对面部重演近年研究工作进行归纳、分析与比较
最后对面部重演的演化趋势、当前挑战、未来发展方向、危害及应对策略进行了总结和展望。
Current image and video data have been increasing dramatically in terms of huge artificial intelligence (AI)-generated contents. The derived face reenactment has been developing based on generated facial images or videos. Given source face information and driving motion information
face reenactment aims to generate a reenacted face or corresponding reenacted face video of driving motion information in related to the animation of expression
mouth shape
eye gazing and pose while preserving the identity information of the source face. Face reenactment methods can generate a variety of multiple feature-based and motion-based face videos
which are widely used with less constraints and becomes a research focus in the field of face generation. However
almost no reviews are specially written for the aspect of face reenactment. In view of this
we carry out the critical review of the development of face reenactment beyond DeepFake detection contexts. Our review is focused on the nine perspectives as following: 1) the universal process of face reenactment model; 2) facial information representation; 3) key challenges and barriers; 4) the classification of related methods; 5) introduction of various face reenactment methods; 6) evaluation metrics; 7) commonly used datasets; 8) practical applications; and 9) conclusion and future prospect. The identity information and background information is extracted from source faces while motion features are extracted from driving information
which are combined to generate the reenacted faces. Generally
latent codes
3D morphable face models (3DMM) coefficients
facial landmarks and facial action units are all served as motion features. Besides
there exist several challenges and problems which are always focused in related research. The identity mismatch problem means the inability of face reenactment model to preserve the identity of source faces. The issue of temporal or background inconsistency indicates that the generated face videos are related to the cross-framing jitter or obvious artifacts between the facial contour and the background. The constraints of identity are originated from the model design and training procedure
which can merely reenact the specific person seen in the training data. As for the category of face reenactment methods
image-driven methods and cross-modality-driven methods are involved according to the modality of driving information. Based on the difference of driving information representation
image-driven methods can be divided into four categories. The driving information representation includes facial landmarks
3DMM
motion field prediction and feature decoupling. The subclasses of identity restriction (yes/no issue) can be melted into the landmark-based and 3DMM-based methods further in terms of whether the model could generate unseen subjects or not. Our demonstration of each category
corresponding model flowchart and following improvement work will be illustrated in detail. As for the cross-modality driven methods
the text and audio related methods are introduced
which are ill-posed questions due to audio or text facial motion information may have multiple corresponding solutions. For instance
different facial poses or motions of same identity can produce basically the same audio. Cross-modality face reenactment is challenged to attract attention
which will also be introduced comprehensively. Text driven methods are developed based on three stages in terms of driving content progressively
which are extra required audio
restricted text-driven and arbitrary text-driven. The audio driven methods can be further divided into two categories depending on whether additional driving information is demanded or not. The additional driving information refers to eye blinking label or head pose videos
which offer auxiliary information in generation procedure. Moreover
comparative experiments are conducted to evaluate the performance between various methods. Image quality and facial motion accuracy are taken into consideration during evaluation. The peak signal-to-noise ratio (PSNR)
structural similarity index measure (SSIM)
cumulative probability of blur detection (CPBD)
frechet inception distance (FID) or other traditional image generation evaluation metrics are adopted together. To judge the facial motion accuracy
landmark difference
action unit detection analysis
and pose difference are utilized. In most facial-related cases
the landmarks
the presence of action unit or Euler angle are predicted all via corresponding pre-trained models. As for audio driven methods
the lip synchronization extent is also estimated in the aid of the pretrained evaluation model. Apart from the objective evaluations
subjective metrics like user study are applied as well. Furthermore
the commonly-used datasets in face reenactment are illustrated
each of which contains face images or videos of various expressions
view angles
illumination conditions or corresponding talking audios. The videos are usually collected from the interviews
news broadcast or actor recording. To reflect different level of difficulties
the image and video datasets are tested related to indoor and outdoor scenario. Commonly
the indoor scenario refers to white or grey walls while the outdoor scenario denotes actual moving scenes or the news live room. As for conclusion part
the practical applications and potential threats are critically illustrated. Face reenactment can contribute to entertainment industry like movie video dubbing
video production
game character avatar or old photo colorization. It can be utilized in conference compressing
online customer service
virtual uploader or 3D digital person as well. However
it is warning that misused face reenactment behaviors of lawbreakers can be used for calumniate
false information spreading or harmful media content creation in DeepFake
which will definitely damage the social stability and causing panic on social media. Therefore
it is important to consider more ethical issues of face reenactment. Furthermore
the development status of each category and corresponding future directions are displayed. Overall
model optimization and generation-scenario robustness are served as the two main concerns. Optimization issue is focused on data dependence alleviation
feature disentanglement
real time testing or evaluation metric improvement. Robustness improvement of face reenactment denotes generate high-quality reenacted faces under situations like face occlusion
outdoor scenario
large pose faces or complicated illumination. In a word
our critical review covers the universal pipeline of face reenactment model
main challenges
the classification and detailed explanation about each category of methods
the evaluation metrics and commonly used datasets
the current research analysis and prospects. The potential introduction and guidance of face reenactment research is facilitated.
Averbuch-Elor H, Cohen-Or D, Kopf J and Cohen M F. 2017. Bringing portraits to life. ACM Transactions on Graphics, 36(6): #196 [DOI: 10.1145/3130800.3130818]
Baltrusaitis T, Zadeh A, Lim Y C and Morency L P. 2018. OpenFace 2.0: facial behavior analysis toolkit//Proceedings of the 13th IEEE International Conference on Automatic Face and Gesture Recognition. Xi′an, China: IEEE: 59-66 [ DOI: 10.1109/FG.2018.00019 http://dx.doi.org/10.1109/FG.2018.00019 ]
Blanz V and Vetter T. 1999. A morphable model for the synthesis of 3D faces//Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques. New York, USA: ACM: 187-194 [ DOI: 10.1145/311535.311556 http://dx.doi.org/10.1145/311535.311556 ]
Burkov E, Pasechnik I, Grigorev A and Lempitsky V. 2020. Neural head reenactment with latent pose descriptors//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 13783-13792 [ DOI: 10.1109/CVPR42600.2020.01380 http://dx.doi.org/10.1109/CVPR42600.2020.01380 ]
Cao Q, Shen L, Xie W D, Parkhi O M and Zisserman A. 2018. VGGFace2: a dataset for recognising faces across pose and age//Proceedings of the 13th IEEE International Conference on Automatic Face and Gesture Recognition. Xi′an, China: IEEE: 67-74 [ DOI: 10.1109/FG.2018.00020 http://dx.doi.org/10.1109/FG.2018.00020 ]
Chen L L, Li Z H, Maddox R K, Duan Z Y and Xu C L. 2018. Lip movements generation at a glance//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 538-553 [ DOI: 10.1007/978-3-030-01234-2_32 http://dx.doi.org/10.1007/978-3-030-01234-2_32 ]
Chen L L, Maddox R K, Duan Z Y and Xu C L. 2019. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 7824-7833 [ DOI: 10.1109/CVPR.2019.00802 http://dx.doi.org/10.1109/CVPR.2019.00802 ]
Chung J S, Jamaludin A and Zisserman A. 2017. You said that? [EB/OL ] . [2021-05-24 ] . https://arxiv.org/pdf/1705.02966.pdf https://arxiv.org/pdf/1705.02966.pdf
Chung J S, Nagrani A and Zisserman A. 2018. VoxCeleb2: deep speakerrecognition [EB/OL ] . [2021-05-02 ] . https://arxiv.org/pdf/1806.05622.pdf https://arxiv.org/pdf/1806.05622.pdf
Chung J S and Zisserman A. 2017a. Out of time: automated lip sync in the wild//Proceedings of 2016 ACCV International Workshops on Computer Vision. Taipei, China: Springer: 251-263 [ DOI: 10.1007/978-3-319-54427-4_19 http://dx.doi.org/10.1007/978-3-319-54427-4_19 ]
Chung J S and Zisserman A. 2017b. Lip reading in the wild//Proceedings of the 13th Asian Conference on Computer Vision. Taipei, China: Springer: 87-103 [ DOI: 10.1007/978-3-319-54184-6_6 http://dx.doi.org/10.1007/978-3-319-54184-6_6 ]
Fox G, Liu W T, Kim H, Seidel H P, Elgharib M and Theobalt C. 2021. Videoforensicshq: detecting high-quality manipulated face videos [EB/OL ] . [2021-05-24 ] . https://arxiv.org/pdf/2005.10360.pdf https://arxiv.org/pdf/2005.10360.pdf
Fried O, Tewari A, Zollhöfer M, Finkelstein A, Shechtman E, Goldman D B, Genova K, Jin Z Y, Theobalt C and Agrawala M. 2019. Text-based editing of talking-head video. ACM Transactions on Graphics, 38(4): #68 [DOI: 10.1145/3306346.3323028]
Fu C Y, Hu Y B, Wu X, Wang G L, Zhang Q and He R. 2021. High-fidelity face manipulation with extreme posesand expressions. IEEE Transactions on Information Forensics and Security, 16: 2218-2231 [DOI: 10.1109/TIFS.2021.3050065]
Geng J H, Shao T J, Zheng Y Y, Weng Y L and Zhou K. 2018. Warp-guided GANs for single-photo facial animation. ACM Transactions on Graphics, 37(6): #231 [DOI: 10.1145/3272127.3275043]
Gross R, Matthews I, Cohn J, Kanade T and Baker S. 2010. Multi-PIE. Image and Vision Computing, 28(5): 807-813 [DOI: 10.1016/j.imavis.2009.08.002]
Gu K X, Zhou Y Q and Huang T. 2020. FLNet: landmark driven fetching and learning network for faithful talking facial animation synthesis. Proceedings of the AAAI Conference on Artificial Intelligence, 34(7): 10861-10868 [DOI: 10.1609/aaai.v34i07.6717]
Ha S, Kersner M, Kim B, Seo S and Kim D. 2020. MarioNETte: few-shot face reenactment preserving identity of unseen targets. Proceedings of the AAAI Conference on Artificial Intelligence, 34(7): 10893-10900 [DOI: 10.1609/aaai.v34i07.6721]
He Y, Gan B, Chen S Y, Zhou Y C, Yin G J, Song L C, Sheng L, Shao J and Liu Z W. 2021. ForgeryNet: a versatile benchmark for comprehensive forgery analysis [EB/OL ] . [2021-05-24 ] . https://arxiv.org/pdf/2103.05630.pdf https://arxiv.org/pdf/2103.05630.pdf
Heusel M, Ramsauer H, Unterthiner T, Nessler B and Hochreiter S. 2018. GANs trained by a two time-scale update rule converge to a local nash equilibrium [EB/OL ] . [2021-05-02 ] . https://arxiv.org/pdf/1706.08500.pdf https://arxiv.org/pdf/1706.08500.pdf
Huang P H, Yang F E and Wang Y C F. 2020. Learning identity-invariant motion representations for cross-ID face reenactment//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 7082-7090 [ DOI: 10.1109/CVPR42600.2020.00711 http://dx.doi.org/10.1109/CVPR42600.2020.00711 ]
Isola P, Zhu J Y, Zhou T H and Efros A A. 2017. Image-to-image translation with conditional adversarial networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 5967-5976 [ DOI: 10.1109/CVPR.2017.632 http://dx.doi.org/10.1109/CVPR.2017.632 ]
Jalalifar S A, Hasani H and Aghajan H. 2018. Speech-driven facial reenactment using conditional generative adversarial networks [EB/OL ] . [2021-05-02 ] . https://arxiv.org/pdf/1803.07461.pdf https://arxiv.org/pdf/1803.07461.pdf
Japhet G. 2021. Animate your family photos [EB/OL ] . [2021-05-26 ] . https://www.myheritage.tw/deep-nostalgia https://www.myheritage.tw/deep-nostalgia
Ji X Y, Zhou H, Wang K S Y, Wu W, Loy C C, Cao X and Xu F. 2021. Audio-driven emotional video portraits [EB/OL ] . [2021-05-24 ] . https://arxiv.org/pdf/2104.07452.pdf https://arxiv.org/pdf/2104.07452.pdf
Jiang L M, Li R, Wu W, Qian C and Loy C C. 2020. Deeperforensics-1.0: a large-scale dataset for real-world face forgery detection//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 2886-2895 [ DOI: 10.1109/CVPR42600.2020.00296 http://dx.doi.org/10.1109/CVPR42600.2020.00296 ]
Karras T, Laine S, Aittala M, Hellsten J, Lehtinen J and Aila T. 2020. Analyzing and improving the image quality of StyleGAN//Proceedings of 2020 IEEE/CVF Conferenceon Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 8107-8116 [ DOI: 10.1109/CVPR42600.2020.00813 http://dx.doi.org/10.1109/CVPR42600.2020.00813 ]
Kim H, Garrido P, Tewari A, Xu W P, Thies J, Niessner M, Pérez P, Richardt C, Zollhöfer M and Theobalt C. 2018. Deep video portraits. ACM Transactions on Graphics, 37(4): #163 [DOI: 10.1145/3197517.3201283]
Langner O, Dotsch R, Bijlstra G, Wigboldus D H J, Hawk S T and Van Knippenberg A. 2010. Presentation and validation of the Radboud Faces Database. Cognition and Emotion, 24(8): 1377-1388 [DOI: 10.1080/02699930903485076]
Lee D. 2019. Deepfake Salvador Dalí takes selfies with museum visitors [EB/OL ] . [2021-05-02 ] . https://www.theverge.com/2019/5/10/18540953/salvador-dali-lives-deepfake-museum https://www.theverge.com/2019/5/10/18540953/salvador-dali-lives-deepfake-museum
Li L C, Wang S Z, Zhang Z M, Ding Y, Zheng Y X, Yu X and Fan C J. 2021a. Write-a-speaker: text-based emotional and rhythmic talking-head generation. Proceedings of the AAAI Conference on Artificial Intelligence, 35(3): 1911-1920
Li X R, Ji S L, Wu C M, Liu Z G, Deng S G, Cheng P, Yang M and Kong X W. 2021b. Survey on deepfakes and detection techniques. Journal of Software, 32(2): 496-518
李旭嵘, 纪守领, 吴春明, 刘振广, 邓水光, 程鹏, 杨珉, 孔祥维. 2021b. 深度伪造与检测技术综述. 软件学报, 32(2): 496-518) [DOI: 10.13328/j.cnki.jos.006140]
Liu J, Chen P, Liang T, Li Z X, Yu C, Zou S Q, Dai J and Han J Z. 2021. LI-Net: large-pose identity-preserving face reenactment network [EB/OL ] . [2021-05-07 ] . https://arxiv.org/pdf/2104.02850.pdf https://arxiv.org/pdf/2104.02850.pdf
Liu Z X, Hu H, Wang Z P, Wang K, Bai J Q and Lian S G. 2019. Video synthesis of human upper body with realistic face//2019 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct). Beijing, China: IEEE: 200-202 [ DOI: 10.1109/ISMAR-Adjunct.2019.00-47 http://dx.doi.org/10.1109/ISMAR-Adjunct.2019.00-47 ]
Lyu S. 2020. Deepfake detection: current challenges and next steps//Proceedings of 2020 IEEE International Conference on Multimedia and Expo Workshops (ICMEW). London, UK: IEEE: 1-6 [ DOI: 10.1109/ICMEW46912.2020.9105991 http://dx.doi.org/10.1109/ICMEW46912.2020.9105991 ]
Ma T X, Peng B, Wang W and Dong J. 2019. Any-to-one face reenactment based on conditional generative adversarial network//Proceedings of 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Lanzhou, China: IEEE: 1657-1664 [ DOI: 10.1109/APSIPAASC47483.2019.9023328 http://dx.doi.org/10.1109/APSIPAASC47483.2019.9023328 ]
Martinez B, Valstar M F, Jiang B H and Pantic M. 2019. Automatic analysis of facial actions: a survey. IEEE Transactions on Affective Computing, 10(3): 325-347 [DOI: 10.1109/TAFFC.2017.2731763]
Mirsky Y and Lee W. 2022. The creation and detection of deepfakes: a survey. ACM Computing Surveys, 54(1): #7 [DOI: 10.1145/3425780]
Nagano K, Seo J, Xing J, Wei L Y, Li Z M, Saito S, Agarwal A, Fursund J and Li H. 2018. paGAN: real-time avatars using dynamic textures. ACM Transactions on Graphics, 37(6): #528 [DOI: 10.1145/3272127.3275075]
Nagrani A, Chung J S and Zisserman A. 2017. VoxCeleb: a large-scale speaker identification dataset [EB/OL ] . [2021-05-02 ] . https://arxiv.org/pdf/1706.08612.pdf https://arxiv.org/pdf/1706.08612.pdf
Narvekar N D and Karam L J. 2009. A no-reference perceptual image sharpness metric based on a cumulative probability of blur detection//Proceedings of 2009 International Workshop on Quality of Multimedia Experience. San Diego, USA: IEEE: 87-91 [ DOI: 10.1109/QOMEX.2009.5246972 http://dx.doi.org/10.1109/QOMEX.2009.5246972 ]
Nirkin Y, Keller Y and Hassner T. 2019. FSGAN: subject agnostic face swap ping and reenactment//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 7183-7192 [ DOI: 10.1109/ICCV.2019.00728 http://dx.doi.org/10.1109/ICCV.2019.00728 ]
Prajwal K R, Mukhopadhyay R, Namboodiri V P and Jawahar C V. 2020. A lip sync expert is all you need for speech to lip generation in the wild//Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM: 484-492 [ DOI: 10.1145/3394171.3413532 http://dx.doi.org/10.1145/3394171.3413532 ]
Prajwal KR, Mukhopadhyay R, Philip J, Jha A, Namboodiri V and Jawahar CV. 2019. Towards automatic face-to-face translation//Proceedings of the 27th ACM International Conference on Multimedia. Nice, France: ACM: 1428-1436 [ DOI: 10.1145/3343031.3351066 http://dx.doi.org/10.1145/3343031.3351066 ]
Pumarola A, Agudo A, Martinez A M, Sanfeliu A and Moreno-Noguer F. 2018. GANimation: anatomically-aware facial animation from a single image//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 835-851 [ DOI: 10.1007/978-3-030-01249-6_50 http://dx.doi.org/10.1007/978-3-030-01249-6_50 ]
Rössler A, Cozzolino D, Verdoliva L, Riess C, Thies J an d Nießner M. 2019. FaceForensics++: learning to detect manipulated facial images//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 1-11 [ DOI: 10.1109/ICCV.2019.00009 http://dx.doi.org/10.1109/ICCV.2019.00009 ]
Sanchez E and Valstar M. 2018. Triple consistency loss for pairing distributions in GAN-based face synthesis [EB/OL ] . [2021-05-24 ] . https://arxiv.org/pdf/1811.03492.pdf https://arxiv.org/pdf/1811.03492.pdf
Shen Y J, Luo P, Luo P, Yan J J, Wang X G and Tang X O. 2018a. FaceID-GAN: learning a symmetry three-player GAN for identity-preserving face synthesis//Proceedings of 2018 IEEE/CVF conference oncomputer vision and pattern recognition. Salt Lake City, USA: IEEE: 821-830 [ DOI: 10.1109/CVPR.2018.00092 http://dx.doi.org/10.1109/CVPR.2018.00092 ]
Shen Y J, Zhou B L, Luo P and Tang X O. 2018b. FaceFeat-GAN: a two-stage approach for identity-preserving face synthesis [EB/OL ] . [2021-05-25 ] . https://arxiv.org/pdf/1812.01288.pdf https://arxiv.org/pdf/1812.01288.pdf
Siarohin A, Lathuilière S, Tulyakov S, Ricci E and Sebe N. 2019a. Animating arbitrary objects via deep motion transfer//Proceedings of 2019 IEEE/CVF Conference on Computer Vis ion and Pattern Recognition. Long Beach, USA: IEEE: 2372-2381 [ DOI: 10.1109/CVPR.2019.00248 http://dx.doi.org/10.1109/CVPR.2019.00248 ]
Siarohin A, Lathuilière S, Tulyakov S, Ricci E and Sebe N. 2019b. First order motion model for image animation//Proceedings of the 33rd Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc. : 7137-7147
Song L S, Wu W, Qian C, He R and Loy C C. 2022. Everybody's talkin': let me talk as you want. IEEE Transactions on Information Forensics and Security, 17: 585-598 [ DOI: 10.1109/TIFS.2022.3146783 http://dx.doi.org/10.1109/TIFS.2022.3146783 ]
Song Y, Zhu J W, Li D W, Wang X and Qi H R. 2019. Talking face generation by conditional recurrent adversarial network [EB/OL ] . [2021-12-29 ] . https://arxiv.org/pdf/1804.04786.pdf https://arxiv.org/pdf/1804.04786.pdf
Sun P, Li Y Z, QiH G and Lyu S W. 2021. LandmarkGAN: synthesizing Faces from Landmarks [EB/OL ] . [2021-05-25 ] . https://arxiv.org/pdf/2011.00269.pdf https://arxiv.org/pdf/2011.00269.pdf
Suwajanakorn S, Seitz S M and Kemelmacher-Shlizerman I. 2017. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics, 36(4): #95 [DOI: 10.1145/3072959.3073640]
Thies J, Elgharib M, Tewari A, Theobalt C and Nießner M. 2020. Neural voice puppetry: audio-driven facial reenactment//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 716-731 [ DOI: 10.1007/978-3-030-58517-4_42 http://dx.doi.org/10.1007/978-3-030-58517-4_42 ]
Thies J, Zollhöfer M and Nießner M. 2019. Deferred neural rendering: Image synthesis using neural textures. ACM Transactions on Graphics, 38(4): # 66 [DOI: 10.1145/3306346.3323035]
Thies J, Zollhöfer M, Nießner M, Valgaerts L, Stamminger M and Theobalt C. 2015. Real-time expression transfer for facial reenactment. ACM Transactions on Graphics, 34(6): #183 [DOI: 10.1145/2816795.2818056]
Thies J, Zollhöfer M, Stamminger M, Theobalt C and Nießner M. 2016. Face2face: real-time face capture and reenactment of RGB videos//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 2387-2395 [ DOI: 10.1109/CVPR.2016.262 http://dx.doi.org/10.1109/CVPR.2016.262 ]
Tolosana R, Romero-Tapiador S, Fierrez J and Vera-Rodriguez R. 2020a. DeepFakes evolution: analysis of facial regions and fake detection performance [EB/OL ] . [2021-05-27 ] . https://arxiv.org/pdf/2004.07532.pdf https://arxiv.org/pdf/2004.07532.pdf
Tolosana R, Vera-Rodriguez R, Fierrez J, Morales A and Ortega-Garcia J. 2020b. Deepfakes and beyond: a survey of face manipulation and fake detection. Information Fusion, 64: 131-148 [DOI: 10.1016/j.inffus.2020.06.014]
Tripathy S, Kannala J and Rahtu E. 2021. FACEGAN: facial attribute controllable rEenactment GAN//Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision. Waikoloa, USA: IEEE: 1328-1337 [ DOI: 10.1109/WACV48630.2021.00137 http://dx.doi.org/10.1109/WACV48630.2021.00137 ]
Wang K S Y, Wu Q Y, Song L S, Yang Z Q, Wu W, Qian C, He R, Qiao Y and Loy C C. 2020a. MEAD: a large-scale audio-visual dataset for emotional talking-face generation//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 700-717 [ DOI: 10.1007/978-3-030-58589-1_42 http://dx.doi.org/10.1007/978-3-030-58589-1_42 ]
Wang S Z, Li L C, Ding Y, Fan C J and Yu X. 2021a. Audio2Head: audio-driven one-shot talking-head generat ion with natural head motion [EB/OL ] . [2022-02-17 ] . https://arxiv.org/pdf/2107.09293.pdf https://arxiv.org/pdf/2107.09293.pdf
Wang S Z, Li L C, Ding Y and Yu X. 2021b. One-shot talking face generation from single-speaker audio-visual correlation learning [EB/OL ] . [2022-06-22 ] . https://arxiv.org/pdf/2112.02749.pdf https://arxiv.org/pdf/2112.02749.pdf
Wang T C, Liu M Y, Zhu J Y, Liu G L, Tao A, Kautz J and Catanzaro B. 2018. Video-to-video synthesis [EB/OL ] . [2022-02-17 ] . https://arxiv.org/pdf/1808.06601.pdf https://arxiv.org/pdf/1808.06601.pdf
Wang T C, Mallya A and Liu M Y. 2021c. One-shot free-view neural talking-head synthesis for video conferencing [EB/OL ] . [2021-05-02 ] . https://arxiv.org/pdf/2011.15126.pdf https://arxiv.org/pdf/2011.15126.pdf
Wang Y H, Bilinski P, Bremond F and Dantcheva A. 2020b. ImaGINator: conditional spatio-temporal GAN for video generation//Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision. Snowmass, USA: IEEE: 1149-1158 [ DOI: 10.1109/WACV45572.2020.9093492 http://dx.doi.org/10.1109/WACV45572.2020.9093492 ]
Wang Z, Bovik A C, Sheikh H R and Simoncelli E P. 2004. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4): 600-612 [DOI: 10.1109/TIP.2003.819861]
Wiles O, Koepke A S and Zisserman A. 2018. X2Face: a network for controlling face generation using images, audio, and pose codes//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 690-706 [ DOI: 10.1007/978-3-030-01261-8_41 http://dx.doi.org/10.1007/978-3-030-01261-8_41 ]
Wu W, Zhang Y X, Li C, Qian C and Loy C C. 2018. ReenactGAN: learning to reenact faces via boundary transfer//Proceedings of the 15th European Conference on Computer Vision (ECCV) Munich, Germany: Springer: 622-638 [ DOI: 10.1007/978-3-030-01246-5_37 http://dx.doi.org/10.1007/978-3-030-01246-5_37 ]
Xiang S T, Gu Y M, Xiang P D, He M M, Nagano K, Chen H W and Li H. 2020. One-shot identity-preserving portrait reenactment [EB/OL ] . [2021-05-26 ] . https://arxiv.org/pdf/2004.12452.pdf https://arxiv.org/pdf/2004.12452.pdf
Xuan S X. 2021. Video of Donald Trump singing [EB/OL ] . [2021-05-26 ] . https://www.bilibili.com/video/BV1Xz4y1U7EY https://www.bilibili.com/video/BV1Xz4y1U7EY
Yao G M, Yuan Y, Shao T J, Li S, Liu S Q, Liu Y, Wang M M and Zhou K. 2021. One-shot face reenactment using appearance adaptive normalization [EB/OL ] . [2021-05-07 ] . https://arxiv.org/pdf/2102.03984.pdf https://arxiv.org/pdf/2102.03984.pdf
Yao G M, Yuan Y, Shao T J and Zhou K. 2020a. Mesh guided one-shot face reenactment using graph convolutional networks//Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM: 1773-1781 [ DOI: 10.1145/3394171.3413865 http://dx.doi.org/10.1145/3394171.3413865 ]
Yao X W, Fried O, Fatahalian K and Agrawala M. 2020b. Iterative text-based editing of talking-heads using neural retargeting [EB/OL ] . [2021-05-24 ] . https://arxiv.org/pdf/2011.10688.pdf https://arxiv.org/pdf/2011.10688.pdf
Yu L Y, Yu J and Ling Q. 2019. Mining audio, text and visual information for talking face generation//Proceedings of 2019 IEEE International Conference on Data Mining (ICDM). Beijing, China: IEEE: 787-795 [ DOI: 10.1109/ICDM.2019.00089 http://dx.doi.org/10.1109/ICDM.2019.00089 ]
Zakharov E, Ivakhnenko A, Shysheya A and Lempitsky V. 2020. Fast Bi-layer neural synthesis of one-shot realistic head avatars//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 524-540 [ DOI: 10.1007/978-3-030-58610-2_31 http://dx.doi.org/10.1007/978-3-030-58610-2_31 ]
Zakharov E, Shysheya A, Burkov E and Lempitsky V. 2019. Few-shot adversarial learning of realistic neural talking head models//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 9458-9467 [ DOI: 10.1109/ICCV.2019.00955 http://dx.doi.org/10.1109/ICCV.2019.00955 ]
Zeng X F, Pan Y S, Wang M M, Zhang J N and Liu Y. 2020. Realistic face reenactment via self-supervised disentangling of identity and pose. Proceedings of the AAAI Conference on Artificial Intelligence, 34(7): 12757-12764 [DOI: 10.1609/aaai.v34i07.6970]
Zhang J N, Liu L, Xue Z C and Liu Y. 2020a. APB2Face: audio-guided face reenactment with auxiliary pose and blink signals//Proceedings of ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona, Spain: IEEE: 4402-4406 [ DOI: 10.1109/ICASSP40776.2020.9052977 http://dx.doi.org/10.1109/ICASSP40776.2020.9052977 ]
Zhang J N, Zeng X F, Wang M M, Pan Y S, Liu L, Liu Y, Ding Y and Fan C J. 2020b. FreeNet: multi-identity face reenactment//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 5325-5334 [ DOI: 10.1109/CVPR42600.2020.00537 http://dx.doi.org/10.1109/CVPR42600.2020.00537 ]
Zhang J N, Zeng X F, Xu C, Chen J, Liu Y and Jiang Y L. 2020c. APB2FaceV2: real-time audio-guided multi-face reenactment [EB/OL ] . [2021-05-24 ] . https://arxiv.org/pdf/2010.13017.pdf https://arxiv.org/pdf/2010.13017.pdf
Zhang Y X, Zhang S W, He Y, Li C, Loy C C and Liu Z W. 2019. One-shot face reenactment [EB/OL ] . [2021-05-26 ] . https://arxiv.org/pdf/1908.03251.pdf https://arxiv.org/pdf/1908.03251.pdf
Zhang Z M, Li L C, Ding Y and Fan C J. 2021. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 3660-3669 [ DOI: 10.1109/CVPR46437.2021.00366 http://dx.doi.org/10.1109/CVPR46437.2021.00366 ]
Zhou H, Liu Y, Liu Z W, Luo P and Wang X G. 2019. Talking face generation by adversarially disentangled audio-visual representation. Proceedings of the AAAI Conference on Artificial Intelligence, 33(1): 9299-9306 [DOI: 10.1609/aaai.v33i01.33019299]
Zhou H, Sun Y S, Wu W,Loy C C, Wang X G and Liu Z W. 2021. Pose-controllable talking face generation by implicitly modularized audio-visual representation [EB/OL ] . [2021-05-07 ] . https://arxiv.org/pdf/2104.11116.pdf https://arxiv.org/pdf/2104.11116.pdf
Zhou Y, Han X T, Shechtman E, Echevarria J, Kalogerakis E and Li D. 2020. MakeltTalk: speaker-aware talking-head animation. ACM Transactions on Graphics, 39(6): #221 [DOI: 10.1145/3414685.3417774]
相关文章
相关作者
相关机构
京公网安备11010802024621