Current Issue Cover
多模态数字人建模、合成与驱动综述

高玄, 刘东宇, 张举勇(中国科学技术大学)

摘 要
多模态数字人是指具备多模态认知与交互能力,且有类人的思维和行为逻辑的真实自然虚拟人。近些年来随着计算机视觉与自然语言处理等领域的交叉融合以及蓬勃发展,相关技术有了显著的进步。本文讨论在图形学和视觉领域比较重要的多模态人头动画、多模态人体动画以及多模态数字人形象构建三个主题,介绍其方法论和代表工作。在多模态人头动画这一主题下介绍语音驱动人头和表情驱动人头两个问题的相关工作。在多模态人体动画这一主题下介绍基于RNN的、基于Transformer的和基于降噪扩散模型的人体动画生成。在多模态数字人形象构建这一主题下介绍视觉语言相似性引导的虚拟形象构建、基于多模态降噪扩散模型引导的虚拟形象构建以及三维多模态虚拟人生成模型。本文将相关方向的代表性工作进行介绍和归类,对已有方法进行总结,并展望未来可能的研究方向。
关键词
Multi-Modal Digital Human Modeling, Synthesis, and Driving: A Survey

gaoxuan, liudongyu, zhangjuyong(University of Science and Technology of China)

Abstract
A multimodal digital human refers to a digital avatar with the capability of multimodal cognition and interaction. It needs to think and behave like a human being. In recent years, there has been significant progress in related technologies due to the cross-fertilization and vibrant development in fields such as computer vision and natural language processing. This article discusses three major themes in the areas of computer graphics and computer vision: multimodal head animation, multimodal body animation, and multimodal portrait creation. It introduces the methodologies and representative works in these areas. Under the theme of multimodal head animation, it presents research on speech-driven and expression-driven head models. Under the theme of multimodal body animation, it explores techniques involving RNN-based, Transformer-based, and DDPM-based body animation. When discussing multimodal portrait creation, it covers portrait creation guided by visual-linguistic similarity, portrait creation guided by multimodal denoising diffusion model , and 3D multimodal generative models on digital portraits. This article provides an overview and classification of representative works in these research directions, summarizes existing methods, and points out potential future research directions. This article delves into key directions in the field of multimodal digital humans, covering multimodal head animation, multimodal body animation, and the construction of multimodal digital human representations. In the realm of multimodal head animation, we extensively explore two major tasks: expression-driven and speech-driven animation. Explicit and implicit parameterized models for expression-driven head animation use mesh surfaces and neural radiance fields to enhance rendering effects. Explicit models employ three-dimensional morphable models (3DMM) and linear models but face challenges such as weak expressive capacity, non-differentiable rendering, and difficulties in modeling personalized features. In contrast, implicit models, especially those based on neural radiance fields (NeRF), demonstrate superior expressive capacity and realism. In the domain of speech-driven head animation, we review both two-dimensional and three-dimensional methods, with a particular focus on the significant advantages of neural radiance field technology in enhancing realism. Two-dimensional speech-driven head video generation utilizes techniques like generative adversarial networks and image transfer but depends on three-dimensional prior knowledge and structural characteristics. On the other hand, methods using neural radiance fields, such as AD-NeRF and SSP-NeRF, achieve end-to-end training with differentiable NeRF, significantly improving rendering realism while still addressing challenges related to slow training and inference speeds. The multimodal body animation section focuses on speech-driven body animation, music-driven dance, and text-driven body animation. We emphasize the importance of learning speech semantics and melody, and discuss the applications of RNN, Transformer, and denoising diffusion models in this field. Transformer gradually replaces RNN as the mainstream model, gaining significant advantages in sequence signal learning through attention mechanisms. We also highlight body animation generation based on denoising diffusion models, exploring their applications in works such as FLAME, MDM, and MotionDiffuse, as well as multimodal denoising networks under music and text conditions. In the realm of constructing multimodal digital human representations, the article emphasizes virtual image construction guided by visual-language similarity and denoising diffusion models. Additionally, we address the demand for large-scale, diverse datasets in digital human representation construction to foster more powerful and universal generative models. This article systematically explores the three key aspects of multimodal digital humans: head animation, body animation, and digital human representation construction. In summary, explicit head models, while simple, editable, and computationally efficient, lack expressive capacity and face challenges in rendering, especially in modeling facial personalization and non-facial regions. In contrast, implicit models, especially those using NeRF, demonstrate stronger modeling capabilities and realistic rendering effects. In the realm of speech-driven animation, NeRF-based solutions for head animation overcome limitations of both two-dimensional speaker and three-dimensional digital head animation, achieving more natural and realistic speaker videos. Regarding body animation models, Transformer gradually replaces RNN, while denoising diffusion models show potential in addressing mapping challenges in multimodal body animation. Finally, digital human representation construction faces challenges, with visual-language similarity and denoising diffusion model guidance showing promising results. However, the difficulty lies in directly constructing three-dimensional multimodal virtual humans due to the lack of sufficient three-dimensional virtual human datasets. This study comprehensively analyzes various issues, providing clear directions and challenges for future research. In conclusion, the article anticipates future developments in multimodal digital humans. Key directions include enhancing 3D modeling and real-time rendering accuracy, integrating speech-driven and facial expression synthesis, building larger and diverse datasets, exploring multimodal information fusion and cross-modal learning, and addressing ethical and social impacts. Implicit representation methods, like neural volume rendering, are crucial for improved 3D modeling. Simultaneously, constructing larger datasets poses a significant challenge for developing robust and universal generative models. Exploring multimodal information fusion and cross-modal learning allows models to learn from diverse data sources, presenting a range of behaviors and expressions. Attention to ethical and social impacts, including digital identity and privacy, is crucial. These research directions will guide the field towards a more comprehensive, realistic, and universal future, profoundly influencing interactions in virtual spaces.
Keywords

订阅号|日报