Current Issue Cover
多模态信息引导的三维数字人运动生成技术

赵宝全1, 付一愉1, 苏卓2, 王若梅2, 吕辰雷3, 罗笑南4(1.中山大学人工智能学院;2.中山大学计算机学院;3.深圳大学计算机与软件学院;4.桂林电子科技大学计算机与信息安全学院)

摘 要
基于多模态信息的三维数字人运动生成技术旨在通过文本、音频、图像、视频等数据实现特定输入条件下的人体运动生成。这项技术在电影、动画、游戏制作、元宇宙等领域具有重要的应用价值和广泛的经济社会效益,是近年来计算机图形学和计算机视觉等领域研究的热点问题之一。然而,基于多模态信息的三维数字人运动生成面临着诸多挑战,包括跨模态信息的表征和融合困难、高质量数据集缺乏、生成的运动质量较差(如抖动、穿模和脚部滑动等)以及生成效率低等问题。虽然近年来研究者们提出了各式各样的解决方案来应对上述挑战,但如何根据不同模态数据的特点实现高效、高质量的三维数字人运动生成仍然是一个开放性问题。本文以数字人运动生成所采用的模型架构为分类标准,将现有的主流方法分为基于生成式对抗网络(generative adversarial network,GAN)的方法、基于自编码器(autoencoder,AE)的方法、基于变分自编码器(variational autoencoder,VAE)的方法以及基于扩散模型的方法,然后总结并形成了一种数字人运动生成通用框架。本文还介绍了该领域常见的参数化人体模型、数据集以及评估指标。对于一些具有代表性的工作,本文在一些常用的数据集上进行了对比实验,评估了这些方法的性能表现。最后综合现有的数据集、算法和代表性研究,总结了该领域已解决和尚未解决的问题和挑战,探讨了完善数据集、优化运动质量和多样性、融合跨模态信息和提高生成效率等潜在的研究方向。
关键词
A survey on multi-modal information guided 3D human motion generation

(School of Computer and Information Security, Guilin University of Electronic Science and Technology)

Abstract
3D digital human motion generation guided by multimodal information aims to generate human motion under specific in-put conditions through data such as text, audio, image, and video. This technology has a wide spectrum of applications and extensive economic and social benefits in the fields of film, animation, game production, metaverse, etc., and is one of the research hotspots in the fields of computer graphics and computer vision in recent years. However, such a task faces many grand challenges, including the difficulty of representation and fusion of multimodal information, the lack of high-quality datasets, the poor quality of generated motion (such as jitter, penetration, and foot sliding), and the low generation effi-ciency. Although researchers have proposed various solutions to address the aforementioned challenges in recent years, how to achieve efficient and high-quality 3Ddigital human motion generation according to the characteristics of different modal data remains an open problem. In this paper, we present a comprehensive review of 3D digital human motion gener-ation and elaborate on recent advances in this area from the perspectives of parametrized 3D human models, human motion representation, motion generation techniques, motion analysis and editing, existing human motion datasets and evaluation metrics, and beyond. Parametrized human models facilitate digital human modeling and motion generation by providing parameters associated with body shapes and postures and have served as a key pillar of current digital human research and applications. This survey begins with an introduction to existing widely used parametrized 3D human body models, in-cluding SCAPE, SMPL-X, SMPL, and SMPL-H, etc., and makes a detailed comparison between them regarding model representations and parameters used to control body shapes, poses and facial expressions. Human motion representation is one of the core issues in digital human motion generation. The paper highlights the musculoskeletal model and classic skinning algorithms, including linear blending skinning and dual quaternion skinning, and their use in physics-based and data-driven methods to control human movements. We have also carried out an extensive study on existing multimodal in-formation guided human motion generation approaches and categorized them into four major branches, i.e., generative ad-versarial network-based methods, autoencoder based methods, variational autoencoder based method, and diffusion model based methods. Other work like generative motion matching has also been mentioned and compared with data-driven methods. The survey summarizes existing human motion generation schemes from the perspectives of both methods and model architectures and presents a unified framework for digital human motion generation. That is, a motion encoder ex-tracts motion features from the original motion sequence, and then fuses them with the conditional features extracted by the conditional encoder into latent variables or maps them to the latent space, from which generative adversarial networks, autoencoders, variational autoencoders or diffusion models can learn to generate qualified human movements through a motion decoder. In addition, this paper also surveys the current work in digital human motion analysis and editing, includ-ing motion clustering, motion prediction, motion in-betweening and motion in-filling. A high-quality dataset is essential to data-driven human motion generation and evaluation. We have collected publicly available human motion databases and classified them into different types according to two criteria. From the perspective of data type, existing databases can be classified into motion capture datasets and video reconstruction datasets. Motion capture data sets rely on devices such as motion capture systems, cameras, and inertial measurement units to obtain real human movement data (i.e., ground truth); While the video reconstruction dataset reconstructs a 3D human body model by estimating body joints from motion videos and fitting them to a parametric human body model. From the perspective of task type, commonly used databases can be classified into text-motion datasets, action-motion datasets, and audio-motion datasets. They are usually new datasets ob-tained by processing motion capture datasets and video reconstruction data sets according to specific tasks. This paper also features a comprehensive briefing on the evaluation metrics of 3D human motion generation, including motion quality, motion diversity and multimodality, consistency between inputs and outputs, and inference efficiency. Apart from those objective evaluation metrics, user study has also been employed for generated human motion quality and thus is discussed in this paper. To compare the performance of different digital human motion generation methods on public datasets, we have selected a collection of the most representative work and carried out extensive experiments for comprehensive evalu-ation. Finally, we summarize the well-addressed and underexplored problems and challenges in this field and discuss sev-eral potential further research directions regarding datasets, the quality and diversity of generated motions, cross-modal information fusion and generation efficiency. Specifically, existing datasets generally fall short of expectations concerning motion diversity and descriptions associated with motions, data distribution, and length of the motion sequence. A large-scale 3D human motion database is expected to be developed in future work to boost the efficacy and robustness of motion generation models. Besides, the quality of generated human motions, especially those with complex movement patterns, is still far from satisfactory. Physical constraints and postprocessing are promising to be integrated into human motion generation frameworks to tackle the issues. In addition, although existing human motion generation methods can generate various motion sequences from multi-modal information such as text, audio, music, actions and keyframes, work on cross-modal human motion generation (e.g., generating a motion from both a text description and a piece of background music) has scarcely been reported in the literature. Such a task is worth investigating in further studies to unlock new op-portunities in this area. In terms of the diversity of generated content, some researchers have explored harvesting rich, di-verse and stylized motions using variational autoencoders, diffusion models, and contrastive language-image pre-training neural networks. However, current studies are mainly focused on the motion generation of a single human represented with an SMPL-like naked parameterized 3D model, while the generation and interaction of multiple dressed humans have huge untapped application potential but have not yet received sufficient attention. Last but not least, how to boost motion gener-ation efficiency and achieve a good balance between quality and inference overhead is a non-negligible issue. Possible so-lutions include lightweight parameterized human models, information-intensive training datasets, and improved or more advanced generative frameworks.
Keywords

订阅号|日报