融入变分自编码网络的文本生成三维运动人体

李健; 杨钧; 王丽燕; 王永归

发布时间： 2024-05-20
摘要点击次数： 300
全文下载次数： 264
DOI: 10.11834/jig.230291
2024 | Volume 29 | Number 5

融入变分自编码网络的文本生成三维运动人体

李健¹, 杨钧¹, 王丽燕², 王永归¹(1.陕西科技大学电子信息与人工智能学院, 西安 710021;2.陕西科技大学文理学院, 西安 710021)

摘要

目的针对现有动态三维数字人体模型生成时不能改变体型、运动固定单一等问题，提出一种融合变分自编码器（variational auto-encoder，VAE）网络、对比语言—图像预训练（contrastive language-image pretraining，CLIP）网络与门控循环单元（gate recurrent unit，GRU）网络生成运动三维人体模型的方法。该方法可根据文本描述生成相应体型和动作的三维人体模型。方法首先，使用VAE编码网络生成潜在编码，结合CLIP网络零样本生成体型与文本表述相符的人体模型，以解决蒙皮多人线性（skinned multi-person linear，SMPL）模型参数不合理而生成不符合正常体型特征的人体模型问题；其次，采用VAE网络与GRU网络生成与文本表述相符的变长时间三维人体姿势序列，以解决现有运动生成方法仅生成事先指定的姿势序列、无法生成运动时间不同的姿势序列问题；最后，将体型特征与运动特征结合，得到三维运动人体模型。结果在HumanML3D数据集上进行人体生成实验，并与其他3种方法进行比较，相比于现有最好方法，R精度的Top1、Top2和Top3分别提高了0.031、0.034和0.028，弗雷歇初始距离（Fréchet inception distance，FID）提高了0.094，多样性提高了0.065。消融实验验证了模型的有效性，结果表明本文方法对人体模型生成效果有提升。结论本文方法可通过文本描述生成运动三维人体模型，模型的体型和动作更符合输入文本的描述。

关键词

人体动作合成自然语言处理(NLP) 深度学习蒙皮多人线性模型变分自编码器网络

Incorporating variational auto-encoder networks for text-driven generation of 3D motion human body

Li Jian¹, Yang Jun¹, Wang Liyan², Wang Yonggui¹(1.School of Electronic Information and Artificial Intelligence, Shaanxi University of Science and Technology, Xi'an 710021, China;2.School of Art and Sciences, Shaanxi University of Science and Technology, Xi'an 710021, China)

Abstract

Objective Artificial intelligence generated content （AIGC） technology can reduce the workload of threedimensional（3D） modeling when applied to generate virtual 3D scene models using natural language. For static 3D objects，methods have arisen in generating high-precision 3D models that match a given textual description. By contrast， for dynamic digital human body models，which is also highly popular in numerous circumstances，only two-dimensional （2D）human images or sequences of human poses can be generated corresponding to a given textual description. Dynamic 3D human models cannot be generated with the same way above using natural language. Moreover，current existing methods can lead to problems such as immutable shape and motion when generating dynamic digital human models. A method fusing variational auto-encoder（VAE），contrastive language-image pretraining（CLIP），and gate recurrent unit（GRU）， which can be used to generate satisfactory dynamic 3D human models corresponding to the shapes and motions described by the text，is proposed to address the above problems. Method A method based on the VAE network is proposed in this paper to generate dynamic 3D human models，which correspond to the body shape and action information described in the text. Notably，a variety of pose sequences with variable time duration can be generated with the proposed method. First，the shape information of the body is obtained through the body shape generation module based on the VAE network and CLIP model，and zero-shot samples are used to generate the skinned multi-person linear（SMPL）parametric human model that matches the textual description. Specifically，the VAE network encodes the body shape of the SMPL model，the CLIP model matches the textual descriptions and body shapes，and the 3D human model with the highest matching score is thus filtered. Second，variable-length 3D human pose sequences are generated through the body action generation module based on the VAE and GRU networks that match the textual description. Particularly，the VAE self-encoder encodes the dynamic human poses. The action length sampling network then obtains the length of time that matches the textual description of the action. The GRU and VAE networks encode the input text and generate the diverse dynamic 3D human pose sequences through the decoder. Finally，a dynamic 3D human model corresponding to the body shape and action description can be generated by fusing the body shape and action information generated above. The performance of the method is evaluated in this paper using the HumanML3D dataset，which comprises 14 616 motions and 44 970 linguistic annotations. Some of the motions in the dataset are mirrored before training，and some words are replaced in the motion descriptions（e. g. ，“left”is changed to“right”）to expand the dataset. In the experiments in this paper，the HumanML3D dataset is divided into training，testing，and validation sets in the ratios of 80%，15%，and 5%，respectively. The experiments in this paper are conducted in an Ubuntu 18. 04 environment with a Tesla V100 GPU and 16GB of video memory. The adaptive moment estimation（Adam）optimizer is trained in 300 training rounds with a learning rate of 0. 000 1 and a batch size of 128 to train the motion self-encoder. The Adam optimizer performs 320 training rounds with a learning rate of 0. 000 2 and a batch size of 32 to train the motion generator. This optimizer also performs 200 training rounds with a learning rate of 0. 000 1 and a batch size of 64 for training the motion length network. Result Dynamic 3D human model generation experiments were conducted on the HumanML3D dataset. Compared with three other state-of-the-art methods，the proposed method shows an improvement of 0. 031，0. 034，and 0. 028 in the Top1，Top2，and Top3 dimensions of R-precision，0. 094 in Fréchet inception distance（FID），and 0. 065 in diversity，respectively，considering the best available results. The experimental analysis for qualitative evaluation was divided into three parts：body shape feature generation，action feature generation， and dynamic 3D human model generation including body features. The body feature generation part was tested using different text descriptions（e. g. ，tall，short，fat，thin）. For the action feature generation part，the same text descriptions are tested using this paper and other methods for generation comparison. Combining the body shape features and the action feature of the human body，the generation of dynamic 3D human models with body shape features is demonstrated. In addition，ablation experiments，including ablation comparison with different methods using different loss functions，are performed to further demonstrate the effectiveness of the method. The final experimental results show that the proposed method in this paper improves the effectiveness of the model. Conclusion This paper presents methods for generating dynamic 3D human models that conform to textual descriptions，fusing body shape and action information. The body shape generation module can generate SMPL parameterized human models whose body shape conforms to the textual description，while the action generation module can generate variable-length 3D human pose sequences that match the textual description. Experimental results show that the proposed method can effectively generate motion dynamic 3D human models that conform to textual descriptions，and the generated human models have diverse body shape and motions. On the HumanML3D dataset， the performance of the method outperforms other similar state-of-the-art algorithms.

Keywords

human motion synthesis natural language processing(NLP) deep learning skinned multi-person linear model variational auto-encoder network