Text driven 3D human motion diffusion generation based on local generation and global fusion
- Pages: 1-14(2026)
Received:01 December 2025,
Revised:2026-04-28,
Accepted:07 May 2026,
Online First:07 May 2026
DOI: 10.11834/jig.250606
移动端阅览

浏览全部资源
扫码关注微信
Received:01 December 2025,
Revised:2026-04-28,
Accepted:07 May 2026,
Online First:07 May 2026,
移动端阅览
目的
2
根据文本提示生成三维人体动作是多模态生成领域的前沿研究方向。尽管当前已经取得了诸多的研究进展,但现有方法在语义对齐精度、局部动作控制和全局协调性方面存在局限,难以实现从文本到高保真三维资产的一体化生成。针对上述问题,本文提出一种局部生成与全局融合的级联式扩散生成框架。
方法
2
首先,利用大语言模型将输入文本自动解耦为头部、四肢及躯干等六个部位的独立语义描述;其次,构建六路并行、梯度隔离的局部扩散编码器,为各部位独立生成动作特征;再次,设计全局融合网络将局部特征融合为符合生物力学的全身姿态,并解码为SMPL(a skinned multi-person linear model)参数化网格;最后,将SMPL网格转换为3D高斯表示,并引入二维扩散模型作为视觉先验,通过分数蒸馏采样优化其外观细节,实现从文本到可实时渲染三维人体的一体化生成。
结果
2
在HumanML3D(3D human motion-language Dataset)和KIT-ML(the KIT motion-language dataset)数据集上开展了对比实验,并从FID(Fréchet inception distance)、和CLIP-S(CLIP similarity)两个维度评估分析本文以及基线对比方法的生成结果。相较于基线方法,本文方法在生成质量和动作准确度方面均有提升,消融实验验证了本文设计思路的有效性。
结论
2
本文方法能够有效提升所生成人体动作的细节表现力、多样性以及文本语义一致性,为三维人体动作生成提供了高效、可扩展的技术方案。
Objective
2
Text-driven 3D human motion generation has emerged as a frontier research direction in multimodal content creation, holding great promise for applications in virtual reality, film production, and the metaverse. Despite significant progress, existing methods still face fundamental challenges in three aspects: precise semantic alignment between natural language descriptions and generated motions, fine-grained control over individual body parts, and global coordination that respects biomechanical constraints. Consequently, current solutions often suffer from semantic leakage unnatural postures, and limited expressiveness. Moreover, most approaches either focus solely on motion synthesis without producing complete 3D assets, or generate static avatars without dynamic pose control. To address these limitations, we propose a novel cascaded diffusion framework that follows a “local-to-global, structure-to-appearance” generation pipeline, enabling end-to-end synthesis from raw text to high-fidelity, real-time renderable 3D human models with precise motion control.
Method
2
Our framework consists of four key stages, each designed to address a specific aspect of the text-to-3D human generation problem. First, a semantic decoupling module leverages a large language model (GPT-4) to automatically parse the input text into independent action descriptions for six anatomical body parts: head, left arm, right arm, torso, left leg, and right leg. This decomposition converts a global motion description into a set of part-specific textual instructions, explicitly separating semantics across different body regions. For body parts not mentioned in the original text, the parser assigns a “do nothing” instruction, preventing unintended movements. This step is crucial because it transforms a loosely coupled global description into a structured, machine-readable format that guides subsequent generation. Second, we construct a local motion generation module composed of six parallel diffusion-based encoders, each conditioned on its corresponding part description. These encoders operate with gradient isolation, meaning that the training and inference processes for different body parts do not share gradients. This design fundamentally prevents semantic leakage—a common issue in prior work where an action described for one body part inadvertently affects others. Each encoder adopts a transformer-based denoising network. Starting from pure Gaussian noise, the network iteratively refines a latent code guided by the corresponding part text embedding produced by a pre-trained TMR encoder. The resulting latent representation captures the fine-grained motion characteristics of that specific body part, such as the trajectory, speed, and joint angles. Importantly, because the six encoders are independent, they can be trained in parallel on part-specific motion data extracted from full-body motion capture datasets. Third, a global motion fusion module integrates the six independent part latents into a coherent full-body pose. Simple concatenation of part latents would ignore the biomechanical dependencies between body regions. To address this, we employ a lightweight feed-forward network with GELU (gaussian error linear units) activation, augmented by the global semantic feature of the complete text. This network learns to enforce biomechanical constraints such as torso leaning backward during a forward kick, natural arm-leg coordination during walking, and maintaining overall balance. The fused latent is then decoded into SMPL parameters, producing a parametric human mesh that respects human skeletal kinematics. Fourth, for appearance enhancement and efficient rendering, we convert the SMPL mesh into a set of 3D Gaussians—a modern explicit representation that supports real-time differentiable rasterization. Each Gaussian is defined by its position, covariance matrix, opacity, and spherical harmonics coefficients for color. To enrich geometric and textural details beyond the smooth SMPL mesh, we adopt a state-of-the-art 2D diffusion model (Flux) as a powerful visual prior. Through SDS (score distillation sampling), gradients from the 2D diffusion model are backpropagated to iteratively optimize the attributes of the 3D Gaussians while keeping their positions fixed to preserve the generated motion. This optimization runs for 4000 iterations, refining details such as skin texture, clothing wrinkles, and lighting effects. The final output is a fully textured 3D human model that can be rendered in real time without any post-processing.
Result
2
We conduct extensive experiments on two standard benchmarks, HumanML3D and KIT-ML, and compare our method against representative baselines including MotionDiffuse, MDM, MLD, DreamFusion, GaussianDreamer, and others. For quantitative evaluation, we employ multiple metrics. FID (Fréchet Inception Distance) measures the realism and diversity of generated motion sequences. CLIP-S (CLIP similarity) evaluates semantic alignment between rendered multi-view images and input text. Additionally, we introduce a Part-FID (part-level Fréchet inception distance) , which computes FID separately for each of the six body parts using dedicated feature extractors, providing a fine-grained assessment of local motion quality. Experimental results demonstrate that our method achieves an FID of 0.429, comparable to MotionDiffuse (0.687) and MDM (0.747). In terms of CLIP-S, our method attains 29.41 (ViT-L/14) and 44.39 (ViT-bigG-14), surpassing GaussianDreamer (27.23 and 41.88) and other text-to-3D baselines. The proposed Part-FID yields an average score of 1.26, which is 18.7% better than MotionDiffuse, with the most significant improvement observed on the torso, validating the effectiveness of our global fusion module in enforcing biomechanical coordination. Ablation studies further confirm the contribution of each component: removing gradient isolation increases semantic leakage. Efficiency analysis shows that our method takes approximately 20 minutes for end-to-end generation, and the final 3D Gaussian representation enables real-time rendering at 24 frames per second, which is two orders of magnitude faster than NeRF-based renderers.
Conclusion
2
we present a comprehensive framework for text-driven 3D human motion generation that uniquely combines local motion generation with global fusion, supported by efficient 3D Gaussian splatting and a powerful 2D diffusion prior. The method achieves superior performance in motion realism, part-level control accuracy, semantic alignment, and rendering efficiency. It provides an end-to-end solution from natural language to high-quality, real-time renderable 3D human assets, opening new possibilities for interactive virtual human applications. Future work will focus on extending the framework to generate long-sequence motions with temporal consistency and incorporating multimodal control signals.
Ahuja C and Morency L P . 2019 . Language2pose: Natural Language Grounded Pose Forecasting. In 2019 International Conference on 3D Vision (3DV) . IEEE , 719 – 728 .[ DOI: 10.1109/3DV.2019.00084 http://dx.doi.org/10.1109/3DV.2019.00084 ]
Athanasiou N , Petrovich M , Black M J , and Varol G . 2022 . TEACH: Temporal Action Composition for 3D Humans. In 2022 International Conference on 3D Vision (3DV) . IEEE Computer Society , 414 – 423 .[ DOI: 10.1109/ 3DV57658.2022.00053 http://dx.doi.org/10.1109/3DV57658.2022.00053 ]
Barron J T , Mildenhall B , Tancik M , Hedman P , Martin-Brualla R , and Srinivasan P P . 2021 . Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields // Proceedings of the IEEE/CVF International Conference on Computer Vision . Montreal, Canada : IEEE: 5835 - 5844 [ DOI: 10.1109/ICCV48922.2021.00580 http://dx.doi.org/10.1109/ICCV48922.2021.00580 ]
Black Forest Labs , Batifol S , Blattmann A , Boesel F , Consul S , Diagne C , et al . 2025 . FLUX . 1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space.[EB/OL].[ 2025-06-24 ]. https://arxiv.org/pdf/2506.15742 https://arxiv.org/pdf/2506.15742
Cao Y , Cao Y , Han K , Shan Y and Wong K K . Dreamavatar: Text-and-shape guided 3d human avatar generation via diffusion models // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition . 2024 : 958 - 968
Chen R , Chen Y , Jiao N , Jia K . 2023 . Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation . in 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France , 2023 , pp. 22189 - 22199 , [ DOI: 10.1109/ICCV51070.2023.02033 http://dx.doi.org/10.1109/ICCV51070.2023.02033 ]
Chen X , Jiang B , Liu W , Huang Z , Fu B , Chen T , et al . 2023a . Executing Your Commands via Motion Diffusion in Latent Space/ /Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18000 - 18010 .[ DOI: 10.1109/CVPR52729.2023.01726 http://dx.doi.org/10.1109/CVPR52729.2023.01726 ]
Dabral R , Mughal M H , Golyanik V , and Theobalt C . 2023 . Mofusion: A framework for denoising-diffusion-based motion synthesis . //Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9760 - 9770 .[ DOI: 10.1109/CVPR52729.2023.00941 http://dx.doi.org/10.1109/CVPR52729.2023.00941 ]
Ghosh A , Cheema M , Oguz C , Theobalt C , and Slusallek P . 2021 . Synthesis of compositional animations from textual descriptions [C]// Proceedings of the IEEE/CVF international conference on computer vision . 2021: 1396 - 1406
Gupta A , Xiong W , Nie Y , Jones I and Oğuz B . 2023 . 3dgen: Triplane latent diffusion for textured mesh generation . [EB/OL].[ 2023-05-09 ]. https://arxiv.org/pdf/2303.05371 https://arxiv.org/pdf/2303.05371
Guo C , Zou S , Zuo X , Wang S , Ji , Li X , et al . 2022 . Generating Diverse and Natural 3D Human Motions From Text . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 5152 – 5161 .
Guo Y , Liu Y , Shao R , Laforte C , Voleti V , Luo G , et al . 2023 . threestudio: A unified framework for 3d content generation .[EB/OL] https://github.com/threestudio-project/ threestudio https://github.com/threestudio-project/threestudio
Hang T , Rui C , Shipeng A , Hao W and Heng Z . 2023 . The growth of image-related three dimensional reconstruction techniques in deep learning-driven era: a critical summary . Journal of Image and Graphics . 28 ( 08 ): 2396 - 2409
杨航 , 陈瑞 , 安仕鹏 , 魏豪 , 张衡 . 2023 . 深度学习背景下的图像三维重建技术进展综述 . 中国图象图形学报 , 28 ( 08 ): 2396 - 2409 [ DOI: 10.11834/jig.220376 http://dx.doi.org/10.11834/jig.220376 ]
Huang Y , Wang J , Zeng A , Cao H , Qi X , Shi Y , et al . 2023 . Dreamwaltz: Make a scene with complex 3d animatable avatars . Advances in Neural Information Processing Systems , 2023, 36 : 4566 - 4584 .
Hu S , Hong F , Hu T , Pan L , Mei H , Xiao W , Yang L , and Liu Z . Humanliff: Layer-wise 3d human generation with diffusion model .[EB/OL].[ 2023-08-18 ]. https:// arxiv/pdf/2308.09712.pdf https://arxiv/pdf/2308.09712.pdf
Jun G , Tianchang S , Zian W , Wenzheng C , Kangxue Y , Daiqing L , et al . 2022 . Get3d: A generative model of high quality 3d textured shapes learned from images . Advances in neural information processing systems , 35 , 31841 - 31854 .[ DOI: 10.48550/arXiv.2209.11163 http://dx.doi.org/10.48550/arXiv.2209.11163 ]
Jun H and Nichol A . 2023 . Shap-e: Generating conditional 3d implicit functions [EB/OL].[ 2023-05-03 ]. https://arxiv.org/pdf/1207.0580.pdf https://arxiv.org/pdf/1207.0580.pdf
Kerbl B , Kopanas G , Leimkühler T and Drettakis G . 2023 . 3D Gaussian Splatting for Real-Time Radiance Field Rendering . ACM Transactions on Graphics , 42 ( 4 ), pp. 1 - 14 . [DOI:10.1145/3592433].
Li J , Tan H , Zhang K , Xu Z , Luan F , Xu Y , Hong Y , et al . 2023 Instant3 d: Fast text-to-3d with sparse-view generation and large reconstruction model// The Twelfth International Conference on Learning Representations .
Li W , Chen R , Chen X and Tan P . 2023 . Sweetdreamer: Aligning geometric priors in 2d diffusion for consistent text-to-3d .[EB/OL].[ 2023-10-20 ]. https://arxiv.org/pdf/2310.02596.pdf https://arxiv.org/pdf/2310.02596.pdf
Loper M , Mahmood N , Romero J , Pons-Moll G , and Black M J . 2023 . SMPL: A Skinned Multi-Person Linear Model. Seminal Graphics Papers: Pushing the Boundaries, Volume 2 (1st ed.. Association for Computing Machinery, New York, NY, USA , Article 88 , 851 – 866 . [ DOI: 10.1145/3596711.3596800 http://dx.doi.org/10.1145/3596711.3596800 ]
Mildenhall B , Srinivasan P P. , Tancik M , Barron J T , Ramamoorthi R , and Ng R . 2022 . NeRF: representing scenes as neural radiance fields for view synthesis . Communications of the ACM , 65 ( 1 ): 99 - 106 . [ DOI: 10.1145/3503250 http://dx.doi.org/10.1145/3503250 ]
Mingtao F , Junhao S , Zijie W , Weixing P , Hang Z , Yulan G , et al . 2025 . Advancements in 3D vision understanding using multimodal large language models . Journal of Image and Graphics , 30 ( 6 ): 1744 - 1791
冯明涛 , 沈军豪 , 武子杰 , 彭伟星 , 钟杭 , 郭裕兰 , 等 . 2025 . 多模态大模型驱动的三维视觉理解技术前沿进展. 中国图象图形学报 , 30 ( 6 ): 1744 - 1791 [ DOI: 10.11834/jig.240588 http://dx.doi.org/10.11834/jig.240588 ]
Müller T , Evans A , Schied C , and Keller A . 2022 . Instant neural graphics primitives with a multiresolution hash encoding . ACM Trans. Graph . 41 , 4, Article 102 (July 2022 ), 15 pages. [ DOI: 10.1145/3528223.3530127 http://dx.doi.org/10.1145/3528223.3530127 ]
Nichol A , Jun H , Dhariwal P , Mishkin P and Chen M . 2022 . Point-e: A system for generating 3d point clouds from complex prompts . [EB/OL].[ 2022-12-16 ]. https://arxiv.org/pdf/2212.08751.pdf https://arxiv.org/pdf/2212.08751.pdf
Petrovich M , Black M J , and Varol G . 2022 . TEMOS: Generating Diverse Human Motions from Textual Descriptions. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel , October 23 – 27 , 2022, Proceedings, Part XXII. Springer-Verlag, Berlin, Heidelberg, 480 – 497 . [ DOI: 10.1007/978-3-03120047- 2_28 http://dx.doi.org/10.1007/978-3-03120047-2_28 ]
Petrovich M , Black M J , and Varol G . 2023 . TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis . International Conference on Computer Vision , Oct 2023 , Paris, France . [ DOI: 10.1109/iccv51070.2023.00870 http://dx.doi.org/10.1109/iccv51070.2023.00870 ]
Plappert M , Mandery C and Asfour T . 2016 . The KIT Motion-Language Dataset . Big data , 2016, 4 ( 4 ): 236 - 252 . [ DOI: 10.1089/big.2016.0028 http://dx.doi.org/10.1089/big.2016.0028 ]
Podell D , English Z , Lacey K , Blattmann A , Dockhorn T , Müller J , et al . 2023 . Sdxl: Improving latent diffusion models for high-resolution image synthesis // The Twelfth International Conference on Learning Representations .
Poole B , Jain A , Barron J T , and Mildenhall B . 2022 . Dreamfusion: Text-to-3d using 2d diffusion . [EB/OL].[ 2022-09-29 ]. https://arxiv.org/pdf/2209.14988.pdf https://arxiv.org/pdf/2209.14988.pdf
Raab S , Leibovitch I , Tevet G , Arar M , Bermano A H , and Cohen-Or D . 2023 . Single motion diffusion . [EB/OL].[ 2023-06-13 ]. https://doi.org/10.48550/arXiv.2302.05905 https://doi.org/10.48550/arXiv.2302.05905
Radford A , Kim J W , Hallacy C , Ramesh A , Goh G , Agarwal S , et al . 2021 . Learning transferable visual models from natural language supervision [C]// International conference on machine learning . PmLR , 2021: 8748 - 8763 .
Rombach R , Blattmann A , Lorenz D , Esser P , and Ommer B . 2022 . High-Resolution Image Synthesis with Latent Diffusion Models. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, New Orleans, LA , USA , 10674 – 10685 . [ DOI: 10.1109/CVPR52688.2022.01042 http://dx.doi.org/10.1109/CVPR52688.2022.01042 ]
Sun H , Zheng R , Huang G , Ma C , Huang H , and Hu R . 2024 . LGTM: Local-to-Global Text-Driven Human Motion Diffusion Model . In ACM SIGGRAPH 2024 Conference Papers (SIGGRAPH ' 24 ). Association for Computing Machinery, YorkNew, NY, USA, Article 66 , 1 – 9 . [ DOI: 10.1145/3641519.3657422 http://dx.doi.org/10.1145/3641519.3657422 ]
Tevet G , Raab S , Gordon B , Shafir Y , Cohen-Or , D and Bermano A . H . 2022 . Human motion diffusion model .[EB/OL].[ 2022-09-29 ] https://arxiv.org/pdf/2209.14916.pdf https://arxiv.org/pdf/2209.14916.pdf
Tevet G , Gordon B , Hertz A , Bermano A H , and Cohen-Or D . 2022a . MotionCLIP: Exposing Human Motion Generation to CLIP Space // European Conference on Computer Vision . Cham : Springer Nature Switzerland , 2022 : 358 - 374 .
Wang H , Du X , Li , J , Raymond A and Yeh G S . 2022 . Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation/ /Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12619 - 12629 . [ DOI: 10.1109/CVPR52729.2023.01214 http://dx.doi.org/10.1109/CVPR52729.2023.01214 ]
Wang Z , Lu C , Wang Y , Bao F , Li C , Su H , et al . 2023 Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation . [EB/OL].[ 2023-05-25 ]. https://arxiv.org/pdf/2305.16213.pdf https://arxiv.org/pdf/2305.16213.pdf
Yi T , Fang J , Wang J , Wu G , Xie L , Zhang X , Liu W , et al . 2024 Gaussiandreamer: Fast generation from text to 3d gaussians by bridging 2d and 3d diffusion models // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition . 2024 : 6796 - 6807 .[ DOI: 10.1109/CVPR52733.2024.00649 http://dx.doi.org/10.1109/CVPR52733.2024.00649 ]
Zhang H , Chen B , Yang H , Qu L , Wang X , Chen L , et al . 2024 . Avatarverse: High-quality & stable 3d avatar creation from text and pose . In Proceedings of the AAAI Conference on Artificial Intelligence (Vol . 38, No. 7 , pp. 7124 - 7132 ).[ DOI: 10.1609/aaai.v38i7.28540 http://dx.doi.org/10.1609/aaai.v38i7.28540 ]
Zhang M , Cai Z , Pan L , Hong F , Guo X , Yang L , et al . 2024 . Motiondiffuse: Text-driven human motion generation with diffusion model . IEEE transactions on pattern analysis and machine intelligence , 46 ( 6 ), 4115 - 4128 . [ DOI: 10.1109/TPAMI.2024.3355414 http://dx.doi.org/10.1109/TPAMI.2024.3355414 ]
相关文章
相关作者
相关机构
京公网安备11010802024621