A survey on multimodal information-guided 3D human motion generation

Zhao Baoquan; Fu Yiyu; Su Zhuo; Wang Ruomei; Lyu Chenlei; Luo Xiaonan

doi:10.11834/jig.230626

Modeling, Generation and Rendering of Digital Human | Views : 0 下载量: 8 CSCD: 0

PDF
Export
Share
Collection
Album

A survey on multimodal information-guided 3D human motion generation
Vol. 29, Issue 9, Pages: 2541-2565(2024)
Published： 16 September 2024 ，
DOI： 10.11834/jig.230626
稿件说明：

移动端阅览

赵宝全，付一愉，苏卓，王若梅，吕辰雷，罗笑南. 2024. 多模态信息引导的三维数字人运动生成综述. 中国图象图形学报， 29(09):2541-2565

Zhao Baoquan， Fu Yiyu， Su Zhuo， Wang Ruomei， Lyu Chenlei， Luo Xiaonan. 2024. A survey on multimodal information-guided 3D human motion generation. Journal of Image and Graphics， 29(09):2541-2565
赵宝全，付一愉，苏卓，王若梅，吕辰雷，罗笑南. 2024. 多模态信息引导的三维数字人运动生成综述. 中国图象图形学报， 29(09):2541-2565 DOI： 10.11834/jig.230626.

Zhao Baoquan， Fu Yiyu， Su Zhuo， Wang Ruomei， Lyu Chenlei， Luo Xiaonan. 2024. A survey on multimodal information-guided 3D human motion generation. Journal of Image and Graphics， 29(09):2541-2565 DOI： 10.11834/jig.230626.

摘要

基于多模态信息的三维数字人运动生成技术旨在通过文本、音频、图像和视频等数据实现特定输入条件下的人体运动生成。这项技术在电影、动画、游戏制作和元宇宙等领域具有重要的应用价值和广泛的经济社会效益，是近年来计算机图形学和计算机视觉等领域研究的热点问题之一。然而，基于多模态信息的三维数字人运动生成面临着诸多挑战，包括跨模态信息的表征和融合困难、高质量数据集缺乏、生成的运动质量较差（如抖动、穿模和脚部滑动等）以及生成效率低等问题。虽然近年来研究者们提出了各式各样的解决方案来应对上述挑战，但如何根据不同模态数据的特点实现高效、高质量的三维数字人运动生成仍然是一个开放性问题。本文以数字人运动生成所采用的模型架构为分类标准，将现有的主流方法分为基于生成对抗网络（generative adversarial network，GAN）的方法、基于自编码器（autoencoder，AE）的方法、基于变分自编码器（variational autoencoder，VAE）的方法以及基于扩散模型的方法，总结并形成了一种数字人运动生成通用框架。本文还介绍了该领域常见的参数化人体模型、数据集以及评估指标。对于一些具有代表性的工作，本文在一些常用数据集上进行了对比实验，评估这些方法的性能表现。最后综合现有的数据集、算法和代表性研究，总结了该领域的问题和挑战，探讨了完善数据集、优化运动质量和多样性、融合跨模态信息和提高生成效率等潜在的研究方向。

Abstract

Three-dimensional （3D） digital human motion generation guided by multimodal information generates human motion under specific input conditions through data， such as text， audio， image， and video. This technology has a wide spectrum of applications and extensive economic and social benefits in the fields of film， animation， game production， metaverse， etc.， and is one of the research hotspots in the fields of computer graphics and computer vision. However， such a task faces grand challenges， including the difficult representation and fusion of multimodal information， lack of high-quality datasets， poor quality of generated motion （such as jitter， penetration， and foot sliding）， and low generation efficiency. Although various solutions have been proposed to address the aforementioned challenges， a mechanism for achieving efficient and high-quality 3D digital human motion generation based on the characteristics of distinct modal data remains an open problem to be solved. This paper comprehensively reviews 3D digital human motion generation and elaborates on related recent advances from the perspectives of parametrized 3D human models， human motion representation， motion generation techniques， motion analysis and editing， existing human motion datasets and evaluation metrics. Parametrized human models facilitate digital human modeling and motion generation through the provision of parameters associated with body shapes and postures and serve as key pillars of current digital human research and applications. This survey begins with an introduction to widely used parametrized 3D human body models， including shape completion and animation of people （SCAPE）， skinned multi-person linear model （SMPL）， SMPL-X， and SMPL-H， and their detailed comparison in terms of model representations and the parameters used to control body shapes， poses， and facial expressions. Human motion representation is a core issue in digital human motion generation. This work highlights the musculoskeletal model and classic skinning algorithms， including linear blending skinning and dual quaternion skinning， and their application in physics-based and data-driven methods to control human movements. We have also extensively studied approaches to existing multimodal information-guided human motion generation and categorized them into four major branches， i.e.， generative adversarial network-， autoencoder-， variational autoencoder-， and diffusion model-based methods. Other works， such as generative motion matching， have also been mentioned and compared with data-driven methods. The survey summarizes existing schemes of human motion generation from the perspectives of methods and model architectures and presents a unified framework for the generation of digital human motion. A motion encoder extracts motion features from an original motion sequence and fuses them with the conditional characteristics extracted by the conditional encoder into latent variables or maps them to the latent space. This condition enables generative adversarial networks， autoencoders， variational autoencoders， or diffusion models to generate qualified human movements through a motion decoder. In addition， this paper surveys the current work on digital human motion analysis and editing， including motion clustering， motion prediction， motion in-betweening， and motion in-filling. Data-driven human motion generation and evaluation requires the use of a high-quality dataset. We collected publicly available human motion databases and classified them into various types based on two criteria. From the perspective of data type， existing databases can be classified into motion capture and video reconstruction datasets. Motion capture data sets rely on devices， such as motion capture systems， cameras， and inertial measurement units， to obtain real human movement data （i.e.， ground truth）. Meanwhile， the video reconstruction dataset was used to reconstruct a 3D human body model through estimation of body joints from motion videos and fitting them to a parametric human body model. From the perspective of task type， commonly used databases can be classified into text-， action-， and audio-motion datasets. The new datasets are usually obtained by processing motion capture and video reconstruction datasets based on specific tasks. A comprehensive briefing on the evaluation metrics of 3D human motion generation， including motion quality， motion diversity， and multimodality， consistency between inputs and outputs， and inference efficiency， is also provided. Apart from objective evaluation metrics， user study was employed to generate human motion quality and was discussed in this paper. To compare the performances of various generation methods used in digital human motion on public datasets， we selected a collection of the most representative work and carried out extensive experiments for comprehensive evaluation. Finally， the well-addressed and underexplored issues in this field were summarized， and several potential further research directions regarding datasets， the quality and diversity of generated motions， cross-modal information fusion， and generation efficiency were discussed. Specifically， existing datasets generally fail to meet the expectations concerning motion diversity and descriptions associated with motions， data distribution， and length of motion sequence. Future work should consider the development of a large-scale 3D human motion database to boost the efficacy and robustness of motion generation models. In addition， the quality of generated human motions， especially those with complex movement patterns， remains dissatisfactory. Physical constraints and postprocessing show promise in the integration into human motion generation frameworks to tackle issues. In addition， although human-motion generation methods can generate various motion sequences from multimodal information， such as text， audio， music， actions and keyframes， work on cross-modal human motion generation （e.g.， generating a motion from a text description and a piece of background music） is scarcely reported. Investigation of such a task is worthy， especially in unlocking new opportunities in this area. In terms of the diversity of generated content， some researchers have explored harvesting rich， diverse， and stylized motions using variational autoencoders， diffusion models， and contrastive language-image pretraining neural networks. However， current studies mainly focus on the motion generation of a single human represented by an SMPL-like naked parameterized 3D model. Meanwhile， the generation and interaction of multiple dressed humans have huge untapped application potential but have not received sufficient attention. Finally， another nonnegligible issue is a mechanism for boosting motion generation efficiency and achieving a good balance between quality and inference overhead. Possible solutions to such a problem include lightweight parameterized human models， information-intensive training datasets， and improved or more advanced generative frameworks.

关键词

三维数字人运动生成多模态信息参数化人体模型生成对抗网络（GAN）自编码器（AE）变分自编码器（VAE）扩散模型

Keywords

3D avatarmotion generationmultimodal informationparametric human modelgenerative adversarial network （GAN）autoencoder （AE）variational autoencoder （VAE）diffusion model

references

Aberman K， Li P Z， Lischinski D， Sorkine-Hornung O， Cohen-Or D and Chen B Q. 2020. Skeleton-aware networks for deep motion retargeting. ACM Transactions on Graphics， 39（4）： #62 ［DOI： 10.1145/3386569.3392462http://dx.doi.org/10.1145/3386569.3392462］

Ahn H， Ha T， Choi Y， Yoo H and Oh S. 2018. Text2Action： generative adversarial synthesis from language to action//Proceedings of 2018 IEEE International Conference on Robotics and Automation. Brisbane， Australia： IEEE： 5915-5920 ［DOI： 10.1109/icra.2018.8460608http://dx.doi.org/10.1109/icra.2018.8460608］

Ahuja C and Morency L P. 2019. Language2Pose： natural language grounded pose forecasting//Proceedings of 2019 International Conference on 3D Vision. Quebec City， Canada： IEEE： 719-728 ［DOI： 10.1109/3dv.2019.00084http://dx.doi.org/10.1109/3dv.2019.00084］

Anguelov D， Srinivasan P， Koller D， Thrun S， Rodgers J and Davis J. 2005. SCAPE： shape completion and animation of people. ACM Transactions on Graphics， 24（3）： 408-416 ［DOI： 10.1145/1073204.1073207http://dx.doi.org/10.1145/1073204.1073207］

Ao T L， Gao Q Z， Lou Y K， Chen B Q and Liu L B. 2022. Rhythmic gesticulator： rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings. ACM Transactions on Graphics， 41（6）： #209 ［DOI： 10.1145/3550454.3555435http://dx.doi.org/10.1145/3550454.3555435］

Ao T L， Zhang Z Y and Liu L B. 2023. GestureDiffuCLIP： gesture diffusion model with CLIP latents. ACM Transactions on Graphics， 42（4）： #42 ［DOI： 10.1145/3592097http://dx.doi.org/10.1145/3592097］

Barsoum E， Kender J and Liu Z C. 2018. HP-GAN： probabilistic 3d human motion prediction via GAN//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 149901-149909 ［DOI： 10.1109/CVPRW.2018.00191http://dx.doi.org/10.1109/CVPRW.2018.00191］

Bengio Y， Louradour J， Collobert R and Weston J. 2009. Curriculum learning//Proceedings of the 26th Annual International Conference on Machine Learning. Montreal， Canada： ACM： 41-48 ［DOI： 10.1145/1553374.1553380http://dx.doi.org/10.1145/1553374.1553380］

Bergamin K， Clavet S， Holden D and Forbes J R. 2019. DReCon： data-driven responsive control of physics-based characters. ACM Transactions on Graphics， 38（6）： #206 ［DOI： 10.1145/3355089.3356536http://dx.doi.org/10.1145/3355089.3356536］

Bińkowski M， Sutherland D J， Arbel M and Gretton A. 2018. Demystifying MMD GANs//Proceedings of the 6th International Conference on Learning Representations. Vancouver， Canada： OpenReview.net

Bogo F， Kanazawa A， Lassner C， Gehler P， Romero J and Black M J. 2016. Keep it SMPL： automatic estimation of 3D human pose and shape from a single image//Proceedings of the 14th European Conference on Computer Vision. Amsterdam， the Netherlands： Springer： 561-578 ［DOI： 10.1007/978-3-319-46454-1_34http://dx.doi.org/10.1007/978-3-319-46454-1_34］

Cao Z， Hidalgo G， Simon T， Wei S E and Sheikh Y. 2021. OpenPose： realtime multi-person 2D pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence， 43（1）： 172-186 ［DOI： 10.1109/TPAMI.2019.2929257http://dx.doi.org/10.1109/TPAMI.2019.2929257］

Cervantes P， Sekikawa Y， Sato I and Shinoda K. 2022. Implicit neural representations for variable length human motion generation//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv， Israel： Springer： 356-372 ［DOI： 10.1007/978-3-031-19790-1_22http://dx.doi.org/10.1007/978-3-031-19790-1_22］

Chen J Y， Yan M， Zhang J Z， Xu Y Z， Li X L， Weng Y J， Yi L， Song S R and Wang H. 2023a. Tracking and reconstructing hand object interactions from point cloud sequences in the wild//Proceedings of the 37th AAAI Conference on Artificial Intelligence. Washington， USA： AAAI： 304-312 ［DOI： 10.1609/aaai.v37i1.25103http://dx.doi.org/10.1609/aaai.v37i1.25103］

Chen X， Jiang B， Liu W， Huang Z L， Fu B， Chen T and Yu G. 2023b. Executing your commands via motion diffusion in latent space//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver， Canada： IEEE： 18000-18010 ［DOI： 10.1109/cvpr52729.2023.01726http://dx.doi.org/10.1109/cvpr52729.2023.01726］

Chen X L， Fang H， Lin T Y， Vedantam R， Gupta S， Doll􀅡r P and Zitnick C L. 2015. Microsoft COCO captions： data collection and evaluation server ［EB/OL］. ［2023-04-16］. https://arxiv.org/pdf/1504.00325.pdfhttps://arxiv.org/pdf/1504.00325.pdf

Cho K， Van Merriënboer B， Gulcehre C， Bahdanau D， Bougares F， Schwenk H and Bengio Y. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. Doha， Qatar： Association for Computational Linguistics： 1724-1734 ［DOI： 10.3115/v1/d14-1179http://dx.doi.org/10.3115/v1/d14-1179］

Dabral R， Mughal M H， Golyanik V and Theobalt C. 2023. MoFusion： a framework for denoising-diffusion-based motion synthesis//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver， Canada： IEEE： 9760-9770 ［DOI： 10.1109/cvpr52729.2023.00941http://dx.doi.org/10.1109/cvpr52729.2023.00941］

Dagioglou M， Tsitos A C， Smarnakis A and Karkaletsis V. 2021. Smoothing of human movements recorded by a single RGB-D camera for robot demonstrations//Proceedings of the 14th PErvasive Technologies Related to Assistive Environments Conference. Corfu， Greece： ACM： 496-501 ［DOI： 10.1145/3453892.3461627http://dx.doi.org/10.1145/3453892.3461627］

Deng J， Dong W， Socher R， Li L J， Li K and Li F F. 2009. ImageNet： a large-scale hierarchical image database//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami， USA： IEEE： 248-255 ［DOI： 10.1109/cvpr.2009.5206848http://dx.doi.org/10.1109/cvpr.2009.5206848］

Duan Y L， Shi T Y， Zou Z X， Lin Y N， Qian Z H， Zhang B H and Yuan Y. 2021. Single-shot motion completion with Transformer ［EB/OL］. ［2023-07-21］. https://arxiv.org/pdf/2103.00776.pdfhttps://arxiv.org/pdf/2103.00776.pdf

Ghorbani S， Mahdaviani K， Thaler A， Kording K， Cook D J， Blohm G and Troje N F. 2021. MoVi： a large multi-purpose human motion and video dataset. PLoS One， 16（6）： #e0253157 ［DOI： 10.1371/journal.pone.0253157http://dx.doi.org/10.1371/journal.pone.0253157］

Ghosh A， Cheema N， Oguz C， Theobalt C and Slusallek P. 2021. Synthesis of compositional animations from textual descriptions//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 1376-1386 ［DOI： 10.1109/ICCV48922.2021.00143http://dx.doi.org/10.1109/ICCV48922.2021.00143］

Goodfellow I， Pouget-Abadie J， Mirza M， Xu B， Warde-Farley D， Ozair S， Courville A and Bengio Y. 2020. Generative adversarial networks. Communications of the ACM， 63（11）： 139-144 ［DOI： 10.1145/3422622http://dx.doi.org/10.1145/3422622］

Guo C， Zou S H， Zuo X X， Wang S， Ji W， Li X Y and Cheng L. 2022a. Generating diverse and natural 3D human motions from text//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 5142-5151 ［DOI： 10.1109/cvpr52688.2022.00509http://dx.doi.org/10.1109/cvpr52688.2022.00509］

Guo C， Zuo X X， Wang S and Cheng L. 2022b. TM2T： stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv， Israel： Springer： 580-597 ［DOI： 10.1007/978-3-031-19833-5_34http://dx.doi.org/10.1007/978-3-031-19833-5_34］

Guo C， Zuo X X， Wang S， Zou S H， Sun Q Y， Deng A N， Gong M L and Cheng L. 2020. Action2Motion： conditioned generation of 3D human motions//Proceedings of the 28th ACM International Conference on Multimedia. Seattle， USA： ACM： 2021-2029 ［DOI： 10.1145/3394171.3413635http://dx.doi.org/10.1145/3394171.3413635］

Harvey F G， Yurick M， Nowrouzezahrai D and Pal C. 2020. Robust motion in-betweening. ACM Transactions on Graphics， 39（4）： #60 ［DOI： 10.1145/3386569.3392480http://dx.doi.org/10.1145/3386569.3392480］

Hertz A， Mokady R， Tenenbaum J， Aberman K， Pritch Y and Cohen-Or D. 2023. Prompt-to-prompt image editing with cross-attention control//Proceedings of the 11th International Conference on Learning Representations. Kigali， Rwanda： OpenReview.net： 1-36

Heusel M， Ramsauer H， Unterthiner T， Nessler B and Hochreiter S. 2017. GANs trained by a two time-scale update rule converge to a local nash equilibrium//Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， USA： Curran Associates Inc.： 6629-6640 ［DOI： 10.5555/3295222.3295408http://dx.doi.org/10.5555/3295222.3295408］

Ho J， Jain A and Abbeel P. 2020. Denoising diffusion probabilistic models//Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver， Canada： Curran Associates Inc.： #574 ［DOI： 10.5555/3495724.3496298http://dx.doi.org/10.5555/3495724.3496298］

Hochreiter S and Schmidhuber J. 1997. Long short-term memory. Neural Computation， 9（8）： 1735-1780 ［DOI： 10.1162/neco.1997.9.8.1735http://dx.doi.org/10.1162/neco.1997.9.8.1735］

Ionescu C， Papava D， Olaru V and Sminchisescu C. 2014. Human3.6M： large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence， 36（7）： 1325-1339 ［DOI： 10.1109/tpami.2013.248http://dx.doi.org/10.1109/tpami.2013.248］

Ji Y L， Xu F X， Yang Y， Shen F M， Shen H T and Zheng W S. 2018. A large-scale RGB-D database for arbitrary-view human action recognition//Proceedings of the 26th ACM international conference on Multimedia. Seoul， Korea（South）： ACM： 1510-1518 ［DOI： 10.1145/3240508.3240675http://dx.doi.org/10.1145/3240508.3240675］

Jiang B， Chen X， Liu W， Yu J Y， Yu G and Chen T. 2023. MotionGPT： human motion as a foreign language//Proceedings of the 37th Conference on Neural Information Processing Systems. New Orleans， USA： NeurIPS

Kaufmann M， Aksan E， Song J， Pece F， Ziegler R and Hilliges O. 2020. Convolutional autoencoders for human motion infilling//Proceedings of 2020 International Conference on 3D Vision. Fukuoka， Japan： IEEE： 918-927 ［DOI： 10.1109/3DV50981.2020.00102http://dx.doi.org/10.1109/3DV50981.2020.00102］

Kim J， Kim J and Choi S. 2023. FLAME： free-form language-based motion synthesis and editing//Proceedings of the 37th AAAI Conference on Artificial Intelligence. Washington， USA： AAAI： 8255-8263 ［DOI： 10.1609/aaai.v37i7.25996http://dx.doi.org/10.1609/aaai.v37i7.25996］

Kingma D P and Welling M. 2014. Auto-encoding variational Bayes//Proceedings of the 2nd International Conference on Learning Representations. Banff， Canada： ICLR

Kocabas M， Athanasiou N and Black M J. 2020. VIBE： video inference for human body pose and shape estimation//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 5252-5262 ［DOI： 10.1109/cvpr42600.2020.00530http://dx.doi.org/10.1109/cvpr42600.2020.00530］

Kwiatkowski A， Alvarado E， Kalogeiton V， Liu C K， Pettré J， van de Panne M and Cani M P. 2022. A survey on reinforcement learning methods in character animation. Computer Graphics Forum， 41（2）： 613-639 ［DOI： 10.1111/cgf.14504http://dx.doi.org/10.1111/cgf.14504］

Lee S， Lee S， Lee Y and Lee J. 2021. Learning a family of motor skills from a single motion clip. ACM Transactions on Graphics， 40（4）： #93 ［DOI： 10.1145/3450626.3459774http://dx.doi.org/10.1145/3450626.3459774］

Lee S， Park M， Lee K and Lee J. 2019. Scalable muscle-actuated human simulation and control. ACM Transactions on Graphics， 38（4）： #73 ［DOI： 10.1145/3306346.3322972http://dx.doi.org/10.1145/3306346.3322972］

Li P Z， Aberman K， Zhang Z H， Hanocka R and Sorkine-Hornung O. 2022. GANimator： neural motion synthesis from a single sequence. ACM Transactions on Graphics， 41（4）： #138 ［DOI： 10.1145/3528223.3530157http://dx.doi.org/10.1145/3528223.3530157］

Li W Y， Chen X L， Li P Z， Sorkine-Hornung O and Chen B Q. 2023. Example-based motion synthesis via generative motion matching. ACM Transactions on Graphics， 42（4）： #94 ［DOI： 10.1145/3592395http://dx.doi.org/10.1145/3592395］

Lin A S， Wu L M， Corona R， Tai K， Huang Q X and Mooney R J. 2018. Generating animated videos of human activities from natural language descriptions//Proceedings of the 32nd Conference on Neural Information Processing Systems. Montréal， Canada： NeurIPS

Lin J， Zeng A L， Wang H Q， Zhang L and Li Y. 2023. One-stage 3D whole-body mesh recovery with component aware Transformer//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver， Canada： IEEE： 21159-21168 ［DOI： 10.1109/cvpr52729.2023.02027http://dx.doi.org/10.1109/cvpr52729.2023.02027］

Lin X and Amer M R. 2018. Human motion modeling using DVGANs ［EB/OL］. ［2023-07-12］. https://arxiv.org/pdf/1804.10652.pdfhttps://arxiv.org/pdf/1804.10652.pdf

Liu H F， Chen J J， Li L， Bao B K， Li Z C， Liu J Y and Nie L Q. 2023. Cross-modal representation learning and generation. Journal of Image and Graphics， 28（6）： 1608-1629

刘华峰，陈静静，李亮，鲍秉坤，李泽超，刘家瑛，聂礼强. 2023. 跨模态表征与生成技术. 中国图象图形学报， 28（6）： 1608-1629 ［DOI： 10.11834/jig.230035http://dx.doi.org/10.11834/jig.230035］

Loper M， Mahmood N， Romero J， Pons-Moll G and Black M J. 2015. SMPL： a skinned multi-person linear model. ACM Transactions on Graphics， 34（6）： #248 ［DOI： 10.1145/2816795.2818013http://dx.doi.org/10.1145/2816795.2818013］

Lyu K D， Chen H P， Liu Z G， Zhang B Q and Wang R L. 2022. 3D human motion prediction： a survey. Neurocomputing， 489： 345-365 ［DOI： 10.1016/j.neucom.2022.02.045http://dx.doi.org/10.1016/j.neucom.2022.02.045］

Mahmood N， Ghorbani N， Troje N F， Pons-Moll G and Black M J. 2019. AMASS： archive of motion capture as surface shapes//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul， Korea （South）： IEEE： 5441-5450 ［DOI： 10.1109/ICCV.2019.00554http://dx.doi.org/10.1109/ICCV.2019.00554］

Memar A M and Yan H. 2022. Noise reduction in human motion-captured signals for computer animation based on b-spline filtering. Sensors， 22（12）： #4629 ［DOI： 10.3390/s22124629http://dx.doi.org/10.3390/s22124629］

Mourot L， Hoyet L， Le Clerc F， Schnitzler F and Hellier P. 2022. A survey on deep learning for skeleton-based human animation. Computer Graphics Forum， 41（1）： 122-157 ［DOI： 10.1111/cgf.14426http://dx.doi.org/10.1111/cgf.14426］

Osman A A A， Bolkart T and Black M J. 2020. STAR： sparse trained articulated human body regressor//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 598-613 ［DOI： 10.1007/978-3-030-58539-6_36http://dx.doi.org/10.1007/978-3-030-58539-6_36］

Pavlakos G， Choutas V， Ghorbani N， Bolkart T， Osman A A， Tzionas D and Black M J. 2019. Expressive body capture： 3D hands， face， and body from a single image//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 10967-10977 ［DOI： 10.1109/CVPR.2019.01123http://dx.doi.org/10.1109/CVPR.2019.01123］

Peng X B， Abbeel P， Levine S and van de Panne M. 2018. DeepMimic： example-guided deep reinforcement learning of physics-based character skills. ACM Transactions on Graphics， 37（4）： #143 ［DOI： 10.1145/3197517.3201311http://dx.doi.org/10.1145/3197517.3201311］

Petrovich M， Black M J and Varol G. 2021. Action-conditioned 3d human motion synthesis with Transformer VAE//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 10965-10975 ［DOI： 10.1109/iccv48922.2021.01080http://dx.doi.org/10.1109/iccv48922.2021.01080］

Petrovich M， Black M J and Varol G. 2022. TEMOS： generating diverse human motions from textual descriptions//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv， Israel： Springer： 480-497 ［DOI： 10.1007/978-3-031-20047-2_28http://dx.doi.org/10.1007/978-3-031-20047-2_28］

Plappert M， Mandery C and Asfour T. 2016. The KIT motion-language dataset. Big Data， 4（4）： 236-252 ［DOI： 10.1089/big.2016.0028http://dx.doi.org/10.1089/big.2016.0028］

Punnakkal A R， Chandrasekaran A， Athanasiou N， Quirós-Ramirez A and Black M J. 2021. BABEL： bodies， action and behavior with English labels//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 722-731 ［DOI： 10.1109/cvpr46437.2021.00078http://dx.doi.org/10.1109/cvpr46437.2021.00078］

Raab S， Leibovitch I， Li P Z， Aberman K， Sorkine-Hornung O and Cohen-Or D. 2023. MoDi： unconditional motion synthesis from diverse data//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver， Canada： IEEE： 13873-13883 ［DOI： 10.1109/CVPR52729.2023.01333http://dx.doi.org/10.1109/CVPR52729.2023.01333］

Radford A， Kim J W， Hallacy C， Ramesh A， Goh G， Agarwal S， Sastry G， Askell A， Mishkin P， Clark J， Krueger G and Sutskever I. 2021. Learning transferable visual models from natural language supervision//Proceedings of the 38th International Conference on Machine Learning. ［s.l.］： PMLR： 8748-8763

Rombach R， Blattmann A， Lorenz D， Esser P and Ommer B. 2022. High-resolution image synthesis with latent diffusion models//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 10674-10685 ［DOI： 10.1109/CVPR52688.2022.01042http://dx.doi.org/10.1109/CVPR52688.2022.01042］

Romero J， Tzionas D and Black M J. 2017. Embodied hands： modeling and capturing hands and bodies together. ACM Transactions on Graphics， 36（6）： #245 ［DOI： 10.1145/3130800.3130883http://dx.doi.org/10.1145/3130800.3130883］

Saharia C， Chan W， Saxena S， Li L L， Whang J， Denton E， Ghasemipour S K S， Ayan B K， Mahdavi S S， Lopes R G， Salimans T， Ho J， Fleet D J and Norouzi M. 2022. Photorealistic text-to-image diffusion models with deep language understanding//Proceedings of the 36th Conference on Neural Information Processing Systems. New Orleans， USA： NeurIPS

Sigal L， Balan A O and Black M J. 2010. HUMANEVA： synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision， 87（1/2）： 4-27 ［DOI： 10.1007/s11263-009-0273-6http://dx.doi.org/10.1007/s11263-009-0273-6］

Sohl-Dickstein J， Weiss E A， Maheswaranathan N and Ganguli S. 2015. Deep unsupervised learning using nonequilibrium thermodynamics//Proceedings of the 32nd International Conference on Machine Learning. Lille， France： JMLR.org： 2256-2265 ［DOI： 10.5555/3045118.3045358http://dx.doi.org/10.5555/3045118.3045358］

Song J M， Meng C L and Ermon S. 2021. Denoising diffusion implicit models//Proceedings of the 9th International Conference on Learning Representations. ［s.l.］： OpenReview.net

Sutskever I， Vinyals O and Le Q V. 2014. Sequence to sequence learning with neural networks//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal， Canada： MIT Press： 3104-3112 ［DOI： 10.5555/2969033.2969173http://dx.doi.org/10.5555/2969033.2969173］

Taheri O， Choutas V， Black M J and Tzionas D. 2022. GOAL： generating 4D whole-body motion for hand-object grasping//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 13253-13263 ［DOI： 10.1109/CVPR52688.2022.01291http://dx.doi.org/10.1109/CVPR52688.2022.01291］

Tang X J， Wang H， Hu B， Gong X， Yi R F， Kou Q L and Jin X G. 2022. Real-time controllable motion transition for characters. ACM Transactions on Graphics， 41（4）： #137 ［DOI： 10.1145/3528223.3530090http://dx.doi.org/10.1145/3528223.3530090］

Tevet G， Gordon B， Hertz A， Bermano A H and Cohen-Or D. 2022. MotionCLIP： exposing human motion generation to CLIP space//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv， Israel： Springer： 358-374 ［DOI： 10.1007/978-3-031-20047-2_21http://dx.doi.org/10.1007/978-3-031-20047-2_21］

Tevet G， Raab S， Gordon B， Shafir Y， Cohen-Or D and Bermano A H. 2023. Human motion diffusion model//Proceedings of the 11th International Conference on Learning Representations. Kigali， Rwanda： OpenReview.net

van den Oord A， Vinyals O and Kavukcuoglu K. 2017. Neural discrete representation learning//Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， USA： Curran Associates Inc.： 6309-6318 ［DOI： 10.5555/3295222.3295378http://dx.doi.org/10.5555/3295222.3295378］

Vaswani A， Shazeer N， Parmar N， Uszkoreit J， Jones L， Gomez A N， Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 6000-6010 ［DOI： 10.5555/3295222.3295349http://dx.doi.org/10.5555/3295222.3295349］

von Marcard T， Henschel R， Black M J， Rosenhahn B and Pons-Moll G. 2018. Recovering accurate 3D human pose in the wild using IMUs and a moving camera//Proceedings of the 15th European Conference on Computer Vision. Munich， Germany： Springer： 614-631 ［DOI： 10.1007/978-3-030-01249-6_37http://dx.doi.org/10.1007/978-3-030-01249-6_37］

Wang M， Qiu F， Liu W T， Qian C， Zhou X W and Ma L Z. 2020. Monocular human pose and shape reconstruction using part differentiable rendering. Computer Graphics Forum， 39（7）： 351-362 ［DOI： 10.1111/cgf.14150http://dx.doi.org/10.1111/cgf.14150］

Wang Z M， Wang J， Ge N and Lu J H. 2023. HiMoReNet： a hierarchical model for human motion refinement. IEEE Signal Processing Letters， 30： 868-872 ［DOI： 10.1109/LSP.2023.3295756http://dx.doi.org/10.1109/LSP.2023.3295756］

Yi H W， Liang H L， Liu Y F， Cao Q， Wen Y D， Bolkart T， Tao D C and Black M J. 2023. Generating holistic 3D human motion from speech//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver， Canada： IEEE： 469-480 ［DOI： 10.1109/cvpr52729.2023.00053http://dx.doi.org/10.1109/cvpr52729.2023.00053］

Yi X Y， Zhou Y X， Habermann M， Shimada S， Golyanik V， Theobalt C and Xu F. 2022. Physical inertial poser （PIP）： physics-aware real-time human motion tracking from sparse inertial sensors//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 13157-13168 ［DOI： 10.1109/CVPR52688.2022.01282http://dx.doi.org/10.1109/CVPR52688.2022.01282］

Yuan Y， Song J M， Iqbal U， Vahdat A and Kautz J. 2023. PhysDiff： physics-guided human motion diffusion model//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. Paris， France： IEEE： 15964-15975

Zhang H， Wang J M and Liu H. 2021. Video-based reconstruction of smooth 3D human body motion//Proceedings of the 4th Chinese Conference on Pattern Recognition and Computer Vision. Beijing， China： Springer： 42-53 ［DOI： 10.1007/978-3-030-88007-1_4http://dx.doi.org/10.1007/978-3-030-88007-1_4］

Zhang J R， Zhang Y S， Cun X D， Zhang Y， Zhao H W， Lu H T， Shen X and Ying S. 2023a. Generating human motion from textual descriptions with discrete representations//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver， Canada： IEEE： 14730-14740 ［DOI： 10.1109/CVPR52729.2023.01415http://dx.doi.org/10.1109/CVPR52729.2023.01415］

Zhang M Y， Cai Z A， Pan L， Hong F Z， Guo X Y， Yang L and Liu Z W. 2022. MotionDiffuse： text-driven human motion generation with diffusion model ［EB/OL］. ［2023-03-21］. https://arxiv.org/pdf/2208.15001.pdfhttps://arxiv.org/pdf/2208.15001.pdf

Zhang M Y， Guo X Y， Pan L， Cai Z A， Hong F Z， Li H R， Yang L and Liu Z W. 2023b. ReMoDiffuse： retrieval-augmented motion diffusion model//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. Paris， France： IEEE： 364-373 ［DOI： 10.1109/ICCV51070.2023.00040http://dx.doi.org/10.1109/ICCV51070.2023.00040］

Zheng J T， Zheng Q Y， Fang L X， Liu Y and Yi L. 2023. CAMS： CAnonicalized manipulation spaces for category-level functional hand-object manipulation synthesis//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver， Canada： IEEE： 585-594 ［DOI： 10.1109/CVPR52729.2023.00064http://dx.doi.org/10.1109/CVPR52729.2023.00064］

Zhuang W L， Wang C Y， Chai J X， Wang Y G， Shao M and Xia S Y. 2022. Music2Dance： DanceNet for music-driven dance generation. ACM Transactions on Multimedia Computing， Communications， and Applications， 18（2）： #65 ［DOI： 10.1145/3485664http://dx.doi.org/10.1145/3485664］

Zou Y L， Yang J M， Ceylan D， Zhang J M， Perazzi F and Huang J B. 2020. Reducing footskate in human motion reconstruction with ground contact constraints//Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision. Snowmass， USA： IEEE： 448-457 ［DOI： 10.1109/WACV45572.2020.9093329http://dx.doi.org/10.1109/WACV45572.2020.9093329］

Alert me when the article has been cited

提交

A review of adversarial examples for optical character recognition

3D face imaging and reconstruction technology： a review

Fundus image enhancement algorithm based on convolutional dictionary diffusion model

Multimodal deep neural network for construction waste object segmentation

Gait-driven quadruped motion generation using a low-dimensional physical model