Current Issue Cover
室内场景拟人交互研究进展

(1.清华大学交叉信息研究院,上海人工智能实验室,上海期智研究院;2.深圳大学计算机与软件学院;3.北京大学智能学院;4.清华大学智能产业研究院)

摘 要
人类智能是在和环境交互中进化的,因而如何实现智能体与环境的自主交互是推进智能演化的关键。环境自主交互是一项涉及计算机图形学、计算机视觉、机器人等多个学科领域的研究课题,在近些年引起了广泛的关注和探究,学术界已围绕这一热点研究问题从不同视角和技术维度开展了一系列研究工作。本文着眼于室内场景拟人交互,全面梳理数字人与机器人在室内环境下学习完成特定交互任务过程中需要涉及到的仿真交互平台、场景交互数据、交互生成算法三方面基本要素的研究进展。具体地,在仿真交互环境搭建方面,本文首先梳理了仿真环境涉及的仿真技术和研究进展,并对代表性的拟人交互仿真平台进行了介绍;在场景交互数据构建方面,本文从场景交互感知数据集、场景交互运动数据集以及交互数据规模的高效扩充三方面进行对国内外研究现状进行了详细介绍;在拟人交互感知与生成方面,本文首先介绍了以交互为导向的场景可供性分析的相关工作,并以交互生成为线索,分别梳理了数字人-场景交互生成、机器人-场景交互生成的相关工作。基于对国内外相关工作的梳理和讨论,本文最后从交互仿真、交互数据、交互感知和交互生成四个方面,总结了该领域目前仍面临的挑战,并对未来的发展趋势进行了展望。
关键词
Research Progress in Human-like Indoor Scene Interaction

Du Tao, Hu Ruizhen1, Liu Libin2, Yi Li3,4,5, Zhao Hao6(1.College of Computer Science and Software Engineering, Shenzhen University;2.School of Intelligence Science and Technology, Peking University;3.Institute for Interdisciplinary Information Sciences, Tsinghua University;4.Shanghai Artificial Intelligence Laboratory;5.Shanghai Qi Zhi Institute;6.Institute for AI Industry Research, Tsinghua University)

Abstract
Human intelligence evolves through interactions with the environment, making autonomous interaction between intelligent agents and the environment a key factor in advancing intelligence. Autonomous interaction with the environment is a research topic that involves multiple disciplines such as computer graphics, computer vision, and robotics, and has attracted significant attention and exploration in recent years. In this article, we focus on human-like interaction in indoor environment and comprehensively review the research progress in the fundamental components including simulation interaction platforms, scene interaction data, and interaction generation algorithms for digital humans and robots. Regarding simulation interaction platforms, we comprehensively review representative simulation methods for virtual humans, objects, and human-object interaction. Specifically, we cover critical algorithms for articulated rigid-body simulation, deformable-body and cloth simulation, fluid simulation, contact and collision, and multi-body multi-physics coupling. In addition, we introduce several popular simulation platforms that are readily available for practitioners in the graphics, robotics, and machine learning communities. We classify these popular simulation platforms into two main categories: simulators focusing on single-physics systems and those supporting multi-physics systems. We review typical simulation platforms in both categories and discuss their advantages in human-like indoor-scene interaction. Finally, we briefly discuss several emerging trends in the physical simulation community that inspire promising future directions: developing a full-featured simulator for multi-physics multi-body physical systems, equipping modern simulation platforms with differentiability, and combining physics principles with insights from learning techniques. Regarding scene interaction data, we provide an in-depth review of the latest developments and trends in datasets that support the understanding and generation of human-scene interactions. We focus on the need for agents to perceive scenes with a focus on interaction, assimilate interactive information, and recognize human interaction patterns to improve simulation and movement generation. Our review spans three areas: perception datasets for human-scene interaction, datasets for interaction motion, and methods for scaling data efficiently. Perception datasets facilitate a deeper understanding of 3D scenes, highlighting geometry, structure, functionality, and motion. They offer resources for interaction affordances, grasping poses, interactive components, and object positioning. Motion datasets, essential for crafting interactions, delve into interaction movement analysis, including motion segmentation, tracking, dynamic reconstruction, action recognition, and prediction. The fidelity and breadth of these datasets are vital for creating lifelike interactions. We also discuss scaling challenges, noting the limitations of manual annotation and specialized hardware, and explore current solutions like cost-effective capture systems, dataset integration, and data augmentation to enable the generation of extensive interactive models for advancing human-scene interaction research. For robot-scene interaction, this paper emphasizes the importance of affordance, i.e., the potential action possibilities that objects or environments can provide to users. It discusses approaches for detecting and analyzing affordance at different granularities, as well as affordance modeling techniques that combine multi-source and multi-modal data. In the aspect of digital human-scene interaction, this paper provides a detailed introduction to the simulation and generation methods of human motion, especially focusing on technologies based on deep learning and generative models in recent years. Building on this foundation, the paper reviews ways to represent a scene and recent successful approaches that achieve high-quality human-scene interaction simulation. Finally, we also discuss the challenges and future development trends in this field in the end.
Keywords

订阅号|日报