Current Issue Cover

杜韬1,2,3, 胡瑞珍4, 刘利斌5, 弋力1,2,3, 赵昊6(1.清华大学交叉信息研究院, 北京 100084;2.上海人工智能实验室, 上海 200232;3.上海期智研究院, 上海 200232;4.深圳大学计算机与软件学院, 深圳 518061;5.北京大学智能学院, 北京 100871;6.清华大学智能产业研究院, 北京 100084)

摘 要
人类智能是在与环境交互中进化的,因而如何实现智能体与环境的自主交互是推进智能演化的关键。环境自主交互是一项涉及计算机图形学、计算机视觉和机器人等多个学科领域的研究课题,引起广泛的关注和探究,学术界已围绕这一热点研究问题从不同视角和技术维度开展了一系列研究工作。本文着眼于室内场景拟人交互,全面梳理数字人与机器人在室内环境下学习完成特定交互任务过程中需要涉及的仿真交互平台、场景交互数据和交互生成算法 3 方面基本要素的研究进展。在仿真交互环境搭建方面,本文梳理了仿真环境涉及的仿真技术和研究进展,并对代表性的拟人交互仿真平台进行了介绍;在场景交互数据构建方面,本文从场景交互感知数据集、场景交互运动数据集以及交互数据规模的高效扩充 3 方面对国内外研究现状进行了详细介绍;在拟人交互感知与生成方面,本文介绍了以交互为导向的场景可供性分析的相关工作,并以交互生成为线索,分别梳理了数字人—场景交互生成、机器人—场景交互生成的相关工作。基于对国内外相关工作的梳理和讨论,最后从交互仿真、交互数据、交互感知和交互生成 4 个方面,总结了该领域目前仍面临的挑战,并对未来的发展趋势进行了展望。
Research progress in human-like indoor scene interaction

Du Tao1,2,3, Hu Ruizhen4, Liu Libin5, Yi Li1,2,3, Zhao Hao6(1.Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing 100084, China;2.Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China;3.Shanghai Qi Zhi Institute, Shanghai 200232, China;4.College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518061, China;5.School of Intelligence Science and Technology, Peking University, Beijing 100871, China;6.Institute for AI Industry Research, Tsinghua University, Beijing 100084, China)

Human intelligence evolves through interactions with the environment,which makes autonomous interaction between intelligent agents and the environment a key factor in advancing intelligence. Autonomous interaction with the environment is a research topic that involves multiple disciplines,such as computer graphics,computer vision,and robotics,and has attracted significant attention and exploration in recent years. In this study,we focus on human-like interaction in indoor environment and comprehensively review the research progress in the fundamental components including simulation interaction platforms,scene interaction data,and interaction generation algorithms for digital humans and robots. Regarding simulation interaction platforms,we comprehensively review representative simulation methods for virtual humans,objects,and human-object interaction. Specifically,we cover critical algorithms for articulated rigid-body simulation,deformable-body and cloth simulation,fluid simulation,contact and collision,and multi-body multi-physics coupling. In addition,we introduce several popular simulation platforms that are readily available for practitioners in the graphics,robotics,and machine learning communities. We classify these popular simulation platforms into two main categories:simulators focusing on single-physics systems and those supporting multi-physics systems. We review typical simulation platforms in both categories and discuss their advantages in human-like indoor-scene interaction. Finally,we briefly discuss several emerging trends in the physical simulation community that inspire promising future directions:developing a full-featured simulator for multi-physics multi-body physical systems,equipping modern simulation platforms with differentiability,and combining physics principles with insights from learning techniques. Regarding scene interaction data,we provide an in-depth review of the latest developments and trends in datasets that support the understanding and generation of human-scene interactions. We focus on the need for agents to perceive scenes with a focus on interaction,assimilate interactive information,and recognize human interaction patterns to improve simulation and movement generation. Our review spans three areas:perception datasets for human-scene interaction,datasets for interaction motion,and methods for scaling data efficiently. Perception datasets facilitate a deeper understanding of 3D scenes,which highlights geometry, structure,functionality,and motion. They offer resources for interaction affordances,grasping poses,interactive components,and object positioning. Motion datasets,which are essential for crafting interactions,delve into interaction movement analysis,including motion segmentation,tracking,dynamic reconstruction,action recognition,and prediction. The fidelity and breadth of these datasets are vital for creating lifelike interactions. We also discuss scaling challenges,with the limitations of manual annotation and specialized hardware,and explore current solutions like cost-effective capture systems,dataset integration,and data augmentation to enable the generation of extensive interactive models for advancing human-scene interaction research. For robot-scene interaction,this study emphasizes the importance of affordance,that is,the potential action possibilities that objects or environments can provide to users. It discusses approaches for detecting and analyzing affordance at different granularities,as well as affordance modeling techniques that combine multi-source and multimodal data. In the aspect of digital human-scene interaction,this study provides a detailed introduction to the simulation and generation methods of human motion,especially focusing on technologies based on deep learning and generative models in recent years. Building on this foundation,the study reviews ways to represent a scene and recent successful approaches that achieve high-quality human-scene interaction simulation. Finally,we discuss the challenges and future development trends in this field.