Current Issue Cover


摘 要
Research Progress in Human-like Indoor Scene Interaction

Du Tao, Hu Ruizhen1, Liu Libin2, Yi Li3,4,5, Zhao Hao6(1.College of Computer Science and Software Engineering, Shenzhen University;2.School of Intelligence Science and Technology, Peking University;3.Institute for Interdisciplinary Information Sciences, Tsinghua University;4.Shanghai Artificial Intelligence Laboratory;5.Shanghai Qi Zhi Institute;6.Institute for AI Industry Research, Tsinghua University)

Human intelligence evolves through interactions with the environment, making autonomous interaction between intelligent agents and the environment a key factor in advancing intelligence. Autonomous interaction with the environment is a research topic that involves multiple disciplines such as computer graphics, computer vision, and robotics, and has attracted significant attention and exploration in recent years. In this article, we focus on human-like interaction in indoor environment and comprehensively review the research progress in the fundamental components including simulation interaction platforms, scene interaction data, and interaction generation algorithms for digital humans and robots. Regarding simulation interaction platforms, we comprehensively review representative simulation methods for virtual humans, objects, and human-object interaction. Specifically, we cover critical algorithms for articulated rigid-body simulation, deformable-body and cloth simulation, fluid simulation, contact and collision, and multi-body multi-physics coupling. In addition, we introduce several popular simulation platforms that are readily available for practitioners in the graphics, robotics, and machine learning communities. We classify these popular simulation platforms into two main categories: simulators focusing on single-physics systems and those supporting multi-physics systems. We review typical simulation platforms in both categories and discuss their advantages in human-like indoor-scene interaction. Finally, we briefly discuss several emerging trends in the physical simulation community that inspire promising future directions: developing a full-featured simulator for multi-physics multi-body physical systems, equipping modern simulation platforms with differentiability, and combining physics principles with insights from learning techniques. Regarding scene interaction data, we provide an in-depth review of the latest developments and trends in datasets that support the understanding and generation of human-scene interactions. We focus on the need for agents to perceive scenes with a focus on interaction, assimilate interactive information, and recognize human interaction patterns to improve simulation and movement generation. Our review spans three areas: perception datasets for human-scene interaction, datasets for interaction motion, and methods for scaling data efficiently. Perception datasets facilitate a deeper understanding of 3D scenes, highlighting geometry, structure, functionality, and motion. They offer resources for interaction affordances, grasping poses, interactive components, and object positioning. Motion datasets, essential for crafting interactions, delve into interaction movement analysis, including motion segmentation, tracking, dynamic reconstruction, action recognition, and prediction. The fidelity and breadth of these datasets are vital for creating lifelike interactions. We also discuss scaling challenges, noting the limitations of manual annotation and specialized hardware, and explore current solutions like cost-effective capture systems, dataset integration, and data augmentation to enable the generation of extensive interactive models for advancing human-scene interaction research. For robot-scene interaction, this paper emphasizes the importance of affordance, i.e., the potential action possibilities that objects or environments can provide to users. It discusses approaches for detecting and analyzing affordance at different granularities, as well as affordance modeling techniques that combine multi-source and multi-modal data. In the aspect of digital human-scene interaction, this paper provides a detailed introduction to the simulation and generation methods of human motion, especially focusing on technologies based on deep learning and generative models in recent years. Building on this foundation, the paper reviews ways to represent a scene and recent successful approaches that achieve high-quality human-scene interaction simulation. Finally, we also discuss the challenges and future development trends in this field in the end.