Current Issue Cover
三维场景点云理解与重建技术

龚靖渝1, 楼雨京1, 柳奉奇1, 张志伟1, 陈豪明2, 张志忠2, 谭鑫2, 谢源2, 马利庄1,2(1.上海交通大学计算机科学与工程系, 上海 200240;2.华东师范大学计算机科学与技术学院, 上海 200062)

摘 要
3维场景理解与重建技术能够使计算机对真实场景进行高精度复现并引导机器以3维空间的思维理解整个真实世界,从而使机器拥有足够智能参与到真实世界的生产与建设,并能通过场景的模拟为人类的决策和生活提供服务。3维场景理解与重建技术主要包含场景点云特征提取、扫描点云配准与融合、场景理解与语义分割、扫描物体点云补全与细粒度重建等,在处理真实扫描场景时,受到扫描设备、角度、距离以及场景复杂程度的影响,对技术的精准度和稳定性提出了更高的要求,相关的技术也十分具有挑战性。其中,原始扫描点云特征提取与配准融合旨在将同场景下多个扫描区域进行特征匹配,从而融合得到完整的场景点云,是理解与重建技术的基石;场景点云的理解与语义分割的目的在于对场景模型进行整体感知并根据语义特征划分为功能性物体甚至是部件的点云,是整套技术的核心组成部分;后续的物体点云细粒度补全主要研究扫描物体的结构恢复和残缺部分补全,是场景物体点云细粒度重建的关键性技术。本文围绕上述系列技术,详细分析了基于3维点云的场景理解与重建技术相关的应用领域和研究方向,归结总结了国内外的前沿进展与研究成果,对未来的研究方向和技术发展进行了展望。
关键词
Scene point cloud understanding and reconstruction technologies in 3D space

Gong Jingyu1, Lou Yujing1, Liu Fengqi1, Zhang Zhiwei1, Chen Haoming2, Zhang Zhizhong2, Tan Xin2, Xie Yuan2, Ma Lizhuang1,2(1.Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China;2.School of Computer Science and Technology, East China Normal University, Shanghai 200062, China)

Abstract
3D scene understanding and reconstruction are essential for machine vision and intelligence, which aim to reconstruct completed models of real scenes from multiple scene scans and understand the semantic meanings of each functional component in the scene. This technique is indispensable for real world digitalization and simulation, which can be widely used in related domains like robots, navigation system and virtual tourism. Its key challenges are required to be resolved on the three aspects:1) to recognize the same area in multiple real scans and fuse all the scans into an integrated scene point cloud;2) to make sense of the whole scene and recognize the semantics of multiple functional components;3) to complete the missing region in the original point cloud caused by occlusion during scanning. It is necessary to extract point cloud feature in order to fuse multiple real scene scans into an integrated point cloud, which can be invariant to scanning position and rotation. Thus, intrinsic geometry features like point distance and singular value in neighborhood covariance matrix are often involved in rotation-invariant feature design. Contrastive learning scheme is usually taken to help the learned features from the same area to be close to each other, while extracted features from different areas to be far away. To get generalization ability better, data augmentation of scanned point cloud can also be used during feature learning process. Features-learnt pose estimation of scanning device can be configured to calculate the transformation matrix between point cloud pairs. After the transformation relationship is sorted out, the following point cloud fusion can be implemented using the raw point cloud scans. To further understand raw point cloud-based whole scene and segment the whole scene into functional parts on the basis of multiple semantics, an effective and efficient network with appropriate 3D convolution operation is required to parse entire points-based scene hierarchically, and specific learning schemes are necessary as well to adapt to various situation. The definition and formulation of basic convolution operation in 3D space is recognized as the core of pattern recognition for 3D scene point cloud. It is highly correlated to the approximated convolution kernel in 3D space where feature extraction can be developed in terms of appropriate point cloud grouping and down/up-sampling. The discrete approximation of 3D continuous convolution pursues being capable of recognizing various geometry pattern while keeping as few parameters as possible. Network design based on these elementary 3D convolution operations is also a fundamental part of outstanding scene parsing. Furthermore, point-level semantic segmentation of scanned scene can be linked mutually in relevance to such aspects of boundary detection, instance segmentation, and scene coloring, where network parameters are supervised through more auxiliary regularization. Semi-supervised methods and weak-supervised methods are required to overcome the lack of data annotation for real data. The segmentation results and semantic hints can be used to strengthen the fine-grained completion of object point cloud from scanned scene, in which the segmented objects can be handled separately, and semantics can be used to provide the structure and geometry prior when occlusion-derived missing region is completed. For the learning of object point cloud completion, it is crucial to learn a compact latent code space to represent all the complete shapes and design versatile decoder to reconstruct the structure and fine-grained geometry details of object point cloud. The learnt latent code space should contain complete shapes as much as possible, thus requiring large-scale synthetic model dataset for training to ensure the generalization ability. The encoder should be designed to recognize the structure of original point cloud and extract specific geometry pattern which preserves this information in latent code, while the decoder is used to recover the overall skeleton of original scanned objects and complete all the details according to the existing local geometry hints. For real scanned object completion, it is required to optimize the integration of latent code space further for synthetic models and real scanned point cloud. A cross-domain learning scheme is used to apply the knowledge of completion to real object scans, whereas the details of real scanned object can be preserved in the completed version. We analyze the current situation about scene understanding and reconstruction, including point cloud fusion, 3D convolution operation, entire scene segmentation, and fine-grained object completion. We analyze the frontier technologies and predict promising future research trends. It is significant for the following research to pay more attention on more open space with further challenges on computing efficiency, handling out-of-domain knowledge, and more complex situation with human-scene interaction. The 3D scene understanding and reconstruction technology will help the machine to understand the real world in a more natural way which can facilitate such various application domains like robots and navigation. It also potential to conduct plausible simulation of real world based on the reconstruction and parsing of real scenes, making it a useful tool in making various decisions.
Keywords

订阅号|日报