三维场景点云理解与重建技术(年度报告)
龚靖渝,楼雨京,柳奉奇,张志伟,陈豪明,张志忠,谭鑫,谢源,马利庄(上海交通大学;华东师范大学) 摘 要
三维场景理解与重建技术能够让计算机对真实场景进行高精度的复现并引导机器以三维空间的思维来理解整个真实世界,从而令机器拥有足够智能参与到真实世界的生产与建设,并能通过场景的模拟为人类的决策和生活提供服务。三维场景理解与重建技术主要包含场景点云特征提取、扫描点云配准与融合、场景理解与语义分割、扫描物体点云补全与细粒度重建等,在处理真实扫描场景时,受到扫描设备、角度、距离以及场景复杂程度的影响,对技术的精准度和稳定性提出了更高的要求,相关的技术也十分具有挑战性。其中,原始扫描点云特征提取与配准融合旨在将同场景下多个扫描区域进行特征匹配,从而融合得到完整的场景点云,是理解与重建技术的基石;场景点云的理解与语义分割的目的在于对场景模型进行整体感知并根据语义特征划分为功能性物体甚至是部件的点云,是整套技术的核心组成部分;后续的物体点云细粒度补全主要研究扫描物体的结构恢复和残缺部分补全,是场景物体点云细粒度重建的关键性技术。本文围绕上述系列技术,详细分析了基于三维点云的场景理解与重建技术相关的应用领域和研究方向,归结总结了国内外的前沿进展与研究成果,对未来的研究方向和技术发展进行了展望。
关键词
Scene point cloud understanding and reconstruction technologies in 3D space
Gong Jingyu,Lou Yujing,Liu Fengqi,Zhang Zhiwei,Chen Haoming,Zhang Zhizhong,Tan Xin,Xie Yuan,Ma Lizhuang(Shanghai Jiao Tong University) Abstract
Understanding and reconstructing 3D models of real scenes is essential for machine vision and intelligence, which aims to reconstruct completed models of real scenes from multiple scene scans and understand the semantic meanings of each functional component in the scene. This technique is indispensable in real world digitalization and simulation, which can be widely used in various robots, navigation system, virtual tourism, etc. The main challenge of this technique is composed of three intertwined questions: (1) how to recognize the same area in multiple real scans and fuse all the scans into an integrated scene point cloud, (2) how to make sense of the whole scene and recognize the semantics of different functional parts, (3) how to complete the missing region in the original point cloud caused by occlusion during scanning. In order to fuse multiple real scene scans into an integrated point cloud, it is important to extract point cloud feature that is invariant to scanning position and rotation. Thus, intrinsic geometry features like point distance and singular value in neighborhood covariance matrix which is invariant to rotation are usually taken into consideration during feature design. Contrastive learning scheme is usually be taken to help the learned features from the same area to be close to each other, while extracted features from different areas to be far away. Data augmentation of scanned point cloud can also be used during feature learning process to make the learnt feature have better generalization ability. Based on these learnt features, further pose estimation of scanning device can be taken to calculate the transformation matrix between point cloud pairs. After finding out the transformation relationship, point cloud fusion can be implemented using the raw point cloud scans. For the understanding of whole scene based on raw point cloud and further segmenting the whole scene into functional parts according to the semantics under different situation, it is necessary to design effective and efficient network with appropriate 3D convolution operation to parse entire scene hierarchically from points and introduce specific learning schemes to adapt to various situation. The core of pattern recognition for 3D scene point cloud is the definition and formulation of basic convolution operation in 3D. This depends highly on how to approximate the convolution kernel in 3D continuous space and execute feature aggregation and extraction with appropriate point cloud grouping and down/up-sampling. The discrete approximation of 3D continuous convolution pursues being capable of recognizing various geometry pattern while keeping as few parameters as possible. Network design based on these elementary 3D convolution operations is also a fundamental part of outstanding scene parsing. Meanwhile, the point-level semantic segmentation of scanned scene can also be aided by tasks with tight relationship like boundary detection, instance segmentation, and scene coloring, where network parameters are supervised by more auxiliary regularization. Under more extreme situation where real data is limited, semi-supervised methods and weak-supervised methods are required to overcome the lack of data annotation. The segmentation results and semantic hints can further help the fine-grained completion of object point cloud from scanned scene, where segmented objects can be handled separately, and semantics will provide the structure and geometry prior when completing the missing region caused by occlusion. For the learning of object point cloud completion, it is crucial to learn a compact latent code space to represent all the complete shapes and design versatile decoder to reconstruct the structure and fine-grained geometry details of object point cloud. The learnt latent code space should contain complete shapes as much as possible, thus requiring large-scale synthetic model dataset for training to ensure the generalization ability. The encoder should be designed to recognize the structure of original point cloud and check the specific geometry pattern to preserve this information in latent code, while the decoder needs to recover the overall skeleton of original scanned objects and complete all the details according to the existing local geometry hints. For real scanned object completion, further unification of latent code space for synthetic models and real scanned point cloud is also indispensable. That requires a cross-domain learning scheme to apply the knowledge of completion to real object scans where the details of real scanned object must be preserved in the completed version. We summarize the prevailing technologies and current situations about scene understanding and reconstruction including point cloud fusion, 3D convolution operation, entire scene segmentation, and fine-grained object completion. We analyze the frontier technologies and predict promising future research trends. It is significant for the following research to pay more attention on more open space with further challenges on computing efficiency and handling out of domain knowledge, and more complex situation with human-scene interaction. The 3D scene understanding and reconstruction technology will help the machine to understand the real world in a more natural way which can further make various application like robots, navigation better serve human beings. It also potential to conduct plausible simulation of real world based on the reconstruction and parsing of real scenes, making it a useful tool in making various decisions.
Keywords
|