Scene point cloud understanding and reconstruction technologies in 3D space
Gong Jingyu,Lou Yujing,Liu Fengqi,Zhang Zhiwei,Chen Haoming,Zhang Zhizhong,Tan Xin,Xie Yuan,Ma Lizhuang(Shanghai Jiao Tong University)
Understanding and reconstructing 3D models of real scenes is essential for machine vision and intelligence, which aims to reconstruct completed models of real scenes from multiple scene scans and understand the semantic meanings of each functional component in the scene. This technique is indispensable in real world digitalization and simulation, which can be widely used in various robots, navigation system, virtual tourism, etc. The main challenge of this technique is composed of three intertwined questions: (1) how to recognize the same area in multiple real scans and fuse all the scans into an integrated scene point cloud, (2) how to make sense of the whole scene and recognize the semantics of different functional parts, (3) how to complete the missing region in the original point cloud caused by occlusion during scanning. In order to fuse multiple real scene scans into an integrated point cloud, it is important to extract point cloud feature that is invariant to scanning position and rotation. Thus, intrinsic geometry features like point distance and singular value in neighborhood covariance matrix which is invariant to rotation are usually taken into consideration during feature design. Contrastive learning scheme is usually be taken to help the learned features from the same area to be close to each other, while extracted features from different areas to be far away. Data augmentation of scanned point cloud can also be used during feature learning process to make the learnt feature have better generalization ability. Based on these learnt features, further pose estimation of scanning device can be taken to calculate the transformation matrix between point cloud pairs. After finding out the transformation relationship, point cloud fusion can be implemented using the raw point cloud scans. For the understanding of whole scene based on raw point cloud and further segmenting the whole scene into functional parts according to the semantics under different situation, it is necessary to design effective and efficient network with appropriate 3D convolution operation to parse entire scene hierarchically from points and introduce specific learning schemes to adapt to various situation. The core of pattern recognition for 3D scene point cloud is the definition and formulation of basic convolution operation in 3D. This depends highly on how to approximate the convolution kernel in 3D continuous space and execute feature aggregation and extraction with appropriate point cloud grouping and down/up-sampling. The discrete approximation of 3D continuous convolution pursues being capable of recognizing various geometry pattern while keeping as few parameters as possible. Network design based on these elementary 3D convolution operations is also a fundamental part of outstanding scene parsing. Meanwhile, the point-level semantic segmentation of scanned scene can also be aided by tasks with tight relationship like boundary detection, instance segmentation, and scene coloring, where network parameters are supervised by more auxiliary regularization. Under more extreme situation where real data is limited, semi-supervised methods and weak-supervised methods are required to overcome the lack of data annotation. The segmentation results and semantic hints can further help the fine-grained completion of object point cloud from scanned scene, where segmented objects can be handled separately, and semantics will provide the structure and geometry prior when completing the missing region caused by occlusion. For the learning of object point cloud completion, it is crucial to learn a compact latent code space to represent all the complete shapes and design versatile decoder to reconstruct the structure and fine-grained geometry details of object point cloud. The learnt latent code space should contain complete shapes as much as possible, thus requiring large-scale synthetic model dataset for training to ensure the generalization ability. The encoder should be designed to recognize the structure of original point cloud and check the specific geometry pattern to preserve this information in latent code, while the decoder needs to recover the overall skeleton of original scanned objects and complete all the details according to the existing local geometry hints. For real scanned object completion, further unification of latent code space for synthetic models and real scanned point cloud is also indispensable. That requires a cross-domain learning scheme to apply the knowledge of completion to real object scans where the details of real scanned object must be preserved in the completed version. We summarize the prevailing technologies and current situations about scene understanding and reconstruction including point cloud fusion, 3D convolution operation, entire scene segmentation, and fine-grained object completion. We analyze the frontier technologies and predict promising future research trends. It is significant for the following research to pay more attention on more open space with further challenges on computing efficiency and handling out of domain knowledge, and more complex situation with human-scene interaction. The 3D scene understanding and reconstruction technology will help the machine to understand the real world in a more natural way which can further make various application like robots, navigation better serve human beings. It also potential to conduct plausible simulation of real world based on the reconstruction and parsing of real scenes, making it a useful tool in making various decisions.