Current Issue Cover
几何属性引导的三维语义实例重建

万骏辉1, 刘心溥2, 陈莉丽3, 敖晟1, 张鹏1, 郭裕兰2(1.中山大学电子与通信工程学院, 深圳 518107;2.国防科技大学电子科学学院, 长沙 410005;3.军事科学院国防科技创新研究院人工智能研究中心, 北京 100071)

摘 要
目的 语义实例重建是机器人理解现实世界的一个重要问题。虽然近年来取得了很多进展,但重建性能易受遮挡和噪声的影响。特别地,现有方法忽视了物体的先验几何属性,同时忽视了物体的关键细节信息,导致重建的网格模型粗糙,精度较低。针对这种问题,提出了一种几何属性引导的语义实例重建算法。方法 首先,通过目标检测器获取检测框参数,并对每个目标实例进行检测框盒采样,从而获得场景中对应的残缺局部点云。然后,通过编码器端的特征嵌入层和Transformer层提取物体丰富且关键的细节几何信息,以获取对应的局部特征,同时利用物体的先验语义信息来帮助算法更快地逼近目标形状。最后,本文设计了一种特征转换器以对齐物体全局特征,并将其与前述局部特征融合送入形状生成模块,进行物体网格重建。结果 在真实数据集ScanNet v2上,本文算法与现有最新方法进行了全面的性能比较,实验结果证明了本文算法的有效性。与性能排名第2的RfD-Net相比,本算法的实例重建指标提升了8%。此外,本文开展了详尽的消融实验以验证算法中各个模块的有效性。结论 本文所提出的几何属性引导的语义实例重建算法,更好地利用了物体的几何属性信息,使得重建结果更为精细、准确。
关键词
Geometric attribute-guided 3D semantic instance reconstruction

Wan Junhui1, Liu Xinpu2, Chen Lili3, Ao Sheng1, Zhang Peng1, Guo Yulan2(1.School of Electronics and Communication Engineering, Sun Yat-sen University, Shenzhen 518107, China;2.College of Electronic Science and Technology, National University of Defense Technology, Changsha 410005, China;3.Artificial Intelligence Research Center, National Innovation Institute of Defense Technology, Academy of Military Sciences, Beijing 100071, China)

Abstract
Objective The objective of 3D vision is to capture the geometric and optical features of the real world from multiple perspectives and convert this information into digital form,enabling computers to understand and process it. 3D vision is an important aspect of computer graphics. Nonetheless,sensors can only provide partial observations of the world due to viewpoint occlusion,sparse sensing,and measurement noise,resulting in a partial and incomplete representation of a scene. Semantic instance reconstruction is proposed to solve this problem. It converts 2D/3D data obtained from multiple sensors into a semantic representation of the scene,including modeling each object instance in the scene. Machine learning and computer vision techniques are applied to achieve high-precision reconstruction results,and point cloud-based methods have demonstrated prominent advantages. However,existing methods disregard prior geometric and semantic information of objects,and the subsequent simple max-pooling operation loses key structural information of objects,resulting in poor instance reconstruction performance. Method In this study,a geometric attribute-guided semantic instance reconstruction network(GANet),which consists of a 3D object detector,a spatial Transformer,and a mesh generator,is proposed. We design the spatial Transformer to utilize the geometric and semantic information of instances. After obtaining the 3D bounding box information of instances in the scene,box sampling is used to obtain the real local point cloud of each target instance in the scene on the basis of the instance scale information,and then semantic information is embedded for foreground point segmentation. Compared with ball sampling,box sampling reduces noise and obtains more effective information. Then,the encoder’s feature embedding and Transformer layers extract rich and crucial detailed geometric information of objects from coarse to fine to obtain the corresponding local features. The feature embedding layer also utilizes the prior semantic information of objects to help the algorithm quickly approximate the target shape. The attention module in the Transformer integrates the correlation information between points. The algorithm also uses the object’s global features provided by the detector. Considering the inconsistency between the scene space and the canonical space,a designed feature space Transformer is used to align the object’s global features. Finally,the fused features are sent to the mesh generator for mesh reconstruction. The loss function of GANet consists of two parts:detection and shape losses. Detection loss is the weighted sum of the instance confidence,semantic classification,and bounding box estimation losses. Shape loss consists of three parts:Kullback-Leibler divergence between the predicted and standard normal distributions,foreground point segmentation loss,and occupancy point estimation loss. Occupancy point estimation loss is the cross-entropy between the predicted occupancy value of the spatial candidate points and the real occupancy value. Result The experiment was compared with the latest methods on the ScanNet v2 datasets. The algorithm utilized computer aided design(CAD)models provided by Scan2CAD,which included 8 categories,as ground truth for training. The mean average precision of semantic instance reconstruction increased by 8% compared with the second-ranked method,i. e. ,RfD-Net. The average precision of bathtub,trash bin,sofa,chair,and cabinet is better than that from RfD-Net. In accordance with the visualization results,GANet can reconstruct object models that are more in line with the scene. Ablation experiments were also conducted on the corresponding dataset. The performance of the complete network was better than the other four networks, which included a GANet that replaced ball sampling with box sampling,replaced the Transformer with PointNet,and removed the semantic embedding of point cloud features and feature transformation. The experimental results indicate that box sampling obtains more effective local point cloud information,the Transformer-based point cloud encoder enables the network to use more critical local structural information of the foreground point cloud during reconstruction,and semantic embedding provides prior information for instance reconstruction. Feature space transformation aligns the global prior information of an object,further improving the reconstruction effect. Conclusion In this study,we proposed a geometric attribute-guided network. This network considers the complexity of scene objects and can better utilize the geometric and attribute information of objects. The experiment results show that our network outperforms several state-of-the-art approaches. Current 3D-based semantic instance reconstruction algorithms have achieved good results,but acquiring and annotating 3D data are still relatively expensive. Future research can focus on how to use 2D data to assist in semantic instance reconstruction.
Keywords

订阅号|日报