万骏辉1, 刘心溥2, 陈莉丽3, 敖晟1, 张鹏1, 郭裕兰2(1.中山大学;2.国防科技大学;3.军事科学院国防科技创新研究院)
目的 语义实例重建是机器人理解现实世界的一个重要问题。虽然近年来取得了很多进展,但重建性能易受遮挡和噪声的影响。特别的,现有方法忽视了物体的先验几何属性,同时忽视了物体的关键细节信息,导致重建的网格模型粗糙,精度较低。方法 针对这种问题,本文提出了一种几何属性引导的语义实例重建算法。该算法首先通过目标检测器获取检测框参数,并对每个目标实例进行检测框盒采样,从而获得场景中对应的残缺局部点云。然后通过编码器端的特征嵌入层和Transformer层提取物体丰富且关键的细节几何信息,以获取对应的局部特征,同时利用物体的先验语义信息来帮助算法更快地逼近目标形状。最后,本文设计了一种特征转换器以对齐物体全局特征,并将其与前述局部特征融合送入形状生成模块,进行物体网格重建。结果 在真实数据集ScanNet V2上,本文算法与现有最新方法全面地进行了性能比较,实验结果证明了本文算法的有效性。与性能排名第二的RfD-Net相比,本算法的实例重建指标提升了8%。此外,本文开展了详尽的对比实验以验证算法中各个模块的有效性。结论 本文所提出的几何属性引导的语义实例重建算法,更好地利用了物体的几何和属性信息,使得重建结果更为精细、准确。
Geometric attribute guided 3D semantic instance reconstruction
(1.National University of Defense Technology;2.Sun Yat-Sen University)
Objective 3D vision aims to capture geometric and optical features of the real world from multiple perspectives and convert this information into digital form so that computers can understand and process it. It is an important aspect of computer graphics. Nonetheless, due to viewpoint occlusion, sparse sensing and measurement noise, sensors can only provide partial observations of the world, resulting in a partial and incomplete representation of the scene. Semantic instance reconstruction is proposed to solve this problem. It converts 2D/3D data obtained from multiple sensors into a semantic representation of the scene, which includes modelling each object instance in the scene. Machine learning and computer vision techniques are applied to achieve high-precision reconstruction results, and point cloud-based methods have shown prominent advantages. However, existing methods ignore the prior geometric and semantic information of objects, and the subsequent simple max-pooling operation loses key structural information of objects, resulting in poor instance reconstruction performance. Method In this study, a geometric attribute guided semantic instance reconstruction network (GANet) is proposed, which consists of a 3D object detector, a spatial transformer, and a mesh generator. We mainly design the spatial transformer to utilize the geometric and semantic information of instances. After obtaining the 3D bounding box information of instances in the scene, box sampling is used to obtain the real local point cloud of each target instance in the scene based on the instance scale information, and then semantic information is embedded for foreground point segmentation. Compared with sphere sampling, box sampling reduces noise and obtains more effective information. Then, from coarse to fine, the encoder"s feature embedding layer and Transformer layer extract rich and crucial detailed geometric information of objects and obtain corresponding local features. The feature embedding layer also utilizes the prior semantic information of objects to help the algorithm quickly approximate the target shape. The attention module in Transformer integrates the correlation information between points. The algorithm also uses the object"s global features provided by the detector. Considering the inconsistency between the scene space and the canonical space, a designed feature space transformer is used to align the object"s global features. Finally, the fused features are sent to the mesh generator for mesh reconstruction. The loss function of GANet mainly consists of two parts: detection loss and shape loss. The detection loss is the weighted sum of the instance confidence loss, semantic classification loss, and bounding box estimation loss. The shape loss consists of three parts: KL divergence between the predicted distribution and standard normal distribution, foreground point segmentation loss, and occupancy point estimation loss. The occupancy point estimation loss is the cross-entropy between the predicted occupancy value of spatial candidate points and the real occupancy value. Result The experiment was compared with the latest methods on the ScanNet V2 datasets. The algorithm utilized CAD models provided by Scan2CAD as ground truth for training, which included 8 categories. Compared to the second-ranked method RfD-Net, the mean average precision of reconstruction increased by 8%. The average precisions of the bathtub, trash bin, sofa, chair, and cabinet are better than RfD-Net. According to the visualization results, GANet can reconstruct object models that are more in line with the scene. Ablation experiments were also conducted on the corresponding dataset. The performance of the complete network was better than the other 4 networks, which included a GANet that replaced sphere sampling with box sampling, replaced the Transformer with PointNet, and removed semantic embedding of point cloud features and feature transformation. The experimental results indicate that box sampling obtains more effective local point cloud information, Transformer-based point cloud encoder enables the model to use more critical local structural information of the foreground point cloud during reconstruction, and semantic embedding provides prior information for instance reconstruction. The feature space transformation aligns the global prior information of the object, further improving the reconstruction effect. Conclusion In this study, we proposed a geometric attribute-guided model. Our model takes into account the complexity of scene objects and can better utilize the geometric and attribute information of objects. The experiment results show that our model outperforms several state-of-the-art approaches. Current 3D-based semantic instance reconstruction algorithms have achieved good results, but acquiring and annotating 3D data is still relatively expensive. Future research could focus on how to use 2D data to assist in semantic instance reconstruction.