Current Issue Cover

孙国栋, 贾俊杰, 李明晶, 张杨(湖北工业大学)

摘 要
目的 杂乱场景下的物体抓取姿态检测是智能机器人的一项基本技能。尽管最近在六自由度抓取学习中取得了进展,但先前的方法在采样和学习中忽略了物体尺寸差异,导致在小物体上抓取表现较差。方法 本文提出了一种物体掩码辅助采样方法,在所有物体上采样相同的点以平衡抓取分布,解决了采样点分布不均问题。此外,学习时采用多尺度学习策略,在物体部分点云上使用多尺度圆柱分组以提升局部几何表示能力,解决了由物体尺度差异导致的学习抓取操作参数困难问题。通过设计了一个端到端的抓取网络,嵌入了提出的采样和学习方法,能够有效提升物体抓取检测性能。结果 最后在大型基准数据集GraspNet-1Billion评估,本文提出的方法达到最优性能,其中在小物体上的抓取指标平均提升了7%,大量的真实机器人实验也表明该方法具有对未知物体的良好泛化性能。结论 本文聚焦于小物体上的抓取,提出了一种掩码辅助采样方法嵌入到提出的端到端学习网络中,并引入了多尺度分组学习策略提高物体的局部几何表示,能够有效提升在小尺寸物体上抓取质量,并在所有物体上的抓取评估结果都超过了之前的方法。
Research on small object grasping detection in cluttered scenes

Sun Guodong, Jia Junjie, Li Mingjing, Zhang Yang(Hubei University of Technology)

Objective Object grasp pose detection in cluttered scenes is an essential skill for intelligent robots. Despite recent advances in 6-DOF grasping learning, learning the grasping configuration of small objects is very challenging. First, due to the huge amount of raw point cloud data, to reduce the computational complexity of the network and increase the detection efficiency, points in the scene need to be downsampled, while previous sampling methods sample fewer points on small objects, leading to difficulties in learning small object grasping poses. In addition, the current consumer-grade depth cameras in the market are seriously noisy, especially since the quality of point clouds obtained on small objects cannot be guaranteed, leading to the possibility of unclear objecthood of points on small objects predicted by the network, and some feasible grasping points are mistakenly considered as background points, further reducing the number of sampling points on small objects, resulting in weak grasping performance on small objects. Method A potential problem in previous grasp detection methods is that they do not take into account the biased distribution of sampling points due to differences in the scale of objects in the scene, resulting in fewer sampling points on small objects. In this paper, we propose an object mask-assisted sampling method that samples the same points on all objects to balance the grasping distribution, solving the problem of uneven distribution of sampling points. In the inference, since there is no a priori knowledge of scene point-level masks, we introduce an unseen object instance segmentation network to distinguish objects in the scenario, implementing a mask-assisted sampling method. In addition, a multi-scale learning strategy is used for learning, and multi-scale cylindrical grouping is used on partial point clouds of objects to improve the local geometric representation, which solves the problem of difficulty in learning to grasp operational parameters caused by differences in object scales. Specifically, we set up three cylinders with different radii to sample the point cloud near the graspable point, which corresponds to learning large, medium, and small-sized object features, and then splice the features of the three scales, and process the spliced features with self-attention layer to enhance the attention of the local region and improve the local geometric representation of the object. Similar to GraspNet, we design an end-to-end grasping network, which consists of three main parts: graspable points, approach direction, and prediction of gripper operation. Graspable points represent the high-scoring points in the scene that are suitable for grasping, which can do the initial filtering of a large amount of point cloud data in the scene, and then embedded with the proposed sampling and learning methods to further predict the approach direction and gripper operation for grasping poses on the object. By designing an end-to-end grasping network embedded with the proposed sampling and learning approach, we can effectively improve object grasping detection capabilities. Result Finally, the proposed method achieves state-of-the-art performance when evaluated on the large benchmark dataset GraspNet-1Billion, where the grasping metrics on small objects are improved by 7% on average, and a large number of real robot experiments also show that the approach has promising generalization performance to unseen objects. To more intuitively observe the improvement of the grasping performance of the proposed method on small objects, we also take the previous most representative method GSNet as the benchmark method and visualize the grasping detection results of the benchmark method and the proposed method in this paper under four cluttered scenarios. The visualization results also show that the previous method tends to predict grasping on large objects in the scene and does not show reasonable grasping poses on some small objects, while the proposed method can accurately predict grasping poses on small objects. Conclusion Focusing on grasping on small objects, this paper proposes a mask-assisted sampling method embedded in the proposed end-to-end learning network and introduces a multiscale grouping learning strategy to improve the local geometric representation of objects, which can effectively improve the quality of grasping on small-sized objects and outperform previous methods in the evaluation of grasping on all objects. However, the proposed method in this paper has certain limitations. For instance, when using noisy and low-quality depth maps as inputs, existing unseen object instance segmentation methods may produce incorrect object masks, failing mask-assisted sampling. In future work, we plan to investigate more robust unseen object instance segmentation methods that can correct erroneous segmentation results under low-quality depth map inputs. This will allow us to obtain more accurate object instance masks and enhance the object grasping detection capability in cluttered scenes.