Current Issue Cover
结合空间深度卷积和残差的大尺度点云场景分割

刘盛, 黄圣跃, 程豪豪, 沈家瑜, 陈胜勇(浙江工业大学计算机科学与技术学院, 杭州 310023)

摘 要
目的 在点云场景中,语义分割对场景理解来说是至关重要的视觉任务。由于图像是结构化的,而点云是非结构化的,点云上的卷积通常比图像上的卷积更加困难,会消耗更多的计算和内存资源。在这种情况下,大尺度场景的分割往往需要分块进行,导致效率不足并且无法捕捉足够的场景信息。为了解决这个问题,本文设计了一种计算高效且内存高效的网络结构,可以用于端到端的大尺度场景语义分割。方法 结合空间深度卷积和残差结构设计空间深度残差(spatial depthwise residual,SDR)块,其具有高效的计算效率和内存效率,并且可以有效地从点云中学习到几何特征。另外,设计一种扩张特征整合(dilated feature aggregation,DFA)模块,可以有效地增加感受野而仅增加少量的计算量。结合SDR块和DFA模块,本文构建SDRNet(spatial depthwise residual network),这是一种encoder-decoder深度网络结构,可以用于大尺度点云场景语义分割。同时,针对空间卷积核输入数据的分布不利于训练问题,提出层级标准化来减小参数学习的难度。特别地,针对稀疏雷达点云的旋转不变性,提出一种特殊的SDR块,可以消除雷达数据绕Z轴旋转的影响,显著提高网络处理激光雷达点云时的性能。结果 在S3DIS(stanford large-scale 3D indoor space)和SemanticKITTI(Karlsruhe Institute of Technology and Toyota Technological Institute)数据集上对提出的方法进行测试,并分析点数与帧率的关系。本文方法在S3DIS数据集上的平均交并比(mean intersection over union,mIoU)为71.7%,在SemanticKITTI上的mIoU在线单次扫描评估中达到59.1%。结论 实验结果表明,本文提出的SDRNet能够直接在大尺度场景下进行语义分割。在S3DIS和SemanticKITTI数据集上的实验结果证明本文方法在精度上有较好表现。通过分析点数量与帧率之间的关系,得到的数据表明本文提出的SDRNet能保持较高精度和较快的推理速率。
关键词
A deep residual network with spatial depthwise convolution for large-scale point cloud semantic segmentation

Liu Sheng, Huang Shengyue, Cheng Haohao, Shen Jiayu, Chen Shengyong(College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310023, China)

Abstract
Objective Point cloud semantic segmentation has been an essential visual task for scene understanding from two-dimensional vision to three-dimensional vision. Deep learning processing point cloud has been divided into three methods as following:point-based method, projection-based method and voxel-based method. Projection-based methods have obtained a two-dimensional image from the point cloud based on spherical projection. The semantic segmentation on the point cloud has been conducted via a two-dimensional convolution neural network method. The original point cloud has been restored via some post-processing. However, those methods have usually only been used for LiDAR point clouds. Voxel-based methods have often consumed a lot of memory due to voxel representation. The above two methods have both represented the unstructured point cloud into a structured form and processed it via a two-dimensional convolutional neural network or a three-dimensional convolutional neural network. However, this method will lose geometric details. Point-based methods have often consumed more memory subjected to additional neighborhood information storage. Some existing methods have usually divided the entire point cloud into blocks for processing. However, this method will destroy the geometric structure of the scene to cause incomplete information capture from the scene. In addition, some point-based methods in large-scale scenes have the problem of insufficient receptive fields caused by shallow network structures due to excessive memory consumption. A computation-based and memory-efficient network structure has been presented that can be used for end-to-end large-scale scene semantic segmentation. Method The spatial depthwise residual (SDR) block has been designed via combining the spatial depthwise convolution and residual structure to learn geometric features from the point cloud effectively. The receptive field has been regarded as one of the key factors in semantic segmentation. In order to increase the receptive field, a dilated feature aggregation (DFA) module, which has a larger receptive field than the SDR block, but with less calculation. The core idea of this module has reduced computational consumption and memory consumption via down sampling. Combining SDR block and DFA module, SDRNet, a deeper encoder-decoder network structure has been constructed, which can be applied to large-scale scenes semantic segmentation. The data distribution of the input data has affected the training process of the network. Data distribution is not conducive to network learning based on the analysis of input data of the convolution kernel. Hierarchical normalization (HN) can reduce the learning difficulty of the convolution kernel. A special SDR block has been used for a kind of rotation invariance of sparse LiDAR point clouds. Before convolution, the point and its neighborhood have been first rotated to a fixed angle. The influence of the rotation of the radar data around the Z-axis can be eliminated. The prediction result has not be changed via the rotated point cloud around the Z-axis. This special SDR block can significantly improve the performance of the network when processing LiDAR point clouds. Result The stanford large-scale 3D indoor space(S3DIS) dataset and the Karlsruhe Institute of Technology and Toyota Technological Institute(SemanticKITTI) dataset have been used. Different parameters for different tasks to adapt to the application scenarios of the task have been setup. A larger model for higher accuracy has been constructed because the S3DIS task has been focused on accuracy. The SemanticKITTI scene has required more speed. A lighter hyperparameter has been chosen. The designed model has been compared with several state-of-the-art models on the S3DIS datasets by using 6-flod cross validation. Mean intersection over union (mIoU), mean accuracy (mAcc) and overall accuracy (OA) have been evaluated on the S3DIS dataset. The method has achieved 88.9% OA, 82.4% mAcc and 71.7% mIoU each. These methods have presented well on different metrics. The online single scan evaluation has been conducted on the SemanticKITTI dataset. 59.1% mIoU has been obtained. The method has achieved better results in mIoU and several accuracy of several classes in comparison with point-based methods and projection-based methods. In an unmanned driving scenario like SemanticKITTI, the inference speed of the mode is a crucial factor. In addition, the inference speed of SDRNet with the different number of points has been tested. When the number of points is 50 K, the network processing point cloud speed can reach 11 frames per second (fps) by using a machine with NVIDIA RTX and i7-8700K. Moreover, this paper has constructed experiment of ablation study to explain the performance of each part of the model further. Conclusion The experiments on the S3DIS dataset SemanticKITTI dataset have shown that the research method can directly perform semantic segmentation in large-scale point cloud scenes. It can extract information from the scene effectively and achieve high accuracy. The experiment of ablation study on S3DIS area-5 has demonstrated that both DFA and HN can improve performance. The experiment of ablation study on SemanticKITTI validation set has presented that eliminating the influence of rotation by using the special SDR block can effectively improve the performance of the network. The higher accuracy and a relatively fast speed have been achieved via analyzing the relationship between the number of points and the frame rate.
Keywords

订阅号|日报