发布时间: 2021-11-16 摘要点击次数: 全文下载次数: DOI: 10.11834/jig.200550 2021 | Volume 26 | Number 11 图像理解和计算机视觉

1. 上海海洋大学信息学院, 上海 201306;
2. 中国极地研究中心, 上海 200136
 收稿日期: 2020-09-15; 修回日期: 2020-12-24; 预印本日期: 2020-12-31 基金项目: 国家重点研发计划项目（2016YFC1400304）；国家自然科学基金项目（61972240）；上海科委部分地方高校能力建设项目（20050501900） 作者简介: 宋巍, 1977年生, 女, 教授, 主要研究方向为机器视觉、图像/视频处理、海洋大数据分析。E-mail: wsong@shou.edu.cn 蔡万源, 男, 硕士研究生, 主要研究方向为计算机视觉、3维点云。E-mail: 865645064@qq.com 何盛琪, 男, 博士, 工程师, 主要研究方向为3维建模、海洋信息工程。E-mail: sqhe@shou.edu.cn 李文俊, 通信作者, 男, 博士, 工程师, 主要研究方向为遥感、视通数据分析。E-mail: liwenjun@pric.org.cn *通信作者: 李文俊  liwenjun@pric.org.cn 中图法分类号: TP391 文献标识码: A 文章编号: 1006-8961(2021)11-2691-12

# 关键词

Dynamic graph convolution with spatial attention for point cloud classification and segmentation
Song Wei1, Cai Wanyuan1, He Shengqi1, Li Wenjun2
1. College of Information Technology, Shanghai Ocean University, Shanghai 201306, China;
2. Polar Research Institute of China, Shanghai 200136, China
Supported by: National Key Research and Development Program of China (2016YFC1400304); National Natural Science Foundation of China (61972240); Shanghai Science and Technology Commission part of the Local University Capacity Building Projects (20050501900)

# Abstract

Objective With the rapid development of 3D acquisition technologies, point cloud has wide applications in many areas, such as medicine, autonomous driving, and robotics. As a dominant technique in artificial intelligence(AI), deep learning has been successfully used to solve various 2D vision problems and has shown great potential in solving 3D vision problems. Using regular grid convolutional neural networks (CNN) for non-Euclidian space of point cloud data and capturing the hidden shapes from irregular points remains challenging. In recent years, deep learning-based methods have been more effective in point cloud classification and segmentation than traditional methods. Deep learning-based methods can be divided into three groups: pointwise methods, convolutional-based methods, and graph convolutional-based methods. These methods include two important processes: feature extraction and feature aggregation. Most of the methods focus on the design of feature extraction and pay less attention to feature aggregation. At present, most point cloud classification and segmentation methods based on deep learning use max pooling for feature aggregation. However, using the maximum value features of neighborhood features in local neighborhood features has the problem of information loss caused by ignoring other neighborhood information. Method This paper proposes a dynamic graph convolution with spatial attention for point cloud classification and segmentation method based on deep learning-dynamic graph convolution spatial attention (DGCSA) neural networks. The key of the network is to learn from the relationship between the neighbor points and the center point, which avoid the information loss caused by feature aggregation using max pool layers in feature aggregation. This network is composed of a dynamic graph convolution module and a spatial attention (SA) module. The dynamic graph convolution module mainly performs K-nearest neighbor (KNN) search algorithm and multiple-layer perception. For each point cloud, it first uses the KNN algorithm to search its neighbor points. Then, it extracts the features of the neighbor points and center points by convolutional layers. The K-nearest neighbors of each point vary in different network layers, leading to a dynamic graph structure updated with layers. After feature extraction, it applies a point-based SA module to learn the local features that are more representative than the maximum feature automatically. The key of the SA module is to use the attention mechanism to calculate the weight of K-neighbor points of the center point. It consists of four units: 1) attention activation unit, 2) attention scores unit, 3) weighted features unit, and 4) multilayer perceptron unit. First, the attention activation of each potential feature is learned through the fully connected layer. Second, the attention score of the corresponding feature is calculated by applying the SoftMax function on the attention activation value. The learned attention score can be regarded as a mask for automatically selecting useful potential features. Third, the attention score is multiplied by the corresponding elements of the local neighborhood features to generate a set of weighted features. Finally, the sum of the weighted features is determined to obtain the locally representative local features, followed by another fully connected convolutional layer to control the output dimension of the SA module. The SA module has strong learning ability, thereby improving the classification and segmentation accuracy of the model. DGCSA implements a high-performance classification and segmentation of point clouds by stacking several dynamic graph convolution modules and SA modules. Moreover, feature fusion is used to fuse the output features of different spatial attention layers that can effectively obtain the global and local characteristics of point cloud data, achieving better classification and segmentation results. Result To evaluate the performance of the proposed DGCSA model, experiments are carried out in classification, instance segmentation, and semantic scene segmentation on the datasets of ModelNet40, ShapeNetPart, and Stanford large-scale 3D Indoor spaces dataset, respectively. Experiment results show that the overall accuracy (OA) of our method reaches 93.4%, which is 0.8% higher than the baseline network dynamic graph CNN (DGCNN). The mean intersection-to-union (mIoU) of instance segmentation reaches 85.3%, which is 0.2% higher than DGCNN; for indoor scene segmentation, the mIoU of the six-fold cross-validation reaches 59.1%, which is 3.0% higher than DGCNN. Overall, the classification accuracy of our method on the ModelNet40 dataset surpasses that of most existing point cloud classification methods, such as PointNet, PointNet++, and PointCNN. The accuracy of DGCSA in instance segmentation and indoor scene segmentation reaches the segmentation accuracy of the current excellent point cloud segmentation network. Furthermore, the validity of the SA module is verified by an ablation study, where the max pooling operations in PointNet and linked dynamic graph CNN (LDGCNN) are replaced by the SA module. The classification results on the ModelNet40 dataset show that the SA module contributes to a more than 0.5% increase of classification accuracy for PointNet and LDGCNN. Conclusion DGCSA can effectively aggregate local features of point cloud data and achieve better classification and segmentation results. Through the design of SA module, this network solves the problem of partial information loss in the aggregation local neighborhood information. The SA module fully considers all neighborhood contributions, selectively strengthens the features containing useful information, and suppresses useless features. Combining the spatial attention module with the dynamic graph convolution module, our network can improve the accuracy of classification, instance segmentation, and indoor scene segmentation. In addition, the spatial attention module can integrate with other point cloud classification model and substantially improve the model performance. Our future work will improve the accuracy of DGCSA in segmentation task in the condition of an unbalanced dataset.

# Key words

point cloud; dynamic graph convolution; spatial attention(SA); classification; segmentation

# 2.1.1 构建图结构—K近邻

 $\boldsymbol{G}=(\boldsymbol{V}, \boldsymbol{E})$ (1)

 $\boldsymbol{V}=\left\{\boldsymbol{p}_{i} \mid i=1,2, \cdots, N\right\}$ (2)

 $\boldsymbol{E}=\left\{\boldsymbol{e}_{i}=\left(\boldsymbol{e}_{i_{1}}, \boldsymbol{e}_{i_{2}}, \cdots, \boldsymbol{e}_{i_{j}}\right) \mid i=1,2, \cdots, N\right\}$ (3)

# 2.1.2 MLP提取特征

 $\boldsymbol{e}_{i_{j}}=h_{\boldsymbol{\varTheta}}\left(\boldsymbol{p}_{i}, \boldsymbol{p}_{i}-\boldsymbol{p}_{i_{j}}\right)$ (4)

 $\boldsymbol{l}_{i}={ReLU}\left(B N\left(\sum\limits_{j:(i, j) \in \boldsymbol{E}} \boldsymbol{e}_{i_{j}}\right)\right)$ (5)

 $\boldsymbol{L}_{i}=\max \left(\boldsymbol{l}_{i}\right)$ (6)

 $\boldsymbol{L}_{i}=S A\left(\boldsymbol{l}_{i}\right)$ (7)

# 2.2 空间注意力模块

1) 注意力激活。如图 4所示，给定一组局部特征集合${\boldsymbol{l}_i} = \left\{ {\boldsymbol{l}_i^1, \cdots ,\boldsymbol{l}_i^k, \cdots ,\boldsymbol{l}_i^K} \right\},{\boldsymbol{l}_i} \in {{\boldsymbol{\rm{R}}}^{K \times D}}$，其中$K$表示K近邻图中的$K$个顶点，$D$表示局部特征的维数。将局部特征${\boldsymbol{l}_i}$输入MLP函数$g\left( \cdot \right)$，即全连接层，输出为一组学习得到的注意力激活参数$\boldsymbol{C}=\left\{\boldsymbol{c}_{1}, \boldsymbol{c}_{2}, \cdots, \boldsymbol{c}_{K}\right\} \in \boldsymbol{\rm{R}}^{K \times D}$

 $\boldsymbol{C}=g\left(\boldsymbol{l}_{i}, \boldsymbol{W}\right)$ (8)

2) 注意力得分。本文使用Softmax函数作为注意力激活的归一化操作，计算出一组注意力得分参数$\boldsymbol{s}=\left\{\boldsymbol{s}_{1}, \boldsymbol{s}_{2}, \cdots, \boldsymbol{s}_{K}\right\} \in \boldsymbol{\rm{R}}^{K \times D}$。第$k$个特征向量的注意力得分为

 $\boldsymbol{s}_{k}=\frac{\exp \left(\boldsymbol{c}_{k}\right)}{\sum\limits_{j=1}^{K} \exp \left(\boldsymbol{c}_{j}\right)}$ (9)

3) 加权特征。本文使用注意力分数s和局部邻域特征${\boldsymbol{l}_i}$之间进行矩阵乘法以生成加权邻域特征，然后沿着$K$个顶点对加权邻域特征求和以获得聚合之后的局部特征$\boldsymbol{L}_{i} \in \boldsymbol{\rm{R}}^{1 \times D}$，计算过程为

 $\boldsymbol{L}_{i}=\sum\limits_{k=1}^{K}\left(\boldsymbol{l}_{i}^{k} * \boldsymbol{s}_{k}\right)$ (10)

4) 多层感知机模块。最后，本文将${\boldsymbol{L}_i}$输入到MLP中以控制局部特征向量的维度。该MLP包括一个全连接层(fully connected layer, FC)，归一化层(BN)和ReLU激活层。该层使注意力池化模块具有更大的灵活性来处理特征尺寸的减小。

# 2.3 空间转换网络

PointNet(Qi等，2017a)中提出的空间转换网络(STN)的作用是训练得到一组空间旋转矩阵，该旋转矩阵可以对输入的点云数据进行坐标对齐。空间转换网络可以直接将输入点云数据旋转到一个更好的角度，这更有利于网络对点云数据进行分类和分割。具体实现如图 5所示，输入的点云数据通过多个MLP与最大值池化预测3×3旋转矩阵$\boldsymbol{R}$并将该矩阵$\boldsymbol{R}$直接与输入点云数据进行矩阵相乘来实现坐标对齐。

# 3 实验验证

Table 1 Experimental configuration

 操作系统 GPU 运算加速库 框架 语言 Linux Centos7 RTX 2080Ti CUDA 10.1+ cuDNN7.5 Pytorch1.3 Python 3.5.2

Table 2 Experimental parameters setting

 数据集 点的数量 K近邻点 优化器 学习率 批处理量 训练次数 ModelNet40 1 024 20 SGD 0.001 32 250 ShapeNetPart 2 048 40 SGD 0.001 14 200 S3DIS 4 096 20 SGD 0.001 14 100

 $I o U=\frac{T P}{T P+F P+F N}$ (11)

# 3.1.2 实验结果

Table 3 Comparison of classification results on ModelNet40 dataset

 /% 模型 OA mAcc PointNet(Qi等，2017a) 89.2 86.2 PointNet++(Qi等，2017b) 91.9 - PointWeb(Zhao等，2019) 92.3 89.4 PointConv (Wu等，2019) 92.5 - PointCNN(Li等，2018) 92.2 88.1 KPConv Rigid(Thomas等，2019) 92.9 - KPConv deform(Thomas等，2019) 92.7 - SpiderCNN(Xu等，2018) 92.4 - ECC(Simonovsky等，2017) 87.4 83.2 RGCNN(Te等，2018) 90.5 87.2 LDGCNN(Zhang等，2019) 92.9 90.3 DGCNN(Wang等，2019) 92.6 90.1 DGCSA (本文) 93.4 90.6 注: 加粗字体为每列最优值。

# 3.2.2 实验结果

Table 4 IoU results of instance segmentation on ShapeNetPart dateset

 /% 模型 mIoU IoU aero (2 690) bag (76) cap (55) car (898) chair (3 758) ear phone (69) guitar (787) knife (392) lamp (1 547) laptop (451) motor (202) mug (184) pistol (283) rocket (66) skate board (152) table (5 271) PointNet 83.7 83.4 78.7 82.5 74.9 89.6 73.0 91.5 85.9 80.8 95.3 65.2 93.0 81.2 57.9 72.8 80.6 PointNet++ 85.1 82.4 79.0 87.7 77.3 90.8 71.8 91.0 85.9 83.7 95.3 71.6 91.4 81.3 58.7 76.4 82.6 DGCNN 85.1 84.0 83.7 84.4 77.8 90.6 74.4 91.0 88.1 83.4 95.8 67.8 93.3 82.3 59.2 76.0 81.9 DGCSA (本文) 85.3 84.2 73.3 82.3 77.7 91.0 75.3 91.2 88.6 85.3 95.9 58.9 94.3 81.8 56.9 75.4 82.7 注: 加粗字体为每列最优值。

# 3.3.2 实验结果

Table 5 Results of semantic segmentation of 3D indoor scenes on S3DIS (test on Area5)

 /% 模型 mIoU IoU ceiling floor wall bean column window door chair table bookcase sofa board clutter PointNet 41.09 88.80 97.33 69.80 0.05 3.92 46.26 10.76 52.61 58.93 40.28 5.85 26.38 33.22 PointNet++ 50.04 90.79 96.45 74.12 0.02 5.77 43.59 25.39 69.22 76.94 21.45 55.61 49.34 41.88 DGCNN (baseline) 47.08 92.42 97.46 76.03 0.37 12.00 51.59 27.01 64.85 68.58 7.67 43.76 29.44 40.83 DGCSA (本文) 50.10 93.21 97.70 77.04 0.29 15.13 50.70 27.90 69.74 69.00 13.90 56.38 44.29 45.00 注: 加粗字体为每列最优值。

Table 6 Results of semantic segmentation of 3D indoor scenes on S3DIS (6-fold cross validation)

 /% 模型 mIoU OA PointNet 47.6 78.6 PointNet++ 54.5 81.0 DGCNN(baseline) 56.1 84.1 DGCSA (本文) 59.1 85.1 注: 加粗字体为每列最优值。

# 3.4 消融实验

Table 7 Ablation study (classification on ModelNet40)

 /% 模型 OA PointNet 89.2 PointNet使用SA模块进行优化 89.7(↑0.5) LDGCNN 92.7 LDGCNN使用SA模块进行优化 93.3(↑0.6) 注：“↑”表示比未优化方法的结果提升。

# 参考文献

• Armeni I, Sener O, Zamir A R, Jiang H L, Brilakis I, Fischer M and Savarese S. 2016. 3D semantic parsing of large-scale indoor spaces//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 1534-1543[DOI: 10.1109/CVPR.2016.170]
• Besl P J, Jain R C. 1988. Segmentation through variable-order surface fitting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 10(2): 167-192D [DOI:10.1109/34.3881]
• Bruna J, Zaremba W, Szlam A and LeCun Y. 2013. Spectral networks and locally connected networks on graphs[EB/OL]. [2020-08-15]. https://arxiv.org/pdf/1312.6203.pdf
• Chen X Z, Ma H M, Wan J, Li B and Xia T. 2017. Multi-view 3D object detection network for autonomous driving//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 6526-6534[DOI: 10.1109/CVPR.2017.691]
• Defferrard M, Bresson X and Vandergheynst P. 2016. Convolutional neural networks on graphs with fast localized spectral filtering//Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain: NIPS: 3844-3852[DOI: 10.5555/3157382.3157527]
• Filin S, Pfeifer N. 2006. Segmentation of airborne laser scanning data using a slope adaptive neighborhood. ISPRS Journal of Photogrammetry and Remote Sensing, 60(2): 71-80D [DOI:10.1016/j.isprsjprs.2005.10.005]
• Fischler M A, Bolles R C. 1981. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6): 381-395D [DOI:10.1145/358669.358692]
• Guo Y L, Wang H Y, Hu Q Y, Liu H, Liu L, Bennamoun M. 2020. Deep learning for 3D point clouds: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence [DOI:10.1109/TPAMI.2020.3005434]
• Kipf T N and Welling M. 2016. Semi-supervised classification with graph convolutional networks[EB/OL]. [2020-08-15]. https://arxiv.org/pdf/1609.02907.pdf
• LeCun Y, Bengio Y, Hinton G. 2015. Deep learning. Nature, 521(7553): 436-444 [DOI:10.1038/nature14539]
• Li Y Y, Bu R, Sun M C, Wu W, Di X H and Chen B Q. 2018. PointCNN: convolution on Χ-transformed points//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal, Canada: NeurIPS: 828-838
• Qi C R, Su H, Mo K C and Guibas L J. 2017a. PointNet: deep learning on point sets for 3D classification and segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 77-85[DOI: 10.1109/CVPR.2017.16]
• Qi C R, Yi L, Su H and Guibas L J. 2017b. Pointnet++: deep hierarchical feature learning on point sets in a metric space//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: NIPS: 5105-5114[DOI: 10.5555/3295222.3295263]
• Simonovsky M and Komodakis N. 2017. Dynamic edge-conditioned filters in convolutional neural networks on graphs//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 29-38[DOI: 10.1109/CVPR.2017.11]
• Su H, Maji S, Kalogerakis E and Learned-Miller E. 2015. Multi-view convolutional neural networks for 3D shape recognition//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 945-953[DOI: 10.1109/ICCV.2015.114]
• Te G S, Hu W, Zheng A M and Guo Z M. 2018. RGCNN: regularized graph CNN for point cloud segmentation//Proceedings of the 26th ACM International Conference on Multimedia. Seoul, Korea (South): ACM: 746-754[DOI: 10.1145/3240508.3240621]
• Thomas H, Qi C R, Deschaud J E, Marcotegui B, Goulette F and Guibas L. 2019. Kpconv: flexible and deformable convolution for point clouds//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 6410-6419[DOI: 10.1109/ICCV.2019.00651]
• Wang C, Samari B and Siddiqi K. 2018. Local spectral graph convolution for point set feature learning//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 56-71[DOI: 10.1007/978-3-030-01225-0_4]
• Wang Y, Sun Y B, Liu Z W, Sarma S E, Bronstein M M, Solomon J M. 2019. Dynamic graph CNN for learning on point clouds. ACM Transactions on Graphics, 38(5): 146 [DOI:10.1145/3326362]
• Wu W X, Qi Z G and Li F X. 2019. PointConv: deep convolutional networks on 3D point clouds//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 9613-9622[DOI: 10.1109/cvpr.2019.00985]
• Wu Z R, Song S R, Khosla A, Yu F, Zhang L G, Tang X O and Xiao J X. 2015. 3D shapenets: a deep representation for volumetric shapes//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 1912-1920[DOI: 10.1109/CVPR.2015.7298801]
• Xu Y F, Fan T Q, Xu M Y, Zeng L and Qiao Y. 2018. Spidercnn: deep learning on point sets with parameterized convolutional filters//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 90-105[DOI: 10.1007/978-3-030-01237-3_6]
• Yi L, Kim V G, Ceylan D, Shen I C, Yan M Y, Su H, Lu C W, Huang Q X, Sheffer A, Guibas L. 2016. A scalable active framework for region annotation in 3D shape collections. ACM Transactions on Graphics, 35(6): 210 [DOI:10.1145/2980179.2980238]
• Zhang K G, Hao M, Wang J, de Silva C W and Fu C L. 2019. Linked dynamic graph CNN: learning on point cloud via linking hierarchical features. [EB/OL]. [2020-08-15]. https://arxiv.org/pdf/1904.10014.pdf
• Zhang Y X and Rabbat M. 2018. A graph-CNN for 3D point cloud classification//Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary, Canada: IEEE: 6279-6283[DOI: 10.1109/ICASSP.2018.8462291]
• Zhao H S, Jiang L, Fu C W and Jia J Y. 2019. PointWeb: enhancing local neighborhood features for point cloud processing//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 5560-5568[DOI: 10.1109/CVPR.2019.00571]
• Zhou Y and Tuzel O. 2018. Voxelnet: end-to-end learning for point cloud based 3D object detection//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4490-4499[DOI:10.1109/CVPR.2018.00472]