结合语义信息与3D点云技术的未知环境地图构建方法

马淼; 刘培敏; 潘海鹏

doi:10.11834/jig.220382

图像理解和计算机视觉 | 浏览量 : 0 下载量: 0 CSCD: 0

PDF
导出
分享
收藏
专辑

结合语义信息与3D点云技术的未知环境地图构建方法
The 3D point cloud based semantic information-relevant map construction method for unrecognized scenario
2023年28卷第8期页码：2432-2446
纸质出版日期： 2023-08-16 ，
DOI： 10.11834/jig.220382
稿件说明：

移动端阅览

马淼，刘培敏，潘海鹏. 2023. 结合语义信息与3D点云技术的未知环境地图构建方法. 中国图象图形学报， 28(08):2432-2446

Ma Miao， Liu Peimin， Pan Haipeng. 2023. The 3D point cloud based semantic information-relevant map construction method for unrecognized scenario. Journal of Image and Graphics， 28(08):2432-2446
马淼，刘培敏，潘海鹏. 2023. 结合语义信息与3D点云技术的未知环境地图构建方法. 中国图象图形学报， 28(08):2432-2446 DOI： 10.11834/jig.220382.

Ma Miao， Liu Peimin， Pan Haipeng. 2023. The 3D point cloud based semantic information-relevant map construction method for unrecognized scenario. Journal of Image and Graphics， 28(08):2432-2446 DOI： 10.11834/jig.220382.

摘要

目的

机器人在进行同时定位与地图构建（

imultaneous localization and mapping，SLAM）时需要有效利用未知复杂环境的场景信息，针对现有SLAM算法对场景细节理解不够及建图细节信息缺失的问题，本文构造出一种将SLAM点云定位技术与语义分割网络相结合的未知环境地图构建方法，实现高精度三维地图重建。

方法

首先，利用场景的实时彩色信息进行相机的位姿估计，并构造融合空间多尺度稀疏及稠密特征的深度学习网络HieSemNet（hierarchical semantic network），对未知场景信息进行语义分割，得到场景的实时二维语义信息；其次，利用深度信息和相机位姿进行空间点云估计，并将二维语义分割信息与三维点云信息融合，使语义分割的结果对应到点云的相应空间位置，构建出具有语义信息的高精度点云地图，实现三维地图重建。

结果

为验证本文方法的有效性，分别针对所构造的HieSemNet网络和语义SLAM系统进行验证实验。实验结果表明，本文的网络在平均像素准确度和平均交并比上均取得了较好的精度，MPA（mean pixel accuracy）指标相较于其他网络分别提高了17.47%、11.67%、4.86%、2.90%和0.44%，MIoU（mean intersection over union）指标分别提高了13.94%、1.10%、6.28%、2.28%和0.62%。本文的SLAM算法可以获得更多的建图信息，构建的地图精度和准确度都更好。

结论

本文方法充分考虑了不同尺寸物体的分割效果，提出的HieSemNet网络能有效提高场景语义分割准确性，此外，与现有的前沿语义SLAM系统相比，本文方法能够明显提高建图的精度和准确度，获得更高质量的地图。

Abstract

Objective

With the continuous in-depth development of computer technology and artificial intelligence， the intelligent robot contexts have been developing intensively. The simultaneous localization and mapping （SLAM） can be as an effective robot-related technique to recognize scene information. Simultaneous localization and mapping is focused on robot motion location starting from the unknown position of the unknown environment while its own position can be identified and located through the observed map features， and a complete map of the scene is then constructed based on its own posture and trajectory. The environment map constructed by traditional SLAM lacks semantic information， and the robot cannot recognize the scene environment to a certain extent. To achieve the ability to perceive increasingly complex scenes， some scholars have been focused on introducing deep learning methods into SLAM systems to achieve the recognition of scenario objects. However， there are still some challenging problems to be resolved for insufficient scene recognition and map building. SLAM tasks-related robots are required to explore unknown environments and use effective scene information of complex environments. Aiming at the problems that the existing SLAM algorithms understanding insufficiently of scene details and lack of information of map building details， as well as the existing semantic segmentation algorithms do not perform well in the segmentation of multi-scale objects， have slow segmentation speed and indistinct segmentation pictures， We develop main research objectives of improving the recognition ability of the semantic segmentation algorithm for multi-scale objects and improving the accuracy and precision of map construction by semantic SLAM technology. A method of unknown environment-related map construction is constructed linked with SLAM point cloud localization technology and semantic segmentation network， which can identify objects of different sizes in the scene effectively and realize high-precision 3D map reconstruction.

Method

We design a spatial multi-scale sparse and dense features-fused deep learning semantic segmentation network， which is called hierarchical semantic network （HieSemNet）. A spatial pyramid module is opted with different dilation rates of dilated convolution， and to capture global contextual information， such features can be extracted using multi-scale structure. To extract features deliberately， the network consists of two branches of the feature extraction base network and the spatial pyramid module. Besides， to supervise the training and calculate the loss function， the semantic labels can be used solely at different scales of the two branches. The final feature map can be generated in terms of weighted fusion method of the feature maps of the two branches. The built semantic segmentation network is then applied to the SLAM system， and the map construction is completed by three modules： tracking， local mapping and LoopClosing. The tracking module extracts ORB （oriented FAST and rotated BRIEF） features from the image sequences acquired by the RGB-D camera， determines key frames based on the ORB feature point pairs between frames and performs camera pose estimation. The local mapping module further filters the inserted key frames， then calculates and filters the map points associated with the key frames. The LoopClosing module performs optimization and updates the generated maps. The steps of the algorithm are as follows： First， it uses the real-time color information of the scene captured by RGB-D camera for camera’s positional estimation and trajectory calculation. And then， to achieve semantic segmentation of unknown scene information and obtain real-time 2D semantic information of the scene， it constructs HieSemNet in the context of a deep learning network fusing spatial multiscale sparse and dense features. Second， spatial point cloud estimation using depth information and camera poses to construct an octree of spatial relations of point clouds. Finally， to build a high-precision point cloud map with semantic information and realizing 3D map reconstruction， the semantic segmentation 2D information is fused with 3D point cloud information， and the result of semantic segmentation can correspond the corresponding spatial position of the octree.

Result

To verify the effectiveness of the method proposed， validation experiments are conducted for the constructed HieSemNet and the semantic SLAM system. The HieSemNet analysis is compared to other related frontier networks full connected network （FCN）， segmentation network （SegNet）， PSPNet （pyramid scene parsing network）， DeepLabv3 and SETR （segmentation transformer） in terms of segmentation accuracy on the classical semantic segmentation dataset ADE20k. The experimental results show that the network proposed has its potentials for mean pixel accuracy and mean intersection over union. Since the HieSemNet can obtain a large perceptual field using dilated convolution without losing too much detail information， it can have much more accurate segmentation results for both of large-size targets and small-size objects. Compared to the above network， the mean pixel accuracy value of the networks can be improved by 17.47%， 11.67%， 4.86%， 2.90% and 0.44%， respectively， and the mean intersection over union value can be improved by 13.94%， 1.10%， 6.28%， 2.28% and 0.62%， respectively as well. The proposed SLAM algorithm is tested in related to such contexts of office scenes， warehouse scenes of TUM RGB-D dataset and natural environment. This paper shows the map building process， the trajectory accuracy and absolute trajectory error for three of different scenes by the SLAM algorithm. The comparative results show that our constructed maps can obtain more information for map building， fewer blank or wrong parts in the maps， the contour and position information of objects in the maps constructed is more accurate， and the adverse effects caused by small and chaotic objects are less. It is able to show the actual scene more accurately.

Conclusion

The segmentation effect of objects of different sizes can be fully involved in， and the proposed HieSemNet network can be used to improve the scene semantic segmentation accuracy potentially.

关键词

同时定位与地图构建（SLAM）语义分割语义三维地图空间多尺度特征

Keywords

simultaneous localization and mapping （SLAM）semantic segmentationsemantic three-dimensional mapspatial multiscale features

references

Badrinarayanan V， Kendall A and Cipolla R. 2017. SegNet： a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence， 39（12）： 2481-2495 ［DOI： 10.1109/TPAMI.2016.2644615http://dx.doi.org/10.1109/TPAMI.2016.2644615］

Buslaev A， Iglovikov V I， Khvedchenya E， Parinov A， Druzhinin M and Kalinin A A. 2020. Albumentations： fast and flexible image augmentations. Information， 11（2）： #125 ［DOI： 10.3390/info11020125http://dx.doi.org/10.3390/info11020125］

Campos C， Elvira R， Rodríguez J J G， Montiel J M M and Tardós J D. 2021. ORB-SLAM3： an accurate open-source library for visual， visual-inertial， and multimap SLAM. IEEE Transactions on Robotics， 37（6）： 1874-1890 ［DOI： 10.1109/TRO.2021.3075644http://dx.doi.org/10.1109/TRO.2021.3075644］

Chen L C， Papandreou G， Kokkinos I， Murphy K and Yuille A L. 2018a. DeepLab： semantic image segmentation with deep convolutional nets， atrous convolution， and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence， 40（4）： 834-848 ［DOI： 10.1109/TPAMI.2017.2699184http://dx.doi.org/10.1109/TPAMI.2017.2699184］

Chen L C， Zhu Y K， Papandreou G， Schroff F and Adam H. 2018b. Encoder-decoder with atrous separable convolution for semantic image segmentation//Proceedings of the 15th European Conference on Computer Vision. Munich， Germany： Springer： 833-851 ［DOI： 10.1007/978-3-030-01234-2_49http://dx.doi.org/10.1007/978-3-030-01234-2_49］

Cui M Y， Zhong S P， Liu S Y， Li B Y， Wu C H and Huang K. 2021. Cooperative LiDAR SLAM for multi-vehicles based on edge computing. Journal of Image and Graphics， 26（1）： 218-228

崔明月，钟仕鹏，刘思瑶，李博洋，吴成昊，黄凯. 2021. 利用边缘计算的多车协同激光雷达SLAM. 中国图象图形学报， 26（1）： 218-228 ［DOI： 10.11834/jig.200441http://dx.doi.org/10.11834/jig.200441］

Davison A J， Reid I D， Molton N D and Stasse O. 2007. MonoSLAM： real-time single camera SLAM. IEEE Transactions on Pattern Analysis and Machine Intelligence， 29（6）： 1052-1067 ［DOI： 10.1109/TPAMI.2007.1049http://dx.doi.org/10.1109/TPAMI.2007.1049］

Du J and Cai G R. 2021. Point cloud semantic segmentation method based on multi-feature fusion and residual optimization. Journal of Image and Graphics， 26（5）： 1105-1116

杜静，蔡国榕. 2021. 多特征融合与残差优化的点云语义分割方法. 中国图象图形学报， 26（5）： 1105-1116 ［DOI： 10.11834/jig.200374http://dx.doi.org/10.11834/jig.200374］

Engel J， Schöps T and Cremers D. 2014. LSD-SLAM： large-scale direct monocular SLAM//Proceedings of the 13th European Conference on Computer Vision. Zurich， Switzerland： Springer： 834-849 ［DOI： 10.1007/978-3-319-10605-2_54http://dx.doi.org/10.1007/978-3-319-10605-2_54］

Li Y， Ushiku Y and Harada T. 2019. Pose graph optimization for unsupervised monocular visual odometry//Proceedings of 2019 International Conference on Robotics and Automation （ICRA）. Montreal， Canada： IEEE： 5439-5445 ［DOI： 10.1109/ICRA.2019.8793706http://dx.doi.org/10.1109/ICRA.2019.8793706］

Liu Z and Zhang F. 2021. BALM： bundle adjustment for lidar mapping. IEEE Robotics and Automation Letters， 6（2）： 3184-3191 ［DOI： 10.1109/LRA.2021.3062815http://dx.doi.org/10.1109/LRA.2021.3062815］

Long J， Shelhamer E and Darrell T. 2015. Fully convolutional networks for semantic segmentation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston， USA： IEEE： 3431-3440 ［DOI： 10.1109/CVPR.2015.7298965http://dx.doi.org/10.1109/CVPR.2015.7298965］

Mur-Artal R and Tardós J D. 2017. ORB-SLAM2： an open-source SLAM system for monocular， stereo， and RGB-D cameras. IEEE Transactions on Robotics， 33（5）： 1255-1262 ［DOI： 10.1109/TRO.2017.2705103http://dx.doi.org/10.1109/TRO.2017.2705103］

Sünderhauf N， Pham T T， Latif Y， Milford M and Reid I. 2017. Meaningful maps with object-oriented semantic mapping//Proceedings of 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems （IROS）. Vancouver， Canada： IEEE： 5079-5085 ［DOI： 10.1109/IROS.2017.8206392http://dx.doi.org/10.1109/IROS.2017.8206392］

Sudre C H， Li W Q， Vercauteren T， Ourselin S and Cardoso M J. 2017. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations//Proceedings of the 3rd International Workshop on Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Québec City， Canada： Springer： 240-248 ［DOI： 10.1007/978-3-319-67558-9_28http://dx.doi.org/10.1007/978-3-319-67558-9_28］

Taketomi T， Uchiyama H and Ikeda S. 2017. Visual SLAM algorithms： a survey from 2010 to 2016. IPSJ Transactions on Computer Vision and Applications， 9（1）： #16 ［DOI： 10.1186/s41074-017-0027-2http://dx.doi.org/10.1186/s41074-017-0027-2］

Tateno K， Tombari F， Laina I and Navab N. 2017. CNN-SLAM： real-time dense monocular SLAM with learned depth prediction//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 6565-6574 ［DOI： 10.1109/CVPR.2017.695http://dx.doi.org/10.1109/CVPR.2017.695］

Theckedath D and Sedamkar R R. 2020. Detecting affect states using VGG16， ResNet50 and SE-ResNet50 networks. SN Computer Science， 1（2）： #79 ［DOI： 10.1007/s42979-020-0114-9http://dx.doi.org/10.1007/s42979-020-0114-9］

Wang J K， Zuo X X， Zhao X R， Lyu J J and Liu Y. 2022. Review of multi-source fusion SLAM： current status and challenges. Journal of Image and Graphics， 27（2）： 368-389

王金科，左星星，赵祥瑞，吕佳俊，刘勇. 2022. 多源融合SLAM的现状与挑战. 中国图象图形学报， 27（2）： 368-389 ［DOI： 10.11834/jig.210547http://dx.doi.org/10.11834/jig.210547］

Xuan Z and David F. 2020. Real-time voxel based 3D semantic mapping with a hand held RGB-D camera ［EB/OL］. ［2022-03-21］. https://www.ybliu.com/2020/07/3D-semantic-mapping-RGBD.htmlhttps://www.ybliu.com/2020/07/3D-semantic-mapping-RGBD.html

Zhao H S， Shi J P， Qi X J， Wang X G and Jia J Y. 2017. Pyramid scene parsing network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 6230-6239 ［DOI： 10.1109/CVPR.2017.660http://dx.doi.org/10.1109/CVPR.2017.660］

Zheng S X， Lu J C， Zhao H S， Zhu X T， Luo Z K， Wang Y B， Fu Y W， Feng J F， Xiang T， Torr P H S and Zhang L. 2021. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 6877-6886 ［DOI： 10.1109/CVPR46437.2021.00681http://dx.doi.org/10.1109/CVPR46437.2021.00681］

Zhou B L， Zhao H， Puig X， Fidler S， Barriuso A and Torralba A. 2017. Scene parsing through ADE20K dataset//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 5122-5130 ［DOI： 10.1109/CVPR.2017.544http://dx.doi.org/10.1109/CVPR.2017.544］

文章被引用时，请邮件提醒。

提交

结合双边交叉增强与自注意力补偿的点云语义分割

面向无人机海岸带生态系统监测的语义分割基准数据集

基于深度学习的弱监督语义分割方法综述

跨层细节感知和分组注意力引导的遥感图像语义分割

语义分割和HSV色彩空间引导的低光照图像增强