基于语义概率预测的动态场景单目视觉SLAM

潘小鹍; 刘浩敏; 方铭; 王政; 张涌; 章国锋

发布时间： 2023-07-19
摘要点击次数： 808
全文下载次数： 627
DOI: 10.11834/jig.210632
2023 | Volume 28 | Number 7

基于语义概率预测的动态场景单目视觉SLAM

潘小鹍¹, 刘浩敏², 方铭¹, 王政³, 张涌³, 章国锋¹(1.浙江大学计算机辅助设计与图形系统全国重点实验室, 杭州 310058;2.商汤研究院, 北京 100080;3.中讯邮电咨询设计院有限公司, 北京 100044)

摘要

目的基于视觉的同步定位与建图（visual-based simultaneous localization and mapping，vSLAM）是计算机视觉以及机器人领域中的关键技术，其通过对输入的图像进行处理分析来感知周围的3维环境以及进行自身的定位。现有的SLAM系统大多依赖静态世界假设，在真实环境中的动态物体会严重影响视觉SLAM系统的稳定运行。同时，场景中静止与运动部分往往和其语义有密切关系，因而可以借助场景中的语义信息来提升视觉SLAM系统在动态环境下的稳定性。为此，提出一种新的基于语义概率预测的面向动态场景的单目视觉SLAM算法。方法结合语义分割的结果以及鲁棒性估计算法，通过对分割进行数据关联、状态检测，从概率的角度来表示观测的静止/运动状态，剔除动态物体上的观测对相机位姿估计的干扰，同时借助运动概率及时剔除失效的地图点，使系统在复杂动态的场景中依然能够稳定运行。结果在本文构建的复杂动态场景数据集上，提出的方法在跟踪精度和完整度上都显著优于现有的单目视觉SLAM方法，而且在TUM-RGBD数据集中的多个高动态序列上也取得了更好的结果。此外，本文定性比较了动态场景下的建图质量以及AR （augmented reality）效果。结果表明，本文方法明显优于对比方法。结论本文通过结合语义分割信息以及鲁棒性估计算法，对分割区域进行数据关联以及运动状态检测，以概率的形式表示2D观测的运动状态，同时及时剔除失效地图点，使相机位姿估计的精度以及建图质量有了明显提升，有效提高了单目视觉SLAM在高度动态环境中运行的鲁棒性。

关键词

视觉SLAM（vSLAM）语义分割动态场景鲁棒性估计概率预测

Dynamic 3D scenario-oriented monocular SLAM based on semantic probability prediction

Pan Xiaokun¹, Liu Haomin², Fang Ming¹, Wang Zheng³, Zhang Yong³, Zhang Guofeng¹(1.State Key Laboratory of CAD and CG, Zhejiang University, Hangzhou 310058, China;2.SenseTime Research, Beijing 100080, China;3.China Information Consulting and Designing Institute Co., Ltd., Beijing 100044, China)

Abstract

Objective Visual-based simultaneous localization and mapping（vSLAM）is essential for computer vision and robotic-related domain. The multiview and 3D structure scenarios can be recovered in terms of the input images analysis. Due to 3D objects in the real environment will seriously affect the stability of the vSLAM system，most of the existing vSLAM systems rely on static-scenarios assumption，which limits the application in the dynamic environment. Current geometry-based methods are focused on the negative effect alleviation of dynamic objects in checking some geometric constraints in 3D vision like epipolar constraints and re-projection error. Recent deep learning based semantic segmentation technology has been facilitating more effective information for SLAM system because the static and dynamic parts in the scene are often closely related to their semantics. Theoretically，due to the image information is transferred from pixel-level to semantic-level，the vSLAM system can be run stably in the dynamic environment in terms of the semantic information. However，some semantic-based SLAM schemes directly remove the semantic objects in the scene based on the segmentation results without considering their motion states. This may remove the areas that can provide stable visual features in some common real scenes，and the lack of sufficient observation will affect the stability of SLAM system severely. A feasible path is oriented to analyze the motion state of the semantic clustering，and applicability strategies are then implemented to alleviate the influence of the moving objects in visual localization module in vSLAM. The challenges of updating map in dynamic scene is required to be resolved as well. Method A new monocular vSLAM algorithm-based semantic probability prediction method is developed，which can combine semantic segmentation and robust estimation algorithm. To simplify training and generalization of network，the learning-based semantic segmentation module is opted. First，to track the certain cluster in 2D image space and keep the temporal consistency of segmentation，the input image data is segmented in relevance to segmentation results of existing frames. Next，robust estimation method and geometric constraints are used to detect the state of these clusters，which can be embedded with temporal consistency information. To determine whether the cluster is dynamic or not，the negative effect is alleviated by sacrifice the possible moving objects in the scene straightforward，which may lead to the reduction of the number of observations，further weaken the robustness of the system. Finally， the cluster state is used to represent probability-related the static/dynamic state of the 2D observation，which can melt the uncertainty of the edge of the dynamic object into the image space and the interference of the observation on the dynamic object can be reduced for the pose estimation of the camera. In addition to eliminating the moving observations，a spatially consistent sparse map is in consistent and beneficial for the stability of monocular vSLAM in dynamic scenes. Dynamic 3D objects in the scene will introduce some invalid map points into the map，resulting in the inconsistency between the map and the 3D structure of real scene，which causes potential risk to the long-duration running of SLAM system. We can resolve this contextual problem from the perspective of probability. The observation is considered as an invalid map point when the dynamic probability is lower than a certain threshold in terms of the motion probability of the previous 2D observation. Our system can be used to alter the invalid map points in time，and it can run stably in highly dynamic scene for a long time. Result Our method proposed is focused on both quantitative and qualitative aspects. Compared to such conventional monocular vSLAM system，our method has its priority in absolute trajectory error （ATE） on TUM-RGBD high dynamic sequences. This metric is illustrated its effectiveness for real time applications in terms of such AR-related visual localization. Due to the motion range of the dynamic object（person）in these sequences is vulnerable，and the scenes are relatively similar，the VICON motion capture system is used to record our own challenging dynamic dataset. In these scenes with complicated object and multiview motion，our method has absolutely advantage in ATE and the integrity of the tracking. Additionally，we compare the quality of mapping and augmented reality（AR）application in highly dynamic scenes qualitatively as well. To maintain the consistency of the real environment while mapping，the sparse map of our method can change synchronously in related to the dynamic object moving in the scene，which is beneficial for some mapbased downstream applications like robot path planning. For AR application，virtual cube can be fixed in the 3D scene stably，and it is not drifted with the disturb of dynamic objects in the scene. Conclusion By integrating semantic information into dynamic scene and robust estimation algorithm，the proposed method carries out data association and motion state detection on the segmentation areas，represents the motion state of 2D observation in the form of probability，and eliminates the invalid map points in time，which can significantly improve the accuracy of camera pose estimation，the quality of mapping and the robustness of monocular vSLAM in highly dynamic environment.

Keywords

visual-based simultaneous localization and mapping（vSLAM） semantic segmentation dynamic environment robust estimation probability prediction

在线采编平台

论文出版

年度会议

下载中心

年度信息