Current Issue Cover
轻量化视觉定位技术综述

叶翰樵1,2, 刘养东2, 申抒含1,2(1.中国科学院大学人工智能学院;2.中国科学院自动化研究所)

摘 要
视觉定位旨在从已知的三维场景中恢复当前观测图像的相机位姿。视觉定位技术具备低成本、高精度和易于集成等优势,是实现计算设备同真实世界建立智能交互过程的关键技术之一,如今获得了混合现实、自动驾驶等应用领域的广泛关注。作为计算机视觉领域长期探索的基础任务之一,视觉定位方法至今已取得显著的研究进展,然而现有方法普遍存在计算开销和存储占用过大等不足,这些问题导致视觉定位在移动端的高效部署和场景模型的更新维护方面存在困难,并因此在很大程度上限制着视觉定位技术的实际应用。针对这一问题,部分研究工作开始聚焦于推动视觉定位技术的轻量化发展。轻量化视觉定位旨在研究更加高效的场景表达形式及其视觉定位方法,目前正逐渐成为视觉定位领域重要的研究方向。本文首先将回顾早期视觉定位框架,随后从场景表达形式的角度对具备轻量化特性的现有视觉定位研究工作进行分类。在各个方法类别下,将分析总结其特点优势、应用场景和技术难点,并同时介绍代表性成果。进一步地,本文将对部分轻量化视觉定位的代表性方法在常用室内外数据集上的性能表现进行对比分析,评估指标主要包含离线建图的用时、场景地图的存储占用和定位精度三个维度。现有的轻量化视觉定位技术仍然面临着诸多的难题与挑战,场景模型的表达能力、定位方法的泛化性与鲁棒性尚存在较大的提升空间。最后,本文将对轻量化视觉定位未来的发展趋势进行分析与展望。
关键词
Review on lightweight visual-based localization technology

Ye Han Qiao, Liu Yang Dong1, Shen Shu Han2(1.Institute of Automation, Chinese Academy of Sciences;2.School of Artificial Intelligence, University of Chinese Academy of Sciences)

Abstract
Visual-based localization determines the camera translation and orientation of an image observation with respect to a pre-built 3D-based representation of the environment. It is an essential technology that empowers the intelligent interactions between computing facilities and the real world. Compared to alternative positioning systems beyond, the capability to accurately estimate the 6DOF camera pose, along with the flexibility and frugality in deployment, positions visual-based localization technology as a cornerstone of many applications, ranging from autonomous vehicles to Augmented and Mixed Reality. As a long-standing problem in computer vision, visual localization has made exceeding progress over the past decades. A primary branch of prior arts relies on a pre-constructed 3D map obtained by Structure-from-Motion techniques. Such 3D maps, a.k.a. SfM point clouds, store 3D points and per-point visual features. To estimate the camera pose, these methods typically establish correspondences between 2D keypoints detected in the query image and 3D points of the SfM point cloud through descriptor matching. The 6DOF camera pose of the query image is then recovered from these 2D-3D matches by leveraging geometric principles introduced by photogrammetry. Despite delivering fairly sound and reliable performance, such a scheme often has to consume several gigabytes of storage for just a single scene, which would result in computationally expensive overhead and prohibitive memory footprint for large-scale applications as well as resource-intensive platforms. Furthermore, it suffers from other drawbacks such as costly map maintenance and privacy vulnerability. The aforementioned issues pose a significant bottleneck in real-world applications and have thus prompted researchers to shift their focus toward leaner solutions. Lightweight visual-based localization seeks to introduce improvements in both scene representations and the associated localization methods, making the resulting framework computationally tractable and memory-efficient without incurring a notable performance expense. For the background, this literature review first introduces several flagship frameworks of the visual-based localization task as preliminaries. These frameworks can be broadly classified into three categories, including image-retrieval-based methods, structure-based methods, and hierarchical-based methods. 3D scene representations adopted in these conventional frameworks, such as reference image databases and SfM point clouds, generally exhibit a high degree of redundancy, which causes excessive memory usage and inefficiency in distinguishing scene features for descriptor matching. Next, this review is going to provide a guided tour of recent advances that promote the brevity of the 3D scene representations and the efficiency of corresponding visual localization methods. From the perspective of scene representations, existing research efforts in lightweight visual localization can be classified into six categories. Within each category, this literature review analyzes its characteristics, application scenarios, and technical limitations, while also surveys some of the representative works. Firstly, several methods have been proposed to enhance memory-efficiency by compressing the SfM point clouds. These methods reduce the size of SfM point clouds through the combination of techniques including feature quantization, keypoint subset sampling, and feature-free matching. Extreme compression rates, such as 1% and below, can be achieved with barely noticeable accuracy degradation. Employing line maps as scene representations has become a focus of research in the field of lightweight visual localization. In human-made scenes characterized by salient structural features, the substitution of line maps for point clouds offers two major merits: 1) the abundance and rich geometric properties of line segments make line maps a concise option for depicting the environment; 2) line features exhibit better robustness in weak-textured areas or under temporally varying lighting conditions. However, the lack of a unified line descriptor and the difficulty of establishing 2D-3D correspondences between 3D line segments and image observations remain as main challenges. In the field of autonomous driving, high-definition maps constructed from vectorized semantic features have unlocked a new wave of cost-effective and lightweight solutions to visual localization for self-driving vehicle. Recent trends involve the utilization of data-driven techniques to learn to localize. This end-to-end philosophy has given rise to two regression-based methods. Scene coordinate regression (SCR) methods eschew the explicit processes of feature extraction and matching. Instead, they establish a direct mapping between observations and scene coordinates through regression. While a grounding in geometry remains essential for camera pose estimation in SCR methods, pose regression methods employ deep neural networks to establish the mapping from image observations to camera poses without any explicit geometric reasoning. Absolute pose regression (APR) techniques are akin to image retrieval approaches with limited accuracy and generalization capability, while relative pose regression (RPR) techniques typically serve as a post-processing step following the coarse localization stage. Neural radiance fields (NeRF) and related volumetric-based approaches have emerged as a novel way for the neural implicit scene representation. While visual localization based solely on a learned volumetric-based implicit map is still in an exploratory phase, the progress made over the past year or two has already yielded an impressive performance in terms of the scene representation capability and precision of localization. Furthermore, this study quantitatively evaluates the performance of several representative lightweight visual localization methods on well-known indoor and outdoor datasets. Evaluation metrics, including offline mapping time usage, storage demand, and localization accuracy, are taken into account for making comparisons. It is concluded that SCR methods generally stand out among the existing work, boasting remarkably compact scene maps and high success rates of localization. Existing lightweight visual localization methods have dramatically pushed the performance boundary. However, challenges still remain in terms of scalability and robustness when enlarging the scene scale and taking considerable visual disparity between query and mapping images into consideration. Therefore, extensive efforts are still in great need in promoting the compactness of scene representations and improving the robustness of localization methods. Finally, this review will provide an outlook on developing trends in the hope of facilitating future research.
Keywords

订阅号|日报