司书斌1, 赵大伟2, 徐婉莹3, 张勇刚1, 戴斌2(1.哈尔滨工程大学智能科学与工程学院, 哈尔滨 150001;2.国防科技创新研究院, 北京 100071;3.国防科技大学智能科学学院, 长沙 410073)
视觉—惯性导航定位技术是一种利用视觉传感器和惯性传感器实现载体的自定位和周围环境感知的无源导航定位方式，可以在全球定位系统（global positioning system，GPS）拒止环境下实现载体6自由度位姿估计。视觉和低精度惯性传感器具有体积小和价格低的优势，得益于二者在导航定位任务中的互补特性，视觉—惯性导航系统（visual inertial navigation system，VINS）引起了极大关注，在移动端的虚拟现实（virtual reality，VR）、增强现实（augmented reality，AR）以及无人系统的自主导航任务中发挥了重要作用，具有重要的理论研究价值和实际应用需求。本文介绍视觉—惯性导航系统，总结概括该系统中初始化、视觉前端处理、状态估计、地图的构建与维护以及信息融合等关键技术的研究进展。对非理想环境下及基于学习方法的视觉—惯性导航定位算法等热点问题进行综述，总结用于算法评测的方法及标准数据集，阐述该技术在实际应用中所面临的主要问题，并针对这些问题对该领域未来的发展趋势进行展望。
Review on visual-inertial navigation and positioning technology
Si Shubin1, Zhao Dawei2, Xu Wanying3, Zhang Yonggang1, Dai Bin2(1.College of Intelligent Systems Science and Engineering, Harbin Engineering University, Harbin 150001, China;2.National Innovation Institute of Defense Technology, Beijing 100071, China;3.College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China)
Visual-inertial navigation and positioning technology is a passive navigation method, which can realize the estimation of ego-motion and the perception of the surrounding environment. In particular, this method can realize six-degree of freedom(DOF) pose estimation of the carrier in GPS-denied environments, such as indoor and underwater environment, and even play a positive role in space exploration. In addition, from a biological point of view, visual-inertial navigation is a bionic navigation method because humans and animals realize their own navigation and positioning through visual and motion perception. The visual-inertial integrated navigation has significant advantages. First, these sensors have the advantages of small size and low cost. Second, different from active navigation, visual-inertial navigation system (VINS) does not rely on external auxiliary devices. The navigation and positioning function can be realized independently without exchanging information with the external environment. Finally, the visual and inertial sensors have very complementary characteristics. For example, the output frequency of visual navigation is low, and no accumulated error is found when it is stationary; it is susceptible to changes in the external environment and cannot adapt to the situation of fast movement. At the same time, the output frequency of inertial navigation is high, and it is robust to the changes in the external environment. It can accurately capture the information of the rapid movement of the carrier, but it has an accumulated error. VINS plays an important role in mobile virtual reality, augmented reality, and autonomous navigation tasks of unmanned system, with an important theoretical research value and practical application requirements. In recent years, the visual-inertial navigation technology has developed rapidly, and many excellent works have emerged and improved the theory of visual-inertial navigation technology. At present, the structure of the algorithm is relatively fixed, and the positioning accuracy of the state-of-the-art VINS in some small-scale structured scenes is as high as centimeter. However, it faces many problems when applied in many complex practical scenes. On the one hand, the real-time performance of the system is difficult to satisfy because visual image processing and back-end optimization bring a large computation burden. Meanwhile, the scale of mapping is a challenge to memory consumption. On the other hand, the performance of this technology in some low-texture, dynamic illumination, large-scale, and dynamic scenes is poor. These complex environments are challenging to the stability of VINS, thereby acting as the major obstacles to the large-scale application of VINS at present. These complex environments directly affect the processing results of the visual front-end and are often difficult to handle by traditional geometric methods. With the strong ability of deep learning technology in image processing, some researchers attempt to use deep learning to replace the traditional image processing technology and even abandon the traditional VINS framework, thereby directly estimating poses with the end-to-end framework. The learning-based method can use the rich semantic information in the image and has more advantages in the complex environment, such as dynamic scene. The purpose of this article is to help those who are interested in VINS to quickly understand the current state of research in this field, as well as the future research directions of interest. The VINS is introduced, and then the research progress of the key technologies in the system, such as initialization, visual front-end processing, state estimation, map construction and maintenance, and information fusion is summarized. In addition, some hot issues, such as visual-inertial navigation algorithm in non-ideal environment and learning-based localization algorithm, are reviewed. The standard datasets used for algorithm evaluation are summarized, and the future development trend of this field is prospected.