车路两端纯视觉鸟瞰图感知研究综述

周松燃; 卢烨昊; 励雪巍; 傅本尊; 王井东; 李玺

发布时间： 2024-05-20
摘要点击次数： 383
全文下载次数： 391
DOI: 10.11834/jig.230387
2024 | Volume 29 | Number 5

车路两端纯视觉鸟瞰图感知研究综述

周松燃¹, 卢烨昊², 励雪巍², 傅本尊¹, 王井东³, 李玺²(1.浙江大学工程师学院, 杭州 310015;2.浙江大学计算机科学与技术学院, 杭州 310007;3.百度, 北京 100085)

摘要

纯视觉鸟瞰图（bird’s-eye-view，BEV）感知是国内外自动驾驶领域的前沿方向与研究热点，旨在通过相机2D图像信息，生成3D空间中周围道路环境俯视视角下的特征表示。该领域在单车智能方向上迅速发展，并实现大量落地部署。但由于车端相机的安装高度受限，不可避免地面临着远距离感知不稳定、存在驾驶盲区等实际问题，单车智能仍存在着一定的安全性风险。路端摄像头部署在红绿灯杆等高处基础设施上，能够有效扩展智能车辆的感知范围，补充盲区视野。因此，车路协同逐渐成为当前自动驾驶的发展趋势。据此，本文从相机部署端和相机视角出发，将纯视觉BEV感知技术划分为车端单视角感知、车端环视视角感知和路端固定视角感知三大方向。在每一方向中，从通用处理流程入手梳理其技术发展脉络，针对主流数据集、BEV映射模型和任务推理输出三大模块展开综述。此外，本文还介绍了相机成像系统的基本原理，并对现有方法从骨干网络使用统计、GPU（graphics processing unit）类型使用统计和模型性能统计等角度进行了定量分析，从可视化对比角度进行了定性分析。最后，从场景多元、尺度多样分布等技术挑战和相机几何参数迁移能力差、计算资源受限等部署挑战两方面揭示了当前纯视觉BEV感知技术亟待解决的问题。并从车路协同、车车协同、虚拟现实交互和统一多任务基座大模型4个方向对本领域的发展进行了全面展望。希望通过对纯视觉BEV感知现有研究以及未来趋势的总结为相关领域研究人员提供一个全面的参考以及探索的方向。

关键词

自动驾驶感知纯视觉BEV感知路端固定视角感知车端移动视角感知多视角图像融合

Pure camera-based bird’s-eye-view perception in vehicle side and infrastructure side：a review

Zhou Songran¹, Lu Yehao², Li Xuewei², Fu Benzun¹, Wang Jingdong³, Li Xi²(1.Polytechnic Institute, Zhejiang University, Hangzhou 310015, China;2.College of Computer Science and Technology, Zhejiang University, Hangzhou 310007, China;3.Baidu, Beijing 100085, China)

Abstract

As a key technology for 3D perception in the autonomous driving domain，pure camera-based bird’s-eye-view （BEV）perception aims to generate a top-down view representation of the surrounding traffic environment using only 2D image information captured by cameras. In recent years，it has gained considerable attention in the computer vision research community. The potential of BEV is immense because it can represent image features from multiple camera viewpoints in a unified space and provide explicit position and size information of the target object. While most BEV methods focus on developing perception methods on ego-vehicle sensors，people have gradually realized the importance of using intelligent roadside cameras to extend the perception ability beyond the visual range in recent years. However，this novel and growing research field has not been reviewed recently. This paper presents a comprehensive review of pure camerabased BEV perception technology based on camera deployment and camera angle，which are segmented into three categories：1）vehicle-side single-view perception，2）vehicle-side surround-view perception，and 3）infrastructure-side fixedview perception. Meanwhile，the typical processing flow，which contains three primary parts：dataset input，BEV model， and task inference output，is introduced. In the task inference output section，four typical tasks in the 3D perception of autonomous driving（i. e. ，3D object detection，3D lane detection，BEV map segmentation，and high-definition map generation）are described in detail. For supporting convenient retrieval，this study summarizes the supported tasks and official links for various datasets and provides open-source code links for representative BEV models in a table format. Simultaneously，the performance of various BEV models on public datasets is analyzed and compared. To our best knowledge，three types of BEV challenging problems must be resolved：1）scene uncertainty problems：In an open-road scenario，many scenes never appear in the training dataset. These scenarios can include extreme weather conditions，such as dark nights， strong winds，heavy rain，and thick fog. A model’s reliability must not degrade in these unusual circumstances. However， majority of BEV models tend to suffer from considerable performance degradation when exposed to varying road scenarios. 2）Scale uncertainty problems：autonomous driving perception tasks have many extreme scale targets. For example，in a roadside scenario，placing a camera on a traffic signal or streetlight pole at least 3 m above the ground can help detect farther targets. However，facing the extremely small scale of the distant targets，existing BEV models have serious issues with false and missed detections. 3）Camera parameter sensitivity problems：most existing BEV models depend on precisely calibrated intrinsic and extrinsic camera parameters for their success during training and evaluation. The performance of these methods drastically diminishes if noisy extrinsic camera parameters are utilized or unseen intrinsic camera parameters are inputted. Meanwhile，a comprehensive outlook on the development of pure camera-based BEV perception is given： 1）vehicle-to-infrastructure（V2I）cooperation：V2I cooperation refers to the integration of information from vehicle-side and infrastructure-side to achieve the visual perception tasks of autonomous driving under communication bandwidth constraints. The design and implementation of a vehicle-infrastructure integration perception algorithm can lead to remarkable benefits，such as supplementing blind spots，expanding the field of view，and improving perception accuracy. 2）Vehicleto-vehicle（V2V）cooperation：V2V cooperation means that connected autonomous vehicles（CAVs）can share the collected data with each other under communication bandwidth constraints. CAVs can collaborate to compensate for the shortage of data and expand view for vehicles in need，thereby augmenting perception capabilities，boosting detection accuracy， and improving driving safety. 3）Multitask learning：the purpose of multitask learning is to optimize multiple tasks at the same time to improve the efficiency and performance of algorithms，simplifying the complexity of models. In BEV models， the generated BEV features are friendly to many downstream tasks，such as 3D object detection and BEV map segmentation. Sharing models can largely increase the parameter sharing rate，save computing costs，reduce training time，and improve model generalization performance. The objective of these endeavors is to provide a comprehensive guide and reference for researchers in related fields by thoroughly summarizing and analyzing existing research and future trends in the field of pure camera-based BEV perception.

Keywords

autonomous driving perception pure camera-based BEV perception infrastructure-side perception vehicleside perception multi-view image fusion