Current Issue Cover

贾明达, 杨金明, 孟维亮, 郭建伟, 张吉光, 张晓鹏(中国科学院自动化研究所多模态人工智能系统全国重点实验室)

摘 要
Survey on the fusion of point clouds and images for environmental object detection

Jia Mingda, Yang Jinming, Meng Weiliang, Guo Jianwei, Zhang Jiguang, Zhang Xiaopeng(State Key Laboratory of Multimodal Artificial Intelligence Systems,Institute of Automation,Chinese Academy of Sciences)

In the field of digital simulation technology applications, especially in the development of autonomous driving, object detection is a crucial component. It involves the perception of objects in the surrounding environment, providing essential information for the decision-making and planning of intelligent systems. Traditional object detection methods typically involve steps such as feature extraction, object classification, and position regression on images. However, these methods are limited by manually designed features and the performance of classifiers, which restrict their effectiveness in complex scenes and for objects with significant variations. The advent of deep learning technology has led to the widespread adoption of object detection methods based on deep neural networks. Notably, the CNN (convolutional neural network) has emerged as one of the most prominent approaches in this field. By leveraging multiple layers of convolution and pooling operations, CNNs are capable of automatically extracting meaningful feature representations from image data. In addition to image data, LiDAR (Light Detection and Ranging) data plays a crucial role in object detection tasks, particularly for 3D object detection. LiDAR data represents objects through a set of unordered and discrete points on their surfaces. Accurately detecting point cloud clusters representing objects and providing their pose estimation from these unordered points is a challenging task. LiDAR data, with its unique characteristics, offers high-precision obstacle detection and distance measurement, contributing to the perception of surrounding roadways, vehicles, and pedestrian targets. In real-world autonomous driving and related environmental perception scenarios, using a single modality often presents numerous challenges. For instance, while image data can provide a wide variety of high-resolution visual information such as color, texture, and shape, it is susceptible to lighting conditions. Additionally, due to inherent limitations in camera perspectives, models may struggle to handle occlusions caused by objects obstructing the view. Fortunately, LiDAR exhibits exceptional performance in challenging lighting conditions and excels at accurately spatially locating objects in diverse and harsh weather scenarios. However, it does possess certain limitations. Specifically, the low resolution of LiDAR input data results in sparse point cloud when detecting distant targets. Extracting semantic information from LiDAR data is also more challenging compared to image data. Hence, an increasing number of researchers are emphasizing multimodal environmental object detection. A robust multimodal perception algorithm can offer richer feature information, enhanced adaptability to diverse environments, and improved detection accuracy. Such capabilities empower the perception system to deliver reliable results across various environmental conditions. Certainly, multimodal object detection algorithms also face certain limitations and pressing challenges that require immediate attention. One challenge is the difficulty in data annotation. Annotating point cloud and image data is relatively complex and time-consuming, particularly for large-scale datasets. Additionally, accurately labeling point cloud data poses a challenge due to its sparsity and the presence of noisy points. Addressing these issues is crucial for further advancements in multimodal object detection. Moreover, the data structure and feature representation of point cloud and image data, as two distinct perception modalities, differ significantly. The current research focus lies in effectively integrating the information from these two modalities and extracting accurate and comprehensive features that can be utilized effectively. Furthermore, processing large-scale point cloud data is equally challenging. Point cloud data typically encompasses a substantial number of three-dimensional coordinates, which necessitates greater demands on computing resources and algorithmic efficiency compared to pure image data. To facilitate researchers in gaining a deeper and more efficient understanding of object detection algorithms that integrate images and point clouds, this paper aims to summarize and refine existing approaches. It classifies object detection algorithms based on multimodal fusion of point clouds, images, and combinations of both. Additionally, we analyze the strengths and weaknesses of various methods while discussing potential solutions. Moreover, we provide a comprehensive review of the development of object detection algorithms that fuse point clouds and images, considering aspects such as data collection, representation, and model design. Ultimately, we give a perspective on the future development direction of environmental target detection, with the goal of enhancing overall capabilities in autonomous systems.