Current Issue Cover

李洋1,2, 吴晓群1,2(1.北京工商大学计算机学院, 北京 100048;2.食品安全大数据技术北京市重点实验室, 北京 100048)

摘 要
RGB-D图像包含丰富的多层特征,如底层的线特征、平面特征,高层的语义特征,面向RGB-D图像的多层特征提取结果可以作为先验知识提升室内场景重建、SLAM(simultaneous localization and mapping)等多种任务的输出质量,是计算机图形学领域的热点研究内容之一。传统的多层特征提取算法一般利用RGB图像中丰富的颜色、纹理信息以及深度图像中的几何信息提取多层特征,此类提取算法依赖输入RGB-D图像的质量,而受采集过程中环境和人为因素的影响,很难得到高质量的RGB-D图像。随着深度学习技术的快速发展,基于深度学习的多层特征提取算法突破了这一限制,涌现出一批高质量的研究成果。本文对面向RGB-D图像的多层特征提取算法进行综述。首先,汇总了现有的常用于多层特征提取任务的RGB-D数据集和相关算法的质量评价指标。然后,按照特征所处的不同层次,依次对线、平面和语义特征相关算法进行了总结。此外,本文还对各算法的优缺点进行比较并结合常用算法质量评价标准进行了定量分析。最后,讨论了当前多层特征提取算法亟待解决的问题并展望了未来发展的趋势。
Survey of multilevel feature extraction methods for RGB-D images

Li Yang1,2, Wu Xiaoqun1,2(1.School of Computer Science and Engineering, Beijing Technology and Business University, Beijing 100048, China;2.Beijing Key Laboratory of Big Data Technology for Food Safety, Beijing 100048, China)

RGB-D images contain rich multilevel features,such as low-level line,planar,and high-level semantic features. These different levels of features provide valuable information for various computer vision tasks. Computer vision algorithms can extract meaningful information from RGB-D images and improve the performance of various tasks,including object detection,tracking,and indoor scene reconstruction,by leveraging these multilevel features. Terms such as feature and contour lines can be used when describing existing line features in a single RGB-D image. Line features provide crucial information regarding the spatial relationships and boundaries in the input image,aiding in the understanding and interpretation of input data. Plane and surface are used to describe planar features and those refer to flat or nearly flat regions in the RGB-D image. Terms such as instance and semantic labels can be used when describing an object. Instance labels refer to unique identifiers or labels assigned to individual instances or occurrences of objects in an image,while semantic labels represent the broad class or category to which an object belongs. Semantic labels provide a high-level understanding of the objects in the image,grouping them into meaningful categories that indicate the general type of object present. Traditional methods for extracting line features often utilize color,texture information of RGB image,and geometric information in the depth image to extract feature and contour lines. The extraction of planar features involves clustering to extract sets of points with similar properties,further facilitating planar feature extraction. Semantic feature extraction aims to assign specific semantic categories to each pixel in the RGB-D input,and most of the methods used for this task are implemented based on deep learning. The multilevel feature extraction results for RGB-D images can be used as prior knowledge aids such as indoor scene reconstruction,scene understanding,object recognition,and other tasks to improve the quality of network output. Multilevel feature extraction for RGB-D images is also one of the popular topics in the field of computer graphics. With the development and popularization of commercial depth cameras,acquiring RGB-D data has become increasingly convenient. However,the quality of captured RGB-D data is often compromised by environmental and human factors during the acquisition process. This phenomenon leads to issues such as noise and depth absence,which,in turn,negatively affects the quality of multilevel feature extraction results to some extent. These problems are detrimental to traditional methods,but the emergence of deep learning approaches has overcome these issues to a certain extent. With the rapid development of deep learning technology,numerous high-quality research results have emerged for multilevel feature extraction tasks based on deep learning. The commonly used RGB-D datasets for multilevel feature extraction tasks,such as NYU v2 and SUN RGB-D,are summarized in this paper. These datasets contain diverse scene data,comprising RGB images paired with corresponding depth images. Taking NYU v2 as an example,the dataset includes 1 499 RGB-D images,derived from 464 distinct indoor scenes across 26 scene classes. After introducing the datasets,this paper provides a summary of commonly used evaluation criteria for assessing the quality of line,planar,and semantic features. Detailed explanations are presented for the computation method of each evaluation criterion. When reviewing line feature extraction methods,a comprehensive summary based on traditional and deep learning approaches is presented. Detailed explanations of the principles,advantages,and limitations of different methods are provided. Furthermore,quantitative comparisons of the extraction results from several different methods are conducted. When summarizing planar feature extraction methods,a comprehensive overview is provided from two perspectives:traditional and deep learning-based planar feature extraction methods. Relevant research papers are gathered,and a quality comparison of planar feature extraction methods is then conducted. Additionally,detailed explanations of the advantages and limitations of each method are provided. A comprehensive review of deep learning-based semantic feature extraction methods is presented in this paper from two aspects:fully-supervised and semi-supervised learning-based semantic feature extraction methods. Relevant research papers are also summarized. When comparing different semantic feature extraction methods,this paper used evaluation metrics such as pixel accuracy(PA),mean PA(MPA),and mean intersection over union(mIoU)to measure the quality of the extraction algorithms. The results of the quantitative comparisons revealed that semantic feature extraction methods oriented toward RGB-D data exhibit superior extraction quality. These comparison results prove that feature extraction methods designed specifically for RGB-D data can achieve better results compared to methods that only utilize RGB data. The incorporation of depth information in RGB-D data facilitates accurate and robust extraction of semantic features, leading to enhanced performance in various tasks such as scene understanding and object recognition. Data annotation has certainly been a challenge for feature extraction methods based on deep learning. Annotating large-scale datasets requires considerable time and human resources. Researchers have been actively seeking ways to reduce the workload of data annotation or maximize existing annotated data to overcome these challenges. Therefore,unsupervised,semi-supervised,and transfer learning are widely investigated to leverage unlabeled or sparsely labeled data for feature extraction. Finally,the problems of the current multilevel feature extraction algorithm that must be addressed are discussed to provide guidance to the future development trend at the end of this paper.