Current Issue Cover
RGB-D图像中的分步超像素聚合和多模态融合目标检测

赵轩1,2, 郭蔚2, 刘京2(1.河北工业大学理学院, 天津 300401;2.河北师范大学数学与信息科学学院, 石家庄 050024)

摘 要
目的 受光照变化、拍摄角度、物体数量和物体尺寸等因素的影响,室内场景下多目标检测容易出现准确性和实时性较低的问题。为解决此类问题,本文基于物体的彩色和深度图像组,提出了分步超像素聚合和多模态信息融合的目标识别检测方法。方法 在似物性采样(object proposal)阶段,依据人眼对显著性物体观察时先注意其色彩后判断其空间深度信息的理论,首先对图像进行超像素分割,然后结合颜色信息和深度信息对分割后的像素块分步进行多阈值尺度自适应超像素聚合,得到具有颜色和空间一致性的似物性区域;在物体识别阶段,为实现物体不同信息的充分表达,利用多核学习方法融合所提取的物体颜色、纹理、轮廓、深度多模态特征,将特征融合核输入支持向量机多分类机制中进行学习和分类检测。结果 实验在基于华盛顿大学标准RGB-D数据集和真实场景集上将本文方法与当前主流算法进行对比,得出本文方法整体的检测精度较当前主流算法提升4.7%,运行时间有了大幅度提升。其中分步超像素聚合方法在物体定位性能上优于当前主流似物性采样方法,并且在相同召回率下采样窗口数量约为其他算法的1/4;多信息融合在目标识别阶段优于单个特征和简单的颜色、深度特征融合方法。结论 结果表明在基于多特征的目标检测过程中本文方法能够有效利用物体彩色和深度信息进行目标定位和识别,对提高物体检测精度和检测效率具有重要作用。
关键词
Object detection adopting sub-step merging of super-pixel and multi-modal fusion in RGB-D

Zhao Xuan1,2, Guo Wei2, Liu Jing2(1.School of Science Hebei University of Technology, Tianjin 300401, China;2.College of Mathematics and Information Science, Hebei Normal University, Shijiazhuang 050024, China)

Abstract
Objective With the development of artificial intelligence, a growing number of scholars begin to study object detection in the field of computer vision, and they are no longer content with the recent research on RGB images. The object detection methods based on the depth of images have attracted attention. However, the accuracy and real-time performance of indoor multi-class object detection is susceptible to illumination change, shooting angle, the number of objects, and object size. To improve detection accuracy, several studies have begun to employ deep learning methods. Although deep learning can effectively extract the underlying characteristics of objects at different levels, large samples and long learning time make the immediate and wide application of these methods impossible. With regard to improving detection efficiency, many scholars wanted to find all possible areas that contain objects based on the edge information of objects, thus reducing the number of detection windows. Several researchers used deep learning method to preselect it. To address these problems, this study proposes two methods by stages, which adopt RGB-D graphs. The first method is object proposal with super-pixel merging by steps, and the other is object classification adopting the technology of multi-modal data fusion. Method In the stage of object proposal, the method first segments images into super-pixels and merges them by steps adopting the method of self-adaptive multi-threshold scale on the basis of the color and depth information, according to the theory of eyes observing the color information first and then the depth information of an object. The method proposes to segment the graph with simple linear iterative clustering and merges the super-pixel in two steps, calculating the area similarity with respect to color and depth information. In this way, the detection windows with similar color and depth information are extracted to decrease the window number through filtering them by area and adopting non-maximal suppression to detection results with the overlapping region. At the end of the process, the number of detected windows becomes far less than that when using a sliding window scan, and each area may contain an object or part of an object. In the object recognition stage, the proposed method fuses the multi-modal features, including color, texture, contour, and depth, which are extracted from RGB-D images, employing multi-kernel learning. In general, objects are unclear when identified simply with one feature because of the multiplicity of objects. For example, distinguishing an apple from one painted in a picture is difficult. Multi-modal data fusion can cover several object characteristics in RGB-D images relative to single feature or simple fusion with two features. Finally, the fusing feature kernel is inputted into the SVM classifier, and the procedure of object detection is complete. Result By setting different threshold segmentation interval parameters and multi-kernel learning gauss kernel parameters, the study compares the proposed method and the current mainstream algorithm. The textual method has a certain advantage in object detection. The detection rate of the method is better by 4.7% than the state-of-art method via the comparative experiment based on the standard RGB-D databases from the University of Washington and real-scene databases obtained by Kinect sensor. The method of sub-step merging of super-pixel is superior to the present mainstream object proposal methods in object location, and the amounts of sampling windows are approximately fourfold less than the other algorithms in the situation of same recall rate. Moreover, by comparing the individual feature and the fusion feature recognition accuracy, multi-feature fusion method is much higher than the individual characteristics. The characteristics of the two fusions in the overall detection accuracy also have outstanding performance on object categories with different gestures. Conclusion Experimental results show that the proposed method can take full use of the color and depth information in object location and classification and is important in achieving high accuracy and enhanced real-time performance. The sub-step merging of super-pixel can also be used in the field of object detection based on deep learning.
Keywords

订阅号|日报