Current Issue Cover
RGB-D语义分割:深度信息的选择使用

赵经阳, 余昌黔, 桑农(华中科技大学人工智能与自动化学院图像信息处理与智能控制教育部重点实验室, 武汉 430074)

摘 要
目的 在室内场景语义分割任务中,深度信息会在一定程度上提高分割精度。但是如何更有效地利用深度信息仍是一个开放性问题。当前方法大都引入全部深度信息,然而将全部深度信息和视觉特征组合在一起可能对模型产生干扰,原因是仅依靠视觉特征网络模型就能区分的不同物体,在引入深度信息后可能产生错误判断。此外,卷积核固有的几何结构限制了卷积神经网络的建模能力,可变形卷积(deformable convolution,DC)在一定程度上缓解了这个问题。但是可变形卷积中产生位置偏移的视觉特征空间深度信息相对不足,限制了进一步发展。基于上述问题,本文提出一种深度信息引导的特征提取(depth guided feature extraction,DFE)模块。方法 深度信息引导的特征提取模块包括深度信息引导的特征选择模块(depth guided feature selection,DFS)和深度信息嵌入的可变形卷积模块(depth embedded deformable convolution,DDC)。DFS可以筛选出关键的深度信息,自适应地调整深度信息引入视觉特征的比例,在网络模型需要时将深度信息嵌入视觉特征。DDC在额外深度信息的引入下,增强了可变形卷积的特征提取能力,可以根据物体形状提取更相关的特征。结果 为了验证方法的有效性,在NYUv2(New York University Depth Dataset V2)数据集上进行一系列消融实验并与当前最好的方法进行比较,使用平均交并比(mean intersection over union,mIoU)和平均像素准确率(pixel accuracy,PA)作为度量标准。结果显示,在NYUv2数据集上,本文方法的mIoU和PA分别为51.9%和77.6%,实现了较好的分割效果。结论 本文提出的深度信息引导的特征提取模块,可以自适应地调整深度信息嵌入视觉特征的程度,更加合理地利用深度信息,且在深度信息的作用下提高可变形卷积的特征提取能力。此外,本文提出的深度信息引导的特征提取模块可以比较方便地嵌入当下流行的特征提取网络中,提高网络的建模能力。
关键词
RGB-D semantic segmentation: depth information selection

Zhao Jingyang, Yu Changqian, Sang Nong(Key Laboratory of Ministry of Education for Image Processing and Intelligent Control, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan 430074, China)

Abstract
Objective Semantic segmentation is essential to computer vision application. It assigns each pixel to its corresponding category in an image, which is a pixel leveled multi-classification task. It is of great significance in the fields of automatic driving, virtual reality and medical image processing. The emergence of convolutional neural network (CNN) promotes the rapid development of neural network in various tasks of computer vision. The fully CNN has completely changed the pattern of semantic segmentation contexts. With the advent of depth camera, it is more convenient to obtain the depth images corresponding to color images. The depth image is single-channel, and each value should be ranged from the pixel to the camera plain in the image. Obviously, depth images contain spatial distance information, but color images are relatively insufficient. In the semantic segmentation task, it is difficult for the network to distinguish the adjacent objects with similar appearance in the plain image, but the application of depth image can be released to some extent. RGB-D semantic segmentation is focused on recently. The ways of depth information-embedded visual features can be roughly divided into the following three categories like one-stream, two-streams and multi-tasks. One-stream does not use depth images as additional input to extract features. It only has a backbone network to extract features from color images. In the process of feature extraction, the inherent spatial information of depth images is used to assist visual feature extraction for semantic segmentation improvement. Two-streams use the depth image as an additional input to extract features. There are mainly two backbone networks involved, each of which extracts features from color images and depth images each. In the encoding stage or decoding stage, the extracted visual features are fused with depth features to realize depth information application. Multi-task processes semantic segmentation, depth estimation and surface normal estimation at the same time. Such a method has one common backbone network only. In the process of feature extraction from color images, multi feature interaction can improve the performance of each task. Previous studies have challenged to effective depth information as well, and embedding all depth information into visual features may cause interference to the network. The inherent color and texture information can sometimes clearly distinguish two or more categories in a color image, where the addition of depth information is somewhat gilding the lily. For example, similar depth objects can be distinguished by visual features excluded different visual features, but the addition of depth information will make the network confused and even make wrong judgments. Moreover, the inherent structure of the convolution kernel limits its ability of feature extraction in CNN. To solve this problem, our proposed deformable convolution can learn the offset of the corresponding points according to the input, and extract more effective features in terms of the shape of the object, thus improving the modeling ability of the network. However, it is insufficient to learn the offset only by the input of visual features, because the spatial information of color images is very limited. Method We develop a depth guided feature extraction module (DFE), which includes depth guided feature selection module (DFS) and depth embedded deformable convolution module (DDC). First, the proposed depth guided feature selection module concatenates the input of depth features and visual features in order to ignore the interference on the network derived from all depth information, and then selects the features with important influence from the fusion features through the channel attention method. Next, the weight matrix of the depth features is obtained through 1×1 convolution and sigmoid function. After multiplying the depth features and the corresponding weight matrix, the depth information to be embedded in the visual features is obtained. This depth information is then added to the visual features. Since the weight matrix corresponding to the depth features is obtained by learning, the network can adjust the number of depth information adequately, rather than accepting all the depth information. For instance, the proportion of depth information will be increased if the depth information is needed for classification. Otherwise, the proportion of depth information will be decreased. In order to promote the feature extraction capability of deformable convolution completely, the depth embedded deformable convolution module is proposed. Depth information-embedded visual features are taken as input to learn the offset of sampling points. The addition of depth features makes up for the deficiency of geometric information of visual features. Result In order to verify the effectiveness of the method, a series of ablation experiments are carried out on the New York University Depth Dataset V2(NYUv2) compared to the current methods. Mean intersection over union (mIoU) and mean pixel accuracy (mPA) are used as the measurement criteria. Our method achieved 51.9% of mIoU and 77.6% of PA on NYUv2, respectively. The visualization results of semantic segmentation are demonstrated to prove the effectiveness of the method. Conclusion We facilitates the depth guided feature extraction module (DFE), which includes depth guided feature selection module (DFS) and depth embedded deformable convolution module (DDC). DFS can adaptively determine the proportion of depth information in terms of the input of visual features and depth features. DDC enhances the feature extraction capability of deformable convolution through embedding depth information, and can extract more effective features via the shape of objects. In addition, our designed module can be embedded into the current feature extraction network. The depth information can be used to improve the modeling ability of the network effectively.
Keywords

订阅号|日报