RGB-D语义分割：深度信息的选择使用

赵经阳; 余昌黔; 桑农

doi:10.11834/jig.210061

图像理解和计算机视觉 | 浏览量 : 0 下载量: 40 CSCD: 1

PDF
导出
分享
收藏
专辑

RGB-D语义分割：深度信息的选择使用
RGB-D semantic segmentation: depth information selection
2022年27卷第8期页码：2473-2486
收稿日期：2021-01-29，

修回日期：2021-06-28，

录用日期：2021-7-3，

纸质出版日期：2022-08-16
DOI： 10.11834/jig.210061
稿件说明：

移动端阅览

赵经阳, 余昌黔, 桑农. RGB-D语义分割：深度信息的选择使用[J]. 中国图象图形学报, 2022,27(8):2473-2486. DOI： 10.11834/jig.210061.

Jingyang Zhao, Changqian Yu, Nong Sang. RGB-D semantic segmentation: depth information selection[J]. Journal of image and graphics, 2022, 27(8): 2473-2486. DOI： 10.11834/jig.210061.

摘要

目的

在室内场景语义分割任务中，深度信息会在一定程度上提高分割精度。但是如何更有效地利用深度信息仍是一个开放性问题。当前方法大都引入全部深度信息，然而将全部深度信息和视觉特征组合在一起可能对模型产生干扰，原因是仅依靠视觉特征网络模型就能区分的不同物体，在引入深度信息后可能产生错误判断。此外，卷积核固有的几何结构限制了卷积神经网络的建模能力，可变形卷积(deformable convolution，DC)在一定程度上缓解了这个问题。但是可变形卷积中产生位置偏移的视觉特征空间深度信息相对不足，限制了进一步发展。基于上述问题，本文提出一种深度信息引导的特征提取(depth guided feature extraction，DFE)模块。

方法

深度信息引导的特征提取模块包括深度信息引导的特征选择模块(depth guided feature selection，DFS)和深度信息嵌入的可变形卷积模块(depth embedded deformable convolution，DDC)。DFS可以筛选出关键的深度信息，自适应地调整深度信息引入视觉特征的比例，在网络模型需要时将深度信息嵌入视觉特征。DDC在额外深度信息的引入下，增强了可变形卷积的特征提取能力，可以根据物体形状提取更相关的特征。

结果

为了验证方法的有效性，在NYUv2(New York University Depth Dataset V2)数据集上进行一系列消融实验并与当前最好的方法进行比较，使用平均交并比(mean intersection over union，mIoU)和平均像素准确率(pixel accuracy，PA)作为度量标准。结果显示，在NYUv2数据集上，本文方法的mIoU和PA分别为51.9%和77.6%，实现了较好的分割效果。

结论

本文提出的深度信息引导的特征提取模块，可以自适应地调整深度信息嵌入视觉特征的程度，更加合理地利用深度信息，且在深度信息的作用下提高可变形卷积的特征提取能力。此外，本文提出的深度信息引导的特征提取模块可以比较方便地嵌入当下流行的特征提取网络中，提高网络的建模能力。

Abstract

Objective

Semantic segmentation is essential to computer vision application. It assigns each pixel to its corresponding category in an image

which is a pixel leveled multi-classification task. It is of great significance in the fields of automatic driving

virtual reality and medical image processing. The emergence of convolutional neural network (CNN) promotes the rapid development of neural network in various tasks of computer vision. The fully CNN has completely changed the pattern of semantic segmentation contexts. With the advent of depth camera

it is more convenient to obtain the depth images corresponding to color images. The depth image is single-channel

and each value should be ranged from the pixel to the camera plain in the image. Obviously

depth images contain spatial distance information

but color images are relatively insufficient. In the semantic segmentation task

it is difficult for the network to distinguish the adjacent objects with similar appearance in the plain image

but the application of depth image can be released to some extent. RGB-D semantic segmentation is focused on recently. The ways of depth information-embedded visual features can be roughly divided into the following three categories like one-stream

two-streams and multi-tasks. One-stream does not use depth images as additional input to extract features. It only has a backbone network to extract features from color images. In the process of feature extraction

the inherent spatial information of depth images is used to assist visual feature extraction for semantic segmentation improvement. Two-streams use the depth image as an additional input to extract features. There are mainly two backbone networks involved

each of which extracts features from color images and depth images each. In the encoding stage or decoding stage

the extracted visual features are fused with depth features to realize depth information application. Multi-task processes semantic segmentation

depth estimation and surface normal estimation at the same time. Such a method has one common backbone network only. In the process of feature extraction from color images

multi feature interaction can improve the performance of each task. Previous studies have challenged to effective depth information as well

and embedding all depth information into visual features may cause interference to the network. The inherent color and texture information can sometimes clearly distinguish two or more categories in a color image

where the addition of depth information is somewhat gilding the lily. For example

similar depth objects can be distinguished by visual features excluded different visual features

but the addition of depth information will make the network confused and even make wrong judgments. Moreover

the inherent structure of the convolution kernel limits its ability of feature extraction in CNN. To solve this problem

our proposed deformable convolution can learn the offset of the corresponding points according to the input

and extract more effective features in terms of the shape of the object

thus improving the modeling ability of the network. However

it is insufficient to learn the offset only by the input of visual features

because the spatial information of color images is very limited.

Method

We develop a depth guided feature extraction module (DFE)

which includes depth guided feature selection module (DFS) and depth embedded deformable convolution module (DDC). First

the proposed depth guided feature selection module concatenates the input of depth features and visual features in order to ignore the interference on the network derived from all depth information

and then selects the features with important influence from the fusion features through the channel attention method. Next

the weight matrix of the depth features is obtained through 1×1 convolution and sigmoid function. After multiplying the depth features and the corresponding weight matrix

the depth information to be embedded in the visual features is obtained. This depth information is then added to the visual features. Since the weight matrix corresponding to the depth features is obtained by learning

the network can adjust the number of depth information adequately

rather than accepting all the depth information. For instance

the proportion of depth information will be increased if the depth information is needed for classification. Otherwise

the proportion of depth information will be decreased. In order to promote the feature extraction capability of deformable convolution completely

the depth embedded deformable convolution module is proposed. Depth information-embedded visual features are taken as input to learn the offset of sampling points. The addition of depth features makes up for the deficiency of geometric information of visual features.

Result

In order to verify the effectiveness of the method

a series of ablation experiments are carried out on the New York University Depth Dataset V2(NYUv2) compared to the current methods. Mean intersection over union (mIoU) and mean pixel accuracy (mPA) are used as the measurement criteria. Our method achieved 51.9% of mIoU and 77.6% of PA on NYUv2

respectively. The visualization results of semantic segmentation are demonstrated to prove the effectiveness of the method.

Conclusion

We facilitates the depth guided feature extraction module (DFE)

which includes depth guided feature selection module (DFS) and depth embedded deformable convolution module (DDC). DFS can adaptively determine the proportion of depth information in terms of the input of visual features and depth features. DDC enhances the feature extraction capability of deformable convolution through embedding depth information

and can extract more effective features via the shape of objects. In addition

our designed module can be embedded into the current feature extraction network. The depth information can be used to improve the modeling ability of the network effectively.

关键词

Keywords

references

Badrinarayanan V, Handa A and Cipolla R. 2015. SegNet: a deep convolutional encoder-decoder architecture for robust semantic pixel-wise labeling [EB/OL ] . [2019-02-05 ] . https://arxiv.org/pdf/1505.07293.pdf https://arxiv.org/pdf/1505.07293.pdf

Chen L C, Papandreou G, Kokkinos I, Murphy K and Yuille A L. 2018a. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4): 834-848 [DOI: 10.1109/TPAMI.2017.2699184]

Chen L C, Papandreou G, Schroff F and Adam H. 2017. Rethinking atrous convolution for semantic image segmentation [EB/OL ] . [2019-02-05 ] . https://arxiv.org/pdf/1706.05587.pdf https://arxiv.org/pdf/1706.05587.pdf

Chen L C, Zhu Y K, Papandreou G, Schroff F and Adam H. 2018b. Encoder-decoder with atrous separable convo lution for semantic image segmentation [EB/OL ] . [2021-01-27 ] . https://arxiv.org/pdf/1802.02611v1.pdf https://arxiv.org/pdf/1802.02611v1.pdf

Chen X K, Lin K Y, Wang J B, Wu W, Qian C, Li H S and Zeng G. 2020. Bi-directional cross-modality feature propagation with separation-and-aggregation gate for RGB-D semantic segmentation//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 561-577 [ DOI: 10.1007/978-3-030-58621-8_33 http://dx.doi.org/10.1007/978-3-030-58621-8_33 ]

Cheng Y H, Cai R, Li Z W, Zhao X and Huang K Q. 2017. Locality-sensitive deconvolution networks with gated fusion for RGB-D indoor semantic segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 1475-1483 [ DOI: 10.1109/CVPR.2017.161 http://dx.doi.org/10.1109/CVPR.2017.161 ]

Dai J F, Qi H Z, Xiong Y W, Li Y, Zhang G D, Hu H and Wei Y C. 2017. Deformable convolutional networks//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 764-773 [ DOI: 10.1109/ICCV.2017.89 http://dx.doi.org/10.1109/ICCV.2017.89 ]

Gupta S, Girshick R, Arbeláez P and Malik J. 2014. Learning rich features from RGB-D images for object detection and segmentation//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 345-360 [ DOI: 10.1007/978-3-319-10584-0_23 http://dx.doi.org/10.1007/978-3-319-10584-0_23 ]

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778 [ DOI: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]

Hu J, Shen L and Sun G. 2018. Squeeze-and-excitation networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7132-7141 [ DOI: 10.1109/CVPR.2018.00745 http://dx.doi.org/10.1109/CVPR.2018.00745 ]

Hu X X, Yang K L, Fei L and Wang K W. 2019. ACNET: attention based network to exploit complementary features for RGBD semantic segmentation//Proceedings of 2019 IEEE International Conference on Image Processing. Taipei, China: IEEE: 1440-1444 [ DOI: 10.1109/ICIP.2019.8803025 http://dx.doi.org/10.1109/ICIP.2019.8803025 ]

Jiang J D, Zheng L, Luo F and Zhang Z J. 2018. RedNet: residual encoder-decoder network for indoor RGB-D semantic segmentation [EB/OL ] . [2021-01-27 ] . https://arxiv.org/pdf/1806.01054v1.pdf https://arxiv.org/pdf/1806.01054v1.pdf

Krizhevsky A, Sutskever I and Hinton G E. 2012. ImageNet classification with deep convolutional neural networks//Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, USA: Curran Associates Inc. : 1097-1105

Lee S, Park S J and Hong K S. 2017. RDFNet: RGB-D multi-level residual feature fusion for indoor semantic segmentation//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 4990-4999 [ DOI: 10.1109/ICCV.2017.533 http://dx.doi.org/10.1109/ICCV.2017.533 ]

Lin D, Chen G Y, Cohen-Or D, Heng P A and Huang H. 2017a. Cascaded feature network for semantic segmentation of RGB-D images//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 1320-1328 [ DOI: 10.1109/ICCV.2017.147 http://dx.doi.org/10.1109/ICCV.2017.147 ]

Lin G S, Milan A, Shen C H and Reid I. 2017b. RefineNet: multi-path refinement networks for high-resolution semantic segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 5168-5177 [ DOI: 10.1109/CVPR.2017.549 http://dx.doi.org/10.1109/CVPR.2017.549 ]

Lin G S, Shen C H, van den Hengel A and Reid I. 2016. Efficient piecewise training of deep structured models for semantic segmentation//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 3194-3203 [ DOI: 10.1109/CVPR.2016.348 http://dx.doi.org/10.1109/CVPR.2016.348 ]

Liu Z W, Li X X, Luo P, Loy C C and Tang X O. 2015. Semantic image segmentation via deep parsing network//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 1377-1385 [ DOI: 10.1109/ICCV.2015.162 http://dx.doi.org/10.1109/ICCV.2015.162 ]

Long J, Shelhamer E and Darrell T. 2015. Fully convolutional networks for semantic segmentation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 3431-3440 [DOI: 10.1109/CVPR.2015.7298965]

Qi X J, Liao R J, Jia J Y, Fidler S and Urtasun R. 2017. 3D graph neural networks for RGBD semantic segmentation//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 5209-5218 [ DOI: 10.1109/ICCV.2017.556 http://dx.doi.org/10.1109/ICCV.2017.556 ]

Ronneberger O, Fischer P and Brox T. 2015. U-net: convolutional networks for biomedical image segmentation//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich, Germany: Springer: 234-241 [ DOI: 10.1007/978-3-319-24574-4_28 http://dx.doi.org/10.1007/978-3-319-24574-4_28 ]

Silberman N, Hoiem D, Kohli P and Fergus R. 2012. Indoor segmentation and support inference from RGBD images//Proceedings of the 12th European Conference on Computer Vision. Florence, Italy: Springer: 746-760 [ DOI: 10.1007/978-3-642-33715-4_54 http://dx.doi.org/10.1007/978-3-642-33715-4_54 ]

Simonyan K and Zisserman A. 2015. Very deep convolutional networks for large-scale image recognition [EB/OL ] . [2019-02-05 ] . h ttps://arxiv.org/pdf/1409.1556.pdf ttps://arxiv.org/pdf/1409.1556.pdf

Song S R, Yu F, Zeng A, Chang A X, Savva M and Funkhouser T. 2017. Semantic scene completion from a single depth image//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 190-198 [ DOI: 10.1109/CVPR.2017.28 http://dx.doi.org/10.1109/CVPR.2017.28 ]

Szegedy C, Liu W, Jia Y Q, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V and Rabinovich A. 2015. Going deeper with convolutions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 1-9 [ DOI: 10.1109/CVPR.2015.7298594 http://dx.doi.org/10.1109/CVPR.2015.7298594 ]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc. : 6000-6010

Wang J H, Wang Z H, Tao D C, See S and Wang G. 2016. Learning common and specific features for RGB-D semantic segmentation with deconvolutional networks//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 664-679 [ DOI: 10.1007/978-3-319-46454-1_40 http://dx.doi.org/10.1007/978-3-319-46454-1_40 ]

Wang W Y and Neumann U. 2018. Depth-aware CNN for RGB-D segmentation//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer International Publishing: 144-161 [ DOI: 10.1007/978-3-030-01252-6_9 http://dx.doi.org/10.1007/978-3-030-01252-6_9 ]

Xing Y J, Wang J B, Chen X K and Zeng G. 2019. 2.5D convolution for RG B-D semantic segmentation//Proceedings of 2019 IEEE International Conference on Image Processing. Taipei, China, IEEE: 1410-1414 [ DOI: 10.1109/ICIP.2019.8803757 http://dx.doi.org/10.1109/ICIP.2019.8803757 ]

Xing Y J, Wang J B and Zeng G. 2020. Malleable 2.5D convolution: learning receptive fields along the depth-axis for RGB-D scene parsing//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 555-571 [ DOI: 10.1007/978-3-030-58529-7_33 http://dx.doi.org/10.1007/978-3-030-58529-7_33 ]

Xiong Z T, Yuan Y, Guo N H and Wang Q. 2020. Variational context-deformable ConvNets for indoor scene parsing//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 3991-4001 [ DOI: 10.1109/CVPR42600.2020.00405 http://dx.doi.org/10.1109/CVPR42600.2020.00405 ]

Xu D, Ouyang W L, Wang X G and Sebe N. 2018. PAD-Net: multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 675-684 [ DOI: 10.1109/CVPR.2018.00077 http://dx.doi.org/10.1109/CVPR.2018.00077 ]

Yu C Q, Wang J B, Peng C, Gao C X, Yu G and Sang N. 2018. Learning a discriminative feature network for semantic segmentation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 1857-1866 [ DOI: 10.1109/CVPR.2018.00199 http://dx.doi.org/10.1109/CVPR.2018.00199 ]

Zhang Z Y. 2012. Microsoft Kinect sensor and its effect. IEEE Multimedia, 19(2): 4-10 [DOI: 10.1109/MMUL.2012.24]

Zhang Z Y, Cui Z, Xu C Y, Yan Y, Sebe N and Yang J. 2019. Pattern-affinitive propagation across depth, surface normal and semantic segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 4101-4110 [ DOI: 10.1109/CVPR.2019.00423 http://dx.doi.org/10.1109/CVPR.2019.00423 ]

Zhao H S, Shi J P, Qi X J, Wang X G and Jia J Y. 2017. Pyramid scene parsing network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Hawaii, USA: IEEE: 6230-6239 [ DOI: 10.1109/CVPR.2017.660 http://dx.doi.org/10.1109/CVPR.2017.660 ]

Zhong Y R, Dai Y C and Li H D. 2018. 3D geometry-aware semantic labeling of outdoor street scenes//Proceedings of the 24th International Conference on Pattern Recognition. Beijing, China: IEEE: 2343-2349 [ DOI: 10.1109/ICPR.2018.8545378 http://dx.doi.org/10.1109/ICPR.2018.8545378 ]

Zhu X Z, Hu H, Lin S and Dai J F. 2019. Deformable ConvNets V2: more deformable, better results//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 9300-9308 [ DOI: 10.1109/CVPR.2019.00953 http://dx.doi.org/10.1109/CVPR.2019.00953 ]

文章被引用时，请邮件提醒。

提交

光场角度线索表征的语义分割研究

结合双边交叉增强与自注意力补偿的点云语义分割

面向无人机海岸带生态系统监测的语义分割基准数据集

基于深度学习的弱监督语义分割方法综述

跨层细节感知和分组注意力引导的遥感图像语义分割