深度学习多模态图像语义分割前沿进展
Progress in multi-modal image semantic segmentation based on deep learning
- 2023年28卷第11期 页码:3320-3341
纸质出版日期: 2023-11-16
DOI: 10.11834/jig.220451
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2023-11-16 ,
移动端阅览
赵什陆, 张强. 2023. 深度学习多模态图像语义分割前沿进展. 中国图象图形学报, 28(11):3320-3341
Zhao Shenlu, Zhang Qiang. 2023. Progress in multi-modal image semantic segmentation based on deep learning. Journal of Image and Graphics, 28(11):3320-3341
图像语义分割旨在将视觉场景分解为不同的语义类别实体,实现对图像中每一个像素的类别预测。多模态图像语义分割通过联合利用不同模态图像(即通过基于不同成像机理的传感器获取的图像)间的互补特性,能够全面且准确地实现复杂场景信息的学习与推理。目前基于深度学习的多模态图像语义分割前沿成果较多,但缺少系统且全面的调研与分析。本文首先总结并分析了目前主流的基于深度学习的可见光—热红外(red-green-blue-thermal,RGB-T)图像语义分割算法和可见光—深度(red-green-blue-depth,RGB-D)图像语义分割算法。依据算法侧重点不同,将基于深度学习的RGB-T图像语义分割算法划分为基于图像特征增强的方法、基于多模态图像特征融合的方法和基于多层级图像特征交互的方法;依据算法对深度信息的利用方式,将基于深度学习的RGB-D图像语义分割算法划分为基于深度信息提取的方法和基于深度信息引导的方法。然后,介绍了多模态图像语义分割算法常用的客观评测指标以及数据集,并在常用数据集上对上述算法进行对比。对于RGB-T图像语义分割,在MFNet(multi-spectral fusion network)数据集上,GMNet(graded-feature multilabel-learning network)和MFFENet(multiscale feature fusion and enhancement network)分别取得了最优的类平均交并比(mean intersection-over-union per class,mIoU)(57.3%)和类平均精度(mean accuracy per class,mAcc)(74.3%)值。在PST900(PENN subterranean thermal 900)数据集上,GMNet仍然取得了最优的mIoU(84.12%)值,而EGFNet取得了最优的mAcc(94.02%)值。对于RGB-D图像语义分割,在NYUD v2(New York University depth dataset v2)数据集上,GLPNet(global-local propagation network)的mIoU和mAcc分别达到了54.6%和66.6%,取得最优性能。而在SUN-RGBD(scene understanding-RGB-D)数据集上,Zig-Zag的mIoU为51.8%,GLPNet的mAcc为63.3%,均为最优性能。最后,本文还指出了多模态图像语义分割领域未来可能的发展方向。
Unlike some low-level vision tasks, such as image deraining, dehazing, and deblurring, semantic segmentation aims to decompose a visual scene into different semantic category entities and achieve category prediction for each pixel in an image, which plays an indispensable role in many scene understanding systems. Most existing semantic segmentation models use visible red-green-blue(RGB) images to perceive the scene contents. However, visible cameras have poor robustness to changing illumination and are unable to penetrate through smoke, fog, haze, rain, and snow. Limited by their imaging mechanism, visible cameras hardly capture sufficient and effective scene information under poor lighting and horrible weather conditions. Furthermore, they cannot provide the spatial structures and 3D layouts of various scenes, thus preventing them from handling complex scenes with similar target appearances or multiple changing scene areas. In recent years, with the continuous development of sensor technologies, thermal infrared and depth cameras have been widely used in military and civil fields. Compared with visible cameras, depth cameras can acquire the physical distances between the objects and the optical centers of sensors in the scenes, while thermal infrared cameras reflect the thermal radiations of objects whose temperatures exceed absolute zero (-273 ℃) under various lighting and weather conditions, thus providing rich contour and semantic cues. However, depth and thermal infrared images usually lack colors, textures, and other details. Given the difficulty for unimodal images to provide complete information about complex scenes, multi-modal image semantic segmentation aims to combine the complementary characteristics of images from different modalities (i.e
.
, the images acquired by sensors based on different imaging mechanisms) to achieve comprehensive and accurate predictions. At present, there are many leading-edge works for multi-modal image semantic segmentation based on deep learning, but comprehensive reviews are scarce. In this paper, we provide a systematic review of the recent advances in multi-modal image semantic segmentation, including red-green-blue-thermal (RGB-T) and red-green-blue-depth (RGB-D) semantic segmentation. First, we summarize and analyze the current mainstream deep-learning-based algorithms for RGB-T and RGB-D semantic segmentation. Specifically, according to the different emphases of these algorithms, RGB-T semantic segmentation models based on deep learning are divided into three categories, namely, image-feature-enhancement-based methods, multi-modal-image-feature-fusion-based methods, and multi-level-image-feature-interaction-based methods. In image-feature-enhancement-based methods, unimodal or multi-modal fused image features are directly or indirectly enhanced by employing some attention mechanisms and embedding some auxiliary information. These models aim to mitigate the influence of interference information and mine such highly discriminative information from unimodal or multi-modal fused image features, thus effectively improving the semantic segmentation accuracy. Multi-modal-image-feature-fusion-based methods mainly focus on how to effectively exploit the complementary characteristics between RGB and thermal infrared features to give full play to the advantages of multi-modal images. Unlike unimodal image semantic segmentation, multi-modal image feature fusion is unique to multi-modal image semantic segmentation. Therefore, most of the existing RGB-T semantic segmentation methods are dedicated to designing fusion modules for integrating multi-modal image features. Receptive fields of different scales can extract the information of objects with different sizes in the scenes. With this in mind, the interactions among multi-level image features can help capture rich multi-scale contextual information that can significantly boost the performance of semantic segmentation models, especially in scenes containing multi-scale objects. Multi-level-image-feature-interaction-based methods have been widely used in unimodal image semantic segmentation, such as non-local networks and DeepLab. Similarly, some works have adopted these methods in RGB-T semantic segmentation and achieved satisfying results. Alternatively, according to the exploitation of depth information, RGB-D semantic segmentation models based on deep learning are divided into depth-information-extraction-based methods and depth-information-guidance-based methods, with the former being further subdivided into multi-modal-image-feature-fusion-based methods and contextual-information-mining-based methods. Similar to RGB-T semantic segmentation methods, depth-information-extraction-based methods regard the depth and RGB images as two separate input data that capture the discriminative information within RGB and depth features by extracting and fusing unimodal image features. In depth-information-guidance-based methods, depth information is embedded into the feature extraction of RGB images. By doing so, depth-information-guidance-based methods can make full use of the 3D information provided by depth images and reduce their model sizes to some extent. Second, we introduce some widely used evaluation criteria and public datasets for RGB-D/RGB-T semantic segmentation and compare and analyze the performance of various models. Specifically, for RGB-T semantic segmentation, graded-feature multilabel-learning network(GMNet) and multiscale feature fusion and enhancelment network(MFFENet) achieve the best performance in terms of mean intersection over union per class (mIoU) (57.3%) and mean accuracy per class (mAcc) (74.3%), respectively, on the MFNet dataset. Meanwhile, on the PST900 dataset, GMNet still achieves the best performance in mIoU (84.12%), but EGFNet achieves the best performance in mAcc (94.02%). For RGB-D semantic segmentation, GLPNet achieves the best performance in mIoU (54.6%) and mAcc (66.6%) on the NYUD v2 dataset. On the SUN-RGBD dataset, GLPNet achieves the best performance in mAcc (63.3%), but Zig-Zag achieves the best results in mIoU (51.8%). We also point out some future development directions for multi-modal image semantic segmentation.
多模态图像语义分割特征增强特征融合特征交互深度信息提取深度信息引导
multi-modal imagesemantic segmentationfeature enhancementfeature fusionfeature interactiondepth information extractiondepth information guidance
Badrinarayanan V, Kendall A and Cipolla R. 2017. SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12): 2481-2495 [DOI: 10.1109/TPAMI.2016.2644615http://dx.doi.org/10.1109/TPAMI.2016.2644615]
Cao J M, Leng H C, Lischinski D, Cohen-Or D, Tu C H and Li Y Y. 2021. ShapeConv: shape-aware convolutional layer for indoor RGB-D semantic segmentation//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 7068-7077 [DOI: 10.1109/ICCV48922.2021.00700http://dx.doi.org/10.1109/ICCV48922.2021.00700]
Chen L C, Papandreou G, Schroff F and Adam H. 2017. Rethinking atrous convolution for semantic image segmentation [EB/OL]. [2022-04-21]. https://arxiv.org/pdf/1706.05587.pdfhttps://arxiv.org/pdf/1706.05587.pdf
Chen S H, Zhu X X, Liu W, He X J and Liu J. 2021. Global-local propagation network for RGB-D semantic segmentation [EB/OL]. [2022-04-21]. https://arxiv.org/pdf/2101.10801.pdfhttps://arxiv.org/pdf/2101.10801.pdf
Chen X K, Lin K Y, Wang J B, Wu W, Qian C, Li H S and Zeng G. 2020. Bi-directional cross-modality feature propagation with separation-and-aggregation gate for RGB-D semantic segmentation//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 561-577 [DOI: 10.1007/978-3-030-58621-8_33http://dx.doi.org/10.1007/978-3-030-58621-8_33]
Deng F Q, Feng H, Liang M J, Wang H M, Yang Y, Gao Y, Chen J F, Hu J J, Guo X Y and Lam T L. 2021. FEANet: feature-enhanced attention network for RGB-thermal real-time semantic segmentation//Proceedings of 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems. Prague, Czech Republic: IEEE: 4467-4473 [DOI: 10.1109/IROS51168.2021.9636084http://dx.doi.org/10.1109/IROS51168.2021.9636084]
Guo Z F, Li X, Xu Q M and Sun Z L. 2021. Robust semantic segmentation based on RGB-thermal in variable lighting scenes. Measurement, 186: #110176 [DOI: 10.1016/j.measurement.2021.110176http://dx.doi.org/10.1016/j.measurement.2021.110176]
Ha Q S, Watanab K, Karasawa T, Ushiku Y and Harada T. 2017. MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes//Proceedings of 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems. Vancouver, Canada: IEEE: 5108-5115 [DOI: 10.1109/IROS.2017.8206396http://dx.doi.org/10.1109/IROS.2017.8206396]
Hazirbas C, Ma L N, Domokos C and Cremers D. 2017. FuseNet: incorporating depth into semantic segmentation via fusion-based CNN architecture//Proceedings of the 13th Asian Conference on Computer Vision. Taipei, China: Springer: 213-228 [DOI: 10.1007/978-3-319-54181-5_14http://dx.doi.org/10.1007/978-3-319-54181-5_14]
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778 [DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]
Hu X X, Yang K L, Fei L and Wang K W. 2019. ACNET: attention based network to exploit complementary features for RGBD semantic segmentation//Proceedings of 2019 IEEE International Conference on Image Processing. Taipei, China: IEEE: 1440-1444 [DOI: 10.1109/ICIP.2019.8803025http://dx.doi.org/10.1109/ICIP.2019.8803025]
Hu Y S, Chen Z Z and Lin W Y. 2018. RGB-D semantic segmentation: a review//Proceedings of 2018 IEEE International Conference on Multimedia and Expo Workshops. San Diego, USA: IEEE: 1-6 [DOI: 10.1109/ICMEW.2018.8551554http://dx.doi.org/10.1109/ICMEW.2018.8551554]
Huang G, Liu Z, Van Der Maaten L and Weinberger K Q. 2017. Densely connected convolutional networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 2261-2269 [DOI: 10.1109/CVPR.2017.243http://dx.doi.org/10.1109/CVPR.2017.243]
Hung S W, Lo S Y and Hang H M. 2019. Incorporating luminance, depth and color information by a fusion-based network for semantic segmentation//Proceedings of 2019 IEEE International Conference on Image Processing. Taipei, China: IEEE: 2374-2378 [DOI: 10.1109/ICIP.2019.8803360http://dx.doi.org/10.1109/ICIP.2019.8803360]
Lan X, Gu X J and Gu X S. 2022. MMNet: multi-modal multi-stage network for RGB-T image semantic segmentation. Applied Intelligence, 52(5): 5817-5829 [DOI: 10.1007/s10489-021-02687-7http://dx.doi.org/10.1007/s10489-021-02687-7]
Lee S, Park S J and Hong K S. 2017. RDFNet: RGB-D multi-level residual feature fusion for indoor semantic segmentation//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 4990-4999 [DOI: 10.1109/ICCV.2017.533http://dx.doi.org/10.1109/ICCV.2017.533]
Li Y B, Zhang J G, Cheng Y H, Huang K Q and Tan T N. 2017. Semantics-guided multi-level RGB-D feature fusion for indoor semantic segmentation//Proceedings of 2017 IEEE International Conference on Image Processing. Beijing, China: IEEE: 1262-1266 [DOI: 10.1109/ICIP.2017.8296484http://dx.doi.org/10.1109/ICIP.2017.8296484]
Lin D, Chen G Y, Cohen-Or D, Heng P A and Huang H. 2017a. Cascaded feature network for semantic segmentation of RGB-D images//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 1320-1328 [DOI: 10.1109/ICCV.2017.147http://dx.doi.org/10.1109/ICCV.2017.147]
Lin D and Huang H. 2020. Zig-Zag network for semantic segmentation of RGB-D images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10): 2642-2655 [DOI: 10.1109/TPAMI.2019.2923513http://dx.doi.org/10.1109/TPAMI.2019.2923513]
Lin D, Zhang R M, Ji Y F, Li P and Huang H. 2020. SCN: switchable context network for semantic segmentation of RGB-D images. IEEE Transactions on Cybernetics, 50(3): 1120-1131 [DOI: 10.1109/TCYB.2018.2885062http://dx.doi.org/10.1109/TCYB.2018.2885062]
Lin G S, Milan A, Shen C H and Reid I. 2017b. RefineNet: multi-path refinement networks for high-resolution semantic segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 5168-5177 [DOI: 10.1109/CVPR.2017.549http://dx.doi.org/10.1109/CVPR.2017.549]
Liu H, Wu W S, Wang X D and Qian Y L. 2018. RGB-D joint modelling with scene geometric information for indoor semantic segmentation. Multimedia Tools and Applications, 77(17): 22475-22488 [DOI: 10.1007/s11042-018-6056-8http://dx.doi.org/10.1007/s11042-018-6056-8]
Long J, Shelhamer E and Darrell T. 2015. Fully convolutional networks for semantic segmentation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3431-3440 [DOI: 10.1109/CVPR.2015.7298965http://dx.doi.org/10.1109/CVPR.2015.7298965]
Noori A Y. 2021. A survey of RGB-D image semantic segmentation by deep learning//Proceedings of the 7th International Conference on Advanced Computing and Communication Systems. Coimbatore, India: IEEE: 1953-1957 [DOI: 10.1109/ICACCS51430.2021.9441924http://dx.doi.org/10.1109/ICACCS51430.2021.9441924]
Ronneberger O, Fischer P and Brox T. 2015. U-Net: convolutional networks for biomedical image segmentation//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich, Germany: Springer: 234-241 [DOI: 10.1007/978-3-319-24574-4_28http://dx.doi.org/10.1007/978-3-319-24574-4_28]
Seichter D, Köhler M, Lewandowski B, Wengefeld T and Gross H M. 2021. Efficient RGB-D semantic segmentation for indoor scene analysis//Proceedings of 2021 IEEE International Conference on Robotics and Automation. Xi’an, China: IEEE: 13525-13531 [DOI: 10.1109/ICRA48506.2021.9561675http://dx.doi.org/10.1109/ICRA48506.2021.9561675]
Shivakumar S S, Rodrigues N, Zhou A, Miller I D, Kumar V and Taylor C J. 2020. PST900: RGB-thermal calibration, dataset and segmentation network//Proceedings of 2020 IEEE International Conference on Robotics and Automation. Paris, France: IEEE: 9441-9447 [DOI: 10.1109/ICRA40945.2020.9196831http://dx.doi.org/10.1109/ICRA40945.2020.9196831]
Simonyan K and Zisserman A. 2015. Very deep convolutional networks for large-scale image recognition [EB/OL]. [2022-04-21]. https://arxiv.org/pdf/1409.1556.pdfhttps://arxiv.org/pdf/1409.1556.pdf
Sun L, Yang K L, Hu X X, Hu W J and Wang K W. 2020. Real-time fusion network for RGB-D semantic segmentation incorporating unexpected obstacle detection for road-driving images. IEEE Robotics and Automation Letters, 5(4): 5558-5565 [DOI: 10.1109/LRA.2020.3007457http://dx.doi.org/10.1109/LRA.2020.3007457]
Sun Y X, Zuo W X and Liu M. 2019. RTFNet: RGB-thermal fusion network for semantic segmentation of urban scenes. IEEE Robotics and Automation Letters, 4(3): 2576-2583 [DOI: 10.1109/LRA.2019.2904733http://dx.doi.org/10.1109/LRA.2019.2904733]
Sun Y X, Zuo W X, Yun P, Wang H L and Liu M. 2021. FuseSeg: semantic segmentation of urban scenes based on RGB and thermal data fusion. IEEE Transactions on Automation Science and Engineering, 18(3): 1000-1011 [DOI: 10.1109/TASE.2020.2993143http://dx.doi.org/10.1109/TASE.2020.2993143]
Vertens J, Zürn J and Burgard W. 2020. HeatNet: bridging the day-night domain gap in semantic segmentation with thermal images//Proceedings of 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems. Las Vegas, USA: IEEE: 8461-8468 [DOI: 10.1109/IROS45743.2020.9341192http://dx.doi.org/10.1109/IROS45743.2020.9341192]
Wang J H, Wang Z H, Tao D C, See S and Wang G. 2016. Learning common and specific features for RGB-D semantic segmentation with deconvolutional networks//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 664-679 [DOI: 10.1007/978-3-319-46454-1_40http://dx.doi.org/10.1007/978-3-319-46454-1_40]
Wang W Y and Neumann U. 2018. Depth-aware CNN for RGB-D segmentation//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 144-161 [DOI: 10.1007/978-3-030-01252-6_9http://dx.doi.org/10.1007/978-3-030-01252-6_9]
Wang X L, Girshick R, Gupta A and He K M. 2018. Non-local neural networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7794-7803 [DOI: 10.1109/CVPR.2018.00813http://dx.doi.org/10.1109/CVPR.2018.00813]
Xu J T, Lu K G and Wang H. 2021. Attention fusion network for multi-spectral semantic segmentation. Pattern Recognition Letters, 146: 179-184 [DOI: 10.1016/j.patrec.2021.03.015http://dx.doi.org/10.1016/j.patrec.2021.03.015]
Yue Y C, Zhou W J, Lei J S and Yu L. 2021. Two-stage cascaded decoder for semantic segmentation of RGB-D images. IEEE Signal Processing Letters, 28: 1115-1119 [DOI: 10.1109/LSP.2021.3084855http://dx.doi.org/10.1109/LSP.2021.3084855]
Zhang Y F, Sidibé D, Morel O and Mériaudeau F. 2021a. Deep multimodal fusion for semantic image segmentation: a survey. Image and Vision Computing, 105: #104042 [DOI: 10.1016/j.imavis.2020.104042http://dx.doi.org/10.1016/j.imavis.2020.104042]
Zhang K, Feng X H, Guo Y R, Su Y K, Zhao K, Zhao Z B, Ma Z Y and Ding Q L. 2021. Overview of deep convolutional neural networks for image classification. Journal of Image and Graphics, 26(10): 2305-2325
张珂, 冯晓晗, 郭玉荣, 苏昱坤, 赵凯, 赵振兵, 马占宇, 丁巧林. 2021. 图像分类的深度卷积神经网络模型综述. 中国图象图形学报, 26(10): 2305-2325 [DOI: 10.11834/jig.200302http://dx.doi.org/10.11834/jig.200302]
Zhang Q, Zhao S L, Luo Y J, Zhang D W, Huang N C and Han J G. 2021b. ABMDRNet: adaptive-weighted bi-directional modality difference reduction network for RGB-T semantic segmentation//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 2633-2642 [DOI: 10.1109/CVPR46437.2021.00266http://dx.doi.org/10.1109/CVPR46437.2021.00266]
Zhang G D, Xue J H, Xie P W, Yang S F and Wang G J. 2021c. Non-local aggregation for RGB-D semantic segmentation. IEEE Signal Processing Letters, 28: 658-662 [DOI: 10.1109/LSP.2021.3066071http://dx.doi.org/10.1109/LSP.2021.3066071]
Zheng Z J, Xie D H, Chen C L and Zhu Z Q. 2020. Multi-resolution cascaded network with depth-similar residual module for real-time semantic segmentation on RGB-D image//Proceedings of 2020 IEEE International Conference on Networking, Sensing and Control. Nanjing, China: IEEE: 1-6 [DOI: 10.1109/ICNSC48988.2020.9238079http://dx.doi.org/10.1109/ICNSC48988.2020.9238079]
Zhou H, Qi L, Wan Z L, Huang H and Yang X. 2020. RGB-D Co-attention network for semantic segmentation//Proceedings of the 15th Asian Conference on Computer Vision. Kyoto, Japan: Springer: 519-536 [DOI: 10.1007/978-3-030-69525-5_31http://dx.doi.org/10.1007/978-3-030-69525-5_31]
Zhou W J, Dong S H, Xu C E and Qian Y G. 2021a. Edge-aware guidance fusion network for RGB thermal scene parsing [EB/OL]. [2022-04-21]. https://arxiv.org/pdf/2112.05144.pdfhttps://arxiv.org/pdf/2112.05144.pdf
Zhou W J, Lin X Y, Lei J S, Yu L and Hwang J N. 2022. MFFENet: multiscale feature fusion and enhancement network for RGB-thermal urban road scene parsing. IEEE Transactions on Multimedia, 24: 2526-2538 [DOI: 10.1109/TMM.2021.3086618http://dx.doi.org/10.1109/TMM.2021.3086618]
Zhou W J, Liu J F, Lei J S, Yu L and Hwang J N. 2021b. GMNet: graded-feature multilabel-learning network for RGB-thermal urban scene semantic segmentation. IEEE Transactions on Image Processing, 30: 7790-7802 [DOI: 10.1109/TIP.2021.3109518http://dx.doi.org/10.1109/TIP.2021.3109518]
Zhou W J, Yuan J Z, Lei J S and Luo T. 2021c. TSNet: three-stream self-attention network for RGB-D indoor semantic segmentation. IEEE Intelligent Systems, 36(4): 73-78 [DOI: 10.1109/MIS.2020.2999462http://dx.doi.org/10.1109/MIS.2020.2999462]
相关作者
相关机构