赵什陆, 张强(西安电子科技大学机电工程学院, 西安 710071)
图像语义分割旨在将视觉场景分解为不同的语义类别实体，实现对图像中每一个像素的类别预测。多模态图像语义分割通过联合利用不同模态图像（即通过基于不同成像机理的传感器获取的图像）间的互补特性，能够全面且准确地实现复杂场景信息的学习与推理。目前基于深度学习的多模态图像语义分割前沿成果较多，但缺少系统且全面的调研与分析。本文首先总结并分析了目前主流的基于深度学习的可见光—热红外（red-green-bluethermal，RGB-T）图像语义分割算法和可见光—深度（red-green-blue-depth，RGB-D）图像语义分割算法。依据算法侧重点不同，将基于深度学习的RGB-T图像语义分割算法划分为基于图像特征增强的方法、基于多模态图像特征融合的方法和基于多层级图像特征交互的方法；依据算法对深度信息的利用方式，将基于深度学习的RGB-D图像语义分割算法划分为基于深度信息提取的方法和基于深度信息引导的方法。然后，介绍了多模态图像语义分割算法常用的客观评测指标以及数据集，并在常用数据集上对上述算法进行对比。对于RGB-T图像语义分割，在MFNet（multi-spectral fusion network）数据集上，GMNet （graded-feature multilabel-learning network）和MFFENet （multiscale feature fusion and enhancement network）分别取得了最优的类平均交并比（mean intersection-over-union per class，mIoU）（57.3%）和类平均精度（mean accuracy per class，mAcc）（74.3%）值。在PST900（PENN subterranean thermal 900）数据集上，GMNet仍然取得了最优的mIoU（84.12%）值，而EGFNet取得了最优的mAcc（94.02%）值。对于RGB-D图像语义分割，在NYUD v2（New York University depth dataset v2）数据集上，GLPNet（global-local propagation network）的mIoU和mAcc分别达到了54.6%和66.6%，取得最优性能。而在SUN-RGBD（scene understanding-RGB-D）数据集上，Zig-Zag的mIoU为51.8%，GLPNet的mAcc为63.3%，均为最优性能。最后，本文还指出了多模态图像语义分割领域未来可能的发展方向。
Progress in multi-modal image semantic segmentation based on deep learning
Unlike some low-level vision tasks，such as image deraining，dehazing，and deblurring，semantic segmentation aims to decompose a visual scene into different semantic category entities and achieve category prediction for each pixel in an image，which plays an indispensable role in many scene understanding systems. Most existing semantic segmentation models use visible red-green-blue（RGB）images to perceive the scene contents. However，visible cameras have poor robustness to changing illumination and are unable to penetrate through smoke，fog，haze，rain，and snow. Limited by their imaging mechanism，visible cameras hardly capture sufficient and effective scene information under poor lighting and horrible weather conditions. Furthermore，they cannot provide the spatial structures and 3D layouts of various scenes，thus preventing them from handling complex scenes with similar target appearances or multiple changing scene areas. In recent years，with the continuous development of sensor technologies，thermal infrared and depth cameras have been widely used in military and civil fields. Compared with visible cameras，depth cameras can acquire the physical distances between the objects and the optical centers of sensors in the scenes，while thermal infrared cameras reflect the thermal radiations of objects whose temperatures exceed absolute zero（-273 ℃）under various lighting and weather conditions，thus providing rich contour and semantic cues. However，depth and thermal infrared images usually lack colors，textures，and other details. Given the difficulty for unimodal images to provide complete information about complex scenes，multi-modal image semantic segmentation aims to combine the complementary characteristics of images from different modalities（i. e. ，the images acquired by sensors based on different imaging mechanisms）to achieve comprehensive and accurate predictions. At present，there are many leading-edge works for multi-modal image semantic segmentation based on deep learning，but comprehensive reviews are scarce. In this paper，we provide a systematic review of the recent advances in multi-modal image semantic segmentation，including red-green-blue-thermal（RGB-T）and red-green-blue-depth（RGB-D）semantic segmentation. First，we summarize and analyze the current mainstream deep-learning-based algorithms for RGB-T and RGB-D semantic segmentation. Specifically，according to the different emphases of these algorithms，RGB-T semantic segmentation models based on deep learning are divided into three categories，namely，image-feature-enhancement-based methods，multi-modal-image-feature-fusion-based methods，and multi-level-image-feature-interaction-based methods. In image-feature-enhancement-based methods，unimodal or multi-modal fused image features are directly or indirectly enhanced by employing some attention mechanisms and embedding some auxiliary information. These models aim to mitigate the influence of interference information and mine such highly discriminative information from unimodal or multimodal fused image features，thus effectively improving the semantic segmentation accuracy. Multi-modal-image-featurefusion-based methods mainly focus on how to effectively exploit the complementary characteristics between RGB and thermal infrared features to give full play to the advantages of multi-modal images. Unlike unimodal image semantic segmentation，multi-modal image feature fusion is unique to multi-modal image semantic segmentation. Therefore，most of the existing RGB-T semantic segmentation methods are dedicated to designing fusion modules for integrating multi-modal image features. Receptive fields of different scales can extract the information of objects with different sizes in the scenes. With this in mind，the interactions among multi-level image features can help capture rich multi-scale contextual information that can significantly boost the performance of semantic segmentation models，especially in scenes containing multi-scale objects. Multi-level-image-feature-interaction-based methods have been widely used in unimodal image semantic segmentation， such as non-local networks and DeepLab. Similarly，some works have adopted these methods in RGB-T semantic segmentation and achieved satisfying results. Alternatively，according to the exploitation of depth information，RGB-D semantic segmentation models based on deep learning are divided into depth-information-extraction-based methods and depthinformation-guidance-based methods，with the former being further subdivided into multi-modal-image-feature-fusionbased methods and contextual-information-mining-based methods. Similar to RGB-T semantic segmentation methods， depth-information-extraction-based methods regard the depth and RGB images as two separate input data that capture the discriminative information within RGB and depth features by extracting and fusing unimodal image features. In depthinformation-guidance-based methods，depth information is embedded into the feature extraction of RGB images. By doing so，depth-information-guidance-based methods can make full use of the 3D information provided by depth images and reduce their model sizes to some extent. Second，we introduce some widely used evaluation criteria and public datasets for RGB-D/RGB-T semantic segmentation and compare and analyze the performance of various models. Specifically，for RGB-T semantic segmentation，graded-feature multilabel-learning network（GMNet）and multiscale feature fusion and enhancelment network（MFFENet） achieve the best performance in terms of mean intersection over union per class （mIoU） （57. 3%） and mean accuracy per class （mAcc）（74. 3%），respectively，on the MFNet dataset. Meanwhile，on the PST900 dataset，GMNet still achieves the best performance in mIoU（84. 12%），but EGFNet achieves the best performance in mAcc（94. 02%）. For RGB-D semantic segmentation，GLPNet achieves the best performance in mIoU（54. 6%） and mAcc（66. 6%）on the NYUD v2 dataset. On the SUN-RGBD dataset，GLPNet achieves the best performance in mAcc （63. 3%），but Zig-Zag achieves the best results in mIoU（51. 8%）. We also point out some future development directions for multi-modal image semantic segmentation.