目的 语义分割是计算机视觉中一项具有挑战性的任务,其核心是为图像中的每个像素分配相应的语义类别标签。然而,在语义分割任务中,缺乏丰富的多尺度信息和足够的空间信息都会严重影响图像分割结果。为进一步提升图像分割结果,从提取丰富的多尺度信息和充分的空间信息出发,本文提出了一种基于编码-解码结构的语义分割模型。方法 首先利用带有膨胀卷积(atrous convolution)的ResNet-101网络作为模型的骨架去提取特征图。接着,在骨架的末端,附加了一个多尺度信息融合模块,其由平行结构的克罗内克卷块和全局通道注意力模块组成。多尺度信息融合模块用于在网络深层提取区分力强且多尺度信息丰富的特征图。然后,在网络浅层引入空间信息捕获模块来提取丰富的空间信息。由空间信息捕获模块捕获的带有丰富空间信息的特征图和由多尺度信息融合模块提取的区分力强且多尺度信息丰富的特征图将融合为一个新的信息丰富的特征图集合,其经过多核卷积块进行细化之后,最终图像分割结果由数据依赖的上采样(DUpsampling)操作得到。结果 此模型在2个公开数据集(Cityscapes数据集和PASCAL VOC 2012数据集)上进行了大量实验,验证了所设计的每个模块及整个模型的有效性。并且,新模型与最新的10种方法进行了比较,在Cityscapes数据集中,相比于RefineNet模型,mIoU值提高了0.52%,相比于DeepLabv2-CRF模型,mIoU值提高了3.72%,相比于LRR模型,mIoU值提高了4.42%；在PASCAL VOC 2012数据集中,相比于Piecewise模型,mIoU值提高了6.23%,相比于DPN模型,mIoU值提高了7.43%,相比于GCRF模型,mIoU值提高了8.33%。结论 本文所提出的语义分割模型,提取了更加丰富的多尺度信息和空间信息,使得分割结果更加准确。
Semantic segmentation based on encoder - decoder structure
Han Huihui,Li Weitao,Wang Jianping,Jiao Dian,Sun Baishun()
Objective Semantic segmentation, a challenging task in computer vision, aims to assign corresponding semantic class labels to every pixel in an image. It is widely applied into many fields, such as autonomous driving, obstacle detection, medical image analysis, 3D geometry, environment modeling, reconstruction of indoor environment, and 3D semantic segmentation. Though there are many achievements in semantic segmentation task, two challenges are still existing: 1) the lack of rich multi-scale information; 2) the loss of spatial information. Starting from capturing rich multi-scale information and extracting affluent spatial information, a new semantic segmentation model is proposed, which can improve the segmentation results greatly. Method The new module is built on encoder-decoder structure, which can effectively promote the fusion of high-level semantic information and low-level spatial information. The details of the whole architecture are elaborated as follows. Firstly, in the encoder part, the ResNet-101 network is taken as our backbone to capture feature maps. In ResNet-101 network, the last two blocks utilize atrous convolutions with and respectively, which can reduce the spatial resolution loss to some extent. Then, a multi-scale information fusion module is designed in encoder part to capture feature maps with rich multi-scale information and discriminative information in deep stage of the network. In this module, applying expansion and stacking principle, Kronecker convolutions are arranged in parallel structure to expand the receptive field to extract multi-scale information. Also, a global attention module is applied to selectively highlight discriminative information in the feature maps captured by Kronecker convolutions. Next, a spatial information capturing module is introduced as a decoder part in the shallow stage of the network. The spatial information capturing module aims to supplement affluent spatial information, which can make up spatial resolution loss caused by the repeated combination of max-pooling and striding at consecutive layers in ResNet-101. Moreover, the spatial information capturing module plays an important role in enhancing relationships between widely separated spatial regions effectively. Note that, the feature maps with rich multi-scale information and discriminative information captured by the multi-scale information fusion module in deep stage and the feature maps with affluent spatial information captured by the spatial information capturing module will be fused together to obtain a new feature maps set, which is full of effective information. After that, a multi-kernel convolution block is utilized to re?ne these feature maps. In the multi-kernel convolution block, there are two convolutions in parallel. The size of the two convolution kernels is 3*3 and 5*5 respectively. And after that, the feature maps refined by the multi-kernel convolution block will be fed to a data-dependent upsampling (DUpsample) operator, by which we can obtain the final prediction feature maps. The reason for replacing the upsample operators with bilinear interpolation with DUpsample is that DUpsample not only can take advantage of the redundancy in the segmentation label space, but also can recover the pixel-wise prediction effectively. We can safely down-sample arbitrary low-level features maps to the resolution of the lowest resolution of feature maps, and then fuse these features to produce final prediction. Result To prove the effectiveness of the proposals, extensive experiments have been conducted on two public datasets: PASCAL VOC 2012 dataset and Cityscapes dataset. We first conduct several ablation studies on PASCAL VOC 2012 dataset to evaluate the effectiveness of each module, then we perform several contrast experiments on PASCAL VOC 2012 and Cityscapes dataset with existing approaches, such as FCN model, FRRN model, DeepLabv2 model, GRF-RNN model, DeconvNet model, GCRF model, DeepLabv2-CRF model, Piecewise model, Dilation10 model, DPN model, LRR model, and RefineNet model, to testify the effectiveness of the whole architecture. On the Cityscapes dataset, our model achieves 0.52% mIoU improvement than the performance of RefineNet model, obtains 3.72% mIoU increasement than the performance of DeepLabv2-CRF model, and achieves 4.42% mIoU improvement than the performance of LRR model. On PASCAL VOC 2012 dataset, our model achieves 6.23% mIoU improvement than the performance of Piecewise model, gets 7.43% mIoU improvement than the performance of DPN model, and obtains 8.33% mIoU improvement than the performance of GCRF model. At the same time, several examples of visualization results from our model experimented on Cityscapes dataset and PASCAL VOC 2012 dataset are provided to demonstrate the superiority of the proposals. Conclusion The experiment results show that our model outperforms several state-of-the-art saliency approaches and can improve the results of the semantic segmentation dramatically.