Current Issue Cover
编码—解码结构的语义分割

韩慧慧1, 李帷韬1,2, 王建平1, 焦点1, 孙百顺1(1.合肥工业大学电气与自动化工程学院, 合肥 230009;2.东北大学流程工业综合自动化国家重点实验室, 沈阳 110004)

摘 要
目的 语义分割是计算机视觉中一项具有挑战性的任务,其核心是为图像中的每个像素分配相应的语义类别标签。然而,在语义分割任务中,缺乏丰富的多尺度信息和足够的空间信息会严重影响图像分割结果。为进一步提升图像分割效果,从提取丰富的多尺度信息和充分的空间信息出发,本文提出了一种基于编码-解码结构的语义分割模型。方法 运用ResNet-101网络作为模型的骨架提取特征图,在骨架末端附加一个多尺度信息融合模块,用于在网络深层提取区分力强且多尺度信息丰富的特征图。并且,在网络浅层引入空间信息捕获模块来提取丰富的空间信息。由空间信息捕获模块捕获的带有丰富空间信息的特征图和由多尺度信息融合模块提取的区分力强且多尺度信息丰富的特征图将融合为一个新的信息丰富的特征图集合,经过多核卷积块细化之后,最终运用数据依赖的上采样(DUpsampling)操作得到图像分割结果。结果 此模型在2个公开数据集(Cityscapes数据集和PASCAL VOC 2012数据集)上进行了大量实验,验证了所设计的每个模块及整个模型的有效性。新模型与最新的10种方法进行了比较,在Cityscapes数据集中,相比于RefineNet模型、DeepLabv2-CRF模型和LRR(Laplacian reconstruction and refinement)模型,平均交并比(mIoU)值分别提高了0.52%、3.72%和4.42%;在PASCAL VOC 2012数据集中,相比于Piecewise模型、DPN(deep parsing network)模型和GCRF(Gaussion conditional random field network)模型,mIoU值分别提高了6.23%、7.43%和8.33%。结论 本文语义分割模型,提取了更加丰富的多尺度信息和空间信息,使得分割结果更加准确。此模型可应用于医学图像分析、自动驾驶、无人机等领域。
关键词
Semantic segmentation of encoder-decoder structure

Han Huihui1, Li Weitao1,2, Wang Jianping1, Jiao Dian1, Sun Baishun1(1.School of Electric Engineering and Automation, Hefei University of Technology, Hefei 230009, China;2.State Key Laboratory of Synthetical Automation for Process Industries, Northeastern University, Shenyang 110004, China)

Abstract
Objective Semantic segmentation, a challenging task in computer vision, aims to assign corresponding semantic class labels to every pixel in an image. This process is widely applied into many fields, such as autonomous driving, obstacle detection, medical image analysis, 3D geometry, environment modeling, reconstruction of indoor environment, and 3D semantic segmentation. Despite the many achievements in semantic segmentation task, two challenges remain:1) the lack of rich multiscale information and 2) the loss of spatial information. Starting from capturing rich multiscale information and extracting affluent spatial information, a new semantic segmentation model is proposed, which can greatly improve the segmentation results. Method The new module is built on an encoder-decoder structure, which can effectively promote the fusion of high-level semantic information and low-level spatial information. The details of the entire architecture are elaborated as follows:First, in the encoder part, the ResNet-101 network is used as our backbone to capture feature maps. In ResNet-101 network, the last two blocks utilize atrous convolutions with rate=2 and rate=4, which can reduce the spatial resolution loss. A multiscale information fusion module is designed in the encoder part to capture feature maps with rich multiscale and discriminative information in the deep stage of the network. In this module, by applying expansion and stacking principle, Kronecker convolutions are arranged in parallel structure to expand the receptive field for extracting multiscale information. A global attention module is applied to highlight discriminative information selectively in the feature maps captured by Kronecker convolutions. Subsequently, a spatial information capturing module is introduced as a decoder part in the shallow stage of the network. The spatial information capturing module aims to supplement the affluent spatial information, which can compensate for the spatial resolution loss caused by the repeated combination of max-pooling and striding at consecutive layers in ResNet-101. Moreover, the spatial information-capturing module plays an important role in effectively enhancing the relationships between the widely separated spatial regions. The feature maps with rich multiscale and discriminative information captured by the multiscale information fusion module in the deep stage and the feature maps with affluent spatial information captured by the spatial information-capturing module will be fused to obtain a new feature map set, which is full of effective information. Afterward, a multikernel convolution block is utilized to refine these feature maps. In the multikernel convolution block, two convolutions are in parallel. The sizes of the two convolution kernels are 3×3 and 5×5. The feature maps refined by the multikernel convolution block will be fed to a Data-dependent Upsampling (DUpsampling) operator to obtain the final prediction feature maps. The reason for replacing the upsample operators with bilinear interpolation with DUpsampling is that DUpsampling not only can utilize the redundancy in the segmentation label space but also can effectively recover the pixel-wise prediction. We can safely downsample arbitrary low-level feature maps to the resolution of the lowest resolution of feature maps and then fuse these features to produce the final prediction. Result To prove the effectiveness of the proposals, extensive experiments are conducted on two public datasets:PASCAL VOC 2012 and Cityscapes. We first conduct several ablation studies on the PASCAL VOC 2012 dataset to evaluate the effectiveness of each module and then perform several contrast experiments on the PASCAL VOC 2012 and Cityscapes datasets with existing approaches, such as FCN (fully convolutional network), FRRN(full-resolution residual networks), DeepLabv2, CRF-RNN(conditional random fields as recurrent neural networks), DeconvNet, GCRF (Gaussion conditional random field network), DeepLabv2-CRF, Piecewise, Dilation10, DPN (deep parsing network), LRR (Laplacian reconstruction and refinement), and RefineNet models, to verify the effectiveness of the entire architecture. On the Cityscapes dataset, our model achieves 0.52%, 3.72%, and 4.42% mIoU improvement in performance compared with the RefineNet, DeepLabv2-CRF, and LRR models, respectively. On the PASCAL VOC 2012 dataset, our model achieves 6.23%, 7.43%, and 8.33% mIoU improvement in performance compared with the Piecewise, DPN, and GCRF models, respectively. Several examples of visualization results from our model experimented on the Cityscapes and PASCAL VOC 2012 datasets demonstrate the superiority of the proposals. Conclusion Experimental results show that our model outperforms several state-of-the-art saliency approaches and can dramatically improve the results of semantic segmentation. This model has great application value in many fields, such as medical image analysis, automatic driving, and unmanned aerial vehicle.
Keywords

订阅号|日报