发布时间: 2019-12-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.190087
2019 | Volume 24 | Number 12

图像分析和识别

结合上下文特征与CNN多层特征融合的语义分割

罗会兰, 张云

江西理工大学信息工程学院, 赣州 341000

收稿日期: 2019-04-01; 修回日期: 2019-06-22; 预印本日期: 2019-06-29

基金项目: 国家自然科学基金项目（61862031，61462035）；江西省自然科学基金项目（20171BAB202014）；江西省赣州市“科技创新人才计划”项目

第一作者简介: 罗会兰, 1974年生, 女, 教授, 主要研究方向为机器学习、模式识别。E-mail:luohuilan@sina.com;
张云, 女, 硕士研究生, 主要研究方向为语义分割。E-mail:1040344705@qq.com.

中图法分类号: TP391.41

文献标识码: A

文章编号: 1006-8961(2019)12-2200-10

摘要

目的针对基于区域的语义分割方法在进行语义分割时容易缺失细节信息，造成图像语义分割结果粗糙、准确度低的问题，提出结合上下文特征与卷积神经网络（CNN）多层特征融合的语义分割方法。方法首先，采用选择搜索方法从图像中生成不同尺度的候选区域，得到区域特征掩膜；其次，采用卷积神经网络提取每个区域的特征，并行融合高层特征与低层特征。由于不同层提取的特征图大小不同，采用RefineNet模型将不同分辨率的特征图进行融合；最后将区域特征掩膜和融合后的特征图输入到自由形式感兴趣区域池化层，经过softmax分类层得到图像的像素级分类标签。结果采用上下文特征与CNN多层特征融合作为算法的基本框架，得到了较好的性能，实验内容主要包括CNN多层特征融合、结合背景信息和融合特征以及dropout值对实验结果的影响分析，在Siftflow数据集上进行测试，像素准确率达到82.3%，平均准确率达到63.1%。与当前基于区域的端到端语义分割模型相比，像素准确率提高了10.6%，平均准确率提高了0.6%。结论本文算法结合了区域的前景信息和上下文信息，充分利用了区域的语境信息，采用弃权原则降低网络的参数量，避免过拟合，同时利用RefineNet网络模型对CNN多层特征进行融合，有效地将图像的多层细节信息用于分割，增强了模型对于区域中小目标物体的判别能力，对于有遮挡和复杂背景的图像表现出较好的分割效果。

关键词

语义分割; 卷积神经网络; 特征融合; 选择搜索; RefineNet模型

Semantic segmentation method with combined context features with CNN multi-layer features

Luo Huilan, Zhang Yun

School of Information Engineering, Jiangxi University of Science and Technology, Ganzhou 341000, China

Supported by: National Natural Science Foundation of China (61862031, 61462035); Natural Science Foundation of Jiangxi Province, China (20171BAB202014)

Abstract

Objective Semantic segmentation plays an increasingly important role in visual analysis. It combines image classification,object detection,and image segmentation and classifies the pixels in an image through certain methods. Semantic segmentation divides an image into regions with certain semantic meanings and identifies the semantic categories of each region block. The semantic inference process from low to high levels is realized,and a segmented image with pixel-by-pixel semantic annotation is obtained. The semantic segmentation method based on candidate regions extracts free-form regions from the image,describes their features,classifies them based on regions,and converts the region-based prediction into pixel-level prediction. Although the candidate region-based model contributes to the development of semantic segmentation,it needs to generate many candidate regions. The process of generating candidate regions requires a huge amount of time and memory space. In addition,the quality of the candidate regions extracted by different algorithms and the lack of spatial information on the candidate areas,especially the loss of information on small objects,directly affect the final semantic segmentation. To solve the problem of rough semantic segmentation results and low accuracy ofregion-based semantic segmentation methods caused by the lack of detailed information,a semantic segmentation method that fuses the context and multiple layer features of convolutional neural networks is proposed in this study. Method First,candidate regions of different scales are generated from an image by using a selection method.The candidate area includes three parts,namely,square bounding box,foreground mask,and foreground size. The foreground mask is a binary mask that covers the foreground of the area over the candidate area. Multiplying the square region features on each channel with the corresponding foreground mask yields the foreground features of the region. Selective search uses graph-based image segmentation to generate several sub-regions,iteratively merges regions according to the similarity between sub-regions (i.e.,color,texture,size,and spatial overlap),and outputs all possible regions of the target.Second,a convolutional neural network is used to extract the features of each region,and the high-and low-level features are fused in parallel. Parallel fusion combines the features of the same data set according to a certain rule,and the dimensions of the features must be the same before the combination.The features obtained by each convolutional layer are reduced using the linear discriminant analysis (LDA) method because of the different sizes of feature maps extracted from different layers. By selecting a projection hyperplane in the multi-dimensional space,the projection distance of the same category on the hyperplane is probably closer than the projection distance of different categories. The dimension reduction of LDA is only related to the number of categories because it is independent of the dimension of the data. The image dataset used in this work contains 33 categories. The LDA dimension reduction method is utilized to reduce the feature dimensions to 32,and this reduction decreases the size of the network's parameters. Moreover,LDA as a supervised algorithm can use prior knowledge on the class very well. Experimental results show that dimension reduction may lose some feature information but does not affect the segmentation result. After feature dimension reduction,the distance between different categories may increase,and the distance between the same categories may decrease,which can make the classification task easy. The RefineNet model is used to fuse feature maps with different resolutions. In this work,five feature map resolutions are used for fusion.The RefineNet network consists of three main components,namely,adaptive convolution,multi-resolution fusion,and chain residual pooling. The multi-resolution fusion part of the structure is utilized to adapt the input feature maps with a convolution layer,conduct upsampling,and perform pixel-level addition. The main task is to perform multi-resolution fusion to solve the problem of information loss caused by the downsampling operation and allow the image features extracted by each layer to be added to the final segmentation network. Finally,the regional feature mask and the fused feature map are inputted into the free-form pool of interest regions,and the pixel-level classification label of the image is obtained through the softmax classification layer. Result Context and convolutional neural network (CNN) multi-layer features are used for semantic segmentation,which exhibits good performance.The experimental content mainly includes CNN multi-layer feature fusion,combination of background information and fusion features,and the influence of dropout values on the experimental results.The training model is tested on the Siftflow dataset with a pixel accuracy of 82.3% and an average accuracy of 63.1%. Compared with the current region-based,end-to-end semantic segmentation model,the pixel accuracy is increased by 10.6% and the average accuracy is increased by 0.6%. Conclusion A semantic segmentation algorithm that combines context features with CNN multi-layer features is proposed in this study. The foreground and context information of the region are combined in the proposed method to utilize the context information of the region. The abstention principle is employed to reduce the parameter quantity of the network and avoid over-fitting,and the RefineNet network model is used to fuse the multi-layer features of CNN. By effectively using the multi-layer detail information of the image for segmentation,the model's capability to discriminate between small and medium-sized objects in the region is enhanced,and the segmentation effect is improved for images with occlusion and complex backgrounds. The experimental results show that the proposed method has a better segmentation effect,better segmentation performance,and higher robustness than several state-of-the-art methods.

Key words

semantic segmentation; convolutional neural network (CNN); feature fusion; selection search; RefineNet model

0 引言

语义分割是图像理解的关键部分，目的是为图像中的每个像素分配类别标签，是图像高级语义理解的基础。近年来，深度学习方法，特别是卷积神经网络，例如Krizhevsky等人(2012)提出的AlexNet，Simonyan和Eisserman(2014)提出的VGG，Szegedy等人(2014)提出的Googlenet以及He等人(2015)提出的Resnet等都已经在视觉识别任务中表现出显著效果。语义分割提供了图像的像素级语义理解，通过物体的类别、位置和形状对场景进行解析。基于深度网络模型的方法在语义分割任务中取得了一系列突破。

用于语义分割的方法大致可分为两类，一类是基于图像候选区域的语义分割方法，另一类是基于端到端的全卷积语义分割方法。

基于候选区域的语义分割方法首先从图像中提取自由形态的区域，并用特征描述这些区域，然后训练区域分类器。在测试时，基于区域的预测被映射到像素，并根据包含它的最高得分的区域标记像素。传统方法有Carreira和Sminchisescu(2012)提出的利用受限参最小剪切(CPMC)算法生成图像的候选区域，然后利用新的池化技术，即二阶池化编码区域内局部描述符，同时利用黎曼流形的几何特性描述任意区域的特征，得到了不错的分割效果。近年来，区域特征采用卷积神经网络自动学习获得。Girshick等人(2014)提出了R-CNN(region convolutional neural network)网络模型对目标的检测结果进行语义分割。Uijlings等人(2013)利用selective search算法在图像上提取大量的候选区域，再用CNN提取每个候选区域的特征，最后使用线性支持向量机对每个区域进行分类，得到最终的语义分割图像。R-CNN网络模型中候选区域相互重叠部分的特征被重复提取，导致计算量大。在R-CNN网络模型(Girshick等，2014)的基础上，Girshick(2015)提出Fast R-CNN网络模型，将候选区域输入CNN网络，使用感兴趣区域池化层(region-of-interest pooling)将不同大小的输入映射到一个固定尺度的特征向量，直接从CNN的最后一个卷积层上提取特征，利用softmax进行分类，不需要对同一个区域多次提取特征，从而提高了运行速度。为了降低候选区域生成的时间成本，Ren等人(2015)在Girshick(2015)的基础上，提出将Fast R-CNN方法与区域生成网络相结合的分割模型，采用滑动窗口直接生成候选区域，有效提高了分割速度和准确度，但是不能准确聚焦候选区域中感兴趣的区域。Caesar等人(2016)提出了基于区域的端到端的语义分割方法，通过采用自由形式感兴趣区域的池化层(free-form-region of interest (ROI) pooling)获得候选区域的前景特征，同时结合语境信息和自由形式区域表示的优点，极大提高了图像分割的速度和精度。

基于候选区域的模型方法虽然为语义分割的发展带来很大进步，但是需要生成大量的候选区域，且生成候选区域集的过程要花费大量的时间和内存空间。姜枫等人(2017)提出不同算法提取到的候选区域集的质量千差万别，直接影响到最终的语义分割效果。因此近年来，肖锋等人(2019)使用全卷积语义分割方法，曹峰梅等人(2019)采用基于端到端的全卷积网络的语义分割方法学习像素到像素的直接映射。Farabet等人(2013)于早期提出基于全卷积神经网络的方法训练相对较浅的端到端网络，为了得到更好的分类性能，近年来的工作都是采用经过预训练的较深的网络结构，并且大多采用了融合多层卷积特征的方法。Shelhamer等人(2014)采用Krizhevsky等人(2012)提出的8层AlexNet网络作为全卷积语义分割模型的基础网络，通过定义一个跳跃式的结构，结合来自深层的语义信息和来自浅层的表征信息以产生准确的分割，但是该方法会产生模糊的对象边界，使得最终得到的语义预测结果较为粗糙。在此基础上，Ronneberger等人(2015)提出U-net对称语义分割模型，该网络模型主要由一个收缩路径和一个对称扩张路径组成，收缩路径用来获得上下文信息，对称扩张路径用来精确定位分割边界。U-net使用图像切块进行训练，所以训练数据量远远大于训练图像的数量，这使得网络在少量样本的情况下也能获得较好的鲁棒性，但该方法对于以像素为中心的区域有很多重叠，导致很多的计算都是多余的，并需要权衡局部标记准确性和语义准确性。于是，Badrinarayanan等人(2015)提出了SegNet网络分割方法，该方法采用max pooling indices来保存图像的轮廓信息，降低了参数数量，且边界的轮廓更准确，计算效率高，能够提高分辨率和准确定位图像的分割边界。同样，为了充分利用图像的多层特征信息，Lin等人(2016)提出了RefineNet分割模型，将残差链接与identity映射相结合，采用长距离残差链接将下采样过程中损失的像素信息有效地恢复出来，以产生高分辨率的预测图像。为了融合尽可能多的语境信息，Ning等人(2017)提出了DeepLabv3分割方法，该方法在扩张空间金字塔池化(ASPP)中加入了全局平均池化，同时在平行扩张卷积后添加批量归一化，有效捕获了全局语境信息，提高了分割效果。Li等人(2018)借鉴Ning等人(2017)的思想，提出金字塔注意网络(PAN)语义分割模型，通过结合注意机制和空间金字塔来提取精确的像素特征，利用全局注意机制上采样模块在每个解码层上指导低级特征选择空间信息，从而获得了更好的语义分割结果。

对于语义分割来说，深度卷积网络每一层的信息对于最终的分割效果都有影响，高层信息识别图像区域的类别，低层信息识别图像的轮廓和细节边界信息。受Caesar等人(2016)和Lin等人(2016)的启发，本文提出了一种新的融合CNN多层特征进行语义分割的方法，采用Simonyan等人(2014)提出的VGG16网络提取图像特征，然后采用并行融合的方式将每一层提取的特征进行融合，有效提高了语义分割的效果。本文贡献主要有3个方面：1)在提取自由形式目标区域时，为了更好地获得上下文信息，对区域前景和包含区域背景信息的方形区域，使用结合自由形式前景特征和上下文特征的方法分别进行训练。2)在全连接层分类中加入了弃权(dropout)原则以减少全连接层的参数量，将得到的区域前景和方形区域特征图进行拼接，通过全连接层分类后，得到区域类别的预测值。3)考虑到卷积神经网络每层提取的特征各有侧重点，都对最后的分割结果有帮助，将VGG16网络的前5层提取的具有不同分辨率的特征图的预测结果采用RefineNet进行融合。

1 本文方法

本文提出结合上下文特征与CNN多层特征融合的语义分割模型，具体框架图如图 1所示。网络共分为4部分，第1部分采用Uijlings等人(2013)提出的选择搜索(selective search)提取图像的候选区域，使用Felzenszwalb和Huttenlocher(2004)提出的基于图的图像分割方法生成许多子区域，根据子区域间的相似性(颜色、纹理、尺寸和空间交叠)不断进行区域迭代合并，最终输出目标所有可能存在的区域。第2部分采用卷积神经网络提取图像的特征，采用VGG16网络的前5层作为基础网络，在对区域进行分类时，结合了自由形式前景特征和上下文特征以更好地获得区域前景语义信息。第3部分利用RefineNet将不同层提取的特征图进行融合，使得每一层提取的图像特征都可以加入到最终的分割网络中。第4部分进行区域像素分割，输入第1部分的候选区域掩膜和第3部分的融合特征图，得到最终的结合前景和背景信息的多层融合特征，输入到全接层后得到连区域的分类标签，最后通过区域到像素层，将分值映射到像素，通过softmax得到最终分割图像。

图 1 本文方法的结构框架图

Fig. 1 The framework of the proposed method

1.1 多层特征融合

不同卷积层的特征表达了图像不同粒度和不同方面的信息，如果将不同卷积层提取的特征融合起来，达到优势互补，可以得到更全面的图像信息，从而获得更好的分割效果。融合特征的方式主要有串行融合和并行融合两种。串行融合是将相同数据集的不同特征前后相连，这样可以将不同维度向量以线性或是非线性形式投影到更高维的维度上，但不能灵活调整不同特征对分类的权重。并行融合是将相同的数据集根据某种规则组合到一起，组合前需要保证特征的维度相同。本文将CNN卷积层中的隐含层特征进行融合，在将不同的特征图进行融合时，采用了并行融合的方法。由于卷积核的个数，卷积核的大小以及卷积核的步幅不同，使得各个卷积层提取的特征维度也不同，为此，本文采用了Lin等人(2016)提出的RefineNet方法将网络的高层特征与低层特征进行融合。RefineNet网络主要包含3个部分，分别是自适应卷积、多分辨率融合和链式残差池化。主要使用该结构的多分辨率融合部分，对输入的特征图都用一个卷积层进行自适应至最小的特征图尺寸，再上采样，最后做像素级相加，主要作用是进行多分辨率融合，解决由下采样操作导致的信息丢失问题。特征图融合流程如图 2所示，首先输入图像，经过不同卷积层和下采样操作后得到不同分辨率的特征图，将每个卷积层得到的特征首先采用线性辨别分析(LDA)方法对特征维度进行降维，通过在多维空间选择一个投影超平面，使得相同类别在超平面上的投影距离尽可能近，不同类别的投影距离尽可能远。由于LDA降维与数据本身的维度无关，只与类别个数有关，本文采用的图像数据集共包含33种类别，利用LDA降维方法将特征维度降到32维，一方面减少了网络的参数量，另一方面LDA作为有监督算法，可以很好地使用类别的先验知识。维度降低可能会缺失一部分特征信息，但并未影响本文的分割效果，降维后的特征将不同类别之间的距离变大，将同一类别间的距离变小，能更有效地分割不同类别。然后，将降维后得到的特征图采用RefineNet网络进行融合，得到融合后的特征图。

图 2 多层特征图融合过程示意图

Fig. 2 The illustration of the multi-layer feature map fusion

1.2 RefineNet融合模型

为了将不同层提取的特征图进行融合，本文采用RefineNet网络对不同层特征图进行融合，本文采用VGG16网络的前5层来提取特征图，将特征图的分辨率分成5个模块进行融合。具体的融合过程如图 3所示，最左边一栏是VGG16网络按特征图的分辨率分成5个模块，将步长设置为2，使得高层特征图相对低层的特征图尺寸减小一半，以获得更深层的语义特征。将5个卷积层的输出分别输入到RefineNet模块进行融合，如图 3所示。第5层的特征图通过RefineNet-1，RefineNet-1只有一个输入，它的目的是调整预训练VGG16网络第5层特征图的权重；RefineNet-1的输出和第4层的特征图输入到RefineNet-2，RefineNet-2的目的是为了使用高分辨率的特征改善RefineNet-1输出的低分辨特征；同样地，将RefineNet-2的输出和第3层的特征图输入到RefineNet-3，将RefineNet-3的输出和第2层的特征图输入到RefineNet-4，将RefineNet-4的输出和第1层的特征图输入到RefineNet-5，这样通过RefineNet层层融合后，最后得到融合5层特征的精调特征图。

图 3 采用RefineNet融合多层特征示意图

Fig. 3 The illustration of the multi-layer feature fusion using RefineNets

1.3 结合自由形式前景特征和上下文特征

在提取区域特征时，包含有区域语境信息的图像可以提高语义分割的准确度。为了更好地结合区域上下文信息，将自由形式的区域以及包含自由形式区域特征和一定背景区域特征的方形区域特征输入到相同的全连接层进行分类。采用Uijlings等人(2013)提出的selective search算法提取到的候选区域包括3个部分，分别是方形边界框、前景掩膜、前景大小。前景掩膜是覆盖在候选区域上表示区域前景的二进制掩码。将方形区域特征在每个通道上与其对应的前景掩膜相乘即可得到区域的前景特征。候选区域信息的准确度会严重影响后续的分割效果，这是基于候选区域的语义分割算法的一个潜在问题，本文采用了候选区域前景特征和背景特征结合的方法在一定程度上减轻了候选区域不好时对分割结果的影响。将区域前景特征图和包含背景区域的方形区域特征图进行拼接，将拼接后的特征图经过全连接层进行分类，得到区域的类别。结合自由形式前景特征和上下文特征的具体示意图如图 4所示。全连接层FC6、FC7、FC8的作用在卷积神经网络中相当于是分类器，它可以将卷积神经学习到的分布式特征映射到样本标记空间，得到图像的类别，A₁、A₂到A_n表示不同的类别。但全连接层的参数冗余，容易产生过拟合。为了解决这个问题，本文借鉴Srivastava等人(2014)的思想，在全连接层这一部分采用了弃权(dropout)方法。在全连接层中采用弃权原则，作用是通过减少中间特征的数量以减少冗余，从而增加每层各个特征之间的正交性，提高模型的泛化性能。随机地让一部分神经元“弃权”，如图 4中用虚线表示的神经元，使得小概率的异常数据获得学习的机会降低，从而获得更好的学习效率和性能。

图 4 结合自由形式前景特征和上下文特征示意图

Fig. 4 The illustration of the combination of the free-form foreground feature and the context feature

2 实验结果

2.1 实验数据集

为了验证本文方法的有效性，在语义分割常用的Siftflow数据集上进行实验分析。Siftflow数据集共2 688幅图像，包括遮篷、阳台、鸟、船、桥、楼、路、公交车、小汽车、牛、人行横道、沙漠、门、栅栏、场地、草、月亮、山、人、工厂、河、路、摇滚、沙、海、人行道、标志、天空、楼梯、路灯、太阳、树、窗户共33个类别。实验将2 688幅图像分成2 488幅训练图像和200幅测试图像。Siftflow数据集来自http://www.cs.unc.edu/~jtighe/Papers/ECCV10/。

2.2 实验设置及性能度量指标

实验的硬件环境为Inter(R) Pentium(R) 2020M @2.40 GHz的CPU、32 GB内存、4 GB显卡，操作系统为Windows 7和仿真软件为MATLAB2016a。网络络迭代次数为10代，动量为0.9，dropout的值为0.5，dropout的参数设置来源于Srivastava等人(2014)的思想，经过交叉验证，dropout = 0.5时随机生成的网络结构最多，效果最好。最小分割尺寸为100，学习率为0.001，每次学习的样本数是10，并使用像素准确率和平均准确率两个指标对实验结果进行评估。

像素准确率(PA)是语义分割中最简单的像素级评价指标，是图像中正确分类的像素占图像中总像素数比值。假设正确分类的像素有$ {p_{ii}}$类，其中包含背景，$ {p_{ii}}$表示正确分类的像素个数，$ {p_{ij}}$表示本应属于第$ i$类却被分到第$ j$类的像素数量，则像素准确率为

$ {f_{{\rm{PA}}}} = \frac{{\sum\limits_{i = 0}^n {{p_{ii}}} }}{{\sum\limits_{i = 0}^n {\sum\limits_{j = 0}^n {{p_{ij}}} } }} $

(1)

平均准确率(MPA)表示图像中所有物体类别像素准确率的平均值，即

$ {f_{{\rm{MPA}}}} = \frac{1}{{n + 1}}\sum\limits_{i = 0}^n {\frac{{{p_{ii}}}}{{\sum\limits_{j = 0}^n {{p_{ij}}} }}} $

(2)

2.3 实验结果分析

2.3.1 Siftflow数据集上的分割性能对比

本文方法与其他语义分割方法在Siftflow数据集上的比较结果如表 1所示。通过对比发现，本文方法获得了较好的效果。在平均准确率指标上，本文方法与Eigen和Fergus(2015)基于多尺度卷积和语义标签的语义分割方法、Caesar等人(2015)基于区域的语义分割方法、Caesar等人(2016)基于区域的端到端的语义分割方法、Hu等人(2017)探究全局视角的语义分割方法、Jiang等人(2018)通过自适应深度进行语义分割的轮廓感知网络相比，分别提高了7.4%、7.5%、0.6%、6.4%和0.8%。在像素准确率指标上，与Sharma等人(2014)基于深度递归神经网络的像素级语义场景标记语义分割方法、Farabet等人(2013)从原始像素训练的多尺度卷积网络来提取密集特征向量的语义分割方法、Hu等人(2017)探究全局视角的语义分割方法、Jiang等人(2018)通过自适应深度进行语义分割的轮廓感知网络相比，分别提高了2.7%、3.8%、8.8%和2.5%，但与Shelhamer等人(2014)基于全卷积的语义分割方法、Eigen等人(2015)提出的基于多尺度卷积和语义标签的语义分割方法相比，分别降低了2.9%和4.5%。实验结果说明，本文提出的融合多层信息及上下文信息的语义分割模型在进行语义分割时，具有比较好的平均性能，平均准确率会更高。

表 1 在Siftflow数据集上本文方法与其他方法结果对比
Table 1 Comparisons of our method with other methods on Siftflow dataset

下载CSV

/%
方法	像素准确率	平均准确率
Sharma等人(2014)	79.6	48
Yang等人(2014)	79.8	48.7
George(2015)	81.7	50.1
Farabet等人(2013)	78.5	50.8
Shelhamer等人(2014)	85.2	51.7
Sharma等人(2015)	80.9	52.8
Caesar等人(2015)	-	55.6
Eigen和Fergus(2015)	86.8	55.7
Caesar等人(2016)	71.7	62.5
Hu等人(2017)	73.5	56.7
Jiang等人(2018)	79.8	62.3
本文	82.3	63.1
注：加粗字体为最佳结果，“-”表示数据不存在。

本文方法在Siftflow数据集上的一些测试图像的分割效果如图 5所示。从图 5可以看出，本文分割效果与真实标注相差甚小，对细小物体识别准确度很高。如图 5第1行中的树(tree)、图 5第2行中的植物(plant)、图 5第3行和第4行中的窗户(window)，在真实标注中未被标注出，但本文方法进行了更细微的识别，图 5第4行中的建筑物(building)和天空(sky)与真实标注几乎一样，说明本文方法具有较好的分割效果。

图 5 本文方法在Siftflow数据集上的图像分割效果

Fig. 5 Image segmentation effects of our method on Siftflow dataset ((a) original image; (b) ground truth labeling; (c) ours)

2.3.2 背景信息和融合特征对分割效果的影响

为了分析结合背景特征和融合多层特征对语义分割效果的影响，本实验分析比较了Caesar等人(2016)提出的未结合背景特征和未采用融合特征的方法，只是利用最后一层卷积层特征，但采用结合自由形式前景特征和背景特征的方法，以及本文提出的同时采用融合特征和结合前景信息和上下文背景信息的方法，实验结果如表 2所示。从表 2可以看出，在像素准确率指标上，结合自由形式前景特征和背景特征时，比Caesar等人(2016)未结合背景信息的方法高1.8%、与Hu等人(2017)的方法相同；结合前景信息、背景信息和融合特征时，比Caesar等人(2016)、Hu等人(2017)和Jiang等人(2018)的方法分别高10.6%、8.8%和2.5%。在平均准确率指标上，结合自由形式前景特征和背景特征时，比Caesar等人(2016)和Hu等人(2017)的方法分别低2.2%和高3.6%；结合前景信息、背景信息和融合特征时，比Caesar等人(2016)、Hu等人(2017)和Jiang等人(2018)的方法分别高0.6%、6.4%和0.8%。故本文提出的结合5层卷积层特征的融合方法和结合前景及背景信息的方法，达到了较好的性能提升。

表 2 在Siftflow数据集上不同方法的结果对比
Table 2 Comparisons of different methods on Siftflow dataset

下载CSV

/%
方法	像素准确率	平均准确率
Caesar等人(2016)	71.7	62.5
Hu等人(2017)	73.5	56.7
Jiang等人(2018)	79.8	62.3
前景信息+背景信息	73.5	60.3
前景信息+背景信息+融合特征	82.3	63.1
注：加粗字体为最佳结果。

2.3.3 dropout值对本文方法语义分割效果的影响

“弃权”(dropout)可以在一定程度上解决深度神经网络中的过拟合问题。在训练期间，从神经网络中随机丢弃神经元，可以大量减少网络参数的数量，同时增加每层各个特征之间的正交性，提高模型的泛化性能。本文使用的“弃权”指在语义分割训练网络中，在区域分类时，将神经网络按照不同的概率将其从网络中暂时丢弃，相当于使用小批量数据集训练不同的网络。本文方法使用不同的dropout值得到的语义分割的像素准确率和平均准确率如表 3所示。从表 3可以看出，当dropout设置为0.5时，模型表现的性能最好。dropout值设置过小和过大时，效果都不理想。

表 3 不同的dropout值在Siftflow数据集上的效果对比
Table 3 Segmentation results of our method with different dropout values on Siftflow dataset

下载CSV

dropout值	像素准确率/%	平均准确率/%
0.3	72.5	56.2
0.4	74.1	60.4
0.5	82.3	63.1
0.6	73.6	58.3
0.7	68.4	53.5
0.8	63.8	47.3
注：加粗字体为最佳结果。

3 结论

本文提出了结合上下文特征与CNN多层特征融合的语义分割方法。该方法使用基于区域的端到端模型，在提取区域特征时，结合自由形式前景特征和上下文特征，同时在进行多层特征融合时采用RefineNet网络模型对不同分辨率的特征图进行融合。在进行区域分类时，采用弃权原则增加每层各个特征之间的正交性，提高模型的泛化性能。在Siftflow数据集上的测试结果表明本文方法有效地提高了语义分割的效果。与基于区域的语义分割方法相比，本文算法具有以下优势：1)结合自由形式前景特征和上下文特征可以更加准确地捕获区域语境信息，从而实现准确的分割。2)在使用全连接层进行区域分类时，采用弃权原则随机使一部分神经元不工作，使得网络的参数量大大减少，避免出现过拟合。3)每一层的信息对于最终的分割效果都有影响，采用RefineNet网络模型对不同层提取的特征信息进行融合，充分利用每一层的特征信息。

实验结果表明，本文算法具有较好的分割性能，像素准确率达到82.3%，平均准确率达到63.1%，下一步的工作是研究更好融合空间信息和语义信息的特征表达方法。

参考文献

Badrinarayanan V, Kendall A, Cipolla R. 2015. SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12): 2481-2495 [DOI:10.1109/TPAMI.2016.2644615]

Caesar H, Uijlings J and Ferrari V. 2015. Joint calibration for semantic segmentation//Proceedings of British Machine Vision Conference. Swansea, UK: BMVA Press [DOI: 10.5244/C.29.29]

Cao F M, Tian H J, Fu J, Liu J. 2019. Feature map slice for semantic segmentation. Journal of Image and Graphics, 24(3): 464-473 (曹峰梅, 田海杰, 付君, 刘静. 2019. 结合特征图切分的图像语义分割. 中国图象图形学报, 24(3): 464-473) [DOI:10.11834/jig.180402]

Caesar H, Uijlings J and Ferrari V. 2016. Region-based semantic segmentation with end-to-end training//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer, 381-397 [DOI: 10.1007/978-3-319-46448-0_23]

Carreira J, Sminchisescu C. 2012. CPMC: automatic object segmentation using constrained parametric min-cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(7): 1312-1328 [DOI:10.1109/TPAMI.2011.231]

Carreira J, Rui C, Batista J and Sminchisescu C. 2012. Semantic segmentation with second-order pooling//Proceedings of the 12th European Conference on Computer Vision. Florence, Italy: Springer, 430-443 [DOI: 10.1007/978-3-642-33786-4_32]

Eigen D and Fergus R. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2380-7504 [DOI: 10.1109/ICCV.2015.304]

Farabet C, Couprie C, Najman L, Lecun Y. 2013. Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8): 1915-1929 [DOI:10.1109/TPAMI.2012.231]

Felzenszwalb P F, Huttenlocher D P. 2004. Efficient graph-based image segmentation. International Journal of Computer Vision, 59(2): 167-181 [DOI:10.1023/b:visi.0000022288.19776.77]

George M. 2015. Image parsing with a wide range of classes and scene-level context//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 3622-3630 [DOI: 10.1109/CVPR.2015.7298985]

Girshick R. 2015. Fast R-CNN//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 1440-1448 [DOI: 10.1109/ICCV.2015.169]

Girshick R, Donahue J, Darrell T and Malik J. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE, 580-587 [DOI: 10.1109/CVPR.2014.81]

Hu H X, Deng Z W, Zhou G T, Sha F and Mori G. 2017. LabelBank: revisiting global perspectives for semantic segmentation[EB/OL]. [2019-03-16]. https://arxiv.org/pdf/1703.09891.pdf

He K M, Zhang X Y, Ren S Q and Sun J. 2015. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 770-778 [DOI: 10.1109/CVPR.2016.90]

Jiang F, Gu Q, Hao H Z, Li N, Guo Y W, Chen D X. 2017. Survey on content-based image segmentation methods. Journal of Software, 28(1): 160-183 (姜枫, 顾庆, 郝慧珍, 李娜, 郭延文, 陈道蓄. 2017. 基于内容的图像分割方法综述. 软件学报, 28(1): 160-183) [DOI:10.13328/j.cnki.j0s.005136]

Jiang Z Y, Yuan Y, Wang Q. 2018. Contour-aware network for semantic segmentation via adaptive depth. Neurocomputing, 284: 27-35 [DOI:10.1016/j.neucom.2018.01.022]

Krizhevsky A, Sutskever I and Hinton G E. 2012. ImageNet classification with deep convolutional neural networks//Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, Nevada: Curran Associates Inc., 1097-1105

Li H C, Xiong P F, An J and Wang L. 2018. Pyramid attention network for semantic segmentation[EB/OL]. [2019-03-16]. https://arxiv.org/pdf/1805.10180.pdf

Lin G S, Milan A, Shen C H and Reid I. 2016. Pyramid attention network for sem. RefineNet: multi-path refinement networks for high-resolution semantic segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 5168-5177 [DOI: 10.1109/CVPR.2017.549]

Ning Q Q, Zhu J K, Chen C. 2017. Very fast semantic image segmentation using hierarchical dilation and feature refining. Cognitive Computation, 10(1): 62-72 [DOI:10.1007/s12559-017-9530-0]

Ren S, He K, Girshick R, Sun J. 2015. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149 [DOI:10.1109/TPAMI.2016.2577031]

Ronneberger O, Fischer P and Brox T. 2015. U-Net: convolutional networks for biomedical image segmentation//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich, Germany: Springer, 234-241 [DOI: 10.1007/978-3-319-24574-4_28]

Sharma A, Tuzel O and Jacobs D W. 2015. Deep hierarchical parsing for semantic segmentation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 530-538 [DOI: 10.1109/CVPR.2015.7298651]

Sharma A, Tuzel O, Liu M Y. 2014. Recursive context propagation network for semantic scene labeling. Advances in Neural Information Processing Systems, 3: 2447-2455

Shelhamer E, Long J, Darrell T. 2014. Fully Convolutional Networks for Semantic Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4): 640-651 [DOI:10.1109/TPAMI.2016.2572683]

Simonyan K and Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition[EB/OL]. [2019-03-16]. https://arxiv.org/pdf/1409.1556.pdf

Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1): 1929-1958

Szegedy C, Liu W, Jia Y Q, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V and Rabinovich A. 2014. Going deeper with convolutions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 1-9 [DOI: 10.1109/CVPR.2015.7298594]

Uijlings J R, van de Sande K E A, Gevers T, Gevers T, Smeulders A W. 2013. Selective search for object recognition. International Journal of Computer Vision, 104(2): 154-171 [DOI:10.1007/s11263-013-0620-5]

Xiao F, Rui T, Ren T W, Wang D. 2019. Full convolutional network for semantic segmentation and object detection. Journal of Image and Graphics, 24(3): 474-482 (肖锋, 芮挺, 任桐炜, 王东. 2019. 全卷积语义分割与物体检测网络. 中国图象图形学报, 24(3): 474-482) [DOI:10.11834/jig.180406]

Yang J M, Price B, Cohen S and Yang M H. 2014. Full convolutional network for semantics. Context driven scene parsing with attention to rare classes//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE, 3294-3301 [DOI: 10.1109/CVPR.2014.415]