发布时间: 2019-06-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.180568
2019 | Volume 24 | Number 6

图像分析和识别

融合分割先验的多图像目标语义分割

廖旋^1,3, 缪君^1,2, 储珺¹, 张桂梅¹

1. 南昌航空大学无损检测技术教育部重点实验室, 南昌 330063;

2. 中国科学院月球与深空探测重点实验室, 北京 100012;

3. 南昌航空大学航空制造工程学院, 南昌 330063

收稿日期: 2018-09-26; 修回日期: 2018-11-03

基金项目: 国家自然科学基金项目（61661036，61663031，61462065）；中国科学院月球与深空探测重点实验室开放基金项目（LDSE201705）；无损检测技术教育部重点实验室开放基金项目（ZD201529003）

第一作者简介: 廖旋, 1993年生, 男, 硕士研究生, 主要研究方向为深度学习及图像处理与分析。E-mail:liaoxuan_cv@163.com;
储君, 女, 教授, 主要研究方向为图像处理、深度学习算法。E-mail:chujun99602@163.com;
张桂梅, 女, 教授, 主要研究方向为计算机视觉、图像处理和模式识别。E-mail:guimei.zh@163.com.

中图法分类号: TP391.41

文献标识码: A

文章编号: 1006-8961(2019)06-0890-12

摘要

目的在序列图像或多视角图像的目标分割中，传统的协同分割算法对复杂的多图像分割鲁棒性不强，而现有的深度学习算法在前景和背景存在较大歧义时容易导致目标分割错误和分割不一致。为此，提出一种基于深度特征的融合分割先验的多图像分割算法。方法首先，为了使模型更好地学习复杂场景下多视角图像的细节特征，通过融合浅层网络高分辨率的细节特征来改进PSPNet-50网络模型，减小随着网络的加深导致空间信息的丢失对分割边缘细节的影响。然后通过交互分割算法获取一至两幅图像的分割先验，将少量分割先验融合到新的模型中，通过网络的再学习来解决前景/背景的分割歧义以及多图像的分割一致性。最后通过构建全连接条件随机场模型，将深度卷积神经网络的识别能力和全连接条件随机场优化的定位精度耦合在一起，更好地处理边界定位问题。结果本文采用公共数据集的多图像集进行了分割测试。实验结果表明本文算法不但可以更好地分割出经过大量数据预训练过的目标类，而且对于没有预训练过的目标类，也能有效避免歧义的区域分割。本文算法不论是对前景与背景区别明显的较简单图像集，还是对前景与背景颜色相似的较复杂图像集，平均像素准确度（PA）和交并比（IOU）均大于95%。结论本文算法对各种场景的多图像分割都具有较强的鲁棒性，同时通过融入少量先验，使模型更有效地区分目标与背景，获得了分割目标的一致性。

关键词

多图像; 目标分割; 深度学习; 卷积神经网络; 分割先验; 条件随机场

Multi-image object semantic segmentation by fusing segmentation priors

Liao Xuan^1,3, Miao Jun^1,2, Chu Jun¹, Zhang Guimei¹

1. Key Laboratory of Nondestructive Testing, Nanchang Hangkong University, Nanchang 330063, China;

2. Key Laboratory of Lunar and Deep Space Exploration, National Astronomical Observatories, Chinese Academy of Sciences, Beijing 100012, China;

3. School of Aeronautical Manufacturing Engineering, Nanchang Hangkong University, Nanchang 330063, China

Supported by: National Natural Science Foundation of China (61661036, 61663031, 61462065)

Abstract

Objective Object segmentation from multiple images involves locating the positions and ranges of common target objects in a scene, as presented in a sequence image set or multi-view images. This process is applied to various computer vision tasks and beyond, such as object detection and tracking, scene understanding, and 3D reconstruction. Early approaches consider object segmentation as a histogram matching of color values, and they are only applied to pair-wise images with the same or similar objects. Later on, object co-segmentation methods are introduced. Most of these methods take the MRF model as the basic framework and establish the cost function that consists of the energy within the image itself and the energy between images by using the feature calculation based on the gray or color values of pixels. The cost function is minimized to obtain consistent segmentation. However, when the foreground and background colors in these images are similar, co-segmentation cannot easily realize object segmentation with consistent regions. In recent years, with the development of deep learning, methods based on various deep learning models have been proposed. Some methods, such as the full convolutional network, adopt convolutional neural networks to extract the high-level semantic features of images to attain end-to-end image classification with pixel level. These algorithms can obtain better precision than traditional methods could. Compared with these traditional methods, deep learning methods can learn appropriate features automatically for individual classes without manual selection and adjustment of features. Exactly segmenting a single image must combine multi-level spatial domain information. Hence, multi-image segmentation not only demands fine grit accuracy in local regions and single image segmentation but also requires the balance of local and global information among multiple images. When ambiguous regions around the foreground and background are involved or when sufficient priori information is not given about objects, most deep learning methods tend to generate errors and achieve inconsistent segmentation from sequential image sets or multi-view images. Method In this study, we propose a multi-image segmentation method on the basis of depth feature exaction. The method is similar to the neural network model of PSPNet-50, in which a residual network is used to exact the features of the first 50 layers of the network. These extracted features are integrated into the pyramid pooling module by using pooling layers with differently sized pooling filters. Then, the features of different levels are fused. After applying a convolutional layer and up-convolutional layer, the initial end-to-end outputs are attained. To make the model learn the detail features from the multi-view images of complex scenes comprehensively, we join the first and fifth parts of the output network features. Thus, the PSPNet-50 network model is improved by integrating the high-resolution details of the shallow layer network, which also is used to reduce the effects of spatial information loss on the segmentation edge details as the network deepens. In the training phase, the improved network model is first pre-trained using the ADE20k dataset. Thus, the model, after considerable data training, achieves strong robustness and generalization. Afterward, one or two prior segmentations of the object are gained by using the interactive segmentation approach. These small priori segmentation integrations are fused into the new model. The network is then re-trained to solve the ambiguity segmentation problem between the foreground and the background and the inconsistent segmentation problem among multi-image. We analyze the relationship between the number of re-training iterations and the segmentation accuracy by employing a large number of experimental results to determine the optimal number of iterations. Finally, by constructing a fully connected conditional random field, the recognition ability of the deep convolutional neural network and the accurate locating ability of the fully connected condition random field are coupled together. The object region is effectively located, and the object edge is clearly detected. Result We evaluate our method on multi-image from various public data sets showing outdoor buildings and indoor objects. We also compare our results with those of other deep learning methods, such as fully convolutional networks (FCNs) and pyramid scene parsing network (PSPNet). Experiments in the multi-view dataset of "Valbonne" and "Box" show that our algorithm can exactly segment the region of the object in re-training classes while effectively avoiding the ambiguous region segmentation for those untraining object classes. To evaluate our algorithm quantitatively, we compute the commonly used accuracy evaluation, average values of pixel accuracy (PA), and intersection over union (IOU) and then evaluate the segmentation accuracy of the object. Results show that our algorithm attains satisfactory scores not only in complex scene image sets with similar foreground and background contexts but also in simple image sets with obvious differences between the foreground and background contexts. For example, in the "Valbonne" set, the PA and IOU values of our result are 0.968 3 and 0.946 9, respectively; whereas the values of FCN are 0.702 7 and 0.694 2, respectively. The values of PSPNet are 0.850 9 and 0.824 0. Our method achieves 10% higher scores than FCN does and 20% higher scores than PSPNet does. In the "Box" set, our method achieves the PA values of 0.994 6 and IOU values of 0.957 7. However, FCN and PSPNet cannot find the real region of the object because the "Box" class is not contained in their re-training classes. The same improvements are found in other data sets. The average scores of PA and IOU of our method are more than 0.95. Conclusion Experimental results demonstrate that our algorithm has strong robustness in various scenes and can achieve consistent segmentation in multi-view images. A small amount of priori integration can help to accurately predict object pixel-level region and make the model effectively distinguish object regions from the background. The proposed approach consistently outperforms competing methods for contained and un-contained object classes.

Key words

multi-image; object segmentation; deep learning; convolutional neural networks(CNN); segmentation prior; conditional random field(CRF)

0 前言

多图像的目标分割是指给定不同视点下同一场景或连续场景的多幅图像、有选择性地定位场景中共同的感兴趣对象在图像中的位置和范围，它是诸如目标检测与跟踪^[1]、场景理解^[2]以及3维重建^[3]等应用的重要基础。

早期的算法^[4-7]只对一对包含相同或相似目标的图像进行分割。Rother等人^[4]首先提出了协同分割算法，他们将分割看做颜色直方图匹配问题进行处理。后来许多协同算法^[8-12]发展到对多图像的分割。这些算法大多以MRF(Markov random field)模型为基本框架，通过颜色等特征建立包含图像内能量和图像间能量的代价函数，最后通过最小化代价函数来获取多图像目标的一致分割。但当同一图像的前景和背景颜色相似时，协同分割算法很难实现目标的一致分割。一些算法^[13-15]将图像的颜色、纹理信息与图像的场景深度图等3D信息进行结合，形成多图像相似特征之间的连接，可以获得较好的分割一致性。但一般情况下图像准确的3D信息如场景深度图、相机参数等不容易得到。传统有监督学习分割算法^[16-17]一般是以某个像素点为中心取一个局部区域，再将图像块的特征(颜色、纹理等低层特征)做样本训练分类器，其分类结果作为此像素点的标记结果。这类方法选择的图像块所包含的上下文信息有限，分割精度有限。当标注的数据量较少时，即使结合了全局信息也很难获得好的分割^[12]。而且传统方法通常使用手动特征，为了使其适应未预训练类，通常需要专家经验和时间对特征进行调整。

近年来，基于深度卷积神经网络的图像分割方法取得了较大发展，这些方法如全卷积网络(FCN)^[18]等利用卷积神经网络获取图像的高层语义特征, 实现对图像端到端的像素级别分类，取得了较传统方法更好的效果。SegNet^[19]针对FCN得到的分割图分辨率较低的缺点，在FCN的基础上增加了上池化层，采用编码解码的对称结构来获取更好的结果。PSPNet(pyramid scene parsing network)^[20]针对FCN等网络没有引入足够的上下文信息及不同感受野下的全局信息而造成错误分割的问题，提出融合不同层次下全局场景信息的金字塔场景池化网络。文献[21]对网络的池化操作虽然会增大感受域的大小，但却会使图像细小的特征丢失，提出了基于稀疏卷积核的卷积网络。文献[22]在文献[21]的基础上进行了进一步的优化，提出多感受域的空间金字塔池化结构。图像分割是一个要求多种空间域信息的问题。特别是多图像分割，不但需要单幅图像分割细粒度、局部信息可以获得好的精确度，还需要多图像间获得局部和全局信息的平衡。目前的深度学习算法在前景和背景有歧义或前景和背景的先验不足时，经常会出现局部分割错误，这也导致了序列图像或多视角图像的目标分割错误和分割不一致。但是相对于传统分割算法，深度学习算法能够针对当前问题自动学习合适的特征表示，不需要手动选择和调整特征，受低层特征不足的限制较小，学习到的特征是图片的固有特征，预训练后的模型具有更强的泛化能力，这使得利用少量标记数据对模型进行优化变得更可行。

本文对PSPNet-50^[20]网络模型进行了改进，将深度特征用于多图像的分割，利用深度网络学习到的高层语义特征来提高多图像分割的准确性和分割一致性。首先通过融合浅层网络高分辨率的细节特征来改进PSPNet-50网络模型，减小随着网络的加深导致空间信息的丢失对分割边缘细节的影响，然后将少量分割先验融合到新的模型中，通过网络的再学习来解决前景/背景的分割歧义以及多视角图像的分割一致，最后通过构建全连接条件随机场模型来进一步优化分割目标的边缘效果。在一些公共数据集上的测试表明，本文的网络模型能进一步学习到复杂场景下前景和背景特征，减少通用分割模型因缺少先验造成背景和前景的歧义分割，具有更高的精度。

1 改进的PSPNet-50模型

PSPNet-50的前50层网络采用的是50层的残差网络^[23]作为特征提取器，将提取到的特征分别通过由不同池化核大小的池化层组成的金字塔池化模块，再将池化得到的不同层次特征进行融合，最后经过一个卷积层和上采样层，实现端到端的输出。为了使模型更好地学习复杂场景下多视角图像的细节特征，首先对PSPNet-50的网络结构进行了改进。

改进后的网络结构如图 1所示。本文先将PSPNet-50第5部分网络输出的特征进行上采样，然后将第1部分网络输出的特征与上采样后的第5部分网络输出的特征进行拼接。拼接后的输出特征$ {\mathit{\boldsymbol{x}}_{{\rm{ concat }}}} = {\mathit{\boldsymbol{x}}_{{\rm{c1}}}} \cup {\mathit{\boldsymbol{x}}_{{\rm{c5}}}}$。其中，$ {\mathit{\boldsymbol{x}}_{{\rm{c1}}}}$为第1部分的输出特征，${\mathit{\boldsymbol{x}}_{{\rm{c5}}}} $为第5部分网络的输出特征。让网络浅层高分辨率的细节特征与高层的语义特征进行融合，以充分利用浅层的特征和弥补随着网络的加深，高层网络输出的特征图分辨率逐渐变小导致的空间信息的损失。之后对融合后的特征进行了3次卷积操作，以将融合后的特征进一步整合得到最后的特征输出。第1次卷积中$ {\mathit{\boldsymbol{x}}_{{\rm{ concat }}}}$通过权重$ {\mathit{\boldsymbol{W}}^1}$得到256个特征映射，${\mathit{\boldsymbol{W}}^1} = \left[ {\mathit{\boldsymbol{W}}_1^1;\mathit{\boldsymbol{W}}_2^1; \cdots ;\mathit{\boldsymbol{W}}_{256}^1} \right], \forall \mathit{\boldsymbol{W}}_i^1 \in {{\bf{R}}^{60 \times 3 \times 3}} $，其中${\mathit{\boldsymbol{W}}_i^1} $表示第1次卷积中每个滤波器对应的参数，它的维度为640×3×3，其中的640为滤波器的通道数，3×3为滤波器的尺寸，采样间隔为1。最后输出的256个特征映射$ \left\{ {\mathit{\boldsymbol{x}}_i^1} \right\}_{i = 1}^{256}$是通过对${\mathit{\boldsymbol{x}}_{{\rm{ concat }}}} $卷积和激活函数计算获得。每个$\mathit{\boldsymbol{x}}_i^1 $计算为

$ \mathit{\boldsymbol{x}}_i^1 = f\left( {\mathit{\boldsymbol{W}}_i^1*{\mathit{\boldsymbol{x}}_{{\rm{ concet }}}}} \right) $

(1)

图 1 改进后的网络结构

Fig. 1 Improved network structure

式中，“*”表示3D卷积，本文选用线性整流函数ReLU作为激活函数

$ f(z)=\left\{\begin{array}{ll}{z} & {z>0} \\ {0} & {z \leqslant 0}\end{array}\right. $

(2)

剩余的两次卷积与第1次卷积相似，将上一层的输出作为输入进行卷积和激活函数计算，滤波器尺寸分别为3×3和1×1。

2 融入分割先验的网络模型的训练

本文先用ADE20k场景解析数据集^[24]对改进后的网络模型进行预训练，经过大量数据预训练后的模型具有较强鲁棒性和泛化能力。然后针对需要分割的多图像集，采用文献[25]的算法得到这组图像中一至两幅图像精确的分割作为先验，并将其前景和背景分别作为一类进行训练数据的标记。接着用分割先验对预训练好的网络模型进行fine-turn，以得到多图像的目标分割模型。得益于本文模型的金字塔池化结构具有较强的提取多尺度上下文信息的能力，模型能快速学习到前景和背景的高维特征并进行有效的分割预测。

2.1 网络的前向传播和反向传播

当网络进行前向传播时，计算该样本图像上所有像素点的输出误差和的平均值作为训练误差，并根据最小化训练误差/损失的方法来更新网络的权值参数。训练时的损失计算为

$ Loss = \frac{1}{n}\sum\limits_i {\ln } [p(x = k)], i = 0, 1, \cdots , n - 1 $

(3)

$ p(x = k) = \frac{{\exp \left( {{z_k}} \right)}}{{\sum\limits_j {\exp } \left( {{z_j}} \right)}}, j = 0, 1, \cdots , K - 1 $

(4)

式中，$Loss $为网络训练损失，$ p(x = k)$为像素点$ x$属于类别$ k$的概率，$ n$为当前训练图像的像素数量，$ {{z_j}}$为第$j $个类别的特征值，$K $为总的类别数，本文中一共有两个类别，所以$ K = 2$。

在反向传播更新网络权值参数阶段，本文采用随机梯度下降法(SDG)^[26]，通过负梯度$\nabla L\left( {{\mathit{\boldsymbol{W}}_t}} \right) $和上一次的权值更新值的线性组合来更新权值，计算为

${\mathit{\boldsymbol{V}}_{t + 1}} = \mu {\mathit{\boldsymbol{V}}_t} - \alpha \quad \nabla L\left( {{\mathit{\boldsymbol{W}}_t}} \right) $

(5)

$ {\mathit{\boldsymbol{W}}_{t + 1}} = {\mathit{\boldsymbol{W}}_t} + {\mathit{\boldsymbol{V}}_{t + 1}} $

(6)

式中，${\mathit{\boldsymbol{W}}_t} $是第$ t$次迭代计算时的权值矩阵，$ {\mathit{\boldsymbol{V}}_t}$是第$ t$次迭代计算时的权值更新值，$\alpha $是负梯度的基础学习率，$ \mu $是权值更新值$ {\mathit{\boldsymbol{V}}_t}$的权重，用来加权之前梯度方向对现在梯度下降方向的影响，本文中$ \mu = 0.9$。

在迭代计算过程中，为了加快模型的收敛，需要对基础学习率进行调整

$ LR = base\_lr\cdot{\left( {1 - \frac{{iter}}{{max\_iter}}} \right)^{power}} $

(7)

式中，$ LR$表示实际学习率，$base\_lr $为基础学习率，$ {max\_iter}$为最大迭代次数，${iter} $为当前迭代次数，$ {power}$为学习率参数，本文设置$ base\_lr = 0.0001$，$ power = 0.9$。

2.2 最大迭代次数的确定

合适的迭代次数可以提高训练效率，一定程度上防止模型欠拟合和过拟合，本文用多组多图像集进行实验。图 2为模型训练时的迭代次数和测试时模型分割图像的平均精确度的关系图。由图 2可知迭代次数达到40的时候，精度基本已经保持不变，但为了确保能达到较高的精度，本文最终选择的最大迭代次数为90。

图 2 迭代次数和精确度关系图

Fig. 2 Relationship diagram between iteration times and accuracy

3 多视角图像全连接条件随机场模型

本文的全连接条件随机场是一种用于图像像素分类的概率图模型^[27]，模型使用了如下能量函数

$ E(\mathit{\boldsymbol{x}}) = \sum\limits_i {{\theta _i}} \left( {{x_i}} \right) + \sum\limits_{ij} {{\theta _{ij}}} \left( {{x_i}, {x_j}} \right) $

(8)

式中，$ E(\mathit{\boldsymbol{x}})$为所有像素的总能量/代价，$ {{x_i}}$表示像素$i $分配的标记/类别。

一元势函数$ \theta_{i}\left(x_{i}\right)=-\ln p\left(x_{i}\right)$，其中$ p\left( {{x_i}} \right)$为像素$ i$的标记分布概率。本文中，网络模型的输出为每个像素点属于不同类别的概率值，因此将模型的输出作为一元势函数的输入。

二元势函数$ \theta_{i j}\left(x_{i}, x_{j}\right)$的表达式为

$ {\theta _{ij}}\left( {{x_i}, {x_j}} \right) = \mu \left( {{x_i}, {x_j}} \right)\sum\limits_{m = 1}^k {{\omega _m}} \cdot {k^m}\left( {{\mathit{\boldsymbol{f}}_i}, {\mathit{\boldsymbol{f}}_j}} \right) $

(9)

式中，$\mu \left( {{x_i}, {x_j}} \right) = \left\{ {\begin{array}{*{20}{l}} 1&{{x_i} \ne {x_j}}\\ 0&{{x_i} = {x_j}} \end{array}, {\omega _m}} \right. $为权重，${k^m}\left( {{\mathit{\boldsymbol{f}}_i}, {\mathit{\boldsymbol{f}}_j}} \right) $为高斯核，表达式为

$ \begin{array}{l} {k^m}\left( {{\mathit{\boldsymbol{f}}_i}, {\mathit{\boldsymbol{f}}_j}} \right) = {\omega _1}\exp \left( { - \frac{{\parallel {p_i} - {p_j}{\parallel ^2}}}{{2\sigma _\alpha ^2}} - \frac{{\parallel {I_i} - {I_j}{\parallel ^2}}}{{2\sigma _\beta ^2}}} \right) + \\ {\omega _2}\exp \left( { - \frac{{\parallel {p_i} - {p_j}{\parallel ^2}}}{{2\sigma _\gamma ^2}}} \right) \end{array} $

(10)

式中，$ p$表示像素位置，$I $为像素颜色值，$ \omega_{1}$和$ \omega_{2}$为权重，参数${\sigma _\alpha }, {\sigma _\beta } $和${\sigma _\gamma } $分别控制高斯核的尺度。

一元势函数${\theta _i}\left( {{x_i}} \right) $表示像素$i $划分为某个类别的代价；二元势函数${\theta _{ij}}\left( {{x_i}, {x_j}} \right) $表示如果两个像素之间的距离越近或颜色越相似，那么将它们划分为同一类所赋予的代价就越低。二元势函数描述的是每一个像素与其他所有像素的关系。通过迭代使能量函数最小化，最终得到多图像的分割结果。

4 实验结果与分析

本文采用Multi-view datasets^[28]、Multi-View Stereo Datasets^[29]和Cornell Multi-View Datasets^[30]的图像数据集测试，并在本节中列出了部分比较典型的图像集的测试结果，这些图像集中既有室外建筑物又有室内常见物体。为了验证算法的有效性，本文算法与协分割算法^[12]、深度学习算法FCN^[18]和PSPNet^[20]进行了比较。为便于比较，本文采用文献[25]的算法交互分割出目标区域，并将其作为真实结果(ground truth)。

图 3为Multi-view dataset^[28]中“Valbonne”的实验结果。该图像集共包含15幅图像，图中的“教堂”为分割目标。图 3(c)是文献[12]的结果，由于前景和背景颜色相似，该算法分割错误较多，并且目标分割很不完整。图 3(d)(e)是FCN^[18]和PSPNet^[20]的结果，它们的训练数据中已包含“建筑物”类，经过大量数据学习得到的深度特征有利于解决这些问题，但在前景和背景有歧义或先验不足时深度学习图像分割算法仍会出现错误分割。例如图 3(a)的黄框处，由于其模型缺少先验，对图中的背景和前景不敏感，算法错误地将部分地面分割为目标；图 3(a)的红框处，靠右的房屋与想要分割出来的目标都为建筑物，这导致歧义的产生，虽然它并不是待分割的目标，但深度学习语义分割算法会将它错误分割为同一目标。图 3(f)的最左侧图像为本文的分割先验，其余为分割结果。本文算法通过融合一幅图像的分割作为先验后，增加了模型的前景/背景的分类能力，完整地将其余图像中的目标分割出来，并且保留了边缘细节。

图 3 “Valbonne”数据集部分分割结果

Fig. 3 Partial segmentation results of "Valbonne" dataset

((a) original images; (b) ground truth; (c) results of references[12]; (d) results of FCN; (e) results of PSPNet; (f) segmentation priori and results of ours)

图 4中分割的目标为纸盒，文献[12]分割错误较多，FCN^[18]和PSPNet^[20]受分割类别的限制，它们的训练数据里没有“Box”这一类，不能以正确的语义分割目标，只能得到无效的分割结果。而本文算法较好地分割出目标，虽然预训练模型没有训练过“Box”这一类别，但本文先用ADE20k场景解析数据集对改进的PSPNet-50网络模型进行了预训练，经过这些大量数据预训练后的模型已具有较强鲁棒性和泛化能力，该模型对于没有预训练过的目标类在背景/前景纹理或颜色很接近的区域仍发生分割错误，即使多视点图像的分割目标类似也不能避免这个问题。因此，本文加入少量先验提示后，使该预训练模型的参数根据先验目标场景进行了相应优化，对于没有预训练过的类也能取得较好的效果。

图 4 “Box”数据集部分分割结果

Fig. 4 Partial segmentation results of "Box" dataset

((a) original images; (b) ground truth; (c) results of references[12]; (d) results of FCN; (e) results of PSPNet; (f) segmentation priori and results of ours)

为了定量分析实验结果，本文采用常用的分割精确度评价指标：像素准确度(PA)和交并比(IOU)来评估目标的分割精度。

PA是计算正确分类的像素数量与所有预测的像素数量的比值

$ {f_{{\rm{PA}}}} = \frac{{TP}}{{TP + FP}} $

(11)

式中，${TP} $为正确分类的像素数量，$ {FP}$为错误分类的数量。

IOU是用于分割问题的标准评价指标，其计算的是两个集合的交集与并集的比值。在本问题中，其计算的是真实分割与系统预测的分割之间的交并比

$ {f_{{\rm{lou}}}} = \frac{{GT \cap R}}{{GT \cup R}} $

(12)

式中，${GT} $为真实标记区域的像素，$R $为系统预测的分割区域的像素。

图 5为本文算法对测试图像集的部分结果，表 1列出了这些结果与文献[12]算法、FCN^[18]和PSPNet^[20]在分割精度上的对比。本文算法不论是在一些前景与背景区别明显的较简单图像集如“Couch”，还是在一些前景与背景相似的较复杂图像集如“Valbonne”，获得的PA和IOU值都比其他算法要高。在所有图像集中，本文算法比传统的协分割算法^[12]在PA上平均高0.108 5，在IOU上平均高0.137 7；比FCN^[18]在PA上平均高0.233 4，在IOU上平均高0.257 6。与PSPNet-50^[20]相比，在PA上平均高0.150 2，在IOU上平均高0.171 1。本文算法的平均PA达到0.985 8，平均IOU为0.958 9。实验说明本文算法在各种场景下的鲁棒性较强，在同一场景下具有较好的一致分割性，融入少量先验可以使模型更有效地区分目标与背景，在较准确地预测出目标像素的同时，预测错误的像素也较少。

图 5 本文算法在其他数据集部分分割结果

Fig. 5 Partial segmentation results of the algorithm in this paper in other datasets

((a) Coffee shack; (b) Building #1; (c) Building #2; (d) Building #3; (e) Clock tower; (f) Library; (g) Bear; (h) Couch)

表 1 不同分割方法精度比较
Table 1 Comparison of precision between different segmentation methods

下载CSV

图像集	图像数量	文献[12]		FCN-8S		PSPNet-50		本文
图像集	图像数量	PA	IOU	PA	IOU	PA	IOU	PA	IOU
Valbonne	13	0.832 2	0.729 2	0.702 7	0.694 2	0.850 9	0.824 0	0.968 3	0.946 9
Box	14	0.880 4	0.685 8	-	-	-	-	0.994 6	0.957 7
Coffee shack	8	0.866 8	0.849 0	0.583 5	0.576 6	0.716 3	0.705 3	0.982 4	0.964 1
Building #1	8	0.963 8	0.940 8	0.969 3	0.958 3	0.973 8	0.968 9	0.991 7	0.983 0
Building #2	6	0.888 6	0.786 7	0.773 5	0.749 2	0.843 4	0.805 4	0.988 2	0.938 7
Building #3	6	0.923 0	0.882 9	0.965 5	0.873 3	0.930 3	0.917 0	0.981 0	0.957 6
Clock tower	10	0.820 2	0.796 3	0.671 3	0.557 2	0.949 2	0.900 2	0.975 1	0.954 6
Library	9	0.856 8	0.828 3	0.378 3	0.322 4	0.453 2	0.448 7	0.990 8	0.952 2
Bear	9	0.765 2	0.716 3	-	-	-	-	0.984 2	0.949 5
Couch	9	0.935 3	0.912 7	0.930 8	0.873 0	0.981 6	0.765 2	0.985 4	0.973 5
平均		0.877 3	0.821 2	0.752 4	0.701 3	0.835 6	0.787 8	0.985 8	0.958 9
注:加粗字体表示最优结果；“-”表示FCN和PSPNet的模型在“Box”和“Bear”图像集上没有有效的分割结果，故没有精度数值。

5 结论

本文提出了一种基于深度学习的多视角图像分割算法。首先获取一至两幅多视角图像的前景/背景分割先验；然后利用先验对改进后的网络的预训练模型进行fine-turn；再用fine-turn后的模型对多视角图像进行逐像素的预测；最后结合深度模型的特征输出与原图信息构建全连接条件随机场模型，将能量函数最小化，实现其他图像的自动前景分割。实验结果表明，本文算法不但可以获得较高的分割精度，而且具有更好的泛化能力。本文算法比传统算法性能提高了很多，传统方法多依赖颜色等低层特征进行建模，由于前景和背景区域存在歧义，很多情况下很难直接通过颜色信息来区分前景和背景，虽然文献[12]采用了全局—局部信息，但还是采用颜色特征建模，因此效果不一定好；而深度卷积特征在图像分类、目标检测、语义分割等视觉任务上被证实较传统特征要好，深度学习算法可以更好地分割出各类目标。但是，面对目标/背景容易混淆的场景，直接使用现有网络模型对多图像进行分割仍会产生歧义，不仅把目标分割出来，还把与目标同类的区域也分割为目标。因此本文对PSPNet-50模型进行了改进，提高了对细节信息的捕获能力，并通过加入分割先验，消除原模型的歧义分割，最终使模型的分割效果要好于现有的深度学习模型，更优于传统的机器学习算法。然而，本文算法仍然存在需要进一步完善的地方，例如需要获取较精确的分割先验，在一些比较复杂的图像中工作量会提高。后续的研究主要将从以下两点出发：1)采用更少量的交互分割先验进行网络模型的再学习，实现多图像目标的高精度分割；2)通过设计一个多输入和多输出的深度网络模型，通过学习多图像之间的语义相关性，最终实现相同或相似多图像目标的高精度自动协同分割。

参考文献

[1] Li F X, Kim T, Humayun A, et al. Video segmentation by tracking many figure-ground segments[C]//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, NSW, Australia: IEEE, 2013: 2192-2199.[Doi: 10.1109/ICCV.2013.273]

[2] Zhang G M, Chen B B, Xu K, et al. New CV model combining fractional differential and image local information[J]. Journal of Image and Graphics, 2018, 23(8): 1131–1143. [张桂梅, 陈兵兵, 徐可, 等. 结合分数阶微分和图像局部信息的CV模型[J]. 中国图象图形学报, 2018, 23(8): 1131–1143. ] [DOI:10.11834/jig.170580]

[3] Martinović A, Knopp J, Riemenschneider H, et al. 3D all the way: semantic segmentation of urban scenes from start to end in 3D[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 4456-4465.[Doi: 10.1109/CVPR.2015.7299075]

[4] Rother C, Minka T, Blake A, et al. Cosegmentation of image pairs by histogram matching-incorporating a global constraint into MRFs[C]//Proceedings of 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. New York, NY, USA: IEEE, 2006: 993-1000.[Doi: 10.1109/CVPR.2006.91]

[5] Mukherjee L, Singh V, Dyer C R. Halfintegrality based algorithms for cosegmentation of images[C]//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL, USA: IEEE, 2009: 2028-2035.[Doi: 10.1109/CVPR.2009.5206652]

[6] Hochbaum D S, Singh V. An efficient algorithm for co-segmentation[C]//Proceedings of 2009 IEEE International Conference on Computer Vision. Kyoto, Japan: IEEE, 2009: 269-276.[Doi: 10.1109/ICCV.2009.5459261]

[7] Vicente S, Kolmogorov V, Rother C. Cosegmentation revisited: models and optimization[C]//Proceedings of the 11th European Conference on Computer Vision. Heraklion, Crete, Greece: Springer, 2010: 465-479.[Doi: 10.1007/978-3-642-15552-9_34]

[8] Joulin A, Bach F, Ponce J. Discriminative clustering for image co-segmentation[C]//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, CA, USA: IEEE, 2010: 1943-1950.[Doi: 10.1109/CVPR.2010.5539868]

[9] Vicente S, Rother C, Kolmogorov V. Object cosegmentation[C]//Proceedings of 2011 IEEE Conference on Computer Vision and Pattern Recognition. Colorado Springs, CO, USA: IEEE, 2011: 2217-2224.[Doi: 10.1109/CVPR.2011.5995530]

[10] Rubio J C, Serrat J, López A, et al. Unsupervised co-segmentation through region matching[C]//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, RI, USA: IEEE, 2012: 749-756.[Doi: 10.1109/CVPR.2012.6247745]

[11] CollinsM D, Xu J, Grady L, et al. Random walks based multi-image segmentation: quasiconvexity results and GPU-based solutions[C]//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, RI, USA IEEE, 2012: 1656-1663.[Doi: 10.1109/CVPR.2012.6247859]

[12] Dong X P, Shen J B, Shao L, et al. Interactive cosegmentation using global and local energy optimization[J]. IEEE Transactions on Image Processing, 2015, 24(11): 3966–3977. [DOI:10.1109/TIP.2015.2456636]

[13] Zhu Y F, Zhang Y J. Transductive co-segmentation of multi-view images[J]. Journal of Electronics & Information Technology, 2011, 33(4): 763–768. [朱云峰, 章毓晋. 直推式多视图协同分割[J]. 电子与信息学报, 2011, 33(4): 763–768. ] [DOI:10.3724/SP.J.1146.2010.00839]

[14] Djelouah A, Franco J S, Boyer E, et al. Multi-view object segmentation in space and time[C]//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, NSW, Australia: IEEE, 2013: 2640-2647.[Doi: 10.1109/ICCV.2013.328]

[15] Nguyen T N A, Cai J F, Zheng J M, et al. Interactive object segmentation from multi-view images[J]. Journal of Visual Communication and Image Representation, 2013, 24(4): 477–485. [DOI:10.1016/j.jvcir.2013.02.012]

[16] Shotton J, Johnson M, Cipolla R. Semantic texton forests for image categorization and segmentation[C]//Proceedings of 2008 IEEE Conference on Computer Vision and Pattern Recognition. Anchorage, AK, USA: IEEE, 2008: 1-8.[Doi: 10.1109/CVPR.2008.4587503]

[17] Shotton J, Fitzgibbon A, Cook M, et al. Real-time human pose recognition in parts from single depth images[C]//Proceedings of 2011 IEEE Conference on Computer Vision and Pattern Recognition. Colorado Springs, CO, USA: IEEE, 2011: 1297-1304.[Doi: 10.1109/CVPR.2011.5995316]

[18] Shelhamer E, Long J, Darrell T. Fully convolutional networks for semantic segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 640–651. [DOI:10.1109/TPAMI.2016.2572683]

[19] Badrinarayanan V, Handa A, Cipolla R. SegNet: a deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling[J]. eprint arXiv: 1505.07293, 2015.

[20] Zhao H S, Shi J P, Qi X J, et al. Pyramid scene parsing network[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, Hawaii, USA: IEEE, 2017: 6230-6239.[Doi: 10.1109/CVPR.2017.660]

[21] Chen L C, Papandreou G, Kokkinos I, et al. Semantic image segmentation with deep convolutional nets and fully connected CRFs[J]. eprint arXiv: 1412.7062, 2014.

[22] Chen L C, Papandreou G, Kokkinos I, et al. DeepLab:semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4): 834–848. [DOI:10.1109/TPAMI.2017.2699184]

[23] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 770-778.[Doi: 10.1109/CVPR.2016.90]

[24] Zhou B L, Zhao H, Puig X, et al. Semantic understanding of scenes through the ADE20K dataset[J]. arXiv: 1608.05442, 2016.

[25] Gulshan V, Rother C, Criminisi A, et al. Geodesic star convexity for interactive image segmentation[C]//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, CA, USA: IEEE, 2010: 3129-3136.[Doi: 10.1109/CVPR.2010.5540073]

[26] Lecun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278–2324. [DOI:10.1109/5.726791]

[27] Krähenbühl P, Koltun V. Efficient inference in fully connected CRFs with Gaussian edge potentials[C]//Proceedings of the International Conference on Neural Information Processing Systems. Granada, Spain: Curran Associates Inc, 2011: 109-117.

[28] Visual Geometry Group. Multi-view and Oxford Colleges building reconstruction[EB/OL].[2018-07-05]. http://www.robots.ox.ac.uk/vgg/data/data-mview.html

[29] Kim H, Xiao H, Max N. Piecewise planar scene reconstruction and optimization for multi-view stereo[C]//Proceedings of the 11th Asian Conference on Computer Vision. Daejeon, Korea: Springer, 2012: 191-204.[Doi: 10.1007/978-3-642-37447-0_15]

[30] Kowdle A, Sinha S N, Szeliski R. Multiple view object cosegmentation using appearance and stereo cues[C]//Proceedings of the 12th European Conference on Computer Vision. Florence, Italy: Springer, 2012: 789-803.[Doi: 10.1007/978-3-642-33715-4_57]