发布时间: 2019-01-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.180364
2019 | Volume 24 | Number 1

图像理解和计算机视觉

结合全卷积神经网络的室内场景分割

黄龙¹, 杨媛¹, 王庆军², 郭飞¹, 高勇¹

1. 西安理工大学自动化与信息工程学院, 西安 710048;

2. 西安中车永电电气有限公司, 西安 710018

收稿日期: 2018-06-05; 修回日期: 2018-08-10

基金项目: 国家自然科学基金项目（61102017）

第一作者简介: 黄龙, 1995年生, 男, 硕士研究生, 主要研究方向为集成电路、深度学习图像处理、人脸识别。E-mail:hl13700200646@163.com;
王庆军, 男, 博士, 主要研究方向为电路系统设计。E-mail:qingjun.wang@qq.com;
郭飞, 男, 博士研究生, 主要研究方向为图像处理、集成电路设计。E-mail:guofei601@126.com;
高勇, 男, 教授, 博士生导师, 主要研究方向为电力电子器件与功率集成、新型半导体器件。E-mail:gaoy@xaut.edu.cn.

中图法分类号: TP391.7

文献标识码: A

文章编号: 1006-8961(2019)01-0064-09

摘要

目的视觉假体通过向盲人体内植入电极刺激视神经产生光幻视，盲人所能感受到的物体只是大体轮廓，对物体识别率低，针对视觉假体中室内应用场景的特点，提出一种快速卷积神经网络图像分割方法对室内场景图像进行分割，通过图像分割技术把物品大致的位置和轮廓显示出来，辅助盲人识别。方法构建了用于室内场景图像分割的FFCN（fast fully convolutional networks）网络，通过层间融合的方法，避免连续卷积对图像特征信息的损失。为了验证网络的有效性，创建了室内环境中的基本生活物品数据集（以下简称XAUT数据集），在原图上通过灰度标记每个物品的类别，然后附加一张颜色表把灰度图映射成伪彩色图作为语义标签。采用XAUT数据集在Caffe（convolutional architecture for fast feature embedding）框架下对FFCN网络进行训练，得到适应于盲人视觉假体的室内场景分割模型。同时，为了对比模型的有效性，对传统的多尺度融合方法FCN-8s、FCN-16s、FCN-32s等进行结构微调，并采用该数据集进行训练得到适用于室内场景分割的相应算法模型。结果各类网络的像素识别精度都达到了85%以上，均交并比（MIU）均达到60%以上，其中FCN-8s at-once网络的均交并比最高，达到70.4%，但其分割速度仅为FFCN的1/5。在其他各类指标相差不大的前提下，FFCN快速分割卷积神经网络上平均分割速度达到40帧/s。结论本文提出的FFCN卷积神经网络可以有效利用多层卷积提取图像信息，避免亮度、颜色、纹理等底层信息的影响，通过尺度融合技术可以很好地避免图像特征信息在网络卷积和池化中的损失，相比于其他FCN网络具有更快的速度，有利于提高图像预处理的实时性。

关键词

室内场景; 视觉假体; 语义分割; 卷积神经网络; 深度学习

Indoor scene segmentation based on fully convolutional neural networks

Huang Long¹, Yang Yuan¹, Wang Qingjun², Guo Fei¹, Gao Yong¹

1. Xi'an University of Technology, Xi'an 710048, China;

2. CRCC Corporation Limited Xi'an Yonge Electric Co. Ltd., Xi'an 710018, China

Supported by: National Natural Science Foundation of China(61102017)

Abstract

Objective Vision is one of the most important ways by which humans obtain information. Visual prosthesis refers to the process where electrodes are implanted into a blind body to stimulate the optic nerve, such that the blind can see hallucinations. Therefore, the objects felt by the blind are only the general features, such as low resolution and poor linearity. In some cases, the blind can hardly distinguish optical illusions. Before the electrodes were stimulated, image segmentation was adopted to display the general position and outline of objects to help blind people clearly recognize every familiar object. The image fast segmentation of the convolution neural network was proposed to segment the indoor scene of visual prosthesis in terms of its application features. Method According to the demand of visual prosthesis for real-time image processing, the fast fully convolutional network (FFCN) network structure proposed in this paper was improved on the AlexNet classification network structure. The network reduced the error rate of top five in the ImageNet dataset to 16.4%, which was better than the 26.2% of the second. The AlexNet uses the convolution layer to extract deep feature information, adds the structure of the overlapping pool layer to reduce the parameters that must be learned, and defines the Relu activation function to solve the gradient diffusion of the Sigmod function in deeper networks. In contrast to other networks, it presents characteristics such as light weight and fast training speed. First, the FFCN for image segmentation in the indoor scene was constructed. It was composed of five convolution layers and one deconvolution layer. The loss produced by the continuous convolution in the picture feature information was avoided by scale fusion. To verify the effectiveness of the network, a dataset of basic items that can be touched by the blind in an indoor environment was created. The dataset was divided into nine categories and included 664 items, such as beds, seats, lamps, televisions, cupboards, cups, and people (XAUT dataset). The type of each item was marked by grayscale in the original image, and a color table was added to map the gray image into pseudo-color map as the semantic label. The XAUT dataset was used to train the FFCN network under the Caffe framework, and the image features were extracted using the deep learning feature and scale fusion of the convolution neural network to obtain the segmentation model in the indoor scene for adapting to the visual prosthesis for the blind. To assess the validity of the model, the fine adjustment of traditional models, including FCN-8s, FCN-16s, FCN-32s, and FCN-8s at-once, was examined. The dataset was used to obtain the corresponding segmentation model in the indoor scene for adapting to the visual prosthesis for the blind. Results A comparative experiment was conducted on the Ubuntu16.04 version of the Amax Sever environment. The training time of the model lasted for 13 h, and a training model was saved every 4 000 iterations. The tests are tested at 4 000, 12 000, 36 000, and 80 000 iterations. The pixel recognition accuracy of all kinds of networks exceeded 85%, and the mean IU was above 60%. The FCN-8s at-once network had the highest mean IU (70.4%), but its segmentation speed was only one-fifth of that of the FFCN. Under the assumption that the other indicators differed insignificantly, the average segmentation speed of the FFCN reached 40 frame/s. Conclusion The FFCN can effectively use multi-layer convolution to extract picture information and avoid the influences of the underlying information, such as brightness, color, and texture. Moreover, it can avoid the loss of image feature information in the network convolution and pool through scale fusion. Compared with other FCN networks, the FFCN has a faster speed and can improve the real-time image preprocessing.

Key words

indoor environment; visual prosthesis; semantic segmentation; convolution neural network; deep learning

0 引言

视觉是人类获取信息的重要手段, 也是人类生活质量的保障。据不完全统计, 目前全球盲人数量将达到5 000万, 而仅我国就大约有600万, 视觉假体^[1]通过向人体内植入电极的方式帮助盲人恢复光幻视, 让盲人对周围的环境有大致的感光。视觉假体系统由体外和体内植入部分构成，体外图像采集处理系统将采集到的图像进行边缘提取、自适应二值化、图像修正等技术处理，之后进行减像素将图像轮廓编码传送到体内植入式微电流刺激器，通过电极刺激盲人产生光幻视。该方法中，由于体内电极数量有限，导致盲人对物体的识别率低，为此，本文提出在体外图像处理部分采用基于图像分割的方法来提高图像识别率^[2]。

图像分割^[3-4]是利用亮度、灰度、纹理等不同特征提取图像中若干感兴趣区域的有效方法。目前国内外提出了很多优秀的图像分割方法：Shi等人^[5]在2000年提出的NCut算法是基于图像轮廓及纹理特征来求解代价函数，达到分割不同子区域的目的；Boykov等人^[6]在2006年提出Graph Cuts算法，该算法利用了图论中顶点和边的关系，即图像中像素的灰度信息和区域边界信息来分割；Fukunaga等人^[7]提出基于像素聚类的Meanshift方法，该方法利用空间距离、颜色、亮度、纹理等像素信息将像素聚类到超像素，收敛形成分割区域；Achanta等人^[8]于2010年提出基于SLIC(simple linear iterative cluster)分割算法，该算法是将K-Means^[9]算法中的欧氏距离应用于超像素聚类，类别K值需要人为初始化设置。国内方面Liu等人^[10]和Li等人^[11]提出了一种新的弱监督双重聚类(WSDC)(weakly-supervised dual clu-stering)的方法应用于图像语义分割，采用谱聚类从一组过分割图像中获得超像素, 同时将特征和标签之间的线性变换作为一种判别聚类，来学习选择不同类别之间的判别特征。上述方法主要是利用图像的低层信息提取特征，在复杂场景下，低层信息特征受环境的影响较大，例如基于像素聚类的方法，当一些物体结构复杂、自身存在差异性时，分割效果往往不是很好。

针对传统分割方法的不足，本文提出基于深度学习的图像分割方法。在视觉假体项目中，通过改进的FFCN卷积神经网络提取图像高层次特征, 对每个像素进行分类, 即判别其所属类别, 从而达到像素级别的分割精度。对图像中的物品进行分割后的视觉感知处理, 可以使盲人获取物品所在大致区域及其类别。本文的图像分割网络适用于复杂场景下的感兴趣目标的图像分割，有效避免了不同场景下传统分割方法的适用性差、多目标分割难度大的缺点。

1 算法理论

1.1 CNN(卷积神经网络)与FCN(全卷积网络)

卷积神经网络(CNN)^[12-13]最早是19世纪60年代由科学家通过观察猫的视觉皮层细胞, 提出感受野的概念, 即每一个视觉神经元只会处理一小块区域的视觉图像。20世纪80年代, 日本科学家提出了神经认知机的概念, 是卷积神经网络最初的模型。神经认知机包含卷积、激活函数、最大池化等操作。2012年，Hinton等人^[14]将卷积神经网络应用到图像识别领域最大的数据库ImageNet上，将错误率降低为17%，使得神经网络在计算机视觉领域飞速发展。

CNN主要用于图像分类, 但是对于图像中特定部分物体的识别还是一个难题。图像语义分割的标志事件是Long等人^[15]在2014年提出全卷积网络，与CNN在卷积网络之后采用全连接层获取1维的特征向量通过Softmax层进行分类不同，全卷积网络(FCN)修改CNN的全连接层为全卷积层，可以接受任意尺寸的输入图像, 通过采用反卷积对最后的特征图进行上采样, 得到与输入图像相同的尺寸。FCN可以对每个像素都产生预测, 保留了输入图像的空间信息, 最后在上采样的特征图上逐个像素分类。与传统的卷积网络进行图像分割的方法相比, FCN网络有两大突出的优点：1)可以接受输入图像的大小不受限制, 不用预处理所有的训练图像和测试图像到同样的尺寸；2)非常高效, 端到端的网络结构避免了由于使用像素块带来的重复存储和感知区域限制。

1.2 FFCN(Fast FCN)网络

根据视觉假体对图像处理实时性的需求，本文提出的FFCN网络结构是在AlexNet分类网络结构上改进的，该网络在ImageNet数据集上将top-5的错误率降低至16.4%, 相比第2名的成绩26.2%，错误率显著降低。AlexNet利用卷积层提取深层次的特征信息，添加重叠池化层的结构减少需要学习的参数, 并通过定义Relu激活函数解决Sigmod函数在较深网络发生梯度弥散的问题，相比其他网络具有轻量级、训练速度快等优点。对于多层深度卷积网络VGGNet、GoogleNet等，深层网络结构对图像提取特征时，反卷积时感兴趣目标区域的特征信息随网络深度增加而减少。FCN-VGG16的多层尺度融合可以提高目标区域的识别精度，但其多层间融合对图像分割速度有很大的影响，FFCN网络两层相加的特征融合结构，在降低图像分割时间损耗的前提下，可达到比较好的分割效果。FFCN网络结构如图 1所示，网络前5层为卷积层, 最后1层为反卷积层, 其中Conv1层和Conv2层分别接有池化层和局部响应归一化层。FFCN网络中Conv3层的输出与Conv4层的输出相加, 通过尺度融合的方法, 避免了连续卷积对图像特征信息的损失。最后通过反卷积层(deConv1), 使得最终输出是2维图像, 而不是1维的向量。

图 1 FFCN网络结构图

Fig. 1 FFCN network architectur

2 XAUT数据集

为了验证网络结构的有效性，本文创建了室内环境数据集(XAUT)，如图 3所示。XAUT数据集搜集了盲人在室内环境所接触的基本物品, 将图像的大小固定为256×256像素, 分为床、座椅、灯、电视、橱柜、杯子、人、照相机、背景共9类664幅图像, 其中600幅用于训练, 64幅用于验证。

图 3 XAUT数据集

Fig. 3 XAUT dataset

数据集格式如图 4所示, 图 4(a)是原图, 图 4(b)是用Labelme制作的标签, 图 4(c)是灰度标签上形成的伪彩色图(或索引图), 图 4(d)是伪彩色图在原图所占的面积显示, 右下角标明了每一种颜色所代表的物体。

图 4 数据集构成

Fig. 4 Dataset composition

((a))image; (b) label; (c) ground truth; (d) visual label)

本文自定义了一张RGB的颜色表如图 5所示, 附加在灰度标签上, FFCN神经网络的标签总共有9类灰度值, 颜色表可以直观地表明灰度值所指代物品的语义。

图 5 RGB颜色表

Fig. 5 RGB color map

3 实验方案

针对本文提出的FFCN网络模型, 设计验证方案并选取现有FCN-VGG16系列模型与之对比。如表 1所示，VGG16^[16]网络是由13层卷积和3层全连接层组成的分类网络，FCN-VGG16结构通过上采样和多尺度融合的方法分为FCN-8s、FCN-16s、FCN-32s、FCN-8s at-once等。

表 1 不同模型结构参数对比
Table 1 Comparison of structural parameters of different models

下载CSV

	FCN-VGG16	FFCN
卷积层	13	5
池化层	5	3
最大步长	32	32
输出单元感受野	404	355

采用图 6所示流程验证方案的有效性。研究方法分为两个阶段：

图 6 实验方案

Fig. 6 Experimental scheme

1) 训练阶段。在XAUT数据集上分别对FFCN模型、FCN-VGG16系列网络进行训练比较, 保存训练日志和模型结果。

2) 测试阶段。用测试样本在训练得到的模型上进行测试, 对比不同模型的输出结果。

4 实验结果与分析

4.1 实验平台介绍

实验在Ubuntu16.04版本的Amax Sever环境下进行, 具有8核GPU, 版本为Tesla k80。深度学习框架为Caffe，其具有以下优点：

1) 第1个主流的工业级深度学习工具, 可以应用在机器视觉、机器人、语音识别、神经科学和天文学等领域。

2) 专精于图像处理；提供了完整的工具包, 纯粹的C++/CUDA架构；支持命令行、Python和MATLAB接口；可以CPU和GPU无缝切换。

3) 底层封装好, 模块化的结构可以很清晰地用来对模型进行训练和测试。

4.2 图像分割评价标准

图像分割算法的种类繁多, 为了衡量分割系统的效果及贡献, 其性能需要经过严格地评估。采用定量的方式计算分割结果图像的性能指标, 并以此评价分割的效果, 具有客观、可重复等优点。图像分割的常用标准为像素精度(PA)、类别平均准确率(MA)、均交并比(MIU)、频率加权区域重合度(FIU), 均交并比由于其简洁性、代表性强而成为最常用的度量标准^[17-18]，具体计算公式为

$ {f_{{\rm{PA}}}} = \sum\limits_i {} {q_{ii}}/\sum\limits_i {} {w_i} $

(3)

$ {f_{{\rm{MA}}}} = \left( {1/{q_{mn}}} \right)\sum\limits_i {} {q_{ii}}/\sum\limits_i {} {w_i} $

(4)

$ {f_{{\rm{MIU}}}} = \left( {1/{q_{mn}}} \right)\sum\limits_i {} {q_{ii}}/\left( {{w_i} + \sum\limits_j {} {q_{ii}} - {q_{ii}}} \right) $

(5)

$ \begin{array}{l} {f_{{\rm{FIU}}}} = \\ {\left( {\sum\limits_k {} {w_k}} \right)^{ - 1}}\sum\limits_i {} {w_i}{q_{ii}}/\left( {{w_i} + \sum\limits_j {} {q_{ij}} - {q_{ii}}} \right) \end{array} $

(6)

式中，$ {q_{mn}}$表示类别个数, $ {q_{ij}}$表示属于$ i$类被识别为$j $类的像素数, ${w_i} = \sum\limits_j {} {n_{ij}} $表示$i $类像素点的总个数, $ k$表示类别，${f_{{\rm{PA}}}} $代表像素精度，$ {f_{{\rm{MA}}}}$代表类别平均准确率，$ {f_{{\rm{MIU}}}}$表示均交并比，${f_{{\rm{FIU}}}}$代表频率加权区域重合度。

$ {f_{{\rm{MIU}}}}$计算公式可以转化为

$ {f_{{\rm{MIU}}}} = \frac{{|\mathit{\boldsymbol{R}} \cap \mathit{\boldsymbol{R}}\prime |}}{{|\mathit{\boldsymbol{R}} \cup {\rm{ }}\mathit{\boldsymbol{R}}\prime |}} $

(7)

式中，$ \mathit{\boldsymbol{R}}$代表真实值, ${\mathit{\boldsymbol{R}}\prime } $代表预测值, 即分割后的图形。

4.3 实验数据分析

4.3.1 FFCN网络分割结果

研究XAUT数据集在FFCN网络的分割效果, 如图 7随机选取一幅测试图像(有效信息包含一张完整的床、半遮掩的床和座椅)在FFCN模型上进行对比实验。训练时间为13 h, 每4 000次迭代保存一次训练模型, FFCN网络迭代100 000次的模型, 分别在迭代4 000次、12 000次、36 000次和80 000次做了测试。图 7(a)中模型只识别出完整床的大致位置, 半遮掩的床和座椅并没有被识别出来。图 7(b)测试的结果已经识别出半遮掩的床和座椅, 只是物体的位置轮廓还不够清晰, 网络的前向推断、反向传播使得训练参数不断更新。图 7(c)(d) 的测试结果中物体的形状和轮廓已经和可视化图非常接近, 说明FFCN模型权重参数收敛得非常好。

图 7 FFCN上不同迭代次数的效果

Fig. 7 The effect of different iterations on FFCN

((a) iteration 4 000;(b) iteration 12 000; (c) iteration 36 000;(d) iteration 80 000;(e)ground truth)

本文提出的FFCN网络对于室内物体的分割能显示出其相应的位置, 输出结果由灰度类别标签标识出来, 应用在视觉假体项目中图像的预处理上。图像后处理的大小信息由统计分割物品每一类的像素个数$i $占图像总像素的比例得出, 位置信息由每一类物品像素位置的平均坐标作为其中心坐标$ [{X_{{\rm{mid}}}}, {Y_{{\rm{mid}}}}]$, 并用相应的符号信息在32×32低像素图像上显示给盲人。

4.3.2 不同网络分割结果

图 8列出在测试集中4幅图像的分割结果, 包含了床、橱柜、座椅、电灯4个类别，总计5个物品。图 8(e)(f)的测试效果最好。由图 8(b)(c)分割结果对比可知, 卷积层可以抽象图像高层次的特征, 但过多的卷积层也会对图像特征信息有所损失, 实验中FFCN的5层卷积所提取的图像特征明显优于FCN-32s结构。从FCN-32s和FCN-16s、FCN-8s的分割结果来看, 通过不同层之间的特征尺度融合可以解决图像在卷积过程中对特征的丢失。FCN-16s融合了Pool4层和FCN-32s的结果, FCN-8s融合了Pool3层和FCN-16s的结果, 都对最后的分割输出效果起到了很好的提升作用。

图 8 不同模型测试结果对比

Fig. 8 The test results of different models

((a) original images; (b)FCN-32s; (c) FFCN; (d) FCN-16s; (e) FCN-8s; (f) FCN-8s at-once; (e)ground truth)

FCN-8s at-once微调了FCN-8s的网络结构, 在Pool4层和Pool3层之后加入了Scale层对数据归一化之后, 对测试结果也有一定的提升。

表 2和表 3分别列出了几种算法的各项性能指标和分割时间。从表 2可以看出，各类网络的像素识别精度都达到了85%以上，均交并比均达到60%以上，其中FCN-8s at-once网络的均交并比最高，达到70.14%，但从表 3可见其分割速度仅为FFCN的1/5。表 3表明FFCN的平均分割速度要快于FCN-VGG16系列的模型分割速度, 达到40帧/s，这是因为FFCN的卷积层少, 3层池化操作对特征损失较低，采用层间相加的特征融合方法可以快速对一幅图像进行分割, 可达到视觉假体图像预处理实时性的需求。

表 2 FFCN与FCN-VGG16系列模型精度比较
Table 2 Comparison of training results of FFCN and FCN-VGG16 series models

下载CSV

/%
	FCN-32s	FFCN	FCN-16s	FCN-8s	FCN-8s at-once
${f_{{\rm{PA}}}} $	87.45	88.57	90.37	90.58	90.56
$ {f_{{\rm{MA}}}}$	75.74	76.52	76.53	77.15	77.28
${f_{{\rm{MIU}}}} $	63.87	65.95	69.58	69.97	70.14
${f_{{\rm{FIU}}}} $	78.01	78.35	82.48	82.89	82.87

表 3 不同模型平均分割时间对比
Table 3 Comparison of the average segmentation time of FFCN and FCN-VGG16 series models

下载CSV

方法	平均分割时间/ms
FCN-32s	127
FFCN	25
FCN-16s	128
FCN-8s	126
FCN-8s at-once	127

5 结论

本文提出了一种FFCN快速分割的卷积神经网络来应用到视觉假体室内场景分割项目中，在XAUT数据集上对比了不同模型的分割效果。该方法避免利用像素的颜色、亮度、纹理等一些比较低像素的分割信息, 通过卷积层和池化层抽象出图像的特征, 同时尺度特征融合的方法减少了过深卷积和池化对于图像信息的损失。从本实验的结果分析可以得到, 在不同室内的场景下利用卷积神经网络对图像进行分割, FFCN网络在测试集每幅平均分割时间最快为25 ms, 可以作为盲人视觉假体识别前的图像处理任务, 是一种实时有效可行的方法。

目前, 卷积神经网络在图像视觉领域发展迅速, 本文提出的FFCN网络对图像分割的均交并比还有待提升。最新发布的FCN结合CRF(cardiovascular research foundation)^[19]图论的方法以及DeepLab^[20]神经网络结构提出使用空洞卷积取代池化层带来的图像信息损失, 更有效地提升了模型分割效果。对于场景分割来说, 如何让模型学习海量的场景信息和避免图像在卷积神经网络中深层的信息损失将是下一步的重点研究方向。

参考文献

[1] Yang Y, Quan N N, Bu J J, et al. A soft decoding algorithm and hardware implementation for the visual prosthesis based on high order soft demodulation[J]. BioMedical Engineering Online, 2016, 15: 110. [DOI:10.1186/s12938-016-0229-3]

[2] Guo F, Yang Y, Gao Y. Optimization of Visual Information Presentation for Visual Prosthesis[J]. International Journal of Biomedical Imaging, 2018, 2018: #3198342. [DOI:10.1155/2018/3198342]

[3] Jiang F, Gu Q, Hao H Z, et al. Survey on content-based image segmentation methods[J]. Journal of Software, 2017, 28(1): 160–183. [姜枫, 顾庆, 郝慧珍, 等. 基于内容的图像分割方法综述[J]. 软件学报, 2017, 28(1): 160–183. ] [DOI:10.13328/j.cnki.jos.005136]

[4] Jung H, Choi M K, Soon K, et al. End-to-end pedestrian collision warning system based on a convolutional neural network with semantic segmentation[C]//Proceedings of 2018 IEEE International Conference on Consumer Electronics. Las Vegas, USA: IEEE, 2018: 1-3.[DOI: 10.1109/ICCE.2018.8326129]

[5] Shi J B, Malik J. Normalized cuts and image segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22(8): 888–905. [DOI:10.1109/34.868688]

[6] Boykov Y, Funka-Lea G. Graph cuts and efficient N-D image segmentation[J]. International Journal of Computer Vision, 2006, 70(2): 109–131. [DOI:10.1007/s11263-006-7934-5]

[7] Fukunaga K, Hostetler L D. The estimation of the gradient of a density function, with applications in pattern recognition[J]. IEEE Transactions on Information Theory, 1975, 21(1): 32–40. [DOI:10.1109/tit.1975.1055330]

[8] Achanta R, Shaji A, Smith K, et al. SLIC superpixels compared to state-of-the-art superpixel methods[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(11): 2274–2282. [DOI:10.1109/tpami.2012.120]

[9] Li Z Q, Chen J S. Superpixel segmentation using Linear Spectral Clustering[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 1356-1363.[DOI: 10.1109/cvpr.2015.7298741]

[10] Liu Y, Liu J, Li Z C, et al. Weakly-supervised dual clustering for image semantic segmentation[C]//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE, 2013: 2075-2082.[DOI: 10.1109/CVPR.2013.270]

[11] Li Z C, Tang J H. Weakly supervised deep matrix factorization for social image understanding[J]. IEEE Transactions on Image Processing, 2017, 26(1): 276–288. [DOI:10.1109/TIP.2016.2624140]

[12] Girshick R. Fast R-CNN[C]//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 1440-1448.[DOI: 10.1109/iccv.2015.169]

[13] Kido S, Hirano Y, Hashimoto N. Detection and classification of lung abnormalities by use of convolutional neural network (CNN) and regions with CNN features (R-CNN)[C]//Proceedings of 2018 International Workshop on Advanced Image Technology. Chiang Mai, Thailand: IEEE, 2018: 1-4.[DOI: 10.1109/IWAIT.2018.8369798]

[14] Yuan Y D, Chao M, Lo Y C. Automatic skin lesion segmentation using deep fully convolutional networks with jaccard distance[J]. IEEE Transactions on Medical Imaging, 2017, 36(9): 1876–1886. [DOI:10.1109/TMI.2017.2695227]

[15] Shelhamer E, Long J, Darrell T. Fully convolutional networks for semantic segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2107, 39(4): 640–651. [DOI:10.1109/TPAMI.2016.2572683]

[16] Qassim H, Verma A, Feinzimer D. Compressed residual-VGG16 CNN model for big data places image recognition[C]//Proceedings of 2018 IEEE 8th Annual Computing and Communication Workshop and Conference. Las Vegas, NV, USA: IEEE, 2018: 169-175.[DOI: 10.1109/CCWC.2018.8301729]

[17] Cheng P M, Malhi H S. Transfer learning with convolutional neural networks for classification of abdominal ultrasound images[J]. Journal of Digital Imaging, 2017, 30(2): 234–243. [DOI:10.1007/s10278-016-9929-2]

[18] Mun J, Jang W D, Sung D J, et al. Comparison of objective functions in CNN-based prostate magnetic resonance image segmentation[C]//Proceedings of 2017 IEEE International Conference on Image Processing. Beijing, China: IEEE, 2017: 3859-3863.[DOI: 10.1109/ICIP.2017.8297005]

[19] Xu P Y, Sarikaya R. Convolutional neural network based triangular CRF for joint intent detection and slot filling[C]//Proceedings of 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. Olomouc, Czech Republic: IEEE, 2013: 78-83.[DOI: 10.1109/ASRU.2013.6707709]

[20] Chen L C, Papandreou G, Kokkinos I, et al. DeepLab:semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 40(4): 834–848. [DOI:10.1109/TPAMI.2017.2699184]