1. 西安理工大学自动化与信息工程学院, 西安 710048;
2. 西安中车永电电气有限公司, 西安 710018
 收稿日期: 2018-06-05; 修回日期: 2018-08-10 基金项目: 国家自然科学基金项目（61102017） 第一作者简介: 黄龙, 1995年生, 男, 硕士研究生, 主要研究方向为集成电路、深度学习图像处理、人脸识别。E-mail:hl13700200646@163.com;王庆军, 男, 博士, 主要研究方向为电路系统设计。E-mail:qingjun.wang@qq.com;郭飞, 男, 博士研究生, 主要研究方向为图像处理、集成电路设计。E-mail:guofei601@126.com;高勇, 男, 教授, 博士生导师, 主要研究方向为电力电子器件与功率集成、新型半导体器件。E-mail:gaoy@xaut.edu.cn.

Indoor scene segmentation based on fully convolutional neural networks
Huang Long1, Yang Yuan1, Wang Qingjun2, Guo Fei1, Gao Yong1
1. Xi'an University of Technology, Xi'an 710048, China;
2. CRCC Corporation Limited Xi'an Yonge Electric Co. Ltd., Xi'an 710018, China
Supported by: National Natural Science Foundation of China(61102017)

Objective Vision is one of the most important ways by which humans obtain information. Visual prosthesis refers to the process where electrodes are implanted into a blind body to stimulate the optic nerve, such that the blind can see hallucinations. Therefore, the objects felt by the blind are only the general features, such as low resolution and poor linearity. In some cases, the blind can hardly distinguish optical illusions. Before the electrodes were stimulated, image segmentation was adopted to display the general position and outline of objects to help blind people clearly recognize every familiar object. The image fast segmentation of the convolution neural network was proposed to segment the indoor scene of visual prosthesis in terms of its application features. Method According to the demand of visual prosthesis for real-time image processing, the fast fully convolutional network (FFCN) network structure proposed in this paper was improved on the AlexNet classification network structure. The network reduced the error rate of top five in the ImageNet dataset to 16.4%, which was better than the 26.2% of the second. The AlexNet uses the convolution layer to extract deep feature information, adds the structure of the overlapping pool layer to reduce the parameters that must be learned, and defines the Relu activation function to solve the gradient diffusion of the Sigmod function in deeper networks. In contrast to other networks, it presents characteristics such as light weight and fast training speed. First, the FFCN for image segmentation in the indoor scene was constructed. It was composed of five convolution layers and one deconvolution layer. The loss produced by the continuous convolution in the picture feature information was avoided by scale fusion. To verify the effectiveness of the network, a dataset of basic items that can be touched by the blind in an indoor environment was created. The dataset was divided into nine categories and included 664 items, such as beds, seats, lamps, televisions, cupboards, cups, and people (XAUT dataset). The type of each item was marked by grayscale in the original image, and a color table was added to map the gray image into pseudo-color map as the semantic label. The XAUT dataset was used to train the FFCN network under the Caffe framework, and the image features were extracted using the deep learning feature and scale fusion of the convolution neural network to obtain the segmentation model in the indoor scene for adapting to the visual prosthesis for the blind. To assess the validity of the model, the fine adjustment of traditional models, including FCN-8s, FCN-16s, FCN-32s, and FCN-8s at-once, was examined. The dataset was used to obtain the corresponding segmentation model in the indoor scene for adapting to the visual prosthesis for the blind. Results A comparative experiment was conducted on the Ubuntu16.04 version of the Amax Sever environment. The training time of the model lasted for 13 h, and a training model was saved every 4 000 iterations. The tests are tested at 4 000, 12 000, 36 000, and 80 000 iterations. The pixel recognition accuracy of all kinds of networks exceeded 85%, and the mean IU was above 60%. The FCN-8s at-once network had the highest mean IU (70.4%), but its segmentation speed was only one-fifth of that of the FFCN. Under the assumption that the other indicators differed insignificantly, the average segmentation speed of the FFCN reached 40 frame/s. Conclusion The FFCN can effectively use multi-layer convolution to extract picture information and avoid the influences of the underlying information, such as brightness, color, and texture. Moreover, it can avoid the loss of image feature information in the network convolution and pool through scale fusion. Compared with other FCN networks, the FFCN has a faster speed and can improve the real-time image preprocessing.

# Key words

indoor environment; visual prosthesis; semantic segmentation; convolution neural network; deep learning

# 1.1 CNN(卷积神经网络)与FCN(全卷积网络)

CNN主要用于图像分类, 但是对于图像中特定部分物体的识别还是一个难题。图像语义分割的标志事件是Long等人[15]在2014年提出全卷积网络，与CNN在卷积网络之后采用全连接层获取1维的特征向量通过Softmax层进行分类不同，全卷积网络(FCN)修改CNN的全连接层为全卷积层，可以接受任意尺寸的输入图像, 通过采用反卷积对最后的特征图进行上采样, 得到与输入图像相同的尺寸。FCN可以对每个像素都产生预测, 保留了输入图像的空间信息, 最后在上采样的特征图上逐个像素分类。与传统的卷积网络进行图像分割的方法相比, FCN网络有两大突出的优点：1)可以接受输入图像的大小不受限制, 不用预处理所有的训练图像和测试图像到同样的尺寸；2)非常高效, 端到端的网络结构避免了由于使用像素块带来的重复存储和感知区域限制。

# 1.2 FFCN(Fast FCN)网络

1) 卷积层。卷积层是神经网络最为重要的部分, 如图 2所示，通过过滤器的操作, 卷积层中每一个节点的输入只是上一层神经网络的一小块。一般来说, 经过卷积层处理过的节点矩阵会变得更深, 从而得到抽象程度更高的特征。

 $g\left( i \right){\rm{ }} = {\rm{ }}f\left( {\sum\limits_{x{\rm{ }} = {\rm{ }}1}^2 {} \sum\limits_{y{\rm{ }} = {\rm{ }}1}^2 {} \sum\limits_{z{\rm{ }} = {\rm{ }}2}^3 {} {a_{x, y, z}} \times w_{x, y, z}^i + {b^i}} \right)$ (1)

2) 池化层。FFCN网络采用的是最大池化(max pooling)的方法。池化层可以缩小矩阵的尺寸, 从而减少最后全连接层中的参数。使用池化层可以起到加快计算速度、防止过拟合的作用。

3) 局部响应归一化层。该层对局部神经元创建了竞争机制, 使得其中响应大的值变得更大, 并抑制其他反馈较小的神经元, 增强模型的泛化能力。

4) 激活函数层。FFCN网络采用的激活函数为ReLU激活函数，定义为

 $f\left( x \right) = {\rm{max}}(x, 0)$ (2)

5) 反卷积层。反卷积的作用是将中间深层的特征图进行卷积逆运算。简单地说, 卷积是将一幅图像$\mathit{\boldsymbol{C}}$与矩阵$\mathit{\boldsymbol{T}}$转化为$\mathit{\boldsymbol{C}} \times \mathit{\boldsymbol{T}}$, 反卷积则是$(\mathit{\boldsymbol{C}} \times \mathit{\boldsymbol{T}}) \times {\mathit{\boldsymbol{T}}^{ - 1}}$, 对卷积操作起到还原作用。

# 3 实验方案

Table 1 Comparison of structural parameters of different models

 FCN-VGG16 FFCN 卷积层 13 5 池化层 5 3 最大步长 32 32 输出单元感受野 404 355

1) 训练阶段。在XAUT数据集上分别对FFCN模型、FCN-VGG16系列网络进行训练比较, 保存训练日志和模型结果。

2) 测试阶段。用测试样本在训练得到的模型上进行测试, 对比不同模型的输出结果。

# 4.1 实验平台介绍

1) 第1个主流的工业级深度学习工具, 可以应用在机器视觉、机器人、语音识别、神经科学和天文学等领域。

2) 专精于图像处理；提供了完整的工具包, 纯粹的C++/CUDA架构；支持命令行、Python和MATLAB接口；可以CPU和GPU无缝切换。

3) 底层封装好, 模块化的结构可以很清晰地用来对模型进行训练和测试。

# 4.2 图像分割评价标准

 ${f_{{\rm{PA}}}} = \sum\limits_i {} {q_{ii}}/\sum\limits_i {} {w_i}$ (3)

 ${f_{{\rm{MA}}}} = \left( {1/{q_{mn}}} \right)\sum\limits_i {} {q_{ii}}/\sum\limits_i {} {w_i}$ (4)

 ${f_{{\rm{MIU}}}} = \left( {1/{q_{mn}}} \right)\sum\limits_i {} {q_{ii}}/\left( {{w_i} + \sum\limits_j {} {q_{ii}} - {q_{ii}}} \right)$ (5)

 $\begin{array}{l} {f_{{\rm{FIU}}}} = \\ {\left( {\sum\limits_k {} {w_k}} \right)^{ - 1}}\sum\limits_i {} {w_i}{q_{ii}}/\left( {{w_i} + \sum\limits_j {} {q_{ij}} - {q_{ii}}} \right) \end{array}$ (6)

${f_{{\rm{MIU}}}}$计算公式可以转化为

 ${f_{{\rm{MIU}}}} = \frac{{|\mathit{\boldsymbol{R}} \cap \mathit{\boldsymbol{R}}\prime |}}{{|\mathit{\boldsymbol{R}} \cup {\rm{ }}\mathit{\boldsymbol{R}}\prime |}}$ (7)

# 4.3.2 不同网络分割结果

FCN-8s at-once微调了FCN-8s的网络结构, 在Pool4层和Pool3层之后加入了Scale层对数据归一化之后, 对测试结果也有一定的提升。

Table 2 Comparison of training results of FFCN and FCN-VGG16 series models

 /% FCN-32s FFCN FCN-16s FCN-8s FCN-8s at-once ${f_{{\rm{PA}}}}$ 87.45 88.57 90.37 90.58 90.56 ${f_{{\rm{MA}}}}$ 75.74 76.52 76.53 77.15 77.28 ${f_{{\rm{MIU}}}}$ 63.87 65.95 69.58 69.97 70.14 ${f_{{\rm{FIU}}}}$ 78.01 78.35 82.48 82.89 82.87

Table 3 Comparison of the average segmentation time of FFCN and FCN-VGG16 series models

 方法 平均分割时间/ms FCN-32s 127 FFCN 25 FCN-16s 128 FCN-8s 126 FCN-8s at-once 127

