摘要 ：目的 视觉假体通过向盲人体内植入电极刺激视神经产生光幻视，盲人所能感受到的物体只是大体轮廓，对物体识别率低，针对视觉假体中室内应用场景的特点，提出一种快速卷积神经网络图像分割方法对室内场景图像进行分割，通过图像分割技术把物品大致的位置和轮廓显示出来，辅助盲人识别。方法 构建了用于室内场景图像分割的FFCN（Fast Fully Convolutional Networks）网络,通过层间融合的方法,避免连续卷积对图片特征信息的损失。为了验证网络的有效性，创建了室内环境中的基本生活物品数据集（以下简称XAUT数据集）,在原图上通过灰度标记每个物品的类别，然后附加一张颜色表把灰度图片映射成伪彩色图作为语义标签。采用XAUT数据集在Caffe框架下对FFCN网络进行训练，得到适应于盲人视觉假体的室内场景分割模型。同时,为了对比模型的有效性，对传统的Fcn-8s、Fcn-16s、Fcn-32s等模型进行结构微调，并采用该数据集进行训练得到适应于室内场景分割的相应算法模型。结果 各类网络的像素识别精度都达到了85%以上，均交并比（Mean IU）均达到60%以上，其中Fcn-8s at-once网络的Mean IU最高，达到70.4%，但其分割速度仅为FFCN的1/5。在其他各类指标相差不大的前提下，FFCN快速分割卷积神经网络上平均分割速度达到40fps。 结论 本文提出的FFCN卷积神经网络可以有效利用多层卷积提取图片信息,避免亮度、颜色、纹理等底层信息的影响,通过尺度融合技术可以很好地避免图像特征信息在网络卷积和池化中的损失，相比于其他FCN网络具有更快的速度，有利于提高图像预处理的实时性。
Indoor scene segmentation algorithm based on convolution neural network
Abstract: Objective Vision is one of the most important ways for human to get information,visual prosthesis implants the electrodes into the blind body to stimulate the optic nerve, so the blind can get hallucination. Therefore, the object that the blind can feel is only the general feature such as low resolution and poor linearity. Sometimes the blind can hardly distinguish the object represented by the optical illusion..Before the electrodes being stimulated, it is necessary to use image segmentation technique to display the general position and outline of the object to help the blind people to recognize, which makes significance for them to recognize every familiar object clearly. The image fast segmentation of the convolution neural network is proposed to segment the indoor scene of visual prosthesis in terms of its application features. Method According to the demand of visual prosthesis for real-time image processing, the FFCN network structure proposed in this paper is improved on the AlexNet classification network structure. The network reduces the error rate of top-5 to 16.4% in the ImageNet dataset, compared with 26.2% of the second, its error rate has been greatly improved. The AlexNet uses the convolution layer to extract the deep feature information, adds the structure of the overlapping pool layer to reduce the parameters needed to be learned and defines the Relu activation function to solve the gradient diffusion of the Sigmod function in the deeper network. Compared with other networks, it has the characteristics such as lightweight, fast training speed and so on. Firstly, the FFCN (Fast Fully Convolutional Networks) network for image segmentation in the indoor scene is constructed, which is composed of five convolution layers and one deconvolution layer. The loss that the continuous convolution effects on the picture feature information is avoided by the scale fusion. In order to verify the effectiveness of the network, the data set of basic items that the blind can touch in an indoor environment is created. It is divided into 9 categories, including a total of 664 such as beds,seats, lamps, televisions, cupboards, cups, and people (Hereinafter referred to as XAUT data set). The type of each item is marked by grayscale in the original image, then added a color table to map the gray image into pseudo-color map as the semantic label. The XAUT data set is used to train the FFCN network under the Caffe framework and the image features are extracted by using the deep learning feature and the scale fusion of the convolution neural network to obtain the segmentation model in the indoor scene for adapting to the visual prosthesis for the blind. To compare with the validity of the model, at the same time, the fine adjustment of traditional models, including Fcn-8s, Fcn-16s, Fcn-32s and Fcn-8s at-once, the data set is used to train to get the corresponding segmentation model in the indoor scene for adapting to the visual prosthesis for the blind. Results In the ubuntu16.04 version of the Amax Sever environment, a comparative experiment is conducted. The training time of the model takes 13 hours, and a training model is saved every 4000 iterations, and the tests are tested at 4000, 12000, 36000, and 80000 iterations. The pixel recognition accuracy of all kinds of networks has reached more than 85% and its Mean IU is above 60%. The Mean IU of Fcn-8s at-once network is the highest, amounting to 70.4%, but its segmentation speed is only 1/5 compared to FFCN. On the premise that there is little difference in other indicators, the average segmentation speed of FFCN fast convolution neural network reaches 40fps. Conclusion The FFCN convolution neural network can effectively use multi-layer convolution to extract picture information and avoid the influence of the underlying information such as brightness, color and texture. It can avoid the loss of image feature information in the network convolution and pool through the scale fusion technology. Compared with other FCN networks, the FFCN has faster speed and can improve the real-time performance of image preprocessing.
Key words: indoor environment; visual prosthesis; semantic segmentation; convolution neural network; deep learning