摘 要 ：目的 在计算机视觉领域中，语义分割是场景解析和行为识别的关键任务，深度卷积神经网络的出现为其带来了新的发展机遇。语义分割是像素级的图像理解，其目的是对图像中的每个像素标注所属的类别。语义分割技术的任务是将图像分割成几个有意义的目标，并为每个目标分配指定的语义标签。本文归纳和总结了现阶段主流的基于深度卷积神经网络的图像语义分割方法的国内外研究现状，为后续研究者提供研究基础。方法 首先对语义分割领域存在的困难和挑战进行了分析和描述。总结了语义分割算法性能评价的常用数据集和客观评测指标。并依据网络训练的方式，将现有的方法分为基于监督学习的语义分割模型和基于弱监督学习的语义分割模型两类，分别详细阐述了这两类方法，分析它们各自的优势和不足，最后，指出了该技术未来的研究热点。结果 基于深度卷积神经网络的图像语义分割模型到目前为止已经取得了突破性进展。本文在PASCAL VOC 2012数据集上比较了部分监督学习和弱监督学习的语义分割模型，在监督学习模型中，DeepLab v3+达到最优，其MIoU为89.0%；在弱监督学习模型中，Tang等的方法达到最优，其MIoU为74.5%。结论 本文细致且较为全面地论述了当前流行的用于解决图像语义分割问题的深度学习网络模型，描述了语义分割存在的问题与挑战，介绍了用于评价语义分割算法的常用数据集和客观评测指标，从监督学习和弱监督学习两个方面详细介绍了基于深度卷积神经网络的图像语义分割领域的研究现状。最后，提出了建立在图像语义分割领域基础上的后续研究方向。
Abstract: Objective Semantic segmentation is a fundamental task in computer vision applications like scene analysis and behavior recognition. The emergence of deep convolutional neural network (CNN) has brought new development opportunities to semantic segmentation. Semantic segmentation is a kind of pixel-level image understanding, whose target is to assign a semantic label for each pixel of a given image. The task of the semantic segmentation technique is to segment the image into several meaningful targets and assign a specific semantic label to each target. The difficulty of the image semantic segmentation mainly comes from three aspects: target, category and background. From the view of targets, when an object is in different lighting, angle of view and distance, or when it is still or moving, the image taken will be significantly different, and the occlusion may also occur between adjacent objects. In terms of categories, there is dissimilarity in the targets of the same category, and similaritiy between the targets of different categories. From the background perspective, the simple background helps to output accurate semantic segmentation results, but the background in the real scene is complex. In this paper, we provide a systematic review of recent advances on deep convolutional neural network methods for semantic segmentation. Method In this paper, we summarize state-of-the-art image semantic segmentation methods based on deep convolutional neural network. We first discuss the difficulties and challenges of semantic segmentation, and provide some datasets and quantitative metrics for evaluating the performance of these methods. According to the behavior of network training, these methods are grouped into two categories, namely, supervised learning-based and weakly supervised learning-based semantic segmentation. Supervised semantic segmentation require pixel-level annotations, while weakly supervised semantic segmentation aims to segment images by class labels, bounding boxes, scribbles etc. In this paper, we divide the supervised semantic segmentation models into four groups, namely, encoder-decoder methods, feature map-based methods, probability map-based methods, and various strategies. In the encoder-decoder network, an encoder module gradually reduces the feature maps and captures higher semantic information, and a decoder module gradually recovers the spatial information. Currently, most state-of-the-art deep convolutional neural networks for semantic segmentation stem from a common forerunner: the Fully Convolutional Network (FCN), an encoder-decoder network. FCN transforms those existing and well-known classi?cation models-AlexNet, VGG (16-layer net), GoogLeNet, and ResNet into fully convolutional ones by replacing the fully connected layers with convolutional ones to output spatial maps instead of classi?cation scores. Those maps are upsampled using deconvolutions to produce dense per-pixel labeled outputs. The feature map-based method aims to take full advantage of the context information of the feature map, including the spatial context (position) and the scale context (size) of the feature map, which facilitates segmenting and parsing the image. Such methods obtain spatial context and scale context by increasing the receptive field and fusing multi-scale information, which can effectively improve the performance of the network. Some models, such as Pyramid Scene Parsing Network (PSPNet) or Deeplab v3, perform spatial pyramid pooling at several different scales (including image-level pooling) or apply several parallel atrous convolution with different rates. These models have shown promising results by involving spatial context and scale context. The probability map-based method combines the semantic context (probability) and the spatial context (location) to post-process probability score maps and semantic label predictions mainly through the use of probabilistic graph model. The probabilistic graph is a probabilistic model which uses the form of graph to present the conditional dependence between random variables. It is the combination of probability theory and graph theory. There are several kinds of probabilistic graph models, such as Conditional Random Fields (CRF), Markov Random Fields (MRF) and Bayesian network. By establishing semantic relationships between pixels, the boundary of the target is refined, and the performance of the network is improved. Typically, this family of approaches include CRF-RNN (Recurrent neural networks), Deep parsing network (DPN), EncNet and so on. Some methods combine two or more of the above strategies to significantly improve the segmentation performance of the network, such as Global convolution network (GCN), DeepLab v1, DeepLab v2, DeepLab v3+, Discriminative feature network (DFN) etc. According to the type of weak supervision used by the training network, the weakly supervised semantic segmentation models are divided into four groups, namely, class label-based, bounding box-based, scribble-based, and various forms of annotations. Class-label annotations only indicate the presence of the object, and thus the substantial problem in class label-based methods is how to accurately assign image-level labels to their corresponding pixels. Generally, such problems can be solved by using the multiple instance learning (MIL)-based strategy to train models for semantic segmentation or adopting an alternative training procedure based on the Expectation Maximization (EM) algorithm to dynamically predict semantic foreground and background pixels. More recent work tries to increase the quality of the object localization map by integrating a seed region growing technique into the segmentation network, which significantly increases the pixel accuracy. Bounding box-based methods use bounding box and class label as supervision information. By using region proposal methods and the traditional image segmentation theory to generate candidate segmentation masks, the convolutional network is trained under the supervision of these approximate segmentation masks. BoxSup proposes a recursive training procedure, where the convolutional network is trained under supervision of segment object proposals and the updated network in turn improves the segmentation mask used for training. Scribble-supervised training models apply the graphical model propagates the information from the scribbles to the unmarked pixels, based on spatial constraints, appearance, and semantic content, which accounts for two tasks. The first task is to propagate the class labels from the scribbles to other pixels and fully annotate the image; and the second task is to learn a convolutional network for semantic segmentation. Result The image semantic segmentation model based on deep convolutional neural network has made significant progress in recent years. Among the supervised learning models, DeepLab v3+ achieves the best performance on the PASCAL VOC 2012 semantic image segmentation benchmark, whose MIoU achieves 89.0%. Among the weakly supervised learning models, Tang’s method is optimal and its MIoU achieves 74.5%. Conclusion According to the above analysis, we detail how recent CNN-based semantic segmentation methods work and analyze their strengths and limitations. Finally, we present further related research areas including video semantic segmentation, 3D dataset semantic segmentation, real-time semantic segmentation and instance segmentation. Image semantic segmentation is a hot topic in the field of computer vision and artificial intelligence, and many applications need accurate and efficient segmentation models, e.g. autonomous driving, indoor navigation, and smart medical. Thus, further work on semantic segmentation needs to be done to improve the accuracy of object boundary and further the performance of semantic segmentation.