Current Issue Cover
深度卷积神经网络图像语义分割研究进展

青晨, 禹晶, 肖创柏, 段娟(北京工业大学信息学部, 北京 100124)

摘 要
在计算机视觉领域中,语义分割是场景解析和行为识别的关键任务,基于深度卷积神经网络的图像语义分割方法已经取得突破性进展。语义分割的任务是对图像中的每一个像素分配所属的类别标签,属于像素级的图像理解。目标检测仅定位目标的边界框,而语义分割需要分割出图像中的目标。本文首先分析和描述了语义分割领域存在的困难和挑战,介绍了语义分割算法性能评价的常用数据集和客观评测指标。然后,归纳和总结了现阶段主流的基于深度卷积神经网络的图像语义分割方法的国内外研究现状,依据网络训练是否需要像素级的标注图像,将现有方法分为基于监督学习的语义分割和基于弱监督学习的语义分割两类,详细阐述并分析这两类方法各自的优势和不足。本文在PASCAL VOC(pattern analysis, statistical modelling and computational learning visual object classes)2012数据集上比较了部分监督学习和弱监督学习的语义分割模型,并给出了监督学习模型和弱监督学习模型中的最优方法,以及对应的MIoU(mean intersection-over-union)。最后,指出了图像语义分割领域未来可能的热点方向。
关键词
Deep convolutional neural network for semantic image segmentation

Qing Chen, Yu Jing, Xiao Chuangbai, Duan Juan(Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China)

Abstract
Semantic segmentation is a fundamental task in computer vision applications, such as scene analysis and behavior recognition. The recent years have witnessed significant progress in semantic image segmentation based on deep convolutional neural network(DCNN). Semantic segmentation is a type of pixel-level image understanding with the objective of assigning a semantic label for each pixel of a given image. Object detection only locates the bounding box of the object, while the task of semantic segmentation is to segment an image into several meaningful objects and then assign a specific semantic label to each object. The difficulty of image semantic segmentation mostly originates from three aspects: object, category, and background. From the perspective of objects, when an object is in different lighting, angle of view, and distance, or when it is still or moving, the image taken will significantly differ. Occlusion may also occur between adjacent objects. In terms of categories, objects from the same category have dissimilarities and objects from different categories have similarities. From the background perspective, a simple background helps output accurate semantic segmentation results, but the background of real scenes is complex. In this study, we provide a systematic review of recent advances in DCNN methods for semantic segmentation. In this paper, we first discuss the difficulties and challenges in semantic segmentation and provide datasets and quantitative metrics for evaluating the performance of these methods. Then, we detail how recent CNN-based semantic segmentation methods work and analyze their strengths and limitations. According to whether to use pixel-level labeled images to train the network, these methods are grouped into two categories: supervised and weakly supervised learning-based semantic segmentation. Supervised semantic segmentation requires pixel-level annotations. By contrast, weakly supervised semantic segmentation aims to segment images by class labels, bounding boxes, and scribbles. In this study, we divide supervised semantic segmentation models into four groups: encoder-decoder methods, feature map-based methods, probability map-based methods, and various strategies. In an encoder-decoder network, an encoder module gradually reduces feature maps and captures high semantic information, while a decoder module gradually recovers spatial information. At present, most state-of-the-art deep CNN for semantic segmentation originate from a common forerunner, i.e., the fully convolutional network (FCN), which is an encoder-decoder network. FCN transforms existing and well-known classification models, such as AlexNet, visual geometry group 16-layer net (VGG16), GoogLeNet, and ResNet, into fully convolutional models by replacing fully connected layers with convolutional ones to output spatial maps instead of classification scores. Such maps are upsampled using deconvolutions to produce dense per-pixel labeled outputs. A feature map-based method aims to take complete advantage of the context information of a feature map, including its spatial context (position) and scale context (size), facilitating the segmentation and parsing of an image. These methods obtain the spatial and scale contexts by increasing the receptive field and fusing multiscale information, effectively improving the performance of the network. Some models, such as the pyramid scene parsing network or Deeplab v3, perform spatial pyramid pooling at several different scales (including image-level pooling) or apply several parallel atrous convolutions with different rates. These models have presented promising results by involving the spatial and scale contexts. A probability map-based method combines the semantic context (probability) and the spatial context (location) with postprocess probability score maps and semantic label predictions primarily through the use of a probabilistic graph model. A probabilistic graph is a probabilistic model that uses a graph to present conditional dependence between random variables. It is the combination of probability and graph theories. Probabilistic graph models have several types, such as conditional random fields (CRFs), Markov random fields, and Bayesian networks. Object boundary is refined and network performance is improved by establishing semantic relationships between pixels. This family of approaches typically includes CRF-recurrent neural networks, deep parsing networks, and EncNet. Some methods combine two or more of the aforementioned strategies to significantly improve the segmentation performance of a network, such as a global convolutional network, DeepLab v1, DeepLab v2, DeepLab v3+, and a discriminative feature network. In accordance with the type of weak supervision used by a training network, weakly supervised semantic segmentation methods are divided into four groups: class label-based, bounding box-based, scribble-based, and various forms of annotations. Class-label annotations only indicate the presence of an object. Thus, the substantial problem in class label-based methods is accurately assigning image-level labels to their corresponding pixels. In general, this problem can be solved by using the multiple instance learning-based strategy to train models for semantic segmentation or adopting an alternative training procedure based on the expectation-maximization algorithm to dynamically predict semantic foreground and background pixels. A recent work attempted to increase the quality of an object localization map by integrating a seed region growing technique into the segmentation network, significantly increasing pixel accuracy. Bounding box-based methods use bounding boxes and class labels as supervision information. By using region proposal methods and the traditional image segmentation theory to generate candidate segmentation masks, a convolutional network is trained under the supervision of these approximate segmentation masks. BoxSup proposes a recursive training procedure wherein a convolutional network is trained under the supervision of segment object proposals. In turn, the updated network improves the segmentation mask used for training. Scribble-supervised training methods apply a graphical model to propagate information from scribbles to unmarked pixels on the basis of spatial constraints, appearance, and semantic content, accounting for two tasks. The first task is to propagate the class labels from scribbles to other pixels and fully annotate an image. The second task is to learn a convolutional network for semantic segmentation. We compare some semantic segmentation methods of supervised learning and weakly supervised learning on the PASCAL VOC (pattern analysis, statistical modelling and computational learning visual object classes) 2012 dataset. We also give the optimal methods of supervised learning methools and wedakly supervised learning methods, and the corresponding MIoU(mean intersection-over-union). Lastly, we present related research areas, including video semantic segmentation, 3D dataset semantic segmentation, real-time semantic segmentation, and instance segmentation. Image semantic segmentation is a popular topic in the fields of computer vision and artificial intelligence. Many applications require accurate and efficient segmentation models, e.g., autonomous driving, indoor navigation, and smart medicine. Thus, further work should be conducted on semantic segmentation to improve the accuracy of object boundaries and the performance of semantic segmentation.
Keywords

订阅号|日报