Current Issue Cover
流形正则化约束的图像语义分割

肖振久1, 宗佳旭1, 兰海2, 魏宪2, 唐晓亮2(1.辽宁工程技术大学软件学院, 葫芦岛 125105;2.泉州装备制造研究所, 泉州 362000)

摘 要
目的 在基于深度学习的图像语义分割方法中,损失函数通常只考虑单个像素点的预测值与真实值之间的交叉熵并对其进行简单求和,而引入图像像素间的上下文信息能够有效提高图像的语义分割的精度,但目前引入上下文信息的方法如注意力机制、条件随机场等算法需要高昂的计算成本和空间成本,不能广泛使用。针对这一问题,提出一种流形正则化约束的图像语义分割算法。方法 以经过数据集ImageNet预训练的残差网络(residual network, ResNet)为基础,采用DeepLabV3作为骨架网络,通过骨架网络获得预测分割图像。进行子图像块的划分,将原始图像和分割图像分为若干大小相同的图像块。通过原始图像和分割图像的子图像块,计算输入数据与预测结果所处流形曲面上的潜在几何约束关系。利用流形约束的结果优化分割网络中的参数。结果 通过加入流形正则化约束,捕获图像中上下文信息,降低了网络前向计算过程中造成的本征结构的损失,提高了算法精度。为验证所提方法的有效性,实验在Cityscapes和PASCAL VOC 2012(pattern analysis, statistical modeling and computational learning visual object classes)两个数据集上进行。在Cityscapes数据集中,精度值为78.0%,相比原始网络提高了0.5%;在PASCAL VOC 2012数据集中,精度值为69.5%,相比原始网络提高了2.1%。同时,在Cityscapes数据集中进行对比实验,验证了算法的有效性,对比实验结果证明提出的算法改善了语义分割的效果。结论 本文提出的语义分割算法在不提高推理网络计算复杂度的前提下,取得了较好的分割精度,具有极大的实用价值。
关键词
Image semantic segmentation based on manifold regularization constraint

Xiao Zhenjiu1, Zong Jiaxu1, Lan Hai2, Wei Xian2, Tang Xiaoliang2(1.College of Software, Liaoning Technology University, Huludao 125105, China;2.Quanzhou Institute of Equipment Manufacturing, Quanzhou 362000, China)

Abstract
Objective Image semantic segmentation is one of the essential issues in computer vision and image processing. It aims to divide pixels in the image into different categories semantically, and to foresee pixel-level predictions. It has been widely used in various fields, such as scene information understanding, automatic driving and medical assisting diagnosis. Competitive performance has still suffered from challenges such as low contrast, uneven luminance and complicated scenarios currently. The performance of semantic segmentation algorithms have mainly constrained by the spatial context information. Current methods based on deep learning algorithms for image semantic segmentation has focused on harnessing the context information between pixels. For instance, the attention mechanism builds an element-wise weight matrix to capture the similarity between pixels which can be used as coefficient to summate the input. Meanwhile, probabilistic graphical models have been utilized in the spatial context as prior to enhance the classification confidence. However, these methodologies require massive computational resource (e.g. GPU memory). A contextual information capturing method is demonstrated based on manifold regularization. By assuming the data in the input image and the segmentation prediction share the same locally geometric structure in the low-dimensional manifold, this research illustrated possibility to harness the relevancy among pixels in more efficient way. As a result, the novel algorithm based on manifold regularization is issued to exploit the spatial context relation from a geometric perspective, which can be embedded into the deep learning framework to improve the performance with no increasing on both parameter amount and reasoning time. Method The contextual information analysis in the image can be effectively captured by manifold regularization. The DeepLab-v3 architecture is extracted the image features, which uses the residual network(ResNet) as the backbone network. The last two down-sampling layers of the model are pruned, and dilated convolution is employed in the subsequent convolutional layer to control the resolution of the features. For the methodology of regular segmentation, the cross-entropy of single pixel between prediction and ground truth is only involved in the cost function and sum up in total loss without any context information simply. A detailed manifold regularization penalty designation is integrated to single pixel information and the neighborhood context information. This geometric intuition for the initial image data has the same locally geometric shape with those in the segmented result. It indicates that the correspondences between clusters of data points in the input image and output result data points. For instance, when the distance of two input data points in the manifold sub-space is close, the corresponding segmentation result data points are close, and vice versa. Furthermore, the image into sub-image patches to capture the relationship between to customize the constraints between pixels. The hierarchical manifold regularization constraints are achieved via sub-image patch divides into different sizes. When the patch size is minimized, the constraint is between pixels substantially and the approach acts like other pixel-wise context aware algorithms such as fully connected conditional random field (CRF) model. On the contrary, the maximum patch size which equals to the input image size makes the approach become semi-supervised learning algorithm based on interconnected samples. The analyzed model gets improved on segmentation accuracy and achieves state-of-the-art performance. This model is based on two public datasets, Cityscapes and PASCAL VOC 2012 (pattern analysis, statistical modeling and computational learning visual object classes 2012). The performance is measured via mean intersection-over-union (mIoU) averaged across all the classes. The open source toolbox Pytorch is used to build the model. The stochastic gradient descent (SGD) method is adopted as the optimization. In addition, data augmentation is conducted by means of random cropping and inversion in accordance with probability levels. The operating system of the experimental platform is Centos7, with a GPU of model NVIDIA RTX 2080Ti and a CPU of Intel(R) Core(TM) i7-6850. Result The tests are conducted with the effect of manifold regularization. The algorithm achieves a good accuracy of the segmentation model without increasing computational complexity in the process of model implementation. On the benchmark, the ResNet50 backbone model improves the performance by 0.8% with manifold regularization adopted on the PASCAL VOC 2012 dataset, while the ResNet101 backbone models bring 2.1% mIoU gain. These results demonstrated that the manifold regularization get qualified performance with larger network model, and the analyszed results on the Cityscapes dataset also prove this inference, the ResNet50 model increases by 0.3% while the ResNet101 model increases by 0.5%. With the comparison of other context aggregation methods, we achieve mIoU of 78.0% on the Cityscapes dataset and 69.5% on the PASCAL VOC 2012 dataset. Furthermore, visualization of the segmentation results is implemented. The generated segmentation results are more accurate at the edges and have less error rate based on the algorithm with manifold regularization constraints. Conclusion This demonstration illustrates a novel algorithm for the context information image semantic segmentation via the manifold regularization constraints, which can be melted into the deep learning network model to improve the segmentation performance without changing the network structure. The results verify that the illustrated algorithm has good generalization capability in semantic segmentation.
Keywords

订阅号|日报