基于半监督对抗学习的图像语义分割

李志欣; 张佳; 吴璟莉; 马慧芳

发布时间： 2022-07-13
摘要点击次数： 1701
全文下载次数： 771
DOI: 10.11834/jig.200600
2022 | Volume 27 | Number 7

基于半监督对抗学习的图像语义分割

李志欣¹, 张佳¹, 吴璟莉¹, 马慧芳²(1.广西师范大学广西多源信息挖掘与安全重点实验室, 桂林 541004;2.西北师范大学计算机科学与工程学院, 兰州 730070)

摘要

目的将半监督对抗学习应用于图像语义分割，可以有效减少训练过程中人工生成标记的数量。作为生成器的分割网络的卷积算子只具有局部感受域，因此对于图像不同区域之间的远程依赖关系只能通过多个卷积层或增加卷积核的大小进行建模，但这种做法也同时失去了使用局部卷积结构获得的计算效率。此外，生成对抗网络（generative adversarial network，GAN）中的另一个挑战是判别器的性能控制。在高维空间中，由判别器进行的密度比估计通常是不准确且不稳定的。为此，本文提出面向图像语义分割的半监督对抗学习方法。方法在生成对抗网络的分割网络中附加两层自注意模块，在空间维度上对语义依赖关系进行建模。自注意模块通过对所有位置的特征进行加权求和，有选择地在每个位置聚合特征。因而能够在像素级正确标记值数据的基础上有效处理输入图像中广泛分离的空间区域之间的关系。同时，为解决提出的半监督对抗学习方法的稳定性问题，在训练过程中将谱归一化应用到对抗网络的判别器中，这种加权归一化方法不仅可以稳定判别器网络的训练，并且不需要对唯一的超参数进行密集调整即可获得满意性能，且实现简单，计算量少，即使在缺乏互补的正则化技术的情况下，谱归一化也可以比权重归一化和梯度损失更好地改善生成图像的质量。结果实验在Cityscapes数据集及PASCAL VOC 2012（pattern analysis，statistical modeling and computational learning visual object classes）数据集上与9种方法进行比较。在Cityscapes数据集中，相比基线模型，性能提高了2.3%~3.2%。在PASCAL VOC 2012数据集中，性能比基线模型提高了1.4%~2.5%。同时，在PASCAL VOC 2012数据集上进行消融实验，可以看出本文方法的有效性。结论本文提出的半监督对抗学习的语义分割方法，通过引入的自注意力机制捕获特征图上各像素之间的依赖关系，应用谱归一化增强对抗生成网络的稳定性，表现出了较好的鲁棒性和有效性。

关键词

半监督学习卷积神经网络(CNN) 图像语义分割生成对抗网络(GAN) 自注意机制谱归一化

Semi-supervised adversarial learning based semantic image segmentation

Li Zhixin¹, Zhang Jia¹, Wu Jingli¹, Ma Huifang²(1.Guangxi Key Laboratory of Multi-source Information Mining and Security, Guangxi Normal University, Guilin 541004, China;2.College of Computer Science and Engineering, Northwest Normal University, Lanzhou 730070, China)

Abstract

Objective Deep learning network training models is based on labeled data. It is challenged to obtain pixel-level label annotations for labor-intensive semantic segmentation. However, the convolution operator of the segmentation network has single local receptive field as the generator, but the size of each convolution kernel is very limited, and each convolution operation just cover a tiny pixel-related neighborhood. The height and width of long-range feature map is dramatically declined due to the constraints of multi-layer convolution and pooling operations. The lower the layer is, the larger the area is covered, which is via the mapped convolution kernel retrace to the original image, which makes it difficult to capture the long-range feature relationship. It is a challenge to coordinate multiple convolutional layers to capture these dependent parameter values in detail via optimization algorithm. Therefore, the long-range dependency between different regions of the image can just be modeled through multiple convolutional layers or the enlargement of the convolution kernel, but this local convolution structural approach also loses the computational efficiency. In addition, another generative adversarial network (GAN) challenge is the manipulation ability of the discriminator. The discriminator training is equivalent to training a good evaluator to estimate the density ratio between the generated distribution and the target distribution. The discriminator based density ratio estimation is inaccurate and unstable in related to high-dimensional space in common. The better the discriminator is trained, the more severed gradient returned to the generator is ignored, and the training process will be cut once the gradient completely disappeared. The traditional method proposes that the parameter matrix of the discriminator is required to meet the Lipschitz constraint, but this method is not detailed enough. The method limit the parameter matrix factors, but it is not greater than a certain value. Although the Lipschitz constraint can also be guaranteed, the structure of the entire parameter matrix is disturbed due to the changed proportional relationship between the parameters. Method The semi-supervised adversarial learning application can effectively reduce the number of manually generated labels to semantic image segmentation in the training process. Our segmentation network is used as the generator of the GAN, and the segmentation network outputs the semantic label probability map of a targeted image. Hence, the output of the segmentation network is possible close to the ground truth label in space. The fully convolutional neural network (CNN) is used as the discriminator. When doing semi-supervised training, the discriminator can distinguish the ground truth label map from the class probability map predicted by the segmentation network. The discriminator network generates a confidence map that it can be used as a supervision signal to guide the cross-entropy loss. Based on the confidence map, it is easy to see the regions in the prediction distribution that are close to the ground truth label distribution, and then use the masked cross-entropy loss to make the segmentation network trust and train these credible predictions. This method is similar to the probabilistic graphical model. The network does not increase the computational load because redundant post-processing modules are not appeared in the test phase and discriminator is not needed in the inference process. We extend two layers of self-attention modules to the segmentation network of GAN, and model the semantic dependency in the spatial dimension. The segmentation network as a generator can precisely coordinate the fine details of each pixel position on the feature map with the fine details in the distance part of the image through this attention module. The self-attention module is optional to aggregate the features at each location via a weighted sum on the features of multifaceted locations. Therefore, the relationship between widely discrete spatial regions in the input image can be effectively processed based on pixel-level ground truth data. A good balance is achieved between long-range dependency modeling capabilities and computational efficiency. We carry out spectral normalization to the discriminator of the adversarial network during the training process. This method introduces Lipschitz continuity constraints from the perspective of the spectral norm of the parameter matrix of each layer of neural network. The neural network beyond the disturbance of the input image and make the training process more stable and easier to converge. This is a more refined way to make the discriminator meet Lipschitz connectivity, which limits the violent degree of function changes and makes the model more stable. This weighted normalization method can not only stabilize the training of the discriminator network, but also obtain satisfactory performance without intensive adjustment of the unique hyper-parameters, and is easy to implement and requires less calculation. When spectral normalization is applied to adversarial generation networks on semantic segmentation tasks, the generated cases are also more diverse than traditional weight normalization. In the absence of complementary regularization techniques, spectral normalization can even improve the quality of the generated image better than other weight normalization and gradient loss. Result Our experiment is compared to the latest 9 methods derived from the Cityscapes dataset and the pattern analysis, statistical modeling and computational learning visual object classes(PASCAL VOC 2012) dataset. In the Cityscapes dataset, the performance is improved by 2.3% to 3.2% compared to the baseline model. In the PASCAL VOC 2012 dataset, the performance is improved by 1.4% to 2.5% over the baseline model. Simultaneously an ablation experiment is conducted on the PASCAL VOC 2012 dataset. Conclusion The semantic segmentation method of semi-supervised adversarial learning proposed uses the introduced self-attention mechanism to capture the dependence between pixels on the feature map. The application of spectral normalization to stabilize the adversarial generation network has its qualified robustness and effectiveness.

Keywords

semi-supervised learning convolutional neural network (CNN) semantic image segmentation generative adversarial network (GAN) self-attention mechanism spectral normalization

在线采编平台

在线出版

年度会议

下载中心

年度信息