基于半监督对抗学习的图像语义分割
Semi-supervised adversarial learning based semantic image segmentation
- 2022年27卷第7期 页码:2157-2170
收稿日期:2020-10-10,
修回日期:2021-03-15,
录用日期:2021-3-22,
纸质出版日期:2022-07-16
DOI: 10.11834/jig.200600
移动端阅览
浏览全部资源
扫码关注微信
收稿日期:2020-10-10,
修回日期:2021-03-15,
录用日期:2021-3-22,
纸质出版日期:2022-07-16
移动端阅览
目的
2
将半监督对抗学习应用于图像语义分割,可以有效减少训练过程中人工生成标记的数量。作为生成器的分割网络的卷积算子只具有局部感受域,因此对于图像不同区域之间的远程依赖关系只能通过多个卷积层或增加卷积核的大小进行建模,但这种做法也同时失去了使用局部卷积结构获得的计算效率。此外,生成对抗网络(generative adversarial network,GAN)中的另一个挑战是判别器的性能控制。在高维空间中,由判别器进行的密度比估计通常是不准确且不稳定的。为此,本文提出面向图像语义分割的半监督对抗学习方法。
方法
2
在生成对抗网络的分割网络中附加两层自注意模块,在空间维度上对语义依赖关系进行建模。自注意模块通过对所有位置的特征进行加权求和,有选择地在每个位置聚合特征。因而能够在像素级正确标记值数据的基础上有效处理输入图像中广泛分离的空间区域之间的关系。同时,为解决提出的半监督对抗学习方法的稳定性问题,在训练过程中将谱归一化应用到对抗网络的判别器中,这种加权归一化方法不仅可以稳定判别器网络的训练,并且不需要对唯一的超参数进行密集调整即可获得满意性能,且实现简单,计算量少,即使在缺乏互补的正则化技术的情况下,谱归一化也可以比权重归一化和梯度损失更好地改善生成图像的质量。
结果
2
实验在Cityscapes数据集及PASCAL VOC 2012(pattern analysis,statistical modeling and computational learning visual object classes)数据集上与9种方法进行比较。在Cityscapes数据集中,相比基线模型,性能提高了2.3%~3.2%。在PASCAL VOC 2012数据集中,性能比基线模型提高了1.4%~2.5%。同时,在PASCAL VOC 2012数据集上进行消融实验,可以看出本文方法的有效性。
结论
2
本文提出的半监督对抗学习的语义分割方法,通过引入的自注意力机制捕获特征图上各像素之间的依赖关系,应用谱归一化增强对抗生成网络的稳定性,表现出了较好的鲁棒性和有效性。
Objective
2
Deep learning network training models is based on labeled data. It is challenged to obtain pixel-level label annotations for labor-intensive semantic segmentation. However
the convolution operator of the segmentation network has single local receptive field as the generator
but the size of each convolution kernel is very limited
and each convolution operation just cover a tiny pixel-related neighborhood. The height and width of long-range feature map is dramatically declined due to the constraints of multi-layer convolution and pooling operations. The lower the layer is
the larger the area is covered
which is via the mapped convolution kernel retrace to the original image
which makes it difficult to capture the long-range feature relationship. It is a challenge to coordinate multiple convolutional layers to capture these dependent parameter values in detail via optimization algorithm. Therefore
the long-range dependency between different regions of the image can just be modeled through multiple convolutional layers or the enlargement of the convolution kernel
but this local convolution structural approach also loses the computational efficiency. In addition
another generative adversarial network (GAN) challenge is the manipulation ability of the discriminator. The discriminator training is equivalent to training a good evaluator to estimate the density ratio between the generated distribution and the target distribution. The discriminator based density ratio estimation is inaccurate and unstable in related to high-dimensional space in common. The better the discriminator is trained
the more severed gradient returned to the generator is ignored
and the training process will be cut once the gradient completely disappeared. The traditional method proposes that the parameter matrix of the discriminator is required to meet the Lipschitz constraint
but this method is not detailed enough. The method limit the parameter matrix factors
but it is not greater than a certain value. Although the Lipschitz constraint can also be guaranteed
the structure of the entire parameter matrix is disturbed due to the changed proportional relationship between the parameters.
Method
2
The semi-supervised adversarial learning application can effectively reduce the number of manually generated labels to semantic image segmentation in the training process. Our segmentation network is used as the generator of the GAN
and the segmentation network outputs the semantic label probability map of a targeted image. Hence
the output of the segmentation network is possible close to the ground truth label in space. The fully convolutional neural network (CNN) is used as the discriminator. When doing semi-supervised training
the discriminator can distinguish the ground truth label map from the class probability map predicted by the segmentation network. The discriminator network generates a confidence map that it can be used as a supervision signal to guide the cross-entropy loss. Based on the confidence map
it is easy to see the regions in the prediction distribution that are close to the ground truth label distribution
and then use the masked cross-entropy loss to make the segmentation network trust and train these credible predictions. This method is similar to the probabilistic graphical model. The network does not increase the computational load because redundant post-processing modules are not appeared in the test phase and discriminator is not needed in the inference process. We extend two layers of self-attention modules to the segmentation network of GAN
and model the semantic dependency in the spatial dimension. The segmentation network as a generator can precisely coordinate the fine details of each pixel position on the feature map with the fine details in the distance part of the image through this attention module. The self-attention module is optional to aggregate the features at each location via a weighted sum on the features of multifaceted locations. Therefore
the relationship between widely discrete spatial regions in the input image can be effectively processed based on pixel-level ground truth data. A good balance is achieved between long-range dependency modeling capabilities and computational efficiency. We carry out spectral normalization to the discriminator of the adversarial network during the training process. This method introduces Lipschitz continuity constraints from the perspective of the spectral norm of the parameter matrix of each layer of neural network. The neural network beyond the disturbance of the input image and make the training process more stable and easier to converge. This is a more refined way to make the discriminator meet Lipschitz connectivity
which limits the violent degree of function changes and makes the model more stable. This weighted normalization method can not only stabilize the training of the discriminator network
but also obtain satisfactory performance without intensive adjustment of the unique hyper-parameters
and is easy to implement and requires less calculation. When spectral normalization is applied to adversarial generation networks on semantic segmentation tasks
the generated cases are also more diverse than traditional weight normalization. In the absence of complementary regularization techniques
spectral normalization can even improve the quality of the generated image better than other weight normalization and gradient loss.
Result
2
Our experiment is compared to the latest 9 methods derived from the Cityscapes dataset and the pattern analysis
statistical modeling and computational learning visual object classes(PASCAL VOC 2012) dataset. In the Cityscapes dataset
the performance is improved by 2.3% to 3.2% compared to the baseline model. In the PASCAL VOC 2012 dataset
the performance is improved by 1.4% to 2.5% over the baseline model. Simultaneously an ablation experiment is conducted on the PASCAL VOC 2012 dataset.
Conclusion
2
The semantic segmentation method of semi-supervised adversarial learning proposed uses the introduced self-attention mechanism to capture the dependence between pixels on the feature map. The application of spectral normalization to stabilize the adversarial generation network has its qualified robustness and effectiveness.
Arjovsky M and Bottou L. 2017. Towards principled methods for training generative adversarial networks[EB/OL]. [2021-03-22] . https://arxiv.org/pdf/1701.04862.pdf https://arxiv.org/pdf/1701.04862.pdf
Arjovsky M, Chintala S and Bottou L. 2017. Wasserstein GAN[EB/OL]. [2021-03-22] . https://arxiv.org/pdf/1701.07875.pdf https://arxiv.org/pdf/1701.07875.pdf
Brock A, Donahue J and Simonyan K. 2019. Large scale GAN training for high fidelity natural image synthesis[EB/OL]. [2021-03-22] . https://arxiv.org/pdf/1809.11096.pdf https://arxiv.org/pdf/1809.11096.pdf
Chen L C, Papandreou G, Kokkinos I, Murphy K and Yuille A L. 2018. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4): 834-848[DOI:10.1109/TPAMI.2017.2699184]
Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S and Schiele B. 2016. The cityscapes dataset for semantic urban scene understanding//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 3213-3223[ DOI: 10.1109/CVPR.2016.350 http://dx.doi.org/10.1109/CVPR.2016.350 ]
Ding H H, Jiang X D, Shuai B, Liu A Q and Wang G. 2018. Context contrasted feature and gated multi-scale aggregation for scene segmentation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 2393-2402[ DOI: 10.1109/CVPR.2018.00254 http://dx.doi.org/10.1109/CVPR.2018.00254 ]
Everingham M, Van Gool L, Williams C K I, Winn J and Zisserman A. 2010. The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2): 303-338[DOI:10.1007/s11263-009-0275-4]
Feng Z Y, Zhou Q Y, Cheng G L, Tan X, Shi J P and Ma L Z. 2021. Semi-supervised semantic segmentation via dynamic self-training and class-balanced curriculum[EB/OL]. [2021-03-22] . https://arxiv.org/pdf/2004.08514v1.pdf https://arxiv.org/pdf/2004.08514v1.pdf
French G, Aila T, Laine S, Mackiewicz M and Finlayson G. 2020a. Semi-supervised semantic segmentation needs strong, high-dimensional perturbations[EB/OL]. [2021-03-22] . https://arxiv.org/pdf/1906.01916.pdf https://arxiv.org/pdf/1906.01916.pdf
French G, Aila T, Laine S, Mackiewicz M and Finlayson G. 2020b. Consistency regularization and CutMix for semi-supervised semantic segmentation[EB/OL] . [2021-03-22]. https://arxiv.org/pdf/1906.01916v1.pdf https://arxiv.org/pdf/1906.01916v1.pdf
Geiger A, Lenz P and Urtasun R. 2012. Are we ready for autonomous driving? The KITTI vision benchmark suite//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 3354-3361[ DOI: 10.1109/CVPR.2012.6248074 http://dx.doi.org/10.1109/CVPR.2012.6248074 ]
Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A and Bengio Y. 2014. Generative adversarial nets//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press: 2672-2680
Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V and Courville A. 2017. Improved training of wasserstein GANs[EB/OL]. [2021-03-22] . https://arxiv.org/pdf/1704.00028.pdf https://arxiv.org/pdf/1704.00028.pdf
Hariharan B, Arbeláez P, Bourdev L, Maji S and Malik J. 2011. Semantic contours from inverse detectors//Proceedings of 2011 International Conference on Computer Vision. Barcelona, Spain: IEEE: 991-998[ DOI: 10.1109/ICCV.2011.6126343 http://dx.doi.org/10.1109/ICCV.2011.6126343 ]
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778[ DOI: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]
Hu H, Gu J Y, Zhang Z, Dai J F and Wei Y C. 2018. Relation networks for object detection//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 3588-3597[ DOI: 10.1109/CVPR.2018.00378 http://dx.doi.org/10.1109/CVPR.2018.00378 ]
Hung W C, Tsai Y H, Liou Y T, Lin Y Y and Yang M H. 2018. Adversarial learning for semi-supervised semantic segmentation[EB/OL]. [2021-03-22] . https://arxiv.org/pdf/1802.07934.pdf https://arxiv.org/pdf/1802.07934.pdf
Kalluri T, Varma G, Chandraker M and Jawahar C V. 2019. Universal semi-supervised semantic segmentation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea(South): IEEE: 5258-5269[ DOI: 10.1109/ICCV.2019.00536 http://dx.doi.org/10.1109/ICCV.2019.00536 ]
Karras T, Aila T, Laine S and Lehtinen J. 2018. Progressive growing of GANs for improved quality, stability, and variation[EB/OL]. [2021-03-22] . https://arxiv.org/pdf/1710.10196.pdf https://arxiv.org/pdf/1710.10196.pdf
Kingma D P and Ba J. 2017. Adam: a method for stochastic optimization[EB/OL]. [2021-03-22] . https://arxiv.org/pdf/1412.6980.pdf https://arxiv.org/pdf/1412.6980.pdf
Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P and Zitnick C L. 2014. Microsoft COCO: common objects in context//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 740-755[ DOI: 10.1007/978-3-319-10602-1_48 http://dx.doi.org/10.1007/978-3-319-10602-1_48 ]
Liu S T, Zhang J Q, Chen Y X, Liu Y F, Qin Z C and Wan T. 2019. Pixel level data augmentation for semantic image segmentation using generative adversarial networks//Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton, UK: IEEE: 1902-1906[ DOI: 10.1109/ICASSP.2019.8683590 http://dx.doi.org/10.1109/ICASSP.2019.8683590 ]
Liu X M, Cao J, Fu T Y, Pan Z F, Hu W, Zhang K and Liu J. 2019. Semi-supervised automatic segmentation of layer and fluid region in retinal optical coherence tomography images using adversarial learning. IEEE Access, 7: 3046-3061[DOI:10.1109/ACCESS.2018.2889321]
Long J, Shelhamer E and Darrell T. 2015. Fully convolutional networks for semantic segmentation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3431-3440[ DOI: 10.1109/CVPR.2015.7298965 http://dx.doi.org/10.1109/CVPR.2015.7298965 ]
Luc P, Couprie C, Chintala S and Verbeek J. 2016. Semantic segmentation using adversarial networks[EB/OL]. [2021-03-22] . https://arxiv.org/pdf/1611.08408.pdf https://arxiv.org/pdf/1611.08408.pdf
Mittal S, Tatarchenko M and Brox T. 2021. Semi-supervised semantic segmentation with high-and low-level consistency. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(4): 1369-1379[DOI:10.1109/TPAMI.2019.2960224]
Miyato T, Kataoka T, Koyama M and Yoshida Y. 2018. Spectral normalization for generative adversarial networks[EB/OL]. [2021-03-22] . https://arxiv.org/pdf/1802.05957.pdf https://arxiv.org/pdf/1802.05957.pdf
Oliver A, Odena A, Raffel C, Cubuk E D and Goodfellow I J. 2019. Realistic evaluation of deep semi-supervised learning algorithms[EB/OL]. [2021-03-22] . https://arxiv.org/pdf/1804.09170.pdf https://arxiv.org/pdf/1804.09170.pdf
Radford A, Metz L and Chintala S. 2016. Unsupervised representation learning with deep convolutional generative adversarial networks[EB/OL]. [2021-03-22] . https://arxiv.org/pdf/1511.06434.pdf https://arxiv.org/pdf/1511.06434.pdf
Ronneberger O, Fischer P and Brox T. 2015. U-Net: convolutional networks for biomedical image segmentation//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich, Germany: Springer: 234-241[ DOI: 10.1007/978-3-319-24574-4_28 http://dx.doi.org/10.1007/978-3-319-24574-4_28 ]
Salimans T and Kingma D P. 2016. Weight normalization: a simple reparameterization to accelerate training of deep neural networks[EB/OL]. [2021-03-22] . https://arxiv.org/pdf/1602.07868.pdf https://arxiv.org/pdf/1602.07868.pdf
Sawatzky J, Zatsarynna O and Gall J. 2021. Discovering latent classes for semi-supervised semantic segmentation[EB/OL]. [2021-03-22] . https://arxiv.org/pdf/1912.12936.pdf https://arxiv.org/pdf/1912.12936.pdf
Shuai B, Zuo Z, Wang B and Wang G. 2018. Scene segmentation with dag-recurrent neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6): 1480-1493[DOI:10.1109/TPAMI.2017.2712691]
Simonyan K and Zisserman A. 2015. Very deep convolutional networks for large-scale image recognition[EB/OL]. [2021-03-22] . https://arxiv.org/pdf/1409.1556.pdf https://arxiv.org/pdf/1409.1556.pdf
Souly N, Spampinato C and Shah M. 2017. Semi and weakly supervised semantic segmentation using generative adversarial network[EB/OL]. [2021-03-22] . https://arxiv.org/pdf/1703.09695.pdf https://arxiv.org/pdf/1703.09695.pdf
Stekovic S, Fraundorfer F and Lepetit V. 2019. S4-Net: geometry-consistent semi-supervised semantic segmentation[EB/OL]. [2021-03-22] . https://arxiv.org/pdf/1812.10717.pdf https://arxiv.org/pdf/1812.10717.pdf
Sun F D and Li W H. 2019. Saliency guided deep network for weakly-supervised image segmentation. Pattern Recognition Letters, 120: 62-68[DOI:10.1016/J.PATREC.2019.01.009]
Szegedy C, Zaremba W, Sutskever I, Bruna J, Erhan D, Goodfellow I and Fergus R. 2014. Intriguing properties of neural networks[EB/OL]. [2021-03-22] . https://arxiv.org/pdf/1312.6199.pdf https://arxiv.org/pdf/1312.6199.pdf
Tsai Y H, Shen X H, Lin Z, Sunkavalli K, Lu X and Yang M H. 2017. Deep image harmonization//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 2799-2807[ DOI: 10.1109/CVPR.2017.299 http://dx.doi.org/10.1109/CVPR.2017.299 ]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc. : 6000-6010
Wang X L, Girshick R, Gupta A and He K M. 2018. Non-local neural networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7794-7803[ DOI: 10.1109/CVPR.2018.00813 http://dx.doi.org/10.1109/CVPR.2018.00813 ]
Yu F and Koltun V. 2016. Multi-scale context aggregation by dilated convolutions[EB/OL]. [2021-03-22] . https://arxiv.org/pdf/1511.07122.pdf https://arxiv.org/pdf/1511.07122.pdf
Zhang H, Dana K, Shi J P, Zhang Z Y, Wang X G, Tyagi A and Agrawal A. 2018. Context encoding for semantic segmentation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7151-7160[ DOI: 10.1109/CVPR.2018.00747 http://dx.doi.org/10.1109/CVPR.2018.00747 ]
Zhang H, Zhang H, Wang C G and Xie J Y. 2019. Co-occurrent features in semantic segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 548-557[ DOI: 10.1109/CVPR.2019.00064 http://dx.doi.org/10.1109/CVPR.2019.00064 ]
Zhao H S, Zhang Y, Liu S, Shi J P, Loy C C, Lin D H and Jia J Y. 2018. PSANet: point-wise spatial attention network for scene parsing//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 270-286[ DOI: 10.1007/978-3-030-01240-3_17 http://dx.doi.org/10.1007/978-3-030-01240-3_17 ]
相关作者
相关机构