流形正则化约束的图像语义分割

肖振久; 宗佳旭; 兰海; 魏宪; 唐晓亮

doi:10.11834/jig.200527

图像分析和识别 | 浏览量 : 0 下载量: 147 CSCD: 1

PDF
导出
分享
收藏
专辑

流形正则化约束的图像语义分割
Image semantic segmentation based on manifold regularization constraint
2022年27卷第4期页码：1204-1215
收稿：2020-09-08，

修回：2020-12-8，

录用：2020-12-15，

纸质出版：2022-04-16
DOI： 10.11834/jig.200527
稿件说明：

移动端阅览

肖振久, 宗佳旭, 兰海, 魏宪, 唐晓亮. 流形正则化约束的图像语义分割[J]. 中国图象图形学报, 2022,27(4):1204-1215. DOI： 10.11834/jig.200527.

Zhenjiu Xiao, Jiaxu Zong, Hai Lan, Xian Wei, Xiaoliang Tang. Image semantic segmentation based on manifold regularization constraint[J]. Journal of Image and Graphics, 2022, 27(4): 1204-1215. DOI： 10.11834/jig.200527.

摘要

目的

在基于深度学习的图像语义分割方法中，损失函数通常只考虑单个像素点的预测值与真实值之间的交叉熵并对其进行简单求和，而引入图像像素间的上下文信息能够有效提高图像的语义分割的精度，但目前引入上下文信息的方法如注意力机制、条件随机场等算法需要高昂的计算成本和空间成本，不能广泛使用。针对这一问题，提出一种流形正则化约束的图像语义分割算法。

方法

以经过数据集ImageNet预训练的残差网络(residual network

ResNet)为基础，采用DeepLabV3作为骨架网络，通过骨架网络获得预测分割图像。进行子图像块的划分，将原始图像和分割图像分为若干大小相同的图像块。通过原始图像和分割图像的子图像块，计算输入数据与预测结果所处流形曲面上的潜在几何约束关系。利用流形约束的结果优化分割网络中的参数。

结果

通过加入流形正则化约束，捕获图像中上下文信息，降低了网络前向计算过程中造成的本征结构的损失，提高了算法精度。为验证所提方法的有效性，实验在Cityscapes和PASCAL VOC 2012(pattern analysis

statistical modeling and computational learning visual object classes)两个数据集上进行。在Cityscapes数据集中，精度值为78.0%，相比原始网络提高了0.5%；在PASCAL VOC 2012数据集中，精度值为69.5%，相比原始网络提高了2.1%。同时，在Cityscapes数据集中进行对比实验，验证了算法的有效性，对比实验结果证明提出的算法改善了语义分割的效果。

结论

本文提出的语义分割算法在不提高推理网络计算复杂度的前提下，取得了较好的分割精度，具有极大的实用价值。

Abstract

Objective

Image semantic segmentation is one of the essential issues in computer vision and image processing. It aims to divide pixels in the image into different categories semantically

and to foresee pixel-level predictions. It has been widely used in various fields

such as scene information understanding

automatic driving and medical assisting diagnosis. Competitive performance has still suffered from challenges such as low contrast

uneven luminance and complicated scenarios currently. The performance of semantic segmentation algorithms have mainly constrained by the spatial context information. Current methods based on deep learning algorithms for image semantic segmentation has focused on harnessing the context information between pixels. For instance

the attention mechanism builds an element-wise weight matrix to capture the similarity between pixels which can be used as coefficient to summate the input. Meanwhile

probabilistic graphical models have been utilized in the spatial context as prior to enhance the classification confidence. However

these methodologies require massive computational resource (e.g. GPU memory). A contextual information capturing method is demonstrated based on manifold regularization. By assuming the data in the input image and the segmentation prediction share the same locally geometric structure in the low-dimensional manifold

this research illustrated possibility to harness the relevancy among pixels in more efficient way. As a result

the novel algorithm based on manifold regularization is issued to exploit the spatial context relation from a geometric perspective

which can be embedded into the deep learning framework to improve the performance with no increasing on both parameter amount and reasoning time.

Method

The contextual information analysis in the image can be effectively captured by manifold regularization. The DeepLab-v3 architecture is extracted the image features

which uses the residual network(ResNet) as the backbone network. The last two down-sampling layers of the model are pruned

and dilated convolution is employed in the subsequent convolutional layer to control the resolution of the features. For the methodology of regular segmentation

the cross-entropy of single pixel between prediction and ground truth is only involved in the cost function and sum up in total loss without any context information simply. A detailed manifold regularization penalty designation is integrated to single pixel information and the neighborhood context information. This geometric intuition for the initial image data has the same locally geometric shape with those in the segmented result. It indicates that the correspondences between clusters of data points in the input image and output result data points. For instance

when the distance of two input data points in the manifold sub-space is close

the corresponding segmentation result data points are close

and vice versa. Furthermore

the image into sub-image patches to capture the relationship between to customize the constraints between pixels. The hierarchical manifold regularization constraints are achieved via sub-image patch divides into different sizes. When the patch size is minimized

the constraint is between pixels substantially and the approach acts like other pixel-wise context aware algorithms such as fully connected conditional random field (CRF) model. On the contrary

the maximum patch size which equals to the input image size makes the approach become semi-supervised learning algorithm based on interconnected samples. The analyzed model gets improved on segmentation accuracy and achieves state-of-the-art performance. This model is based on two public datasets

Cityscapes and PASCAL VOC 2012 (pattern analysis

statistical modeling and computational learning visual object classes 2012). The performance is measured via mean intersection-over-union (mIoU) averaged across all the classes. The open source toolbox Pytorch is used to build the model. The stochastic gradient descent (SGD) method is adopted as the optimization. In addition

data augmentation is conducted by means of random cropping and inversion in accordance with probability levels. The operating system of the experimental platform is Centos7

with a GPU of model NVIDIA RTX 2080Ti and a CPU of Intel(R) Core(TM) i7-6850.

Result

The tests are conducted with the effect of manifold regularization. The algorithm achieves a good accuracy of the segmentation model without increasing computational complexity in the process of model implementation. On the benchmark

the ResNet50 backbone model improves the performance by 0.8% with manifold regularization adopted on the PASCAL VOC 2012 dataset

while the ResNet101 backbone models bring 2.1% mIoU gain. These results demonstrated that the manifold regularization get qualified performance with larger network model

and the analyszed results on the Cityscapes dataset also prove this inference

the ResNet50 model increases by 0.3% wile the ResNet101 model increases by 0.5%. With the comparison of other context aggregation methods

we achieve mIoU of 78.0% on the Cityscapes dataset and 69.5% on the PASCAL VOC 2012 dataset. Furthermore

visualization of the segmentation results is implemented. The generated segmentation results are more accurate at the edges and have less error rate based on the algorithm with manifold regularization constraints.

Conclusion

This demonstration illustrates a novel algorithm for the context information image semantic segmentation via the manifold regularization constraints

which can be melted into the deep learning network model to improve the segmentation performance without changing the network structure. The results verify that the illustrated algorithm has good generalization capability in semantic segmentation.

关键词

Keywords

references

Badrinarayanan V, Kendall A and Cipolla R. 2017. SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12): 2481-2495 [DOI: 10.1109/TPAMI.2016.2644615]

Belkin M, Niyogi P and Sindhwani V. 2006. Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. The Journal of Machine Learning Research, 7: 2399-2434 [DOI: 10.1007/s10846-006-9077-x]

Byeon W, Breuel T M, Raue F and Liwicki M. 2015. Scene labeling with LSTM recurrent neural networks//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA: IEEE: 3547-3555 [ DOI: 10.1109/CVPR.2015.7298977 http://dx.doi.org/10.1109/CVPR.2015.7298977 ]

Chandra S, Usunier N and Kokkinos I. 2017. Dense and low-rank Gaussian CRFs using deep embeddings//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 5113-5122 [ DOI: 10.1109/ICCV.2017.546 http://dx.doi.org/10.1109/ICCV.2017.546 ]

Chao P, Kao C Y, Ruan Y S, Huang C H and Lin Y L. 2019. HarDNet: a low memory traffic network//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE: 3551-3560 [ DOI: 10.1109/ICCV.2019.00365 http://dx.doi.org/10.1109/ICCV.2019.00365 ]

Chen L C, Papandreou G, Kokkinos I, Murphy K and Yuille A L. 2016. Semantic image segmentation with deep convolutional nets and fully connected CRFs. [EB/OL]. [2020-08-20] . http://arxiv.org/pdf/1412.7062.pdf http://arxiv.org/pdf/1412.7062.pdf [ DOI: 10.1080/17476938708814211 http://dx.doi.org/10.1080/17476938708814211 ].

Chen L C, Papandreou G, Kokkinos I, Murphy K and Yuille A L. 2018. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4): 834-848 [DOI: 10.1109/TPAMI.2017.2699184]

Chen L C, Papandreou G, Schroff F and Adam H. 2017. Rethinking atrous convolution for semantic image segmentation [EB/OL]. [2020-08-20] . https://arxiv.org/pdf/1706.05587.pdf https://arxiv.org/pdf/1706.05587.pdf

Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S and Schiele B. 2016. The cityscapes dataset for semantic urban scene understanding//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 3213-3223 [ DOI: 10.1109/CVPR.2016.350 http://dx.doi.org/10.1109/CVPR.2016.350 ]

Everingham M, Van G L, Williams C K I, Winn J and Zisserman A. 2012. PASCAL visual object classes challenge 2012 [DB/OL]. [2020-08-20] . http://host.robots.ox.ac.uk/pascal/VOC/voc2012/ http://host.robots.ox.ac.uk/pascal/VOC/voc2012/

Evgeniou T, Pontil M and Poggio T. 2000. Regularization networks and support vector machines. Advances in Computational Mathematics, 13(1): #1 [DOI: 10.1023/A:1018946025316]

Fu J, Liu J, Tian H J, Li Y, Bao Y J, Fang Z W and Liu H Q. 2019. Dual attention network for scene segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 3141-3149 [ DOI: 10.1109/CVPR.2019.00326 http://dx.doi.org/10.1109/CVPR.2019.00326 ]

Geng B, Xu C, Tao D C, Yang L J and Hua X S. 2009. Ensemble manifold regularization//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, USA: IEEE: 2396-2402 [ DOI: 10.1109/CVPR.2009.5206695 http://dx.doi.org/10.1109/CVPR.2009.5206695 ]

Han H H, Li W T, Wang J P, Jiao D and Sun B S. 2020. Semantic segmentation of encoder-decoder structure. Journal of Image and Graphics, 25(2): 255-266

韩慧慧, 李帷韬, 王建平, 焦点, 孙百顺. 2020. 编码—解码结构的语义分割. 中国图象图形学报, 25(2): 255-266 [DOI: 10.11834/jig.190212]

Hariharan B, Arbeláez P, Girshick R and Malik J. 2015. Hypercolumns for object segmentation and fine-grained localization//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA: IEEE: 447-456 [ DOI: 10.1109/CVPR.2015.7298642 http://dx.doi.org/10.1109/CVPR.2015.7298642 ]

He C, Zhang Y H and He Z F. 2020. Semantic segmentation of workpiece target based on multiscale feature fusion. Journal of Image and Graphics, 25(3): 476-485

和超, 张印辉, 何自芬. 2020. 多尺度特征融合工件目标语义分割. 中国图象图形学报, 25(3): 476-485 [DOI: 10.11834/jig.190218]

He K, Zhang X, Ren S and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 770-778 [ DOI: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]

Huang Z L, Wang X G, Huang L C, Huang C, Wei Y C and Liu W Y. 2019. CCNet: criss-cross attention for semantic segmentation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE: 603-612 [ DOI: 10.1109/ICCV.2019.00069 http://dx.doi.org/10.1109/ICCV.2019.00069 ]

Krähenbühl P and Koltun V. 2012. Efficient inference in fully connected CRFs with Gaussian edge potentials [EB/OL]. [2020-08-20] https://arxiv.org/pdf/1210.5644.pdf https://arxiv.org/pdf/1210.5644.pdf

Lin G S, Milan A, Shen C H and Reid I. 2017. RefineNet: multi-path refinement networks for high-Resolution semantic segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 5168-5177 [ DOI: 10.1109/CVPR.2017.549 http://dx.doi.org/10.1109/CVPR.2017.549 ]

Liu Z W, Li X X, Luo P, Loy C C and Tang X O. 2015. Semantic image segmentation via deep parsing network//Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE: 1377-1385 [ DOI: 10.1109/ICCV.2015.162 http://dx.doi.org/10.1109/ICCV.2015.162 ]

Long J, Shelhamer E and Darrell T. 2015. Fully convolutional networks for semantic segmentation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA: IEEE: 3431-3440 [ DOI: 10.1109/CVPR.2015.7298965 http://dx.doi.org/10.1109/CVPR.2015.7298965 ]

Luo D J and Huang H. 2014. Video motion segmentation using new adaptive manifold denoising model//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 65-72 [ DOI: 10.1109/CVPR.2014.16 http://dx.doi.org/10.1109/CVPR.2014.16 ]

Mehta S, Rastegari M, Shapiro L and Hajishirzi H. 2019. ESPNetv2: a light-weight, power efficient, and general purpose convolutional neural network//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 9182-9192 [ DOI: 10.1109/CVPR.2019.00941 http://dx.doi.org/10.1109/CVPR.2019.00941 ]

Niyogi P. 2013 Manifold regularization and semi-supervised learning: some theoretical analyses. The Journal of Machine Learning Research, 14(1): 1229-1250 [DOI: 10.5555/2567709.2502619]

Qing C, Yu J, Xiao C B and Duan J. 2020. Deep convolutional neural network for semantic image segmentation. Journal of Image and Graphics, 25(6): 1069-1090

青晨, 禹晶, 肖创柏, 段娟. 2020. 深度卷积神经网络图像语义分割研究进展. 中国图象图形学报, 25(6): 1069-1090 [DOI: 10.11834/jig.190355]

Quispe A M and Petitjean C. 2015. Shape priorbased image segmentation using manifold learning//Proceedings of 2015 International Conference on Image Processing Theory, Tools and Applications (IPTA). Orleans, France: IEEE: 137-142 [ DOI: 10.1109/IPTA.2015.7367113 http://dx.doi.org/10.1109/IPTA.2015.7367113 ]

Ronneberger O, Fischer P and Brox T. 2015. U-Net: convolutional networks for biomedical image segmentation//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich, Germany: Springer: 234-241 [ DOI: 10.1007/978-3-319-24574-4_28 http://dx.doi.org/10.1007/978-3-319-24574-4_28 ]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, KaiserŁand Polosukhin I. 2017. Attention is all you need [EB/OL]. [2020-08-20] . https://arxiv.org/pdf/1706.03762.pdf https://arxiv.org/pdf/1706.03762.pdf

Wang P Q, Chen P F, Yuan Y, Liu D, Huang Z H, Hou X D and Cottrell W. 2018a. Understanding convolution for semantic segmentation//Proceedings of 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). Lake Tahoe, USA: IEEE: 1451-1460 [ DOI: 10.1109/WACV.2018.00163 http://dx.doi.org/10.1109/WACV.2018.00163 ]

Wang X L, Girshick R, Gupta A and He K M. 2018b. Non-local neural networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: 7794-7803 [ DOI: 10.1109/CVPR.2018.00813 http://dx.doi.org/10.1109/CVPR.2018.00813 ]

Yu F and Koltun V. 2016 Multi-scale context aggregation by dilated convolutions [EB/OL]. [2020-08-20] . https://arxiv.org/pdf/1511.07122.pdf https://arxiv.org/pdf/1511.07122.pdf

Zhao H S, Shi J P, Qi X J, Wang X G and Jia J Y. 2017. Pyramid scene parsing network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 6230-6239 [ DOI: 10.1109/CVPR.2017.660 http://dx.doi.org/10.1109/CVPR.2017.660 ]

Zhao H S, Zhang Y, Liu S, Shi J P, Loy C C, Lin D H and Jia J Y. 2018. PSANet: point-wise spatial attention network for scene parsing//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 270-286 [ DOI: 10.1007/978-3-030-01240-3_17 http://dx.doi.org/10.1007/978-3-030-01240-3_17 ]

Zheng S, Jayasumana S, Romera-Paredes B, Vineet V, Su Z Z, Du D L, Huang C and Torr P H S. 2015. Conditional random fields as recurrent neural networks//Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE: 1529-2537 [ DOI: 10.1109/ICCV.2015.179 http://dx.doi.org/10.1109/ICCV.2015.179 ]