动态生成掩膜弱监督语义分割

陈辰; 唐胜; 李锦涛

doi:10.11834/jig.190458

图像分析和识别 | 浏览量 : 0 下载量: 18 CSCD: 3

PDF
导出
分享
收藏
专辑

动态生成掩膜弱监督语义分割
Weakly supervised semantic segmentation based on dynamic mask generation
2020年25卷第6期页码：1190-1200
收稿：2019-10-11，

修回：2019-11-10，

录用：2019-11-17，

纸质出版：2020-06-16
DOI： 10.11834/jig.190458
稿件说明：

移动端阅览

陈辰, 唐胜, 李锦涛. 动态生成掩膜弱监督语义分割[J]. 中国图象图形学报, 2020,25(6):1190-1200. DOI： 10.11834/jig.190458.

Chen Chen, Sheng Tang, Jintao Li. Weakly supervised semantic segmentation based on dynamic mask generation[J]. Journal of Image and Graphics, 2020, 25(6): 1190-1200. DOI： 10.11834/jig.190458.

摘要

目的

传统图像语义分割需要的像素级标注数据难以大量获取，图像语义分割的弱监督学习是当前的重要研究方向。弱监督学习是指使用弱标注样本完成监督学习，弱标注比像素级标注的标注速度快、标注方式简单，包括散点、边界框、涂鸦等标注方式。

方法

针对现有方法对多层特征利用不充分的问题，提出了一种基于动态掩膜生成的弱监督语义分割方法。该方法以边界框作为初始前景分割轮廓，使用迭代方式通过卷积神经网络(convolutional neural network，CNN)多层特征获取前景目标的边缘信息，根据边缘信息生成掩膜。迭代的过程中首先使用高层特征对前景目标的大体形状和位置做出估计，得到粗略的物体分割掩膜。然后根据已获得的粗略掩膜，逐层使用CNN特征对掩膜进行更新。

结果

在Pascal VOC(visual object classes) 2012数据集上取得了78.06%的分割精度，相比于边界框监督、弱—半监督、掩膜排序和实例剪切方法，分别提高了14.71%、4.04%、3.10%和0.92%。

结论

该方法能够利用高层语义特征，减少分割掩膜中语义级别的错误，同时使用底层特征对掩膜进行更新，可以提高分割边缘的准确性。

Abstract

Objective

Image semantic segmentation is an important research topic in the field of computer vision. It refers to dividing an input image into multiple regions with semantic meaning

i.e.

assigning a semantic category to each pixel in the image. Many studies on image semantic segmentation based on deep learning have been conducted recently in China and overseas. Current mainstream methods are based on supervised deep learning. However

deep learning requires a large number of training samples

and the image semantic segmentation problem requires category labeling for each pixel in the training sample. On the one hand

pixel-level labeling is difficult. On the other hand

a large number of sample labels means high manual labeling costs. Therefore

image semantic segmentation based on weak supervision has become a research focus in recent years. Weakly supervised learning uses a weak label that is faster and easier to obtain

such as points

bounding boxes

and scribbles

for training. The major difficulty in weakly supervised learning is that weakly labeled data do not contain the location and contour information required for training.

Method

To solve the problem of missing edge information in a weak label for semantic segmentation

our primary objective is to fully utilize multilayer features extracted by a convolutional neural network (CNN). Our contributions include the following: first

a dynamic mask generation method for extracting the edges of image foreground targets is proposed. The method uses a bounding box as the initial foreground edge contour and iteratively adjusts it with the multilayer features of a CNN with a Gaussian mixture model. The input data of the dynamic mask generation method include bounding box label data and CNN feature maps. During each iteration

eigenvectors from a specific feature map are normalized and used to initialize the Gaussian mixture model

whose training samples are selected in accordance with the edges generated in the last iteration. The probability of all the sample points with respect to the Gaussian mixture model is calculated

and a fine-tuned contour is generated on the basis of these probabilities. In our dynamic mask generation process

the final mask generation iteration uses the original image feature to improve edge accuracy. Simultaneously

high-level features are used for mask initialization to reduce semantic level errors in edge information. Second

a weak supervised semantic segmentation method based on dynamic mask generation is proposed. The generated dynamic mask is used as supervision information in the semantic segmentation training process to feedback the CNN. In each training step

the mask is dynamically generated in accordance with the forward propagation result of each input image

and the mask is used instead of the traditional pixel-level annotation to complete the calculation of the loss function. The semantic segmentation model is trained in an end-to-end manner. A dynamic mask is only generated during the training process

and the test process only requires the forward propagation of the CNN.

Result

The segmentation accuracy of our method on the Pascal visual object classes(VOC)2012 dataset is 78.06%. Compared with existing weakly supervised semantic segmentation methods

such as box supervised(BoxSup) method

weakly and semi-supervised learning(WSSL) method

simple does it(SDI) method

and cut and paste(CaP) method

accuracy increases by 14.71%

4.04%

3.10%

and 0.92%

respectively. On the Berkeley deep drive(BDD 100K) dataset

the segmentation accuracy of our method is 61.56%. Compared with Boxsup

WSSL

SDI

and CaP

the accuracy increases by 10.39%

3.12%

1.35%

and 2.04%

respectively. The method has improved segmentation accuracy in the categories of pedestrians

cars

and traffic lights. Improvements are achieved in the categories of trucks and buses. The foreground targets of the two categories are typically large

and simple features tend to result in unsatisfactory segmentation. After the fusion of the underlying

middle

and high-level features in this study

the segmentation accuracy of such large targets is relatively significantly improved.

Conclusion

High-level features are used to estimate the approximate shape and position of the foreground object and generate rough edges

which will be corrected layer by layer with multilayer features. High-level semantic features can decrease edge information error in the semantic level

and low-level image features improve the accuracy of the edge. The training speed of our method is relatively slow because of the dynamic mask generation in each training step. However

test speed does not slow down because only the forward propagation calculation of the CNN is required.

关键词

Keywords

references

Ahn J, Cho S and Kwak S. 2019. Weakly supervised learning of instance segmentation with inter-pixel relations[EB/OL].[2019-04-10] . https://arxiv.org/pdf/1904.05044v1.pdf https://arxiv.org/pdf/1904.05044v1.pdf

Arbeláez P, Pont-Tuset J, Barron J, Marques F and Malik J. 2014. Multiscale combinatorial grouping//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, Ohio, USA: IEEE: 328-335[ DOI:10.1109/CVPR.2014.49 http://dx.doi.org/10.1109/CVPR.2014.49 ]

Bearman A, Russakovsky O, Ferrari V and Li F F. 2016. What's the point: semantic segmentation with point supervision//Proceedings of the 14th European Conference on Computer Vision. Amsterdam: Springer: 549-565[ DOI:10.1007/978-3-319-46478-7_34 http://dx.doi.org/10.1007/978-3-319-46478-7_34 ]

Chen L C, Papandreou G, Kokkinos I, Murphy K and Yuille A L. 2018. Deeplab:semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834-848[DOI:10.1109/TPAMI.2017.2699184]

Dai J F, He K M and Sun J. 2015. BoxSup: exploiting bounding boxes to supervise convolutional networks for semantic segmentation//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago: IEEE: 1635-1643[ DOI:10.1109/ICCV.2015.191 http://dx.doi.org/10.1109/ICCV.2015.191 ]

Deng J, Dong W, Socher R, Li L J, Li K and Li F F. 2009. Imagenet: a large-scale hierarchical image database//Proceedings of 2019 IEEE Conference on Computer Vision and Pattern Recognition. Miami, USA: IEEE: 248-255[ DOI:10.1109/CVPR.2009.5206848 http://dx.doi.org/10.1109/CVPR.2009.5206848 ]

Everingham M, Van Gool L, Williams C K I, Winn J and Zisserman A. 2010. The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2):303-338[DOI:10.1007/s11263-009-0275-4]

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778[ DOI:10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]

Huang Z L, Wang X G, Wang J S, Liu W Y and Wang J D. 2018. Weakly-supervised semantic segmentation network with deep seeded region growing//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, Utah, USA: IEEE: 7014-7023[ DOI:10.1109/CVPR.2018.00733 http://dx.doi.org/10.1109/CVPR.2018.00733 ]

Khoreva A, Benenson R, Hosang J, Hein M and Schiele B. 2017. Simple does it: weakly supervised instance and semantic segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE: 1665-1674[ DOI:10.1109/CVPR.2017.181 http://dx.doi.org/10.1109/CVPR.2017.181 ]

Kolesnikov A and Lampert C H. 2016. Seed, expand and constrain: three principles for weakly-supervised image segmentation//Proceedings of the 14th European Conference on Computer Vision. Amsterdam: Springer: 695-711[ DOI:10.1007/978-3-319-46493-0_42 http://dx.doi.org/10.1007/978-3-319-46493-0_42 ]

Krähenbühl P and Koltun V. 2011. Efficient inference in fully connected CRFs with Gaussian edge potentials//Advances in Neural Information Processing Systems. Granada, Spain: Springer: 109-117

Lin D, Dai J F, Jia J Y, He K M and Sun J. 2016. Scribblesup: scribble-supervised convolutional networks for semantic segmentation//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, Nevada, USA: IEEE: 3159-3167[ DOI:10.1109/CVPR.2016.344 http://dx.doi.org/10.1109/CVPR.2016.344 ]

Lin G S, Milan A, Shen C H and Reid I. 2017. Refinenet: multi-path refinement networks for high-resolution semantic segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 5168-5177[ DOI:10.1109/CVPR.2017.549 http://dx.doi.org/10.1109/CVPR.2017.549 ]

Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P and Zitnick C L. 2014. Microsoft COCO: common objects in context//Proceedings of the 13th European Conference on Computer Vision. Zurich: Springer: 740-755[ DOI:10.1007/978-3-319-10602-1_48 http://dx.doi.org/10.1007/978-3-319-10602-1_48 ]

Long J, Shelhamer E and Darrell T. 2015. Fully convolutional networks for semantic segmentation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3431-3440[ DOI:10.1109/CVPR.2015.7298965 http://dx.doi.org/10.1109/CVPR.2015.7298965 ]

Maninis K K, Caelles S, Pont-Tuset J and Van Gool L. 2018. Deep extreme cut: from extreme points to object segmentation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, Utah, USA: IEEE: 616-625[ DOI:10.1109/CVPR.2018.00071 http://dx.doi.org/10.1109/CVPR.2018.00071 ]

Oquab M, Bottou L, Laptev I and Sivic J. 2015. Is object localization for free? Weakly-supervised learning with convolutional neural networks//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, Massachusetts, USA: IEEE: 685-694[ DOI:10.1109/CVPR.2015.7298668 http://dx.doi.org/10.1109/CVPR.2015.7298668 ]

Papadopoulos D P, Uijlings J R R, Keller F and Ferrari V. 2017. Training object class detectors with click supervision//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE: 180-189[ DOI:10.1109/CVPR.2017.27 http://dx.doi.org/10.1109/CVPR.2017.27 ]

Papandreou G, Chen L C, Murphy K P and Yuille A L. 2015. Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 1742-1750[ DOI:10.1109/ICCV.2015.203 http://dx.doi.org/10.1109/ICCV.2015.203 ]

Pathak D, Krähenbühl P and Darrell T. 2015. Constrained convolutional neural networks for weakly supervised segmentation//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 1796-1804[ DOI:10.1109/ICCV.2015.209 http://dx.doi.org/10.1109/ICCV.2015.209 ]

Rajchl M, Lee M C H, Oktay O, Kamnitsas K, Passerat-Palmbach J, Bai W J, Damodaram M, Rutherford M A, Hajnal J V, Kainz B and Rueckert D. 2017. Deepcut:object segmentation from bounding box annotations using convolutional neural networks. IEEE Transactions on Medical Imaging, 36(2):674-683[DOI:10.1109/TMI.2016.2621185]

Remez T, Huang J and Brown M. 2018. Learning to segment via Cut-and-Paste//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 39-54[ DOI:10.1007/978-3-030-01234-2_3 http://dx.doi.org/10.1007/978-3-030-01234-2_3 ]

Rother C, Kolmogorov V and Blake A. 2004. "GrabCut":interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics, 23(3):309-314[DOI:10.1145/1186562.1015720]

Roy A and Todorovic S. 2017. Combining bottom-up, top-down, and smoothness cues for weakly supervised image segmentation//Proceedings of 2017 IEEE Conferenceon Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE: 7282-7291[ DOI:10.1109/CVPR.2017.770 http://dx.doi.org/10.1109/CVPR.2017.770 ]

Tang M, Djelouah A, Perazzi F, Boykov Y and Schroers C. 2018a. Normalized cut loss for weakly-supervised CNN segmentation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, Utah, USA: IEEE: 1818-1827[ DOI:10.1109/CVPR.2018.00195 http://dx.doi.org/10.1109/CVPR.2018.00195 ]

Tang M, Perazzi F, Djelouah A, Ayed I B, Schroers C and Boykov Y. 2018b. On regularized losses for weakly-supervised cnn segmentation//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 524-540[ DOI:10.1007/978-3-030-01270-0_31 http://dx.doi.org/10.1007/978-3-030-01270-0_31 ]

Vedaldi A and Lenc K. 2015. MatConvNet: convolutional neural networks for MATLAB//Proceedings of the 23rd ACM international conference on Multimedia. Brisbane: ACM: 689-692[ DOI:10.1145/2733373.2807412 http://dx.doi.org/10.1145/2733373.2807412 ]

Xie S N and Tu Z W. 2015. Holistically-nested edge detection//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 1395-1403[ DOI:10.1109/ICCV.2015.164 http://dx.doi.org/10.1109/ICCV.2015.164 ]

Yu F, Xian W Q, Chen Y Y, Liu F C, Liao M K, Madhavan V and Darrell T. 2018. BDD100K: a diverse driving video database with scalable annotation tooling[EB/OL].[2018-05-12] . https://arxiv.org/pdf/1805.04687.pdf http://dx.doi.org/https://arxiv.org/pdf/1805.04687.pdf

Zeiler M D, Krishnan D, Taylor G W and Fergus R. 2010. Deconvolutional networks//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, USA: IEEE: 2528-2535[ DOI:10.1109/CVPR.2010.5539957 http://dx.doi.org/10.1109/CVPR.2010.5539957 ]