Current Issue Cover
基于深度学习的弱监督语义分割方法综述

项伟康1, 周全1, 莫智懿2, 崔景程3, 吴晓富3, 欧卫华4, 王井东5, 刘文予6(1.南京邮电大学信息与通信工程学院,梧州学院广西高校智慧行业软件重点实验室;2.梧州学院广西高校智慧行业软件重点实验室;3.南京邮电大学信息与通信工程学院;4.贵州师范大学大数据与计算机科学学院;5.百度;6.华中科技大学电子信息与通信学院)

摘 要
语义分割是计算机视觉领域中的基本任务,旨在为每个像素分配语义类别标签,实现对图像的像素级理解。近年来,得益于深度学习的发展,基于深度学习的全监督语义分割方法取得了巨大的进展。然而,这些方法往往需要大量带有像素级标注的训练数据,标注成本巨大,限制了其在诸如自动驾驶、医学图像分析以及工业控制等实际场景中的应用。为了降低数据的标注成本并进一步拓宽语义分割的应用场景,研究者们越来越关注基于深度学习的弱监督语义分割方法,希望通过诸如图像级标注、最小包围盒标注、线标注和点标注等弱标注信息实现图像的像素级分割预测。本文首先对语义分割任务进行了简要介绍,并分析了全监督语义分割所面临的困境,从而引出弱监督语义分割。然后,介绍了相关数据集和评估指标。接着,根据弱标注的类型和受关注程度,从图像级标注,其它弱标注,以及大模型辅助这三个方面回顾和讨论了弱监督语义分割的研究进展。其中,第二类弱监督语义分割方法包括基于最小包围盒、线和点标注的弱监督语义分割。最后,本文分析了弱监督语义分割领域存在的问题与挑战,并就其未来可能的研究方向提出建议,旨在进一步推动弱监督语义分割领域研究的发展。
关键词
A survey on weakly supervised semantic segmentation based on deep learning

(1.School of Communication and Information Engineering,Nanjing University of Posts and Telecommunications;2.School of Big Data and Computer Science,Guizhou Normal University;3.Baidu;4.School of Electronic Information and Communications,Huazhong University of Science and Technology)

Abstract
Semantic segmentation is an important and fundamental task in the field of computer vision. Its goal is to assign a semantic category label to each pixel in an image, achieving pixel-level understanding. It has wide applications in areas such as autonomous driving, virtual reality, and medical image analysis. In recent years, thanks to the development of deep learning, significant progress has been made in fully su-pervised semantic segmentation that requires a large amount of training data with pixel-level annotations. However, it is very hard to pro-vide accurate pixel-level annotations, since it sacrifices substantial time, money and human-label resources, which limits their widespread application in reality. In order to reduce the cost of annotating data and further expand the application scenarios of semantic segmentation, researchers are paying increasing attention to weakly supervised semantic segmentation (WSSS) based on deep learning. The goal is to develop a semantic segmentation model that utilizes weak annotations information instead of dense pixel-level annotations to accurately predict pixel-level segmentation. Weak annotations mainly include image-level, bounding-box, scribble and point annotations. The key problem in WSSS lies in how to find a way to effectively utilize the limited annotations information, incorporate appropriate training strategies, and design powerful models to bridge the gap between weak supervision and pixel-level annotations. This paper aims to clas-sify and summarize WSSS methods based on deep learning, analyze the challenges and problems encountered by recent methods, and provide insights into future research directions. Firstly, we introduce WSSS as a solution to the limitations of fully supervised semantic segmentation. Secondly, introduce the related datasets and evaluation metrics. Thirdly, we review and discuss the research progress of WSSS from three categories: image-level annotations, other weak annotations, and assistance from large-scale models, where the second category includes bounding-box, scribble, and point annotations. More specifically, image-level annotations only provide object categories information contained in the image, without specifying the positions of the target objects. Existing methods always follow a two-stage training process: producing class activation map (CAM), also known as initial seed regions used to generate high-quality pixel-level pseudo labels; and training a fully supervised semantic segmentation model using the produced pixel-level pseudo labels. According to whether the pixel-level pseudo-labels are updated or not during the training process in the second stage, WSSS based on image-level annotations can be further divided into offline and online approaches. For offline approaches, existing research treats two stages inde-pendently, where the initial seed regions are optimized to obtain more reliable pixel-level pseudo labels that remain unchanged throughout the second stage. They are often divided into six classes according to different optimization strategies, including the ensemble of CAM, image-erasing, co-occurrence relationship decoupling, affinity propagation, additional supervised information and self-supervised learning. For online approaches, the pixel-level pseudo labels keep updating during the entire training process in the second stage. The production of pixel-level pseudo labels and the semantic segmentation model are jointly optimized. The online counterparts are able to be trained end-to-end, making the training process more efficient. Compared with image-level annotations, other weak annotations, including bounding-box, scribble, and point, are more powerful supervised signals. Among them, bounding-box annotations not only provide ob-ject category labels, but also include information of object positions. The regions outside the bounding-box are always considered as background, while box regions simultaneously contain foreground and background areas. Therefore, for bounding-box annotations, ex-isting research mainly starts from accurately distinguishing foreground areas from background regions within the bounding-box, thereby producing more accurate pixel-level pseudo labels, used for training following semantic segmentation networks. Scribble and point anno-tations not only indicate the categories of objects contained in the image, but also provide local positional information of the target objects. For scribble annotations, by inferring the category of unlabeled regions from the annotated scribble, more complete pseudo labels can be produced to supervise semantic segmentation. For point annotations, the associated semantic information is expanded to the entire image through label propagation, distance metric learning, and loss function optimization. In addition, with the rapid development of large-scale models, this paper further discusses the recent research achievements in using large-scale models to assist WSSS tasks. Large-scale mod-els can leverage their pretrained universal knowledge to better understand images and generate more accurate pixel-level pseudo labels, thus improving the final segmentation performance. To evaluate the performance among different WSSS methods, this paper also reports the quantitative segmentation results on PASCAL VOC 2012 dataset. Finally, four challenges and potential future research directions are also provided. Firstly, there still exists a certain performance gap between weakly supervised and fully supervised methods. To bridge this gap, research should keep on improving the accuracy of pixel-level pseudo-labels. Secondly, when WSSS models are applied to re-al-world scenarios, they may encounter object categories that have never appeared in the training data. This requires the models to have a certain adaptability to identify and segment unknown objects. Thirdly, existing research mainly focuses on improving the accuracy, with-out considering the model size and inference speed of WSSS networks. This poses a significant challenge for the deployment of the mod-el in real-world applications that require real-time estimations and online decisions. Fourthly, the scarcity of relevant datasets used to evaluate different WSSS models and algorithms is also a major obstacle, which leads to performance degradation and limits generalization capability. Therefore, it is urgent to construct large-scale WSSS datasets of high quality, greater diversity, and containing various types of images.
Keywords

订阅号|日报