Current Issue Cover

项伟康1,2, 周全1,2, 崔景程1, 莫智懿2, 吴晓富1, 欧卫华3, 王井东4, 刘文予5(1.南京邮电大学通信与信息工程学院, 南京 210023;2.梧州学院广西高校智能软件重点实验室, 梧州 543003;3.贵州师范大学大数据与计算机科学学院, 贵阳 550025;4.百度, 北京 100085;5.华中科技大学电子信息与通信学院, 武汉 430071)

摘 要
Weakly supervised semantic segmentation based on deep learning

Xiang Weikang1,2, Zhou Quan1,2, Cui Jingcheng1, Mo Zhiyi2, Wu Xiaofu1, Ou Weihua3, Wang Jingdong4, Liu Wenyu5(1.School of Communication and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210023, China;2.Guangxi Colleges and Universities Key Laboratory of Intelligent Software, Wuzhou University, Wuzhou 543003, China;3.School of Big Data and Computer Science, Guizhou Normal University, Guiyang 550025, China;4.Baidu, Beijing 100085, China;5.School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430071, China)

Semantic segmentation is an important and fundamental task in the field of computer vision. Its goal is to assign a semantic category label to each pixel in an image,achieving pixel-level understanding. It has wide applications in areas, such as autonomous driving,virtual reality,and medical image analysis. Given the development of deep learning in recent years,remarkable progress has been achieved in fully supervised semantic segmentation,which requires a large amount of training data with pixel-level annotations. However,accurate pixel-level annotations are difficult to provide because it sacrifices substantial time,money,and human-label resources,thus limiting their widespread application in reality. To reduce the cost of annotating data and further expand the application scenarios of semantic segmentation,researchers are paying increasing attention to weakly supervised semantic segmentation(WSSS)based on deep learning. The goal is to develop a semantic segmentation model that utilizes weak annotations information instead of dense pixel-level annotations to predict pixel-level segmentation accurately. Weak annotations mainly include image-level,bounding-box,scribble,and point annotations. The key problem in WSSS lies in how to find a way to utilize the limited annotation information,incorporate appropriate training strategies,and design powerful models to bridge the gap between weak supervision and pixel-level annotations. This study aims to classify and summarize WSSS methods based on deep learning,analyze the challenges and problems encountered by recent methods,and provide insights into future research directions. First,we introduce WSSS as a solution to the limitations of fully supervised semantic segmentation. Second,we introduce the related datasets and evaluation metrics. Third,we review and discuss the research progress of WSSS from three categories:image-level annotations, other weak annotations,and assistance from large-scale models,where the second category includes bounding-box, scribble,and point annotations. Specifically,image-level annotations only provide object categories information contained in the image,without specifying the positions of the target objects. Existing methods always follow a two-stage training process:producing a class activation map(CAM),also known as initial seed regions used to generate high-quality pixel-level pseudo labels;and training a fully supervised semantic segmentation model using the produced pixel-level pseudo labels. According to whether the pixel-level pseudo labels are updated or not during the training process in the second stage,WSSS based on image-level annotations can be further divided into offline and online approaches. For offline approaches,existing research treats two stages independently,where the initial seed regions are optimized to obtain more reliable pixel-level pseudo labels that remain unchanged throughout the second stage. They are often divided into six classes according to different optimization strategies,including the ensemble of CAM,image erasing,co-occurrence relationship decoupling, affinity propagation,additional supervised information,and self-supervised learning. For online approaches,the pixellevel pseudo labels keep updating during the entire training process in the second stage. The production of pixel-level pseudo labels and the semantic segmentation model are jointly optimized. The online counterparts can be trained end to end,making the training process more efficient. Compared with image-level annotations,other weak annotations,including bounding box,scribble,and point,are more powerful supervised signals. Among them,bounding-box annotations not only provide object category labels but also include information of object positions. The regions outside the bounding-box are always considered background,while box regions simultaneously contain foreground and background areas. Therefore, for bounding-box annotations,existing research mainly starts from accurately distinguishing foreground areas from background regions within the bounding-box,thereby producing more accurate pixel-level pseudo labels,used for training following semantic segmentation networks. Scribble and point annotations not only indicate the categories of objects contained in the image but also provide local positional information of the target objects. For scribble annotations,more complete pseudo labels can be produced to supervise semantic segmentation by inferring the category of unlabeled regions from the annotated scribble. For point annotations,the associated semantic information is expanded to the entire image through label propagation,distance metric learning,and loss function optimization. In addition,with the rapid development of large-scale models,this paper further discusses the recent research achievements in using large-scale models to assist WSSS tasks. Large-scale models can leverage their pretrained universal knowledge to understand images and generate accurate pixel-level pseudo labels,thus improving the final segmentation performance. This paper also reports the quantitative segmentation results on pattern analysis,statistical modeling and computational learning visual object classes 2012 (PASCAL VOC 2012)dataset to evaluate the performance of different WSSS methods. Finally,four challenges and potential future research directions are provided. First,a certain performance gap remains between weakly supervised and fully supervised methods. To bridge this gap,research should keep on improving the accuracy of pixel-level pseudo labels. Second,when WSSS models are applied to real-world scenarios,they may encounter object categories that have never appeared in the training data. This encounter requires the models to have a certain adaptability to identify and segment unknown objects. Third,existing research mainly focuses on improving the accuracy without considering the model size and inference speed of WSSS networks,posing a major challenge for the deployment of the model in real-world applications that require real-time estimations and online decisions. Fourth,the scarcity of relevant datasets used to evaluate different WSSS models and algorithms is also a major obstacle,which leads to performance degradation and limits generalization capability. Therefore,large-scale WSSS datasets with high quality,great diversity,and wide variation of image types must be constructed.