白雪飞1, 卢立彬1, 王文剑2(1.山西大学计算机与信息技术学院;2.山西大学计算机与信息技术学院，计算智能与中文信息处理教育部重点实验室(山西大学))
目的 图像级弱监督语义分割方法利用类别标签训练分割网络，可显著降低标注成本。现有方法大多采用类激活图定位目标物体，然而传统类激活图只能挖掘出物体中最具辨识性的区域，直接将其作为伪标签训练的分割网络精度较差。本文提出一种显著性引导的弱监督语义分割算法，可在获取更完整类激活图的基础上提高分割模型的性能。方法 首先通过显著图对目标进行互补随机隐藏，以获得互补图像对，然后融合互补图像对的类激活图作为监督，提高网络获取完整类激活图的能力。其次引入双重注意力修正模块，利用全局信息修正类激活图并生成伪标签训练分割网络。最后使用标签迭代精调策略，结合分割网络的初始预测、类激活图以及显著图生成更精确的伪标签，迭代训练分割网络。结果 在PASCAL VOC 2012数据集上进行类激活图生成实验与语义分割实验，所生成的类激活图更加完整，平均交并比有10.21%的提升。语义分割结果均优于对比方法，平均交并比有6.9%的提升。此外在COCO 2014数据集上进行了多目标的语义分割实验，平均交并比获得了0.5%的性能提升。结论 该算法可获得更完整的类激活图，缓解了弱监督语义分割中监督信息不足的问题，提升了弱监督语义分割模型的精度。
Saliency guided object complementary hiding for weakly supervised semantic segmentation
(School of Computer and Information Technology, Shanxi University Key Laboratory of Computational Intelligence and Chinese Information Processing (Shanxi University), Ministry of Education)
Objective The fully supervised semantic segmentation method based on deep learning has made great progress, promoting practical applications such as automatic driving and medical image analysis. However, the fully supervised semantic segmentation method depends on the complete pixel-wise annotation, and the construction of large-scale pixel-wise annotation data sets requires lots of human labors and resources. Recently, to reduce the reliance on accurate annotations, researchers have attempted to study semantic segmentation based on convenient supervisions, such as bounding boxes, scribbles, points, and image-level labels. Weakly supervised semantic segmentation based on image-level labels only uses category labels to train the segmentation network, which can significantly reduce the annotation cost. Most of the existing weakly supervised semantic segmentation methods use class activation map (CAM) to locate target objects. On the one hand, the CAM generated by classification networks is sparse and can only focus on the most discriminative areas of objects. And there are some misactivated pixels in the CAM which may cause wrong guidance for the subsequent segmentation task. On the other hand, the performance of the segmentation network depends on the quality of the pseudo label, and obtaining the accurate pseudo label also requires the shape and boundary of the object. However, in image-level labels, these information cannot be directly and accurately obtained, and the quality of pseudo label is difficult to guarantee. In this paper, a new saliency guided weakly supervised semantic segmentation algorithm is proposed to improve the performance of the segmentation model on the basis of obtaining more complete class activation map. Method Firstly, research shows that the random hiding of the target in the image can enhance the ability of the network to locate the complete target, but directly hiding the image at random will lead to the fact that part of the image information cannot be used. In contrast, the complementary hiding method can use all the image information. However, because the hiding method is random, there is no guarantee that the target object can be hidden as expected. In some cases, only the background area is randomly hidden. In this paper, we propose a saliency guided object complementary hiding method. Through the foreground information provided by the saliency map, the complementary random hiding of the object in the image is performed to obtain the complementary image pairs, and then the CAM of the complementary image pairs are fused as supervision to improve the ability of the network in order to obtain more complete CAM. Secondly, the convolution operation in the classification network used to generate CAM can lead to a local receptive field, which may cause some differences in the corresponding features of the same class objects with changes in scale, illumination and viewing angle. These differences may cause intra-class inconsistency, negatively affecting the activation and leading to mis-activation in the CAM. In addition, the classification network itself has weak ability to extract complete objects, and the object complementary hiding method guided only by saliency is still difficult to achieve good effects in expanding the object area. So a dual attention refinement module is introduced to further correct the CAM by the global information, and the obtained CAM is used to generate the pseudo label to train segmentation network. The prediction results of the segmentation network will have higher accuracy than the original pseudo labels. However, it also has some noise, which cannot guarantee the performance improvement of segmentation model by directly using iterative training. Finally, this paper uses the label iteration refinement strategy, combines the initial prediction of the segmentation network, CAM and saliency map to generate pseudo label, and iteratively trains the segmentation network to further improve the performance of the segmentation network. Saliency map can effectively distinguish between foreground and background objects but cannot identify the object categories. CAM can accurately locate the object categories but lack information about the complete shape of the objects. Segmentation network prediction can provide relatively complete information about the object boundary but may contain misclassification pixels. By fully utilizing the information provided by these three types of maps to refine the pseudo-labels, the impact of misclassification pixels is reduced as much as possible. Result In order to verify the effectiveness of the algorithm, the experiment is divided into two parts. In the first part, the CAM generation algorithm proposed in this paper is verified and compared with other methods. In the second part, the proposed method is compared with several classical weakly supervised semantic segmentation algorithms, and the effectiveness of the modules in the proposed model is analyzed by ablation experiment. The experiments are first conducted on PASCAL VOC 2012 data set. Compared with the comparison method, the CAM generated by this algorithm is more complete, and its mean intersection over union(mIoU) is improved by 10.21% compared with the baseline. The segmentation network predicted better results than the six methods compared, with a 6.9% improvement over the baseline. Our method outperforms the compared methods in 13 categories. With a mIoU value of 92% in the background category, our method achieved the highest performance compared to other methods, indicating that our method effectively utilizes saliency maps in training. Multi-objective semantic segmentation experiment is also carried out on COCO 2014 data set. Compared with PASCAL VOC 2012, this dataset has richer categories and contains a large number of images with multiple object categories, which means a higher demand on the performance of the algorithm. The experimental results show that the value of mIoU is improved by 0.5% on COCO 2014. Conclusion This algorithm can obtain a more complete class activation map, effectively alleviate the problem of insufficient supervision information and improve the accuracy of weakly supervised semantic segmentation models.