石德硕1, 李军侠2, 刘青山2(1.南京信息工程大学，江苏省大气环境与装备技术协同创新中心;2.南京信息工程大学计算机与软件学院)
目的 语义分割作为视觉领域的一项基础性任务,其深度网络模型的训练依赖于大规模像素级标注数据,然而,像素级标注的获取往往耗时耗力。弱监督语义分割,由于仅依赖弱标注数据训练分割模型,也因此正在成为一大学术研究热点。现有图像级标注的弱监督分割方法大多利用卷积神经网络获取伪标签,能够准确定位目标位置,但其覆盖的目标区域过小。基于Transformer模型的方法通常采用自注意力对类激活图进行扩张,可以得到较大的鉴别性区域,然而受其深层注意力不准确性的影响,优化之后得到的伪标签中背景噪声比较多。为了利用该两类特征提取网络的优点,同时结合Transformer不同层级的注意力特性,构建了一种结合卷积特征和Transformer特征的自注意力融合调制网络进行弱监督语义分割。方法 首先,为了充分利用卷积神经网络提取的局部特征和Transformer提取的全局特征,本文采用卷积增强的Transformer(Conformer)作为特征提取网络,其能够对图像进行更加全面的编码,得到初始的类激活图。其次,据实验观察,融合卷积网络后得到的Transformer浅层自注意力大多关注目标细节,深层注意力则更加突出全局信息,为此设计了一种自注意力层级自适应融合模块,根据自注意力值和层级重要性生成融合权重,融合之后的自注意力能够较好地抑制背景噪声。进一步,提出了一种自注意力调制模块,利用像素对之间的注意力关系,设计调制函数,增大前景像素的激活响应。之后使用调制后的注意力对初始类激活图进行优化,其可以有效扩大类激活图响应范围,使其覆盖较多的目标区域,同时有效抑制背景噪声。结果 在最常用的PASCAL VOC 2012数据集和COCO 2014数据集上与最新的方法进行实验比较,本文的算法均取得了最优的结果,在PASCAL VOC验证集上mIoU达到了70.2%,测试集上mIoU值为70.5%,在COCO 2014验证集上结果为40.1%。结论 本文提出的弱监督语义分割模型,结合了卷积神经网络和Transformer的优点,通过对Transformer自注意力进行自适应融合调制,得到了图像级标签下目前最优的语义分割结果。
Self-attention fusion and modulation for weakly supervised semantic segmentation
(School of Computer and Software， Nanjing University of Information Science and Technology)
Objective Semantic segmentation is a fundamental task in computer vision and image processing, whose aim is to assign a class label to each pixel. While the training of the segmentation model often relies on the dense pixel-wise annotations, which suffers from time-consuming and labor-intensive to collect. To get rid of dependence of the pixel-level labels, weakly supervised semantic segmentation (WSSS) has got much attention to mitigate the issue with weaker/cheaper supervision, such as points, scribbles, image-level labels and bounding boxes. Image-level label is the weakest and easiest to obtain, while it is the most challenging to get the pseudo labels. The main difficult of the WSSS based on the convolutional neural network with image-level supervision is that the naive gap between the tasks of classification and segmentation, leading to less activation of the target regions, which is not satisfied for the task of segmentation. While the classifier using the Transformer can activate most of the foreground objects, it also introduces lots of background noises, decreasing the quality of the pseudo masks. In order to take full use of the advantages of these two types of feature extraction networks and combine the attention features of different levels of the Transformer, a self-attention fusion and modulation network is constructed for weakly supervised semantic segmentation. Method To make full use of the local features extracted by the convolutional neural network and the global features extracted by the Transformer, this paper turns to the convolution enhanced Transformer (Conformer) as the feature extraction network, which can encode the image more comprehensively and thus obtain the initial class activation maps. What catches our eyes is that the attention maps learned by Transformer branch is different from shallow layers to deep layers. Influenced by the convolution information, the attention maps in shallow layers tend to capture the details information of the targets regions while the deeper layers prefer mining the global information. And we analysis that the noises in the background regions are caused by the attention maps in deeper layers, owing to the incorrect relation between background and foreground. Therefore, it is a suboptimal choice to add the different attention maps directly. We put forward a self-attention adaptive fusion module to assign a weight for each layer to balance the importance of different layers. On one hand, we argue that the attention maps in shallow layers are more accurate than the maps in deeper layers, so we distribute large weights to the maps in shallow layers and small weights to the deep layers maps to reduce the influence of the noises caused by the deep layers. On the other hand, we take consider of the discrete activation value of the attention map, and the larger the value, the greater the importance. The fused self-attention can well suppress background noises and meanwhile can well describe the similarity between pixel pairs. In order to further increase the activation response of foreground pixels, a self-attention modulation module is designed. We first normalize the attention map and then map it via the exponential function to get the importance of each pixel pair. Since the target object pixels are relative similar, and the attention value of the pixel pair may be larger than others, we increase this connection via a large modulation parameter. When the attention value of a pixel pair is small, the relation between them may be not close, and it can introduce some noises. Therefore, we reduce this connection via a small modulation parameter. After modulating the attention map, the distance between the foreground and background pixels becomes large and the attention maps can pay more attention to the foreground regions. Result A series of experiments demonstrate that our model can achieve the state-of-the-art performance. We can get 70.2% mIoU in the validation set and 70.5% mIoU in the test set in the most popular PASCAL VOC 2012 dataset, meanwhile, we get 40.1% mIoU in the validation set of COCO 2014. It is worth noting that we do not utilize the saliency maps to provide the background cues, and our results are comparable to those using the saliency maps. When compared with the recent state-of-the-art models which use the Transformer structure to extract features, our method outperforms the MCTformer 2% in validation set and 2.1% in test set in terms of mIoU, respectively. When compared with the TransCAM, which directly uses the attention to adjust the class activation maps, we get 0.9% performance boost both in the validation set and the test set. This phenomenon demonstrates that our model is effective in reducing the noises in the background regions. When compared with the existing methods like IRNet, SEAM, AMR, SIPE, URN, which use the convolutional neural network as backbone, our method outperforms them by 6.7%, 5.7%, 1.4%, 1.4%, and 0.7% in the validation set, respectively. This shows that our dual branch feature extraction structure is effective and feasible. Considering that we extract the features from both the aspect of local and global, we also conduct an ablation experiment to show the importance of the completement of information. If we only use the information of the convolution branch, we can get 27.7% mIoU of the class activation map (CAM) and when fused with the global feature generated by the transformer branch, we can get 35.1% mIoU, which shows that both the local information and the global information are helpful to generate the CAM. Conclusion The proposed self-attention adaptive fusion and modulation network in this paper is effective for the image-level weakly supervised semantic segmentation task.