多尺度特征融合与交叉指导的小样本语义分割

郭婧; 王飞

发布时间： 2024-05-20
摘要点击次数： 337
全文下载次数： 316
DOI: 10.11834/jig.230550
2024 | Volume 29 | Number 5

多尺度特征融合与交叉指导的小样本语义分割

郭婧¹, 王飞²(1.晋中职业技术学院电子信息系, 晋中 030600;2.英国布莱顿大学智能和可用系统中心, 布莱顿 BN24AT, 英国)

摘要

目的构建支持分支和查询分支间的信息交互对于提升小样本语义分割的性能具有重要作用，提出一种多尺度特征融合与交叉指导的小样本语义分割算法。方法利用一组共享权重的主干网络将双分支输入图像映射到深度特征空间，并将输出的低层、中间层和高层特征进行尺度融合，构造多尺度特征；借助支持分支的掩码将支持特征分解成目标前景和背景特征图；设计了一种特征交互模块，在支持分支的目标前景和整个查询分支的特征图上建立信息交互，增强任务相关特征的表达能力，并利用掩码平均池化策略生成目标前景和背景区域的原型集；利用无参数的度量方法分别计算支持特征和原型集、查询特征与原型集之间的余弦相似度值，并根据相似度值给出对应图像的掩码。结果通过在PASCAL-5（i pattern analysis，statistical modeling and computational learning）和COCO-20i（common objects in context）开源数据集上进行实验，结果表明，利用VGG-16（Visual Geometry Group）、ResNet-50（residual neural network）和ResNet-101作为主干网络时，所提模型在1-way 1-shot任务中，分别获得50.2%、53.2%、57.1%和23.9%、35.1%、36.4%的平均交并比（mean intersection over union，mIoU），68.3%、69.4%、72.3%/和60.1%、62.4%、64.1%的前景背景二分类交并比（foreground and background intersection over union，FB-IoU）；在1-way 5-shot任务上，分别获得52.9%、55.7%、59.7%和32.5%、37.3%、38.3%的mIoU，69.7%、72.5%、74.6%和64.2%、66.2%、66.7%的FB-IoU。结论相比当前主流的小样本语义分割模型，所提模型在1-way 1-shot和1-way5-shot任务中可以获得更高的mIoU和FB-IoU，综合性能提升效果显著。

关键词

小样本语义分割多尺度特征融合跨分支交叉指导特征交互掩码平均池化

Multiscale feature fusion and cross-guidance for few-shot semantic segmentation

Guo Jing¹, Wang Fei²(1.Department of Electronic Information, Jingzhong Vocational and Technical College, Jinzhong 030600, China;2.Intelligent and Available Systems Centre, University of Brighton, Brighton BN24AT, UK)

Abstract

Objective Few-shot semantic segmentation is one of the fundamental and challenging tasks in the field of computer vision. It aims to use a limited amount of annotated support samples to guide the segmentation of unknown objects in a query image. Compared with traditional semantic segmentation，few-shot semantic segmentation methods effectively alleviate problems，such as the high cost of per-pixel annotation greatly limiting the application of semantic segmentation technology in practical scenarios and the weak generalization ability of this model for novel class targets. The existing few-shot semantic segmentation methods mainly utilize the meta-learning architecture with dual-branch networks，where the support branch consists of the support images and their corresponding per-pixel labeled ground truth masks，and the query branch takes the input of the new image to be segmented，and both branches share the same semantic classes. The valuable information of support images in the support branch is extracted to guide the segmentation of unknown novel classes in query images. However，different instances of the same semantic class may have variations in appearance and scale，and the information extracted solely from the support branch is insufficient to guide the segmentation of unknown novel classes in query images. Although some researchers have attempted to improve the performance of few-shot semantic segmentation through bidirectional guidance，existing bidirectional guidance models overly rely on the pseudo masks predicted by the query branch in the intermediate stage. If the initial predictions of the query branch are poor，it can easily lead to a weak generalization of shared semantics，which is not conducive to improving segmentation performance. Method A multiscale feature fusion and cross-guidance network for few-shot semantic segmentation is proposed to alleviate these problems， attempting to construct the information interaction between the support branch and the query branch to improve the performance of the few-shot semantic segmentation task. First，a set of pretrained backbone networks with shared weights are used as feature extractors to map features from the support and query branch into the same deep feature space，and then the low-level，intermediate-level，and high-level features output by them are fused at multiple scales to construct a multiscale feature set，which enriches the semantic information of features and enhances the reliability of the feature expression. Second，with the help of the ground-truth mask of the support branch，the fused support features are decomposed into the target-related foreground feature maps and task-irrelevant background feature maps. Then，a feature interaction module is designed on the basis of the cross-attention mechanism，which establishes information interaction between the target-related foreground feature maps of the support branch and the entire query branch feature map，aiming to promote the interactivity between branches while enhancing the expressiveness of task-related features. In addition，a mask average pooling strategy is used on the interactive feature map to generate a target foreground region prototype set，and a background prototype set is generated on the support background feature map. Finally，the cosine similarity measure is used to calculate the similarity values between the support features and the prototype sets and between the query features and the prototype sets；then，the corresponding mask is generated on the basis of the maximum similarity value at each position. Result Experimental results on the classic PASCAL-5（i pattern analysis，statistical modeling and computational learning）dataset show that when Visual Geometry Group（VGG-16），residual neural network（ResNet-50），and ResNet-101 are used as backbone networks，the proposed few-shot semantic segmentation model achieves mean intersection over union（mIoU）scores of 50. 2%/53. 2%/ 57. 1% and FB-IoU scores of 68. 3%/69. 4%/72. 3% in the one-way one-shot task and mIoU scores of 52. 9%/55. 7%/ 59. 7% and FB-IoU scores of 69. 7%/72. 5%/74. 6% in the one-way five-shot task. Results on the more challenging COCO- 20i dataset show that the proposed model achieves mIoU scores of 23. 9%/35. 1%/36. 4% and FB-IoU scores of 60. 1%/ 62. 4%/64. 1% in the one-way one-shot task and mIoU scores of 32. 5%/37. 3%/38. 3% and FB-IoU scores of 64. 2%/ 66. 2%/66. 7% in the one-way five-shot task when VGG-16，ResNet-50，and ResNet-101 are used as backbone networks. Furthermore，the performance gains of the proposed few-shot semantic segmentation model on the PASCAL-5i and COCO- 20i（common objects in context）datasets are competitive. Conclusion Compared with current mainstream few-shot semantic segmentation models，our model can achieve higher mIoU and FB-IoU in one-way one-shot and one-way five-shot tasks， with remarkable improvement in overall performance. Further validation shows that feature interaction between the support branch and query branch can effectively improve the model’s ability to locate and segment unknown new classes in query images，and using joint loss between support branch and query branch can promote information flow between dual-branch features，enhance the reliability of prototype expression，and achieve alignment of cross-branch prototype sets.

Keywords

few-shot semantic segmentation multiscale feature fusion cross-branch cross-guidance feature interaction masked averaging pooling