Current Issue Cover
融合多尺度上下文信息的实例分割

万新军1,2, 周逸云1,2, 沈鸣飞1, 周涛3, 胡伏原1,2(1.苏州科技大学电子与信息工程学院, 苏州 215009;2.苏州市虚拟现实智能交互及应用技术重点实验室, 苏州 215009;3.北方民族大学计算机科学与工程学院, 银川 750021)

摘 要
目的 实例分割通过像素级实例掩膜对图像中不同目标进行分类和定位。然而不同目标在图像中往往存在尺度差异,目标多尺度变化容易错检和漏检,导致实例分割精度提高受限。现有方法主要通过特征金字塔网络(feature pyramid network,FPN)提取多尺度信息,但是FPN采用插值和元素相加进行邻层特征融合的方式未能充分挖掘不同尺度特征的语义信息。因此,本文在Mask R-CNN (mask region-based convolutional neural network)的基础上,提出注意力引导的特征金字塔网络,并充分融合多尺度上下文信息进行实例分割。方法 首先,设计邻层特征自适应融合模块优化FPN邻层特征融合,通过内容感知重组对特征上采样,并在融合相邻特征前引入通道注意力机制对通道加权增强语义一致性,缓解邻层不同尺度目标间的语义混叠;其次,利用多尺度通道注意力设计注意力特征融合模块和全局上下文模块,对感兴趣区域(region of interest,RoI)特征和多尺度上下文信息进行融合,增强分类回归和掩膜预测分支的多尺度特征表示,进而提高对不同尺度目标的掩膜预测质量。结果 在MS COCO 2017(Microsoft common objects in context 2017)和Cityscapes数据集上进行综合实验。在MS COCO 2017数据集上,本文算法相较于Mask R-CNN在主干网络为ResNet50/101时分别提高了1.7%和2.5%;在Cityscapes数据集上,以ResNet50为主干网络,在验证集和测试集上进行评估,比Mask R-CNN分别提高了2.1%和2.3%。可视化结果显示,所提方法对不同尺度目标定位更精准,在相互遮挡和不同目标分界处的分割效果显著改善。结论 本文算法有效提高了网络对不同尺度目标检测和分割的准确率。
关键词
Multi-scale context information fusion for instance segmentation

Wan Xinjun1,2, Zhou Yiyun1,2, Shen Mingfei1, Zhou Tao3, Hu Fuyuan1,2(1.School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China;2.Virtual Reality Key Laboratory of Intelligent Interaction and Application Technology of Suzhou, Suzhou 215009, China;3.School of Computer Science and Engineering, North Minzu University, Yinchuan 750021, China)

Abstract
Objective Case-relevant segmentation is one of the essential tasks for image and video scene recognition. Its precise segmentation is widely used in real scenes like automatic driving, medical image profiling, and video surveillance. To classify and locate multiple targets of image, this kind of segmentation can be used for pixel-level case-related masks. However, different targets interpretation often has featured of multiple scales. For larger-scale targets, the receptive field can be covered its local area only, which is possible to get detection error or insufficient or inaccurate segmentation. For smaller-scale targets, the receptive field is often affected by much more background noise, and it is easy to be misjudged as the background category and lead to detection error. The recognition and segmentation accuracy are lower at the target boundary and occlusion. To enhance the segmentation accuracy effectively, most of case-relevant segmentation methods are improved in consistency without a multi-scale targets-oriented solution. To optimize segmentation accuracy further, we develop a mask region-based convolutional neural network (Mask R-CNN) based case-relevant segmentation network in terms of the improved feature pyramid network (FPN) and multi-scale information. Method First, an attention-guided feature pyramid network (AgFPN) is illustrated, which optimizes the fusion method of FPN adjacent layer features through an adaptive adjacent layer feature fusion module (AFAFM). To learn multi-scale features effectively, the AgFPN is based on content-oriented reconstruction for features-upsampled and a channel attention mechanism is used to weight channels before adjacent layer feature fusion. Then, we design an attention feature fusion module (AFFM) and a global context module (GCM) in relation to multi-scale channel attention. We enhance the multi-scale feature representation of the mask prediction branch and the classification and regression branch for region of interest (RoI) features via multi-scale contextual information-adding. Hence, our analysis can improve the quality of mask prediction for multi-scale objects. First, we utilize AgFPN to extract multi-scale features. Next, multi-scale context information extraction and fusion are carried out in the network. The inner region proposal network (RPN) can be used to develop the bounding boxes of target regions and filters. Meanwhile, multi-scale context information is derived from the output of AgFPN in accordance with AFFM and GCM. Then, to obtain a fixed-size feature map, the network-based RoIAlign algorithm can be used to map the RoI to the feature map, which is fused with the following multi-scale context information. Finally, the bounding box regression and mask prediction are performed in terms of features-fused. We use the deep learning framework PyTorch to implement the algorithm proposed. The experimental facility is equipped with the Ubuntu 16.04 operating system, and a sum of 4 NVIDIA 1080Ti graphics processing units (GPUs) are used to accelerate the operation. The ResNet-50/101 network is used as the backbone network and the pre-trained weights on ImageNet are utilized to initialize the network parameters. For the Microsoft common objects in context 2017(MS COCO 2017) dataset, we use stochastic gradient descent (SGD) for 160 000 iterations of training optimization. The initial learning rate is 0.002 and the batch size is set to 4. When the number of iterations is 130 000 and 150 000, the learning rates can be reached to 10 times lower. For the Cityscapes dataset, we set the batch size to 4 and the initial learning rate to 0.005. The number of iterations is 48 000. When it reaches 36 000, the learning rate can be down to 0.000 5. The weight decay coefficient is set to 0.000 5 and the momentum coefficient is configured to 0.9. The loss function and hyperparameters-related are set and initialized of the strategy-described following. Result Our method effectiveness is evaluated through comprehensive experiments on the two datasets of MS COCO 2017 and Cityscapes. For the COCO dataset, the algorithm value can be increased by 1.7% and 2.5% of each compared to the benchmark of Mask R-CNN when the backbone network is based on ResNet50 and ResNet101. For the Cityscapes dataset, ResNet50 is used as the backbone network to evaluate on the validation set and test set, which are 2.1% and 2.3% higher than Mask R-CNN for the two sets. The ablation results show that the AgFPN has its potential performance and is easy to be integrated into multiple detectors. Furthermore, feature-related augmentation is utilized to improve average accuracy of 0.6% and 0.7% each for attention feature fusion module and the global context module. When we combine the two modules, the performance-benched is improved by 1.7%. The visualization results show that our method is more accurate in positioning multi-scale targets. The segmentation effect is improved significantly on the two aspects of mutual occlusion and the boundary of multiple targets. Conclusion The experimental results show that our algorithm is based on the overall multi-scale context information of the target and the multiple feature representation of the target can be improved. Therefore, the algorithm effectiveness is demonstrated that it can improve the accuracy of the network for target detection and segmentation at different scales further.
Keywords

订阅号|日报