Multi-scale context information fusion for instance segmentation

Xinjun Wan; Yiyun Zhou; Mingfei Shen; Tao Zhou; Fuyuan Hu

doi:10.11834/jig.211090

Image Understanding and Computer Vision | Views : 0 下载量: 0 CSCD: 1

PDF
Export
Share
Collection
Album

Multi-scale context information fusion for instance segmentation
Vol. 28, Issue 2, Pages: 495-509(2023)
Published： 16 February 2023 ，

Accepted： 17 January 2022
DOI： 10.11834/jig.211090
稿件说明：

移动端阅览

Xinjun Wan, Yiyun Zhou, Mingfei Shen, Tao Zhou, Fuyuan Hu. Multi-scale context information fusion for instance segmentation. [J]. Journal of Image and Graphics 28(2):495-509(2023)
DOI：

Xinjun Wan, Yiyun Zhou, Mingfei Shen, Tao Zhou, Fuyuan Hu. Multi-scale context information fusion for instance segmentation. [J]. Journal of Image and Graphics 28(2):495-509(2023) DOI： 10.11834/jig.211090.

摘要

目的

实例分割通过像素级实例掩膜对图像中不同目标进行分类和定位。然而不同目标在图像中往往存在尺度差异，目标多尺度变化容易错检和漏检，导致实例分割精度提高受限。现有方法主要通过特征金字塔网络（feature pyramid network，FPN）提取多尺度信息，但是FPN采用插值和元素相加进行邻层特征融合的方式未能充分挖掘不同尺度特征的语义信息。因此，本文在Mask R-CNN（mask region-based convolutional neural network）的基础上，提出注意力引导的特征金字塔网络，并充分融合多尺度上下文信息进行实例分割。

方法

首先，设计邻层特征自适应融合模块优化FPN邻层特征融合，通过内容感知重组对特征上采样，并在融合相邻特征前引入通道注意力机制对通道加权增强语义一致性，缓解邻层不同尺度目标间的语义混叠；其次，利用多尺度通道注意力设计注意力特征融合模块和全局上下文模块，对感兴趣区域（region of interest，RoI）特征和多尺度上下文信息进行融合，增强分类回归和掩膜预测分支的多尺度特征表示，进而提高对不同尺度目标的掩膜预测质量。

结果

在MS COCO 2017（Microsoft common objects in context 2017）和Cityscapes数据集上进行综合实验。在MS COCO 2017数据集上，本文算法相较于Mask R-CNN在主干网络为ResNet50/101时分别提高了1.7%和2.5%；在Cityscapes数据集上，以ResNet50为主干网络，在验证集和测试集上进行评估，比Mask R-CNN分别提高了2.1%和2.3%。可视化结果显示，所提方法对不同尺度目标定位更精准，在相互遮挡和不同目标分界处的分割效果显著改善。

结论

本文算法有效提高了网络对不同尺度目标检测和分割的准确率。

Abstract

Objective

Case-relevant segmentation is one of the essential tasks for image and video scene recognition. Its precise segmentation is widely used in real scenes like automatic driving

medical image profiling

and video surveillance. To classify and locate multiple targets of image

this kind of segmentation can be used for pixel-level case-related masks. However

different targets interpretation often has featured of multiple scales. For larger-scale targets

the receptive field can be covered its local area only

which is possible to get detection error or insufficient or inaccurate segmentation. For smaller-scale targets

the receptive field is often affected by much more background noise

and it is easy to be misjudged as the background category and lead to detection error. The recognition and segmentation accuracy are lower at the target boundary and occlusion. To enhance the segmentation accuracy effectively

most of case-relevant segmentation methods are improved in consistency without a multi-scale targets-oriented solution. To optimize segmentation accuracy further

we develop a mask region-based convolutional neural network (Mask R-CNN) based case-relevant segmentation network in terms of the improved feature pyramid network (FPN) and multi-scale information.

Method

First

an attention-guided feature pyramid network (AgFPN) is illustrated

which optimizes the fusion method of FPN adjacent layer features through an adaptive adjacent layer feature fusion module (AFAFM). To learn multi-scale features effectively

the AgFPN is based on content-oriented reconstruction for features-upsampled and a channel attention mechanism is used to weight channels before adjacent layer feature fusion. Then

we design an attention feature fusion module (AFFM) and a global context module (GCM) in relation to multi-scale channel attention. We enhance the multi-scale feature representation of the mask prediction branch and the classification and regression branch for region of interest (RoI) features via multi-scale contextual information-adding. Hence

our analysis can improve the quality of mask prediction for multi-scale objects. First

we utilize AgFPN to extract multi-scale features. Next

multi-scale context information extraction and fusion are carried out in the network. The inner region proposal network (RPN) can be used to develop the bounding boxes of target regions and filters. Meanwhile

multi-scale context information is derived from the output of AgFPN in accordance with AFFM and GCM. Then

to obtain a fixed-size feature map

the network-based RoIAlign algorithm can be used to map the RoI to the feature map

which is fused with the following multi-scale context information. Finally

the bounding box regression and mask prediction are performed in terms of features-fused. We use the deep learning framework PyTorch to implement the algorithm proposed. The experimental facility is equipped with the Ubuntu 16.04 operating system

and a sum of 4 NVIDIA 1080Ti graphics processing units (GPUs) are used to accelerate the operation. The ResNet-50/101 network is used as the backbone network and the pre-trained weights on ImageNet are utilized to initialize the network parameters. For the Microsoft common objects in context 2017(MS COCO 2017) dataset

we use stochastic gradient descent (SGD) for 160 000 iterations of training optimization. The initial learning rate is 0.002 and the batch size is set to 4. When the number of iterations is 130 000 and 150 000

the learning rates can be reached to 10 times lower. For the Cityscapes dataset

we set the batch size to 4 and the initial learning rate to 0.005. The number of iterations is 48 000. When it reaches 36 000

the learning rate can be down to 0.000 5. The weight decay coefficient is set to 0.000 5 and the momentum coefficient is configured to 0.9. The loss function and hyperparameters-related are set and initialized of the strategy-described following.

Result

Our method effectiveness is evaluated through comprehensive experiments on the two datasets of MS COCO 2017 and Cityscapes. For the COCO dataset

the algorithm value can be increased by 1.7% and 2.5% of each compared to the benchmark of Mask R-CNN when the backbone network is based on ResNet50 and ResNet101. For the Cityscapes dataset

ResNet50 is used as the backbone network to evaluate on the validation set and test set

which are 2.1% and 2.3% higher than Mask R-CNN for the two sets. The ablation results show that the AgFPN has its potential performance and is easy to be integrated into multiple detectors. Furthermore

feature-related augmentation is utilized to improve average accuracy of 0.6% and 0.7% each for attention feature fusion module and the global context module. When we combine the two modules

the performance-benched is improved by 1.7%. The visualization results show that our method is more accurate in positioning multi-scale targets. The segmentation effect is improved significantly on the two aspects of mutual occlusion and the boundary of multiple targets.

Conclusion

The experimental results show that our algorithm is based on the overall multi-scale context information of the target and the multiple feature representation of the target can be improved. Therefore

the algorithm effectiveness is demonstrated that it can improve the accuracy of the network for target detection and segmentation at different scales further.

关键词

实例分割Mask R-CNN特征金字塔网络(FPN)多尺度上下文信息多尺度通道注意力(MSCA)

Keywords

instance segmentationmask region-based convolutional neural network (Mask R-CNN)feature pyramid network (FPN)multi-scale context informationmulti-scale channel attention (MSCA)

references

Bolya D, Zhou C, Xiao F Y and Lee Y J. 2019. YOLACT: real-time instance segmentation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 9156-9165 [DOI: 10.1109/ICCV.2019.00925http://dx.doi.org/10.1109/ICCV.2019.00925]

Chen K, Pang J M, Wang J Q, Xiong Y, Li X X, Sun S Y, Feng W S, Liu Z W, Shi J P, Ouyang W L, Loy C C and Lin D H. 2019a. Hybrid task cascade for instance segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 4969-4978 [DOI: 10.1109/CVPR.2019.00511http://dx.doi.org/10.1109/CVPR.2019.00511]

Chen K, Wang J Q, Pang J M, Cao Y H, Xiong Y, Li X X, Sun S Y, Feng W S, Liu Z W, Xu J R, Zhang Z, Cheng D Z, Zhu C C, Cheng T H, Zhao Q J, Li B Y, Lu X, Zhu R, Wu Y, Dai J F, Wang J D, Shi J P, Ouyang W L, Loy C C and Lin D H. 2019b. MMDetection: open MMLab detection toolbox and benchmark [EB/OL]. [2021-10-10].https://arxiv.org/pdf/1906.07155.pdfhttps://arxiv.org/pdf/1906.07155.pdf

Chen L C, Hermans A, Papandreou G, Schroff F, Wang P and Adam H. 2018. MaskLab: instance segmentation by refining object detection with semantic and direction features//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4013-4022 [DOI: 10.1109/CVPR.2018.00422http://dx.doi.org/10.1109/CVPR.2018.00422]

Chen X L, Girshick R, He K M and Dollar P. 2019. TensorMask: a foundation for dense object segmentation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 2061-2069 [DOI: 10.1109/ICCV.2019.00215http://dx.doi.org/10.1109/ICCV.2019.00215]

Cheng T H, Wang X G, Huang L C and Liu W Y. 2020. Boundary-preserving mask R-CNN//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 660-676 [DOI: 10.1007/978-3-030-58568-6_39http://dx.doi.org/10.1007/978-3-030-58568-6_39]

Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S and Schiele B. 2016. The Cityscapes dataset for semantic urban scene understanding//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 3213-3223 [DOI: 10.1109/CVPR.2016.350http://dx.doi.org/10.1109/CVPR.2016.350]

Dai Y M, Gieseke F, Oehmcke S, Wu Y Q and Barnard K. 2021. Attentional feature fusion//Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision. Waikoloa, USA: IEEE: 3559-3568 [DOI: 10.1109/WACV48630.2021.00360http://dx.doi.org/10.1109/WACV48630.2021.00360]

Ding Z Y, Sun Q S, Wang T and Wang H Y. 2021. Deep interactive image segmentation based on fusion multi-scale annotation information. Journal of Computer Research and Development, 58(8): 1705-1717

丁宗元, 孙权森, 王涛, 王洪元. 2021. 基于融合多尺度标记信息的深度交互式图像分割. 计算机研究与发展, 58(8): 1705-1717 [DOI: 10.7544/issn1000-1239.2021.20210195]

He K M, Gkioxari G, Dollár P and Girshick R. 2017. Mask R-CNN//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2980-2988 [DOI: 10.1109/ICCV.2017.322http://dx.doi.org/10.1109/ICCV.2017.322]

Hu J, Shen L, Albanie S, Sun G and Wu E H. 2020. Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(8): 2011-2023 [DOI: 10.1109/tpami.2019.2913372]

Huang Z J, Huang L C, Gong Y C, Huang C and Wang X G. 2019a. Mask scoring R-CNN//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 6402-6411 [DOI: 10.1109/CVPR.2019.00657http://dx.doi.org/10.1109/CVPR.2019.00657]

Huang Z L, Wang X G, Huang L C, Huang C, Wei Y C and Liu W Y. 2019b. CCNet: Criss-cross attention for semantic segmentation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 603-612 [DOI: 10.1109/ICCV.2019.00069http://dx.doi.org/10.1109/ICCV.2019.00069]

Huang Z T, Liu Y, Yu C L, Zhang J J, Wang X and Qi S H. 2021. Video instance segmentation based on temporal feature fusion. Journal of Image and Graphics, 26(7): 1692-1703

黄泽涛, 刘洋, 于成龙, 张加佳, 王轩, 漆舒汉. 2021. 时序特征融合的视频实例分割. 中国图象图形学报, 26(7): 1692-1703 [DOI: 10.11834/jig.200521]

Ji S Y and Xiao Z Y. 2021. Integrated context and multi-scale features in thoracic organs segmentation. Journal of Image and Graphics, 26(9): 2135-2145

吉淑滢, 肖志勇. 2021. 融合上下文和多尺度特征的胸部多器官分割. 中国图象图形学报, 26(9): 2135-2145 [DOI: 10.11834/jig.200558]

Kirillov A, Levinkov E, Andres B, Savchynskyy B and Rother C. 2017. InstanceCut: from edges to instances with MultiCut//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 7322-7331 [DOI: 10.1109/CVPR.2017.774http://dx.doi.org/10.1109/CVPR.2017.774]

Kirillov A, Wu Y X, He K M and Girshick R. 2020. PointRend: image segmentation as rendering//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 9796-9805 [DOI: 10.1109/CVPR42600.2020.00982http://dx.doi.org/10.1109/CVPR42600.2020.00982]

Li Y, Qi H Z, Dai J F, Ji X Y and Wei Y C.2017. Fully convolutional instance-aware semantic segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 4438-4446 [DOI: 10.1109/CVPR.2017.472http://dx.doi.org/10.1109/CVPR.2017.472]

Lin C C, Zhao G S, Yin A H, Ding B C, Guo L and Chen H B. 2020. AS-PANet: a chromosome instance segmentation method based on improved path aggregation network architecture. Journal of Image and Graphics, 25(10): 2271-2280

林成创, 赵淦森, 尹爱华, 丁笔超, 郭莉, 陈汉彪. 2020. AS-PANet: 改进路径增强网络的重叠染色体实例分割. 中国图象图形学报, 25(10): 2271-2280 [DOI: 10.11834/jig.200236]

Lin T Y, Dollár P, Girshick R, He K M, Hariharan B and Belongie S. 2017. Feature pyramid networks for object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 936-944 [DOI: 10.1109/CVPR.2017.106http://dx.doi.org/10.1109/CVPR.2017.106]

Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P and Zitnick C L. 2014. Microsoft COCO: common objects in context//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 740-755 [DOI: 10.1007/978-3-319-10602-1_48http://dx.doi.org/10.1007/978-3-319-10602-1_48]

Liu S, Jia J Y, Fidler S and Urtasun R. 2017. SGN: sequential grouping networks for instance segmentation//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 3516-3524 [DOI: 10.1109/ICCV.2017.378http://dx.doi.org/10.1109/ICCV.2017.378]

Liu S, Qi L, Qin H F, Shi J P and Jia J Y. 2018. Path aggregation network for instance segmentation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 8759-8768 [DOI: 10.1109/CVPR.2018.00913http://dx.doi.org/10.1109/CVPR.2018.00913]

Peng S D, Jiang W, Pi H J, Li X L, Bao H J and Zhou X W. 2020. Deep snake for real-time instance segmentation//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 8530-8539 [DOI: 10.1109/CVPR42600.2020.00856http://dx.doi.org/10.1109/CVPR42600.2020.00856]

Ren S Q, He K M, Girshick R and Sun J. 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149 [DOI: 10.1109/TPAMI.2016.2577031]

Selvaraju R R, Cogswell M, Das A, Vedantam R, Parikh D and Batra D. 2017. Grad-CAM: visual explanations from deep networks via gradient-based localization//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 618-626 [DOI: 10.1109/ICCV.2017.74http://dx.doi.org/10.1109/ICCV.2017.74]

Shen X, Yang J R, Wei C B, Deng B, Huang J Q, Hua X S, Cheng X L and Liang K W. 2021. DCT-Mask: discrete cosine transform mask representation for instance segmentation//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 8716-8725 [DOI: 10.1109/CVPR46437.2021.00861http://dx.doi.org/10.1109/CVPR46437.2021.00861]

Wang J Q, Chen K, Xu R, Liu Z W, Loy C C and Lin D H. 2019. CARAFE: content-aware reassembly of features//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 3007-3016 [DOI: 10.1109/ICCV.2019.00310http://dx.doi.org/10.1109/ICCV.2019.00310]

Wang X L, Kong T, Shen C H, Jiang Y N and Li L. 2020. SOLO: segmenting objects by locations//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 649-665 [DOI: 10.1007/978-3-030-58523-5_38http://dx.doi.org/10.1007/978-3-030-58523-5_38]

Wang Z Y, Yuan C and Li J C. 2019. Instance segmentation with separable convolutions and multi-level features. Journal of Software, 30(4): 954-961

王子愉, 袁春, 黎健成. 2019. 利用可分离卷积和多级特征的实例分割. 软件学报, 30(4): 954-961 [DOI: 10.13328/j.cnki.jos.005667]

Wen Y L, Hu F Y, Ren J C, Shang X R, Li L Y and Xi X F. 2020. Joint multi-task cascade for instance segmentation. Journal of Real-Time Image Processing, 17(6): 1983-1989 [DOI: 10.1007/s11554-020-01007-5]

Xie E Z, Sun P Z, Song X G, Wang W H, Liu X B, Liang D, Shen C H and Luo P. 2020. PolarMask: single shot instance segmentation with polar representation//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 12190-12199 [DOI: 10.1109/CVPR42600.2020.01221http://dx.doi.org/10.1109/CVPR42600.2020.01221]

Zhang G, Lu X, Tan J R, Li J M, Zhang Z X, Li Q Q and Hu X L. 2021. RefineMask: towards high-quality instance segmentation with fine-grained features//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Nashville, USA: IEEE: 6857-6865 [DOI: 10.1109/CVPR46437.2021.00679http://dx.doi.org/10.1109/CVPR46437.2021.00679]

Zhang R F, Tian Z, Shen C H, You M Y and Yan Y L. 2020. Mask encoding for single shot instance segmentation//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 10223-10232 [DOI: 10.1109/CVPR42600.2020.01024http://dx.doi.org/10.1109/CVPR42600.2020.01024]

Zhang T Y, Zhang X R, Zhu P, Tang X, Li C, Jiao L C and Zhou H Y. 2022. Semantic attention and scale complementary network for instance segmentation in remote sensing images. IEEE Transactions on Cybernetics, 52(10): 10999-11013 [DOI: 10.1109/TCYB.2021.3096185]

Zhao H S, Shi J P, Qi X J, Qi X G, Wang X G and Jia J Y. 2017. Pyramid scene parsing network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 6230-6239 [DOI: 10.1109/CVPR.2017.660http://dx.doi.org/10.1109/CVPR.2017.660]

Zhou D F, Fang J, Song X B, Liu L, Yin J B, Dai Y C, Li H D and Yang R G. 2020. Joint 3D instance segmentation and object detection for autonomous driving//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 1836-1846 [DOI: 10.1109/CVPR42600.2020.00191http://dx.doi.org/10.1109/CVPR42600.2020.00191]

Alert me when the article has been cited

提交

Multi-scale fusion-enhanced ultrasound elastic images segmentation for mediastinal lymph node

Semantic segmentation and model matching-integrated indoor scenario-relevant reconstruction method

The improved atrous spatial pyramid pooling and polarized self-attention based bottom-up panoptic segmentation

Recurrent slice networks-based 3D point cloud-relevant integrated segmentation of semantic and instances

Ship hull number detection and recognition under sparse samples