Multi-scale context information fusion for instance segmentation
- Vol. 28, Issue 2, Pages: 495-509(2023)
Published: 16 February 2023 ,
Accepted: 17 January 2022
DOI: 10.11834/jig.211090
移动端阅览
浏览全部资源
扫码关注微信
Published: 16 February 2023 ,
Accepted: 17 January 2022
移动端阅览
Xinjun Wan, Yiyun Zhou, Mingfei Shen, Tao Zhou, Fuyuan Hu. Multi-scale context information fusion for instance segmentation. [J]. Journal of Image and Graphics 28(2):495-509(2023)
目的
2
实例分割通过像素级实例掩膜对图像中不同目标进行分类和定位。然而不同目标在图像中往往存在尺度差异,目标多尺度变化容易错检和漏检,导致实例分割精度提高受限。现有方法主要通过特征金字塔网络(feature pyramid network,FPN)提取多尺度信息,但是FPN采用插值和元素相加进行邻层特征融合的方式未能充分挖掘不同尺度特征的语义信息。因此,本文在Mask R-CNN(mask region-based convolutional neural network)的基础上,提出注意力引导的特征金字塔网络,并充分融合多尺度上下文信息进行实例分割。
方法
2
首先,设计邻层特征自适应融合模块优化FPN邻层特征融合,通过内容感知重组对特征上采样,并在融合相邻特征前引入通道注意力机制对通道加权增强语义一致性,缓解邻层不同尺度目标间的语义混叠;其次,利用多尺度通道注意力设计注意力特征融合模块和全局上下文模块,对感兴趣区域(region of interest,RoI)特征和多尺度上下文信息进行融合,增强分类回归和掩膜预测分支的多尺度特征表示,进而提高对不同尺度目标的掩膜预测质量。
结果
2
在MS COCO 2017(Microsoft common objects in context 2017)和Cityscapes数据集上进行综合实验。在MS COCO 2017数据集上,本文算法相较于Mask R-CNN在主干网络为ResNet50/101时分别提高了1.7%和2.5%;在Cityscapes数据集上,以ResNet50为主干网络,在验证集和测试集上进行评估,比Mask R-CNN分别提高了2.1%和2.3%。可视化结果显示,所提方法对不同尺度目标定位更精准,在相互遮挡和不同目标分界处的分割效果显著改善。
结论
2
本文算法有效提高了网络对不同尺度目标检测和分割的准确率。
Objective
2
Case-relevant segmentation is one of the essential tasks for image and video scene recognition. Its precise segmentation is widely used in real scenes like automatic driving
medical image profiling
and video surveillance. To classify and locate multiple targets of image
this kind of segmentation can be used for pixel-level case-related masks. However
different targets interpretation often has featured of multiple scales. For larger-scale targets
the receptive field can be covered its local area only
which is possible to get detection error or insufficient or inaccurate segmentation. For smaller-scale targets
the receptive field is often affected by much more background noise
and it is easy to be misjudged as the background category and lead to detection error. The recognition and segmentation accuracy are lower at the target boundary and occlusion. To enhance the segmentation accuracy effectively
most of case-relevant segmentation methods are improved in consistency without a multi-scale targets-oriented solution. To optimize segmentation accuracy further
we develop a mask region-based convolutional neural network (Mask R-CNN) based case-relevant segmentation network in terms of the improved feature pyramid network (FPN) and multi-scale information.
Method
2
First
an attention-guided feature pyramid network (AgFPN) is illustrated
which optimizes the fusion method of FPN adjacent layer features through an adaptive adjacent layer feature fusion module (AFAFM). To learn multi-scale features effectively
the AgFPN is based on content-oriented reconstruction for features-upsampled and a channel attention mechanism is used to weight channels before adjacent layer feature fusion. Then
we design an attention feature fusion module (AFFM) and a global context module (GCM) in relation to multi-scale channel attention. We enhance the multi-scale feature representation of the mask prediction branch and the classification and regression branch for region of interest (RoI) features via multi-scale contextual information-adding. Hence
our analysis can improve the quality of mask prediction for multi-scale objects. First
we utilize AgFPN to extract multi-scale features. Next
multi-scale context information extraction and fusion are carried out in the network. The inner region proposal network (RPN) can be used to develop the bounding boxes of target regions and filters. Meanwhile
multi-scale context information is derived from the output of AgFPN in accordance with AFFM and GCM. Then
to obtain a fixed-size feature map
the network-based RoIAlign algorithm can be used to map the RoI to the feature map
which is fused with the following multi-scale context information. Finally
the bounding box regression and mask prediction are performed in terms of features-fused. We use the deep learning framework PyTorch to implement the algorithm proposed. The experimental facility is equipped with the Ubuntu 16.04 operating system
and a sum of 4 NVIDIA 1080Ti graphics processing units (GPUs) are used to accelerate the operation. The ResNet-50/101 network is used as the backbone network and the pre-trained weights on ImageNet are utilized to initialize the network parameters. For the Microsoft common objects in context 2017(MS COCO 2017) dataset
we use stochastic gradient descent (SGD) for 160 000 iterations of training optimization. The initial learning rate is 0.002 and the batch size is set to 4. When the number of iterations is 130 000 and 150 000
the learning rates can be reached to 10 times lower. For the Cityscapes dataset
we set the batch size to 4 and the initial learning rate to 0.005. The number of iterations is 48 000. When it reaches 36 000
the learning rate can be down to 0.000 5. The weight decay coefficient is set to 0.000 5 and the momentum coefficient is configured to 0.9. The loss function and hyperparameters-related are set and initialized of the strategy-described following.
Result
2
Our method effectiveness is evaluated through comprehensive experiments on the two datasets of MS COCO 2017 and Cityscapes. For the COCO dataset
the algorithm value can be increased by 1.7% and 2.5% of each compared to the benchmark of Mask R-CNN when the backbone network is based on ResNet50 and ResNet101. For the Cityscapes dataset
ResNet50 is used as the backbone network to evaluate on the validation set and test set
which are 2.1% and 2.3% higher than Mask R-CNN for the two sets. The ablation results show that the AgFPN has its potential performance and is easy to be integrated into multiple detectors. Furthermore
feature-related augmentation is utilized to improve average accuracy of 0.6% and 0.7% each for attention feature fusion module and the global context module. When we combine the two modules
the performance-benched is improved by 1.7%. The visualization results show that our method is more accurate in positioning multi-scale targets. The segmentation effect is improved significantly on the two aspects of mutual occlusion and the boundary of multiple targets.
Conclusion
2
The experimental results show that our algorithm is based on the overall multi-scale context information of the target and the multiple feature representation of the target can be improved. Therefore
the algorithm effectiveness is demonstrated that it can improve the accuracy of the network for target detection and segmentation at different scales further.
实例分割Mask R-CNN特征金字塔网络(FPN)多尺度上下文信息多尺度通道注意力(MSCA)
instance segmentationmask region-based convolutional neural network (Mask R-CNN)feature pyramid network (FPN)multi-scale context informationmulti-scale channel attention (MSCA)
Bolya D, Zhou C, Xiao F Y and Lee Y J. 2019. YOLACT: real-time instance segmentation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 9156-9165 [DOI: 10.1109/ICCV.2019.00925http://dx.doi.org/10.1109/ICCV.2019.00925]
Chen K, Pang J M, Wang J Q, Xiong Y, Li X X, Sun S Y, Feng W S, Liu Z W, Shi J P, Ouyang W L, Loy C C and Lin D H. 2019a. Hybrid task cascade for instance segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 4969-4978 [DOI: 10.1109/CVPR.2019.00511http://dx.doi.org/10.1109/CVPR.2019.00511]
Chen K, Wang J Q, Pang J M, Cao Y H, Xiong Y, Li X X, Sun S Y, Feng W S, Liu Z W, Xu J R, Zhang Z, Cheng D Z, Zhu C C, Cheng T H, Zhao Q J, Li B Y, Lu X, Zhu R, Wu Y, Dai J F, Wang J D, Shi J P, Ouyang W L, Loy C C and Lin D H. 2019b. MMDetection: open MMLab detection toolbox and benchmark [EB/OL]. [2021-10-10].https://arxiv.org/pdf/1906.07155.pdfhttps://arxiv.org/pdf/1906.07155.pdf
Chen L C, Hermans A, Papandreou G, Schroff F, Wang P and Adam H. 2018. MaskLab: instance segmentation by refining object detection with semantic and direction features//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4013-4022 [DOI: 10.1109/CVPR.2018.00422http://dx.doi.org/10.1109/CVPR.2018.00422]
Chen X L, Girshick R, He K M and Dollar P. 2019. TensorMask: a foundation for dense object segmentation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 2061-2069 [DOI: 10.1109/ICCV.2019.00215http://dx.doi.org/10.1109/ICCV.2019.00215]
Cheng T H, Wang X G, Huang L C and Liu W Y. 2020. Boundary-preserving mask R-CNN//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 660-676 [DOI: 10.1007/978-3-030-58568-6_39http://dx.doi.org/10.1007/978-3-030-58568-6_39]
Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S and Schiele B. 2016. The Cityscapes dataset for semantic urban scene understanding//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 3213-3223 [DOI: 10.1109/CVPR.2016.350http://dx.doi.org/10.1109/CVPR.2016.350]
Dai Y M, Gieseke F, Oehmcke S, Wu Y Q and Barnard K. 2021. Attentional feature fusion//Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision. Waikoloa, USA: IEEE: 3559-3568 [DOI: 10.1109/WACV48630.2021.00360http://dx.doi.org/10.1109/WACV48630.2021.00360]
Ding Z Y, Sun Q S, Wang T and Wang H Y. 2021. Deep interactive image segmentation based on fusion multi-scale annotation information. Journal of Computer Research and Development, 58(8): 1705-1717
丁宗元, 孙权森, 王涛, 王洪元. 2021. 基于融合多尺度标记信息的深度交互式图像分割. 计算机研究与发展, 58(8): 1705-1717 [DOI: 10.7544/issn1000-1239.2021.20210195]
He K M, Gkioxari G, Dollár P and Girshick R. 2017. Mask R-CNN//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2980-2988 [DOI: 10.1109/ICCV.2017.322http://dx.doi.org/10.1109/ICCV.2017.322]
Hu J, Shen L, Albanie S, Sun G and Wu E H. 2020. Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(8): 2011-2023 [DOI: 10.1109/tpami.2019.2913372]
Huang Z J, Huang L C, Gong Y C, Huang C and Wang X G. 2019a. Mask scoring R-CNN//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 6402-6411 [DOI: 10.1109/CVPR.2019.00657http://dx.doi.org/10.1109/CVPR.2019.00657]
Huang Z L, Wang X G, Huang L C, Huang C, Wei Y C and Liu W Y. 2019b. CCNet: Criss-cross attention for semantic segmentation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 603-612 [DOI: 10.1109/ICCV.2019.00069http://dx.doi.org/10.1109/ICCV.2019.00069]
Huang Z T, Liu Y, Yu C L, Zhang J J, Wang X and Qi S H. 2021. Video instance segmentation based on temporal feature fusion. Journal of Image and Graphics, 26(7): 1692-1703
黄泽涛, 刘洋, 于成龙, 张加佳, 王轩, 漆舒汉. 2021. 时序特征融合的视频实例分割. 中国图象图形学报, 26(7): 1692-1703 [DOI: 10.11834/jig.200521]
Ji S Y and Xiao Z Y. 2021. Integrated context and multi-scale features in thoracic organs segmentation. Journal of Image and Graphics, 26(9): 2135-2145
吉淑滢, 肖志勇. 2021. 融合上下文和多尺度特征的胸部多器官分割. 中国图象图形学报, 26(9): 2135-2145 [DOI: 10.11834/jig.200558]
Kirillov A, Levinkov E, Andres B, Savchynskyy B and Rother C. 2017. InstanceCut: from edges to instances with MultiCut//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 7322-7331 [DOI: 10.1109/CVPR.2017.774http://dx.doi.org/10.1109/CVPR.2017.774]
Kirillov A, Wu Y X, He K M and Girshick R. 2020. PointRend: image segmentation as rendering//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 9796-9805 [DOI: 10.1109/CVPR42600.2020.00982http://dx.doi.org/10.1109/CVPR42600.2020.00982]
Li Y, Qi H Z, Dai J F, Ji X Y and Wei Y C.2017. Fully convolutional instance-aware semantic segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 4438-4446 [DOI: 10.1109/CVPR.2017.472http://dx.doi.org/10.1109/CVPR.2017.472]
Lin C C, Zhao G S, Yin A H, Ding B C, Guo L and Chen H B. 2020. AS-PANet: a chromosome instance segmentation method based on improved path aggregation network architecture. Journal of Image and Graphics, 25(10): 2271-2280
林成创, 赵淦森, 尹爱华, 丁笔超, 郭莉, 陈汉彪. 2020. AS-PANet: 改进路径增强网络的重叠染色体实例分割. 中国图象图形学报, 25(10): 2271-2280 [DOI: 10.11834/jig.200236]
Lin T Y, Dollár P, Girshick R, He K M, Hariharan B and Belongie S. 2017. Feature pyramid networks for object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 936-944 [DOI: 10.1109/CVPR.2017.106http://dx.doi.org/10.1109/CVPR.2017.106]
Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P and Zitnick C L. 2014. Microsoft COCO: common objects in context//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 740-755 [DOI: 10.1007/978-3-319-10602-1_48http://dx.doi.org/10.1007/978-3-319-10602-1_48]
Liu S, Jia J Y, Fidler S and Urtasun R. 2017. SGN: sequential grouping networks for instance segmentation//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 3516-3524 [DOI: 10.1109/ICCV.2017.378http://dx.doi.org/10.1109/ICCV.2017.378]
Liu S, Qi L, Qin H F, Shi J P and Jia J Y. 2018. Path aggregation network for instance segmentation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 8759-8768 [DOI: 10.1109/CVPR.2018.00913http://dx.doi.org/10.1109/CVPR.2018.00913]
Peng S D, Jiang W, Pi H J, Li X L, Bao H J and Zhou X W. 2020. Deep snake for real-time instance segmentation//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 8530-8539 [DOI: 10.1109/CVPR42600.2020.00856http://dx.doi.org/10.1109/CVPR42600.2020.00856]
Ren S Q, He K M, Girshick R and Sun J. 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149 [DOI: 10.1109/TPAMI.2016.2577031]
Selvaraju R R, Cogswell M, Das A, Vedantam R, Parikh D and Batra D. 2017. Grad-CAM: visual explanations from deep networks via gradient-based localization//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 618-626 [DOI: 10.1109/ICCV.2017.74http://dx.doi.org/10.1109/ICCV.2017.74]
Shen X, Yang J R, Wei C B, Deng B, Huang J Q, Hua X S, Cheng X L and Liang K W. 2021. DCT-Mask: discrete cosine transform mask representation for instance segmentation//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 8716-8725 [DOI: 10.1109/CVPR46437.2021.00861http://dx.doi.org/10.1109/CVPR46437.2021.00861]
Wang J Q, Chen K, Xu R, Liu Z W, Loy C C and Lin D H. 2019. CARAFE: content-aware reassembly of features//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 3007-3016 [DOI: 10.1109/ICCV.2019.00310http://dx.doi.org/10.1109/ICCV.2019.00310]
Wang X L, Kong T, Shen C H, Jiang Y N and Li L. 2020. SOLO: segmenting objects by locations//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 649-665 [DOI: 10.1007/978-3-030-58523-5_38http://dx.doi.org/10.1007/978-3-030-58523-5_38]
Wang Z Y, Yuan C and Li J C. 2019. Instance segmentation with separable convolutions and multi-level features. Journal of Software, 30(4): 954-961
王子愉, 袁春, 黎健成. 2019. 利用可分离卷积和多级特征的实例分割. 软件学报, 30(4): 954-961 [DOI: 10.13328/j.cnki.jos.005667]
Wen Y L, Hu F Y, Ren J C, Shang X R, Li L Y and Xi X F. 2020. Joint multi-task cascade for instance segmentation. Journal of Real-Time Image Processing, 17(6): 1983-1989 [DOI: 10.1007/s11554-020-01007-5]
Xie E Z, Sun P Z, Song X G, Wang W H, Liu X B, Liang D, Shen C H and Luo P. 2020. PolarMask: single shot instance segmentation with polar representation//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 12190-12199 [DOI: 10.1109/CVPR42600.2020.01221http://dx.doi.org/10.1109/CVPR42600.2020.01221]
Zhang G, Lu X, Tan J R, Li J M, Zhang Z X, Li Q Q and Hu X L. 2021. RefineMask: towards high-quality instance segmentation with fine-grained features//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Nashville, USA: IEEE: 6857-6865 [DOI: 10.1109/CVPR46437.2021.00679http://dx.doi.org/10.1109/CVPR46437.2021.00679]
Zhang R F, Tian Z, Shen C H, You M Y and Yan Y L. 2020. Mask encoding for single shot instance segmentation//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 10223-10232 [DOI: 10.1109/CVPR42600.2020.01024http://dx.doi.org/10.1109/CVPR42600.2020.01024]
Zhang T Y, Zhang X R, Zhu P, Tang X, Li C, Jiao L C and Zhou H Y. 2022. Semantic attention and scale complementary network for instance segmentation in remote sensing images. IEEE Transactions on Cybernetics, 52(10): 10999-11013 [DOI: 10.1109/TCYB.2021.3096185]
Zhao H S, Shi J P, Qi X J, Qi X G, Wang X G and Jia J Y. 2017. Pyramid scene parsing network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 6230-6239 [DOI: 10.1109/CVPR.2017.660http://dx.doi.org/10.1109/CVPR.2017.660]
Zhou D F, Fang J, Song X B, Liu L, Yin J B, Dai Y C, Li H D and Yang R G. 2020. Joint 3D instance segmentation and object detection for autonomous driving//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 1836-1846 [DOI: 10.1109/CVPR42600.2020.00191http://dx.doi.org/10.1109/CVPR42600.2020.00191]
相关文章
相关作者
相关机构