多尺度特征融合工件目标语义分割

和超; 张印辉; 何自芬

doi:10.11834/jig.190218

图像分析和识别 | 浏览量 : 0 下载量: 47 CSCD: 4

PDF
导出
分享
收藏
专辑

多尺度特征融合工件目标语义分割
Semantic segmentation of workpiece target based on multiscale feature fusion
2020年25卷第3期页码：476-485
收稿：2019-05-30，

修回：2019-9-10，

录用：2019-9-17，

纸质出版：2020-03-16
DOI： 10.11834/jig.190218
稿件说明：

移动端阅览

和超, 张印辉, 何自芬. 多尺度特征融合工件目标语义分割[J]. 中国图象图形学报, 2020,25(3):476-485. DOI： 10.11834/jig.190218.

Chao He, Yinhui Zhang, Zifen He. Semantic segmentation of workpiece target based on multiscale feature fusion[J]. Journal of Image and Graphics, 2020, 25(3): 476-485. DOI： 10.11834/jig.190218.

摘要

目的

目标语义特征提取效果直接影响图像语义分割的精度，传统的单尺度特征提取方法对目标的语义分割精度较低，为此，提出一种基于多尺度特征融合的工件目标语义分割方法，利用卷积神经网络提取目标的多尺度局部特征语义信息，并将不同尺度的语义信息进行像素融合，使神经网络充分捕获图像中的上下文信息，获得更好的特征表示，有效实现工件目标的语义分割。

方法

使用常用的多类工件图像定义视觉任务，利用残差网络模块获得目标的单尺度语义特征图，再结合本文提出的多尺度特征提取方式获得不同尺度的局部特征语义信息，通过信息融合获得目标分割图。使用上述方法经多次迭代训练后得到与视觉任务相关的工件目标分割模型，并对训练权重与超参数进行保存。

结果

将本文方法和传统的单尺度特征提取方法做定性和定量的测试实验，结果表明，获得的分割网络模型对测试集中的目标都具有较精确的分割能力，与单尺度特征提取方法相比，本文方法的平均交并比mIOU（mean intersection over union）指标在验证集上训练精度提高了4.52%，在测试集上分割精度提高了4.84%。当测试样本中包含的目标种类较少且目标边缘清晰时，本文方法能够得到更精准的分割结果。

结论

本文提出的语义分割方法，通过多尺度特征融合的方式增强了神经网络模型对目标特征的提取能力，使训练得到的分割网络模型比传统的单尺度特征提取方式在测试集上具有更优秀的性能，从而验证了所提出方法的有效性。

Abstract

Objective

Image segmentation technology is one of the most difficult aspects of computer vision and image processing and is an indispensable step in the process of understanding and analyzing image information. The disadvantage of image segmentation technology is that the size and direction of the target in the image make the level of the image unpredictable. At the same time

the segmentation of images has a complex background

different brightness

and different textures is still problems in the image segmentation technology. The target semantic feature extraction effect directly influences the accuracy of image semantic segmentation. The image capturing device mounted on the robot has a variable spatial relationship with the target during the operation of the robot in the automated production line that segments the target by machine vision technology. When the image capturing device takes images from different distances and angles

the target has different scales in the image. The traditional single-scale feature extraction method has lower precision for semantic segmentation of the target. This study shows how to use the context information of the image to create the multiscale feature fusion module and develop the ability to extract rich target features and improve the segmentation performance of the network model.

Method

This paper proposes a method of workpiece target semantic segmentation based on multiscale feature fusion. The convolutional neural network is used to extract the multiscale local feature semantic information of the target

and the semantic information of different scales is pixel-fused so that the neural network fully captures context information of the image and obtains a better feature representation

thereby effectively achieving semantic segmentation of the workpiece target. The method uses the ResNet as the underlying network structure and combines the image gold tower theory to construct a multiscale feature fusion module. As the image pyramid is simply a change in image resolution

although the multiscale information representation of the image can be obtained

the output of the fourth block layer of the ResNet network is already a feature map with a small dimension. Reduced resolution of the feature map is not conducive to the feature response of the network model and tends to increase the amount of parameters calculated by the network model. Therefore

the resolution reduction operation in the original image pyramid is replaced in the form of the atrous convolution. The sensitivity field of the filter is effectively increased under a nonreduced resolution of the image

and the local feature information of the superior image can be fully obtained. In this study

a three-layer image pyramid is used

where the bottom layer image is the feature map of the Block4 layer output

the middle layer is a plurality of parallel atrous convolution layers with different sampling rates to extract local feature information of different scales

and the top layer is the fusion layer of the local feature information extracted by the middle layer.

Result

The method of this study is compared with the traditional single-scale feature extraction method through qualitative and quantitative experimental methods

and mean intersection over union (mIOU) is used as the evaluation index. Experiments show that the segmentation network model obtained by this method has more accurate segmentation ability for the targets in the test set. Compared with the traditional single-scale feature extraction network

the mIOU evaluation index of this method on the test set is improved by 4.84% compared with the network that also adopts the porous convolution strategy. The parallel structure proposed in this paper improves the mIOU evaluation index on the test set by 3.57%

compared with the network using the atrous spatial pyramid pooling strategy to improve the network semantic segmentation ability. The mIOU evaluation index of the method in the test set is also improved by 2.24%. When the test sample contains fewer types of targets and the target edges are clearer

more accurate segmentation results can be obtained. To verify that the method has certain generalization

this study uses the method to verify the dataset of the tennis court scene. The tennis court scene dataset includes nine categories of goals:tennis

rackets

inside the tennis court

venue lines

outside the tennis court

nets

people

tennis court fence

and sky. The size and scale of these types of targets are different

which is consistent with the multiscale feature extraction ideas proposed in this paper. Under the condition that the parameters set by the method are completely adopted and the network model has not optimized the parameter for the tennis court scene dataset in the mIOU evaluation index of the test set

the accuracy increased from 54.68% to 56.43%.

Conclusion

This study introduces the labeling method of multi-workpiece datasets

and uses methods such as data expansion and definition of learning rate update to effectively prevent the overfitting phenomenon in network training and improve the basic performance of the network model. The value of the neural network depth and the value of the hyperparameter in the neural network training process are determined by comparing the experiments. At the same time

a multiscale feature fusion module is designed to extract multiscale semantic information of the target. The multiscale feature fusion enhances the ability of the neural network model to extract the target features

and the designed MsFFNet network model in more accurate in extracting the semantic features of the target. Therefore

the method can perform the semantic segmentation task of robot vision-based robotic grabbing target on the automated production line under the condition that the spatial position between the image capturing device and the target is variable. In this study

the network model determined by the specific dataset provides a reference value for the subsequent artifact detection. The next step will also focus on the generalization ability of the dataset of other industrial scenes.

关键词

Keywords

references

Chen L C, Papandreou G, Kokkinos I, Murphy K and Yuille A L. 2017a. Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[EB/OL].[2019-01-23] . https://arxiv.org/pdf/1606.00915v1.pdf https://arxiv.org/pdf/1606.00915v1.pdf

Chen L C, Papandreou G, Schroff F and Adam. 2017b. Rethinking atrous convolution for semantic image segmentation[EB/OL].[2019-01-13] . https://arxiv.org/pdf/1706.05587.pdf https://arxiv.org/pdf/1706.05587.pdf

Dai J F, He K M and Sun J. 2014. Convolutional feature masking for joint object and stuff segmentation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE: 3992-4000[ DOI:10.1109/CVPR.2015.7299025 http://dx.doi.org/10.1109/CVPR.2015.7299025 ]

Farabet C, Couprie C, Najman L and LeCun Y. 2012. Scene parsing with multiscale feature learning, purity trees, and optimal covers[EB/OL].[2019-05-01] . https://arxiv.org/pdf/1202.2160.pdf https://arxiv.org/pdf/1202.2160.pdf

Feng X J, Williams C K I and Felderhof S N. 2002. Combining belief networks and neural networks for scene segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(4):467-483[DOI:10.1109/34. 993555]

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778[ DOI:10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]

He X M, Zemel R S and Carreira-Perpinan M A. 2004. Multiscale conditional random fields for image labeling//Proceedings of 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Washington, DC, USA: IEEE: 695-702[ DOI:10.1109/CVPR.2004.1315232 http://dx.doi.org/10.1109/CVPR.2004.1315232 ]

He X M, Zemel R S and Ray D. 2006. Learning and incorporating top-down cues in image segmentation//Proceedings of the 9th European Conference on Computer Vision. Graz, Austria: Springer: 338-351[ DOI:10.1007/11744023_27 http://dx.doi.org/10.1007/11744023_27 ]

Jiang Y C. 2016. Research on Recognition of Workpiece and Sketch. Dalian:Dalian University of Technology

蒋羽超. 2016.工件及简图识别的研究.大连:大连理工大学

Jiang Y F, Zhang H, Xue Y B, Zhou M, Xu G P and Gao Z. 2016. A new multi-scale image semantic understanding method based on deep learning. Journal of Optoelectronics·Loser, 27(2):224-230

蒋应锋, 张桦, 薛彦兵, 周冕, 徐光平, 高赞. 2016.一种新的多尺度深度学习图像语义理解方法研究.光电子·激光, 27(2):224-230)[DOI:10.16136/j.joel.2016.02.0652]

Kadota R, Sugano H, Hiromoto M, Ochi H, Miyamoto R and Nakamura Y. 2009. Hardware architecture for HOG feature extraction//Proceedings of the 5th International Conference on Intelligent Information Hiding and Multimedia Signal Processing. Kyoto, Japan: IEEE: 1330-1333[ DOI:10.1109/IIH-MSP.2009.216 http://dx.doi.org/10.1109/IIH-MSP.2009.216 ]

Keskar N S, Mudigere D, NocedalJ, Smelyanskiy M and Tang P T P. 2017. On large-batch training for deep learning: generalization gap and sharp minima[EB/OL].[2019-01-20] . https://arxiv.org/pdf/1609.04836.pdf https://arxiv.org/pdf/1609.04836.pdf

Kumar S and Hebert M. 2003. Man-made structure detection in natural images using a causal multiscale random field//Proceedings of 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Madison, WI, USA: IEEE: 119-126[ DOI:10.1109/CVPR.2003.1211345 http://dx.doi.org/10.1109/CVPR.2003.1211345 ]

LeCun Y, Bottou L, Bengio Y and Haffner P. 1998. Gradient-based learning applied to document recognition//Proceedings of 1998 IEEE, 86(11): 2278-2324[ DOI:10.1109/5.726791 http://dx.doi.org/10.1109/5.726791 ]

Liu D, Liu X J and Wang M Z. 2017. Semantic segmentation with multi-scale convolutional neural network. Remote Sensing Information, 32(1):57-64

刘丹, 刘学军, 王美珍. 2017.一种多尺度CNN的图像语义分割算法.遥感信息, 32(1):57-64)[DOI:10.3969/j.issn.1000-3177. 2017. 01.011]

Mostajabi M, Yadollahpour P and Shakhnarovich G. 2015. Feedforward semantic segmentation with zoom-out features//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE: 3376-3385[ DOI:10.1109/CVPR.2015.7298959 http://dx.doi.org/10.1109/CVPR.2015.7298959 ]

Nowozin S. 2014. Optimal decisions from probabilistic models: the intersection-over-union case//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 548-555[ DOI:10.1109/CVPR.2014.77 http://dx.doi.org/10.1109/CVPR.2014.77 ]

Schulz H and Behnke S. 2012. Learning object-class segmentation with convolutional neural networks//2012 European Symposium on Artificial Neural Network. Bruge: 25-27.

Shelhamer E, Long J and Darrell T. 2014. Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4):640-651[DOI:10.1109/TPAMI.2016.2572683]

Zhou H Y, Yuan Y and Shi C M. 2009. Object tracking using SIFT features and mean shift. Computer Vision and Image Understanding, 113(3):345-352[DOI:10.1016/j.cviu.2008.08.006]

Zhou Y C. 2015. Research on Flexible Grasp Key Technology of Robot Based on Vision Guiding. Guangzhou:Guangdong University of Technology

周衍超. 2015.基于视觉引导的机器人智能抓取技术研究.广州:广东工业大学)[DOI:10.7666/d.Y2795373]

Zhang Y J. 2006. Image Engineering:Image Processing 2nd ed. Beijing:Tsinghua University Press

章毓晋. 2006.图像工程:上册.图像处理. 2版.北京:清华大学出版社