Current Issue Cover
多尺度特征融合工件目标语义分割

和超, 张印辉, 何自芬(昆明理工大学机电工程学院, 昆明 650500)

摘 要
目的 目标语义特征提取效果直接影响图像语义分割的精度,传统的单尺度特征提取方法对目标的语义分割精度较低,为此,提出一种基于多尺度特征融合的工件目标语义分割方法,利用卷积神经网络提取目标的多尺度局部特征语义信息,并将不同尺度的语义信息进行像素融合,使神经网络充分捕获图像中的上下文信息,获得更好的特征表示,有效实现工件目标的语义分割。方法 使用常用的多类工件图像定义视觉任务,利用残差网络模块获得目标的单尺度语义特征图,再结合本文提出的多尺度特征提取方式获得不同尺度的局部特征语义信息,通过信息融合获得目标分割图。使用上述方法经多次迭代训练后得到与视觉任务相关的工件目标分割模型,并对训练权重与超参数进行保存。结果 将本文方法和传统的单尺度特征提取方法做定性和定量的测试实验,结果表明,获得的分割网络模型对测试集中的目标都具有较精确的分割能力,与单尺度特征提取方法相比,本文方法的平均交并比mIOU(mean intersection over union)指标在验证集上训练精度提高了4.52%,在测试集上分割精度提高了4.84%。当测试样本中包含的目标种类较少且目标边缘清晰时,本文方法能够得到更精准的分割结果。结论 本文提出的语义分割方法,通过多尺度特征融合的方式增强了神经网络模型对目标特征的提取能力,使训练得到的分割网络模型比传统的单尺度特征提取方式在测试集上具有更优秀的性能,从而验证了所提出方法的有效性。
关键词
Semantic segmentation of workpiece target based on multiscale feature fusion

He Chao, Zhang Yinhui, He Zifen(College of Mechanical and Electrical Engineering, Kunming University of Science and Technology, Kunming 650500, China)

Abstract
Objective Image segmentation technology is one of the most difficult aspects of computer vision and image processing and is an indispensable step in the process of understanding and analyzing image information. The disadvantage of image segmentation technology is that the size and direction of the target in the image make the level of the image unpredictable. At the same time, the segmentation of images has a complex background, different brightness, and different textures is still problems in the image segmentation technology. The target semantic feature extraction effect directly influences the accuracy of image semantic segmentation. The image capturing device mounted on the robot has a variable spatial relationship with the target during the operation of the robot in the automated production line that segments the target by machine vision technology. When the image capturing device takes images from different distances and angles, the target has different scales in the image. The traditional single-scale feature extraction method has lower precision for semantic segmentation of the target. This study shows how to use the context information of the image to create the multiscale feature fusion module and develop the ability to extract rich target features and improve the segmentation performance of the network model. Method This paper proposes a method of workpiece target semantic segmentation based on multiscale feature fusion. The convolutional neural network is used to extract the multiscale local feature semantic information of the target, and the semantic information of different scales is pixel-fused so that the neural network fully captures context information of the image and obtains a better feature representation, thereby effectively achieving semantic segmentation of the workpiece target. The method uses the ResNet as the underlying network structure and combines the image gold tower theory to construct a multiscale feature fusion module. As the image pyramid is simply a change in image resolution, although the multiscale information representation of the image can be obtained, the output of the fourth block layer of the ResNet network is already a feature map with a small dimension. Reduced resolution of the feature map is not conducive to the feature response of the network model and tends to increase the amount of parameters calculated by the network model. Therefore, the resolution reduction operation in the original image pyramid is replaced in the form of the atrous convolution. The sensitivity field of the filter is effectively increased under a nonreduced resolution of the image, and the local feature information of the superior image can be fully obtained. In this study, a three-layer image pyramid is used, where the bottom layer image is the feature map of the Block4 layer output, the middle layer is a plurality of parallel atrous convolution layers with different sampling rates to extract local feature information of different scales, and the top layer is the fusion layer of the local feature information extracted by the middle layer. Result The method of this study is compared with the traditional single-scale feature extraction method through qualitative and quantitative experimental methods, and mean intersection over union (mIOU) is used as the evaluation index. Experiments show that the segmentation network model obtained by this method has more accurate segmentation ability for the targets in the test set. Compared with the traditional single-scale feature extraction network, the mIOU evaluation index of this method on the test set is improved by 4.84% compared with the network that also adopts the porous convolution strategy. The parallel structure proposed in this paper improves the mIOU evaluation index on the test set by 3.57%, compared with the network using the atrous spatial pyramid pooling strategy to improve the network semantic segmentation ability. The mIOU evaluation index of the method in the test set is also improved by 2.24%. When the test sample contains fewer types of targets and the target edges are clearer, more accurate segmentation results can be obtained. To verify that the method has certain generalization, this study uses the method to verify the dataset of the tennis court scene. The tennis court scene dataset includes nine categories of goals:tennis, rackets, inside the tennis court, venue lines, outside the tennis court, nets, people, tennis court fence, and sky. The size and scale of these types of targets are different, which is consistent with the multiscale feature extraction ideas proposed in this paper. Under the condition that the parameters set by the method are completely adopted and the network model has not optimized the parameter for the tennis court scene dataset in the mIOU evaluation index of the test set, the accuracy increased from 54.68% to 56.43%. Conclusion This study introduces the labeling method of multi-workpiece datasets, and uses methods such as data expansion and definition of learning rate update to effectively prevent the overfitting phenomenon in network training and improve the basic performance of the network model. The value of the neural network depth and the value of the hyperparameter in the neural network training process are determined by comparing the experiments. At the same time, a multiscale feature fusion module is designed to extract multiscale semantic information of the target. The multiscale feature fusion enhances the ability of the neural network model to extract the target features, and the designed MsFFNet network model in more accurate in extracting the semantic features of the target. Therefore, the method can perform the semantic segmentation task of robot vision-based robotic grabbing target on the automated production line under the condition that the spatial position between the image capturing device and the target is variable. In this study, the network model determined by the specific dataset provides a reference value for the subsequent artifact detection. The next step will also focus on the generalization ability of the dataset of other industrial scenes.
Keywords

订阅号|日报