Current Issue Cover

刘文,王海荣,周北京(北方民族大学计算机科学与工程学院, 银川 750021)

摘 要
目的 为了解决经典卷积神经网络无法满足图像中极小目标特征提取的准确性需求问题,本文基于DeepLabv3plus算法,在下采样过程中引入特征图切分模块,提出了DeepLabv3plus-IRCNet(IR为倒置残差(inverted residual,C为特征图切分(feature map cut))图像语义分割方法,支撑图像极小目标的特征提取。方法 采用由普通卷积层和多个使用深度可分离卷积的倒置残差模块串联组成的深度卷积神经网络提取特征,当特征图分辨率降低到输入图像的1/16时,引入特征图切分模块,将各个切分特征图分别放大,通过参数共享的方式提取特征。然后,将每个输出的特征图进行对应位置拼接,与解码阶段放大到相同尺寸的特征图进行融合,提高模型对小目标物体特征的提取能力。结果 本文方法引入特征图切分模块,提高了模型对小目标物体的关注,充分考虑了图像上下文信息,对多个尺度下的各个中间层特征进行融合,提高了图像分割精度。为验证方法的有效性,使用CamVid(Cambridge-driving labeled video database)数据集对提出的方法进行验证,平均交并比(mean intersection over union,mIoU)相对于DeepLabv3plus模型有所提升。验证结果表明了本文方法的有效性。结论 本文方法充分考虑了图像分割中小目标物体的关注度,提出的DeepLabv3plus-IRCNet模型提升了图像分割精度。
DeepLabv3plus-IRCNet: an image semantic segmentation method for small target feature extraction

Liu Wen,Wang Hairong,Zhou Beijing(School of Computer Science and Engineering, North Minzu University, Yinchuan 750021, China)

Objective A huge amount of image data have been generated with the development of the Internet of things and artificial intelligence technology and their widespread application to various fields. Understanding image content quickly and accurately and automatically segmenting the target area of an image in accordance with the requirements of the application scene have become the focus of many researchers. In recent years, image semantic segmentation methods based on deep learning have been developed steadily. These methods have been widely used in automatic driving and robot engineering, and have become the primary research task in computer vision. Common convolutional neural networks (CNNs) can efficiently extract the features of an image. They typically operate directly on the entire feature map. However, extremely small targets frequently occur in a local area of an image. The common convolution operation cannot efficiently extract the features of small targets. To solve this problem, the feature image cut module is introduced into the down-sampling process. Method At present, the spatial pyramid pool module and codec structure of a deep CNN (DCNN) have become the mainstream method for image semantic segmentation. The former network can extract the features of an input feature map by using filters or pooling operations with multiple rates and effective fields, and thus, encode the multi-scale context information. Meanwhile, the latter network can capture clearer object boundaries by gradually recovering spatial information. However, many difficulties and challenges persist. The first problem is that the DCNN model has extremely high requirements for the hardware platform and is unsuitable for real-time engineering applications. The second problem is that the resolution of the feature image shrinks after the image is encoded, resulting in the loss of the spatial information of some pixels. The third problem is that the segmentation process cannot effectively consider the image context information (i.e., the relationship among pixels) and cannot fully utilize rich spatial location information. The fourth problem is that DCNNs are not good at capturing feature expression, and thus, achieving a better semantic segmentation effect is difficult. To solve these problems, this study proposes an improved image semantic segmentation algorithm DeepLab IRCNet based on DeepLabv3+ to solve the problem in which DCNNs experience difficulty in extracting the features of small and medium-sized objects. In the encoder part, a DCNN composed of a series of ordinary convolutional layers and multiple inverted residual modules is used to extract features. In the inverted residual module, deep separable convolutions are used instead of ordinary convolutions. When the resolution of the feature image is reduced to 1/16 of the input image, the feature map is divided equally, the feature map after segmentation is enlarged to the size before segmentation, and the feature extraction module is used to share each segmented feature map through parameter sharing. Consequently, the model can focus better on small target objects in the local area after feature segmentation. On the main network, the extracted feature map is continuously inputted into the hollow space pyramid pooling module to capture the multi-scale contextual content information of the image, and the hollow convolution with a void rate of {6, 12, 18} in the atrous spatial pyramid pooling module is used. The sequence also parallels a 1×1 convolutional layer and image pooling, wherein the choice of the void rate is the same as that of DeepLabv3+, improving segmentation performance. Then, a 1×1 convolution is used to obtain the output tensor of the target feature map. In the decoder part, bilinear interpolation is used to up-sample two times, and then the up sampling feature mapis fused with the output feature map of the feature segmentation module in the encoder. Several 3×3 depth separable convolutions are used to redefine the feature, and bilinear interpolation is used for up-sampling. Finally, an image semantic segmentation map that is the same size as the input image is the output. Result In this study, the CamVid(Cambridege-driving labeled video database) dataset is used to verify the proposed method. The mean intersection over union(mIoU) is increased by 1.5 percentage points compared with the DeepLabv3+ model. The verification results show the effectiveness of the proposed method. Conclusion In this study, a feature graph segmentation module is introduced to improve model attention to small objects and address the problem of low semantic segmentation accuracy.