曹峰梅,田海杰,付君,刘静(北京理工大学光电学院, 北京 100081;中国科学院自动化研究所模式识别国家重点实验室, 北京 100190)
目的 基于全卷积神经网络的图像语义分割研究已成为该领域的主流研究方向。然而，在该网络框架中由于特征图的多次下采样使得图像分辨率逐渐下降，致使小目标丢失，边缘粗糙，语义分割结果较差。为解决或缓解该问题，提出一种基于特征图切分的图像语义分割方法。方法 本文方法主要包含中间层特征图切分与相对应的特征提取两部分操作。特征图切分模块主要针对中间层特征图，将其切分成若干等份，同时将每一份上采样至原特征图大小，使每个切分区域的分辨率增大；然后，各个切分特征图通过参数共享的特征提取模块，该模块中的多尺度卷积与注意力机制，有效利用各切块的上下文信息与判别信息，使其更关注局部区域的小目标物体，提高小目标物体的判别力。进一步，再将提取的特征与网络原输出相融合，从而能够更高效地进行中间层特征复用，对小目标识别定位、分割边缘精细化以及网络语义判别力有明显改善。结果 在两个城市道路数据集CamVid以及GATECH上进行验证实验，论证本文方法的有效性。在CamVid数据集上平均交并比达到66.3%，在GATECH上平均交并比达到52.6%。结论 基于特征图切分的图像分割方法，更好地利用了图像的空间区域分布信息，增强了网络对于不同空间位置的语义类别判定能力以及小目标物体的关注度，提供更有效的上下文信息和全局信息，提高了网络对于小目标物体的判别能力，改善了网络整体分割性能。
Feature map slice for semantic segmentation
Cao Fengmei,Tian Haijie,Fu Jun,Liu Jing(School of Optics and Photonic, Beijing Institute of Technology, Beijing 100081, China;National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China)
Objective Deep convolutional neural networks have recently shown outstanding performances in object recognition and have also been the first choice for dense classification problems, such as semantic segmentation. Fully convolutional network based methods have become the main research direction in the field of image semantic segmentation. However, repeated downsampling operations in these methods, such as pooling or convolution striding, lead to a significant decrease in the initial image resolution, which results in poor object delineation, small target losing, and weak segmentation output. Although some studies have solved this problem in recent years, determining how to effectively handle this problem remains an open question and deserves further attention. This study proposes a feature map slice module for semantic segmentation to solve this problem. Method The proposed method mainly includes two parts:middle layer feature map segmentation and corresponding feature extraction network. The feature map slice module mainly focuses on the middle layer feature map. The feature map is sliced into several small cubes, and then each cube is upsampled to the corresponding resolution of the original feature map, which enlarges the small target in the local area. Each cube is equivalent to a subregion of the original feature map by the proposed feature map slice module. After upsampling these cubes, the objects in these subregions are enlarged. Thus, the small objects in these regions can be regarded as relatively large objects, which are difficult to detect through the entire feature map. Therefore, in the process of feature extraction, attention must be focused on the small target objects in these subregions, which are difficult to detect if we handle the entire feature map. A weight-shared feature extraction network is thus designed for sliced feature maps. The feature extraction network adopts multiple convolution operations (different kernel sizes) to extract different scale feature information. For each input of the network, the dimension is reduced to half to save memory and dilation convolution is adopted to enlarge the network's receptive field. We then concatenate a difficult feature map (obtained by different convolution operations) and add a channel-attention operation. The feature extraction network combines multi-scale convolution and attention mechanism; when subregions are passing through the feature extraction network, it can extract different semantic category information from corresponding subregions, as well as provide contextual and global information and discriminant information of each slice effectively. Accordingly, we can focus on small objects in local areas and improve the discriminability of small target objects. Each cube passes through the feature extraction network. The extracted feature in the corresponding position is assembled and the entire mosaic feature map is acquired. The network original output is upsampled and fused with the mosaic feature map by element-wise max operation. In this way, the middle-layer feature can be reused efficiently. To utilize the middlelayer feature information, this module is introduced at multiple scales, which enhances the capability of extracting small target characteristics and spatial information in local areas. It also utilizes the semantic information in different scales and exhibits an obvious improvement for extracting small target features, refining segmentation edge, and enhancing network discrimination. Result The proposed method is verified on two urban scene-understanding datasets, namely, CamVid and GATECH. Both datasets contain many common urban scene objects, such as building, car, and cyclist. Several ablation experiments are conducted on the two datasets and excellent performances are achieved. In particular, intersection-over-union scores of 66.3 and 52.6 are acquired on CamVid and GATECH, respectively. Conclusion The proposed method utilizes the spatial distribution information of images, enhances the network capability to determine the semantic categories of different spatial locations, pays considerable attention to small target objects, and provides effective context and global information. The proposed method is expanded into different resolutions of the network considering that different resolutions can provide rich-scale information. Thus, we utilize middle layer feature information, improve the network capability to discriminate small target objects, and enhance the overall segmentation performance of the network.