Current Issue Cover
域自适应城市场景语义分割

张桂梅1, 潘国峰1, 刘建新2(1.南昌航空大学计算机视觉研究所, 南昌 330063;2.西华大学机械工程学院, 成都 610039)

摘 要
目的 域自适应分割网(AdaptSegNet)在城市场景语义分割中可获得较好的效果,但是该方法直接采用存在较大域差异(domain gap)的源域数据集GTA(grand theft auto)5与目标域数据集Cityscapes进行对抗训练,并且在网络的不同特征层间的对抗学习中使用固定的学习率,所以分割精度仍有待提高。针对上述问题,提出了一种新的域自适应的城市场景语义分割方法。方法 采用SG-GAN(semantic-aware grad-generative adversarial network(GAN))方法对虚拟数据集GTA5进行预处理,生成新的数据集SG-GTA5,其在灰度、结构以及边缘等信息上都更加接近现实场景Cityscapes,并用新生成的数据集代替原来的GTA5数据集作为网络的输入。针对AdaptSegNet加入的固定学习率问题,在网络的不同特征层引入自适应的学习率进行对抗学习,通过该学习率自适应地调整不同特征层的损失值,达到动态更新网络参数的目标。同时,在对抗网络的判别器中增加一层卷积层,以增强网络的判别能力。结果 在真实场景数据集Cityscapes上进行验证,并与相关的域自适应分割模型进行对比,结果表明:提出的网络模型能更好地分割出城市交通场景中较复杂的物体,对于sidewalk、wall、pole、car、sky的平均交并比(mean intersection over union, mIoU)分别提高了9.6%、5.9%、4.9%、5.5%、4.8%。结论 提出方法降低了源域和目标域数据集之间的域差异,减少了训练过程中的对抗损失值,规避了网络在反向传播训练过程中出现的梯度爆炸问题,从而有效地提高了网络模型的分割精度;同时提出基于该自适应的学习率进一步提升模型的分割性能;在模型的判别器网络中新添加一个卷积层,能学习到图像的更多高层语义信息,有效地缓解了类漂移的问题。
关键词
Domain adaptation for semantic segmentation based on adaption learning rate

Zhang Guimei1, Pan Guofeng1, Liu Jianxin2(1.Institute of Computer Vision, Nanchang Hangkong University, Nanchang 330063, China;2.School of Mechanical Engineering, Xihua University, Chengdu 610039, China)

Abstract
Objective Semantic segmentation is a core computer vision task where one aims to densely assign labels to each pixel in the input image, such as person, car, road, pole, traffic light, or tree. Convolutional neural network-based approaches achieve state-of-the-art performance on various semantic segmentation tasks with applications for autonomous driving, image editing, and video monitoring. Despite such progress, these models often rely on massive amounts of pixel-level labels. However, for a real urban scene task, large amounts of labeled data are unavailable because of the high labor of annotating segmentation ground truth. When the labeled dataset is difficult to obtain, adversarial-training-based methods are preferred. These methods seek to adapt by confusing the domain discriminator with domain alignment standalone from task-specific learning under a separate loss. Another challenge is that a large difference exists between source data and target data in real scenarios. For instance, the distribution of appearance for objects and scenes may vary in different places, and even weather and lighting conditions can change significantly at the same place. In particular, such differences are often called as “domain gaps” and could cause significantly decreased performance. Unsupervised domain adaptation seeks to overcome such problems without target domain labels. Domain adaption aims to bridge the source and target domains by learning domain-invariant feature representations without using target labels. Such efforts have been made by using a deep learning network like AdaptSegNet, which has gained good results in the semantic segmentation of urban scenes. However, the network is trained directly using the synthetic dataset GTA5 and the real urban scene dataset Cityscapes, which exhibit a domain gap in gray, structure, and edge information. The fixed learning rate is employed in the model during the adversarial learning of different feature layers. In sum, segmentation accuracy needs to be improved. Method To handle these problems, a new domain adaptation method is proposed for urban scene semantic segmentation. To reduce the domain gap between source and target datasets, knowledge transfer or domain adaption is proposed to close the gap between source and target domains. This work is based on adversarial learning. First, the semantic-aware grad-generative adversarial network(GAN) (SG-GAN) is introduced to pre-process the synthetic dataset of GTA5. As a result, a new dataset SG-GTA5 is generated, which brings the newly dataset SG-GTA5 considerably closer to the urban scene dataset Cityscapes in gray, structure, and edge information. It is also suitable to substitute the original dataset GTA5 in AdaptSegNet. Second, the newly dataset SG-GTA5 is used as input of our network. To further enhance the adapted model and handle the fixed learning rate of AdaptSegNet, a multi-level adversarial network is constructed to effectively perform output space domain adaptation at different feature levels. Third, an adaptive learning rate is introduced in different feature levels of the network. Fourth, the loss value of different levels is adjusted by the proposed adaptive learning rate. Thus, the network’s parameters can be updated dynamically. Fifth, a new convolution layer is added into the discriminator of GAN. As a result, the discriminant ability of the network is enhanced. For the discriminator, we use an architecture that uses fully convolutional layers to replace all fully connected layers to retain the spatial information. This architecture is composed of six convolution layers with 4×4 kernel and a stride of 2 for the first four kernels, a stride of 1 for the fifth kernel and channel numbers of 64, 128, 256, 512, 1 024, and 1, respectively. Except for the last layer, each convolution layer is followed by a leaky ReLU parameterized by 0.2. An up-sampling layer is added to the last convolution layer for re-scaling the output to the input size. No batch-normalization layers are used because we jointly train the discriminator with the segmentation network using a small batch size. For the segmentation network, it is essential to build upon a good baseline model to achieve high-quality segmentation results. We adopt the DeepLab-v2 framework with ResNet-101 model pre-trained on ImageNet as our segmentation baseline network. Similar to the recent work on semantic segmentation, we remove the last classification layer and modify the stride of the last two convolution layers from 2 to 1, making the resolution of the output feature maps effectively 1/8 times the input image size. To enlarge the receptive field, we apply dilated convolution layers in conv4 and conv5 layers with a stride of 2 and 4, respectively. After the last layer, we use the atrous spatial pyramid pooling (ASPP)as the final classifier. Finally, the batch normalization(BN) layers are removed because the discriminator network is trained with small batch generator network. Furthermore, we implement our network using the PyTorch toolbox on a single GTX1080Ti GPU with 11 GB memory. Result The new model is verified using the Cityscapes dataset. Experimental results demonstrate that the presented model is capable of segmenting more complex targets in the urban traffic scene precisely. The model also performs well against existing state-of-the-art segmentation model in terms of accuracy and visual quality. The segmentation accuracy of sidewalk, wall, pole, car and sky are improved by 9.6%, 5.9%, 4.9%, 5.5%, and 4.8%, respectively. Conclusion The effectiveness of the proposed model is validated by using the real urban scene Cityscapes. The segmentation precision is improved through the presented dataset preprocessing scheme on the synthetic dataset of GTA5 by using the SG-GAN model, which makes the newly dataset SG-GTA5 much closer to the urban scene dataset Cityscapes on gray, structure, and edge information. The presented data preprocessing method also reduces the adversarial loss value effectively and avoids gradient explosion during the back propagation process. The network’s learning capability is also further strengthened and the model’s segmentation precision is improved through the presented adaptive learning rate, which is used for different adversarial layers to adjust the loss value of each layer. The learning rate can also update network parameters dynamically and optimize the performance of the generator and discriminator network. Finally, the discrimination capability of the proposed model is further improved by adding a new convolution layer in the discriminator, which enables the model to learn high layer semantic information. The domain shift is also alleviated to some extent.
Keywords

订阅号|日报