域自适应城市场景语义分割
Domain adaptation for semantic segmentation based on adaption learning rate
- 2020年25卷第5期 页码:913-925
收稿:2019-08-20,
修回:2019-10-28,
录用:2019-11-4,
纸质出版:2020-05-16
DOI: 10.11834/jig.190424
移动端阅览

浏览全部资源
扫码关注微信
收稿:2019-08-20,
修回:2019-10-28,
录用:2019-11-4,
纸质出版:2020-05-16
移动端阅览
目的
2
域自适应分割网(AdaptSegNet)在城市场景语义分割中可获得较好的效果,但是该方法直接采用存在较大域差异(domain gap)的源域数据集GTA(grand theft auto)5与目标域数据集Cityscapes进行对抗训练,并且在网络的不同特征层间的对抗学习中使用固定的学习率,所以分割精度仍有待提高。针对上述问题,提出了一种新的域自适应的城市场景语义分割方法。
方法
2
采用SG-GAN(semantic-aware grad-generative adversarial network(GAN))方法对虚拟数据集GTA5进行预处理,生成新的数据集SG-GTA5,其在灰度、结构以及边缘等信息上都更加接近现实场景Cityscapes,并用新生成的数据集代替原来的GTA5数据集作为网络的输入。针对AdaptSegNet加入的固定学习率问题,在网络的不同特征层引入自适应的学习率进行对抗学习,通过该学习率自适应地调整不同特征层的损失值,达到动态更新网络参数的目标。同时,在对抗网络的判别器中增加一层卷积层,以增强网络的判别能力。
结果
2
在真实场景数据集Cityscapes上进行验证,并与相关的域自适应分割模型进行对比,结果表明:提出的网络模型能更好地分割出城市交通场景中较复杂的物体,对于sidewalk、wall、pole、car、sky的平均交并比(mean intersection over union,mIoU)分别提高了9.6%、5.9%、4.9%、5.5%、4.8%。
结论
2
提出方法降低了源域和目标域数据集之间的域差异,减少了训练过程中的对抗损失值,规避了网络在反向传播训练过程中出现的梯度爆炸问题,从而有效地提高了网络模型的分割精度;同时提出基于该自适应的学习率进一步提升模型的分割性能;在模型的判别器网络中新添加一个卷积层,能学习到图像的更多高层语义信息,有效地缓解了类漂移的问题。
Objective
2
Semantic segmentation is a core computer vision task where one aims to densely assign labels to each pixel in the input image
such as person
car
road
pole
traffic light
or tree. Convolutional neural network-based approaches achieve state-of-the-art performance on various semantic segmentation tasks with applications for autonomous driving
image editing
and video monitoring. Despite such progress
these models often rely on massive amounts of pixel-level labels. However
for a real urban scene task
large amounts of labeled data are unavailable because of the high labor of annotating segmentation ground truth. When the labeled dataset is difficult to obtain
adversarial-training-based methods are preferred. These methods seek to adapt by confusing the domain discriminator with domain alignment standalone from task-specific learning under a separate loss. Another challenge is that a large difference exists between source data and target data in real scenarios. For instance
the distribution of appearance for objects and scenes may vary in different places
and even weather and lighting conditions can change significantly at the same place. In particular
such differences are often called as "domain gaps" and could cause significantly decreased performance. Unsupervised domain adaptation seeks to overcome such problems without target domain labels. Domain adaption aims to bridge the source and target domains by learning domain-invariant feature representations without using target labels. Such efforts have been made by using a deep learning network like AdaptSegNet
which has gained good results in the semantic segmentation of urban scenes. However
the network is trained directly using the synthetic dataset GTA5 and the real urban scene dataset Cityscapes
which exhibit a domain gap in gray
structure
and edge information. The fixed learning rate is employed in the model during the adversarial learning of different feature layers. In sum
segmentation accuracy needs to be improved.
Method
2
To handle these problems
a new domain adaptation method is proposed for urban scene semantic segmentation. To reduce the domain gap between source and target datasets
knowledge transfer or domain adaption is proposed to close the gap between source and target domains. This work is based on adversarial learning. First
the semantic-aware grad-generative adversarial network(GAN) (SG-GAN) is introduced to pre-process the synthetic dataset of GTA5. As a result
a new dataset SG-GTA5 is generated
which brings the newly dataset SG-GTA5 considerably closer to the urban scene dataset Cityscapes in gray
structure
and edge information. It is also suitable to substitute the original dataset GTA5 in AdaptSegNet. Second
the newly dataset SG-GTA5 is used as input of our network. To further enhance the adapted model and handle the fixed learning rate of AdaptSegNet
a multi-level adversarial network is constructed to effectively perform output space domain adaptation at different feature levels. Third
an adaptive learning rate is introduced in different feature levels of the network. Fourth
the loss value of different levels is adjusted by the proposed adaptive learning rate. Thus
the network's parameters can be updated dynamically. Fifth
a new convolution layer is added into the discriminator of GAN. As a result
the discriminant ability of the network is enhanced. For the discriminator
we use an architecture that uses fully convolutional layers to replace all fully connected layers to retain the spatial information. This architecture is composed of six convolution layers with 4×4 kernel and a stride of 2 for the first four kernels
a stride of 1 for the fifth kernel and channel numbers of 64
128
256
512
1 024
and 1
respectively. Except for the last layer
each convolution layer is followed by a leaky ReLU parameterized by 0.2. An up-sampling layer is added to the last convolution layer for re-scaling the output to the input size. No batch-normalization layers are used because we jointly train the discriminator with the segmentation network using a small batch size. For the segmentation network
it is essential to build upon a good baseline model to achieve high-quality segmentation results. We adopt the DeepLab-v2 framework with ResNet-101 model pre-trained on ImageNet as our segmentation baseline network. Similar to the recent work on semantic segmentation
we remove the last classification layer and modify the stride of the last two convolution layers from 2 to 1
making the resolution of the output feature maps effectively 1/8 times the input image size. To enlarge the receptive field
we apply dilated convolution layers in conv4 and conv5 layers with a stride of 2 and 4
respectively. After the last layer
we use the atrous spatial pyramid pooling (ASPP)as the final classifier. Finally
the batch normalization(BN) layers are removed because the discriminator network is trained with small batch generator network. Furthermore
we implement our network using the PyTorch toolbox on a single GTX1080Ti GPU with 11 GB memory.
Result
2
The new model is verified using the Cityscapes dataset. Experimental results demonstrate that the presented model is capable of segmenting more complex targets in the urban traffic scene precisely. The model also performs well against existing state-of-the-art segmentation model in terms of accuracy and visual quality. The segmentation accuracy of sidewalk
wall
pole
car and sky are improved by 9.6%
5.9%
4.9%
5.5%
and 4.8%
respectively.
Conclusion
2
The effectiveness of the proposed model is validated by using the real urban scene Cityscapes. The segmentation precision is improved through the presented dataset preprocessing scheme on the synthetic dataset of GTA5 by using the SG-GAN model
which makes the newly dataset SG-GTA5 much closer to the urban scene dataset Cityscapes on gray
structure
and edge information. The presented data preprocessing method also reduces the adversarial loss value effectively and avoids gradient explosion during the back propagation process. The network's learning capability is also further strengthened and the model's segmentation precision is improved through the presented adaptive learning rate
which is used for different adversarial layers to adjust the loss value of each layer. The learning rate can also update network parameters dynamically and optimize the performance of the generator and discriminator network. Finally
the discrimination capability of the proposed model is further improved by adding a new convolution layer in the discriminator
which enables the model to learn high layer semantic information. The domain shift is also alleviated to some extent.
Chen L C, Papandreou G, Kokkinos I, Murphy K and Yuille A L. 2018. DeepLab:semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834-848[DOI:10.1109/TPAMI.2017.2699184]
Chen Y H, Chen W H, Chen Y T, Tsai B C, Frank Wang Y C and Sun M. 2017. No more discrimination: cross city adaptation of road scene segmenters//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 1992-2001[ DOI: 10.1109/ICCV.2017.220 http://dx.doi.org/10.1109/ICCV.2017.220 ]
Cordts M, OmranM, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S and Schiele B. 2016. The cityscapes dataset for semantic urban scene understanding//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE: 3213-3223[ DOI: 10.1109/CVPR.2016.350 http://dx.doi.org/10.1109/CVPR.2016.350 ]
Dai J F, He K M and Sun J. 2015. BoxSup: exploiting bounding boxes to supervise convolutional networks for semantic segmentation//Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE: 1635-1643[ DOI: 10.1109/ICCV.2015.191 http://dx.doi.org/10.1109/ICCV.2015.191 ]
Ganin Y, Ustinova E, Ajakan H, Germain P, Larochelle H, Laviolette F, Marchand M and Lempitsky V. 2015. Domain-adversarial training of neural networks. Journal of Machine Learning Research, 17(1):2096-2030[DOI:10.1007/978-3-319-58347-1_10]
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A and Bengio Y. 2014. Generative adversarial nets//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press: 2672-2680
Hoffman J, Tzeng E, Park T, Zhu J Y, Isola P, Saenko K, Efros A and Darrell T. 2017. CyCADA: cycle-consistent adversarial domain adaptation//Proceedings of the 35th International Conference on Machine Learning (ICML).[s.l.]: PMLR: 1994-2003
Hoffman J, Wang D Q, Yu F and Darrell T. 2016. FCNs in the wild: pixel-level adversarial and constraint-based adaptation[EB/OL ] .[2019-08-15 ] . https://arxiv.org/pdf/1612.62649.pdf https://arxiv.org/pdf/1612.62649.pdf
Hong S, Noh H and Han B. 2015. Decoupled deep neural network for semi-supervised semantic segmentation//Proceedings of the 29th International Conference on Neural Information Processing Systems. Montreal, Canada: Neural Information Processing Systems Foundation: 1495-1503
Johnson-Roberson M, Barto C, Mehta R, Sridhar S N, Rosaen K and Vasudevan R. 2017. Driving in the matrix: can virtual worlds replace human-generated annotations for real world tasks?//Proceedings of 2017 IEEE International Conference on Robotics and Automation (ICRA). Singapore: IEEE: 746-753[ DOI: 10.1109/ICRA.2017.7989092 http://dx.doi.org/10.1109/ICRA.2017.7989092 ]
Khoreva A, Benenson R, Hosang J, Hein M and Schiele B. 2017. Simple does it: weakly supervised instance and semantic segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA: IEEE: 876-885[ DOI: 10.1109/CVPR.2017.181 http://dx.doi.org/10.1109/CVPR.2017.181 ]
LeCun Y, Bengio Y and Hinton G. 2015. Deep learning. Nature, 521(7553):436-444[DOI:10.1038/nature14539]
Li P L, Liang X D, Jia D Y and Xing E P. 2018. Semantic-aware grad-GAN for virtual-to-real urban scene adaption[EB/OL ] .[2019-08-05 ] . https://arxiv.org/pdf/1801.01726.pdf https://arxiv.org/pdf/1801.01726.pdf
Long M S, Cao Y, Wang J M and Jordan M I. 2015. Learning transferable features with deep adaptation networks//Proceedings of the 32nd International Conference on Machine Learning (ICML). Lille, France: ACM: 97-105
Luo Y W, Zheng L, Guan T, Yu J Q and Yang Y. 2019. Taking a closer look at domain shift: category-level adversaries for semantics consistent domain adaptation//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).[s.l.]: [s.n.]: 2507-2516
Papandreou G, Chen L C, Murphy K and Yuille A L. 2015. Weakly- and semi-supervised learning of a deep convolutional network for semantic image segmentation//Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE Computer Society: 1742-1750[ DOI: 10.1109/ICCV.2015.203 http://dx.doi.org/10.1109/ICCV.2015.203 ]
Pathak D, Krahenbuhl P and Darrell T. 2015. Constrained convolutional neural networks for weakly supervised segmentation//Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE: 1796-1804[ DOI: 10.1109/ICCV.2015.209 http://dx.doi.org/10.1109/ICCV.2015.209 ]
Qu S R, Xi Y L and Ding S T. 2018. Image caption description of traffic scene based on deep learning. Journal of Northwestern Polytechnical University, 36(3):522-527
曲仕茹, 席玉玲, 丁松涛. 2018.基于深度学习的交通场景语义描述.西北工业大学学报, 36(3):522-527[DOI:10.3969/j.issn.1000-2758.2018.03.017]
Richter S R, Hayder Z and Koltun V. 2017. Playing for benchmarks//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 2213-2222[ DOI: 10.1109/ICCV.2017.243 http://dx.doi.org/10.1109/ICCV.2017.243 ]
Tsai Y H, Hung W C, Schulter S, Sohn K, Yang M H and Chandraker M. 2018. Learning to adapt structured output space for semantic segmentation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, UT, USA: IEEE: 7472-7481[ DOI: 10.1109/CVPR.2018.00780 http://dx.doi.org/10.1109/CVPR.2018.00780 ]
Zhang Y, David P and Gong B Q. 2017. Curriculum domain adaptation for semantic segmentation of urban scenes//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 2020-2030[ DOI: 10.1109/ICCV.2017.223 http://dx.doi.org/10.1109/ICCV.2017.223 ]
Zhu J Y, Park T, Isola P and Efros A A. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 2223-2232[ DOI: 10.1109/ICCV.2017.244 http://dx.doi.org/10.1109/ICCV.2017.244 ]
相关作者
相关机构
京公网安备11010802024621