Current Issue Cover
面向驾驶场景精准图像翻译的条件扩散模型

徐映芬1, 胡学敏1, 黄婷玉1, 李燊1, 陈龙2(1.湖北大学;2.中国科学院自动化研究所)

摘 要
目的 针对虚拟到现实驾驶场景翻译中成对的数据样本匮乏、翻译结果不精准以及模型训练不稳定等问题,提出一种多模态数据融合的条件扩散模型。方法 首先,为解决目前主流的基于生成对抗网络的图像翻译方法中存在的模式崩塌、训练不稳定等问题,本文以生成多样性强、训练稳定性好的扩散模型为基础,构建图像翻译模型;其次,为解决传统扩散模型无法融入先验信息从而无法控制图像生成这一问题,本文提出基于多头自注意力机制的多模态特征融合方法,该方法能将多模态信息融入扩散模型的去噪过程,从而起到条件控制的作用;最后,基于语义分割图和深度图能分别表征物体的轮廓信息和深度信息这一特点,将其与噪声图像进行融合后输入去噪网络,以此构建多模态数据融合的条件扩散模型,从而实现更精准的驾驶场景图像翻译。结果 在Cityscapes数据集上训练本文提出的模型,并且将本文方法与近些年的最新方法进行比较,结果表明,本文方法可以实现轮廓细节更细致、距离远近更一致的驾驶场景图像翻译,在弗雷歇初始距离(Fréchet inception distance, FID) 和学习感知图像块相似度(learned perceptual image patch similarity, LPIPS)等指标上均取得了更好的结果,分别为44.20和0.377。结论 本文方法能有效解决现有图像翻译方法中存在的数据样本匮乏、翻译结果不精准以及模型训练不稳定问题,提高驾驶场景的翻译精确度,为实现安全实用的自动驾驶提供理论支撑和数据基础。
关键词
Precise image translation based on conditional diffusion modelfor driving scenarios

Xu Yingfen, Hu Xuemin1, Huang Tingyu1, Li Shen1, Chen Long2(1.Hubei University;2.Institute of Automation,Chinese Academy of Sciences,Beijing,Beijing)

Abstract
Objective Safety is the most significant consideration for autonomous driving vehicles. New autonomous driving methods need a large number of training and testing processes before applying in real vehicles, but training and testing autonomous driving methods directly in real-world scenarios is a costly and risky task. Many researchers first train and test their methods in simulate-world scenarios and then transfer the trained knowledge to real-world scenarios. However, there are many differences in scene modeling, light, vehicle dynamics, etc. between the two-world scenarios, which leads to that the autonomous driving model trained in simulate-world scenarios cannot be well generalized to the real-world scenarios. With the development of deep learning technologies, image translation, which aims to transform the content of an image from one presentation form to another, has made good achievement in many fields, such as image beautification, style transfer, scene design, and video special effects, etc. If the image translation technology is applied to the translation of simulated driving scenarios to real ones, it can not only solve the problem of poor generalization ability of autonomous driving model, but also effectively reduce the cost and risky of training the real scenarios. Unfortunately, existing image translation methods applied in autonomous driving lack datasets of paired simulated and real scenarios, and most of the mainstream image translation methods are based on generative adversarial network (GAN), which have problems of mode collapse and unstable training. And the generated images also suffer from a lot of detail problems, such as distorted object contours and unnatural small objects in the scene. This will not only further affect the perception of automatic driving, which will then impact the decision of automatic driving, but also influence the evaluation metrics of image translation. In this paper, we propose a multi-modal conditional diffusion model based on the denoising diffusion probabilistic model (DDPM), which has achieved remarkable success in various image generation tasks, to address the problems of lacking paired simulate-real data, mode collapse, unstable training, and insufficient diversity of generated data in existing image translation. Method Firstly, in order to solve the problems of mode collapse and unstable training in existing mainstream image translation methods based on GAN, we propose an image translation method based on the diffusion model with good training stability and generative diversity. Secondly, to address the problem that traditional diffusion models cannot integrate prior information without controlling the image generation process, a multi-modal feature fusion method based on multi-head self-attention mechanism is developed in this paper. The proposed method can send the early fused data to the convolutional layer, extract the high-level features, and then obtain the high-level fused feature vectors by the multi-head self-attention mechanism. Finally, inspired by that the semantic segmentation and depth maps can precisely represent the contour and depth information, respectively, we design the conditional diffusion model (CDM) by fusing the semantic segmentation and depth maps with the noise image before sending them to the denoising network., where the semantic segmentation map, depth map and noise image can perceive each other through the proposed multi-modal feature fusion method. The output fusion features will be fed to the next sub-layer in the network. After the denoising iterative process, the final output of the denoising network contains both semantic and depth information, thus the semantic segmentation and depth maps can play a conditional guiding role in the diffusion model. According to the settings in the DDPM, we utilized the U-net network as the denoising network. Compared with the U-net in DDPM, we modified the self-attention layer to the improved self-attention proposed in this paper for better learning the fusion features. The proposed model can be applied in image translation of simulated-to-real scenarios after training the denoising network in the CDM. We add noise to the simulated images collected from the Carla simulator, then send paired semantic segmentation map and depth map to the denoising network to perform a step-by-step denoising process, and finally get real driving scene images, so as to realize image translation with more precise contour details and more consistent distance in simulated and real images. Result We trained our model in the Cityscapes dataset and compared it with state-of-the-art (SOTA) methods in recent years. Experimental results indicate that our approach achieves a better translation result with improved semantic precision and more contour details. Our evaluation metrics include Fréchet Inception Distance (FID) and the Learned Perceptual Image Patch Similarity (LPIPS), which indicate the similarity between the generated images and the original images and the difference between two images, respectively. A lower FID score represents a better generation quality with a smaller gap between the generated image distribution and the real image distribution, while a higher LPIPS value indicates a better generation diversity. Compared with the four comparative SOTA methods, our method can achieve better results in both FID and LPIPS indicators, with the scores of 44.2 and 0.377, respectively. Conclusion In this paper, we propose a novel image-to-image translation method based on a conditional diffusion model and a multi-modal fusion method with multi-head attention mechanism for autonomous driving scenarios. The experimental results show that our method can effectively solve the problems of insufficient paired datasets, imprecise translation results, unstable training, and insufficient generation diversity in existing image translation methods, which improves image translation precision of driving scenarios and provides theoretical support and data basis to realize safe and practical autonomous driving systems.
Keywords

订阅号|日报