Current Issue Cover

谭明奎,许守恺,张书海,陈奇(华南理工大学软件学院, 广州 510000)

摘 要
深度视觉生成是计算机视觉领域的热门方向,旨在使计算机能够根据输入数据自动生成预期的视觉内容。深度视觉生成使用人工智能技术赋能相关产业,推动产业自动化、智能化改革与转型。生成对抗网络(generative adversarial networks,GANs)是深度视觉生成的有效工具,近年来受到极大关注,成为快速发展的研究方向。GANs能够接收多种模态的输入数据,包括噪声、图像、文本和视频,以对抗博弈的模式进行图像生成和视频生成,已成功应用于多项视觉生成任务。利用GANs实现真实的、多样化和可控的视觉生成具有重要的研究意义。本文对近年来深度对抗视觉生成的相关工作进行综述。首先介绍深度视觉生成背景及典型生成模型,然后根据深度对抗视觉生成的主流任务概述相关算法,总结深度对抗视觉生成目前面临的痛点问题,在此基础上分析深度对抗视觉生成的未来发展趋势。
A review on deep adversarial visual generation

Tan Mingkui,Xu Shoukai,Zhang Shuhai,Chen Qi(School of Software Engineering, South China University of Technology, Guangzhou 510000, China)

Deep visual generation has aimed to create synthetic photo-realistic visual contents (such as images and videos) that could fool or please human perceptions according to some specific requirements. In fact, many human activities belong to the field of visual generation, e.g., advertisement making, house designing and film making. However, these tasks normally can only be done by experts with professional skills gained through long-term training and the help of professional software such as Adobe Photoshop. Besides, it may also take a very long time to produce photo-realistic contents since the process can be very tedious and cumbersome. Thus, how to make these processes automated is a very important yet non-trivial problem. Nowadays, deep visual generation has become a significant research direction in computer vision and machine learning, and has been applied in many tasks, such as automatic content generation, beautification, rendering and data augmentation. Thanks to the current deep generative methods can be categorized into two groups:variational auto-encoder (VAE) based methods and generative adversarial networks (GANs) based methods. Based on encoder-decoder architecture, VAE methods first map input data into a latent distribution, and then minimize the distance between the latent distribution and some prior distribution, e.g., Gaussian distribution. A well-trained VAE model could be used in the tasks of dimensionality reduction and image generation. However, an inevitable gap between the latent distribution and prior distribution would make the generated images/videos blurred. Unlike the VAE model, GAN has learned a mapping between input and output distributions to synthesize sharper images/videos. A GAN model has contained two major modules. A generator has aimed to generate the fake data and a discriminator has distinguished whether a sample is fake or not. To produce plausible fake data, the generator has been matched the distribution of real data and synthesized fake data that would fulfill the requirements of reality and diversity. The optimization problem of learning the generator and discriminator has been formulated into a two-player minimax game. During the training, the two modules have been optimized alternately using stochastic gradient methods. At the end of the training, the generator and discriminator have been supposed to reach a Nash Equilibria of the minimax game. Due to the development of GAN model, more deep visual generation applications and tasks have occurred based on GAN model. The six typical tasks for deep visual generation have been presented as follows:1) Image generation from noises:it is the earliest task of deep visual generation in which GAN model seeks to generate an image (e.g., face image) from random noises. 2) Image generation from images:it tries to transform a given image into a new one (e.g., from black-and-white image to color image). This task can be applied to applications like style transfer and image reconstruction. 3) Image generation from texts:it is a very natural task just like that humans describe the content of a painting and then the painters draw the corresponding images based on the texts. 4) Video generation from images:it aims to turn a static image into a dynamic video, which can be used in time-lapse photography, making animated videos from pictures, etc. 5) Video generation from videos:it is mainly used for video style transfer, video super-resolution and so on. 6) Video generation from texts:it is more difficult than image generation from texts since it needs the generated videos focusing on both semantical alignments with text and consistency among video frames. The challenges in deep visual generation have been analyzed and discussed. First, rather than 2D data, we should try to generate high-quality 3D data, which contains more information and details. Second, we could pay more attention to video generation instead of only image generation. Third, we could conduct some researches on controllable deep visual generation methods, which are more practical in real-world applications. Finally, we could try to expand the style transfer methods from two domains to multiple domains. In this review, we have summarized very recent works on deep adversarial visual generation through a systematic investigation. The review has mainly included an introduction of deep visual generation background, typical generation models, an overview of mainstream deep visual generation tasks and related algorithms. The deep adversarial visual generation research has been conducted further.