Current Issue Cover

徐正国1, 普碧才1, 秦建明1, 项炎平2, 彭振江3, 宋纯锋3(1.云南电网有限责任公司怒江供电局;2.智慧互通科技股份有限公司;3.中国科学院自动化研究所智能感知与计算研究中心)

摘 要
目的 姿态引导下的人物图像生成具有广泛的应用潜力,受到了广泛的关注。近年来,低分辨率场景的姿态引导人物图像生成任务取得了很大的成功。然而在高分辨率场景下,现有的人体姿态迁移数据集存在分辨率低或多样性差等问题,同时也缺乏相关高分辨率图像生成方法。针对这一问题,本文构建了具有多模态辅助数据的大规模高清人物图像数据集PersonHD。方法 PersonHD数据集收集了包含100个不同人物的299,817张图像。在提出的PersonHD基础上,基于现有数据集的公共设置,本文进一步构建了两个不同分辨率下的评测基准,并设计了一个实用的高分辨率人物图像生成框架,为评估最先进的姿态引导人物图像生成方法提供了一个新的平台。结果 与现有数据集相比,PersonHD在更高的图像分辨率、更多样化的人物姿态和更大规模的样本方面具有显著的优势。基于PersonHD数据集,实验在两个不同分辨率的评测基准上系统地评估了当前具有代表性的姿态引导人物图像生成方法,并对本文提出框架各模块的有效性进行了系统验证。实验结果表明,该框架具有良好的效果。结论 本文提出的高清人物图像生成基准数据集具有高分辨率数据规模大、多样性强等特点,有助于更为全面地评估姿态引导下的人物图像生成算法。本文的数据集和代码可在上获得。
Data benchmark and model framework for high-definition human image generation

(Center for Research on Intelligent Perception and Computing, Institute of Automation Chinese Academy of Sciences)

Abstract: Objective Pose-guided person image generation has attracted great attention with wide application potential. In the early stages of development, researchers mainly relied on manually designing features and models, matching key points between different characters, and then achieving pose transfer through interpolation or transformation. With the rapid development of deep learning technology. The emergence of Generative Adversarial Networks (GANs) has made significant progress in posture transfer. GAN can learn and generate realistic images, and variants of related generative adversarial networks have begun to be widely used in pose transfer tasks. Meanwhile, deep learning has also brought progress in key point detection technology. Advanced keypoint detection models, such as OpenPose, can more accurately capture human pose information, providing tremendous assistance for the development of algorithms in related fields and the construction of datasets. Recent works have achieved great success in pose-guided person image generation task with the low-definition scenes. However, in high-resolution scenes, existing human pose transfer datasets suffer from low resolution or poor diversity, and there is also a lack of relevant high-resolution image generation methods. In response to this issue, a large-scale high-definition human image dataset, PersonHD, with multimodal auxiliary data, has been constructed. Method This study constructs a large-scale, high-resolution human image dataset, PersonHD. Compared with other datasets, this dataset has several advantages. (1) Higher image resolution: The cropped human images in PersonHD have a resolution of 1520×880. (2) More diverse pose variations: The actions of the subjects are closer to real-life scenarios, introducing more fine-grained non-rigid deformation of the human body. (3) Larger image size. The PersonHD dataset contains 299817 images from 100 different people in 4000 videos. Based on the proposed PersonHD, this study further constructs two benchmarks and designs a practical high-resolution human image generation framework. Since most existing work deals with human images with a resolution of 256×256, this study first establishes a low-resolution (256×256) benchmark for general evaluation, evaluating the performance of these methods on the PersonHD dataset and further improving the performance of the state-of-the-art methods. This study also constructs a high-definition benchmark (512×512) to verify the performance of the state-of-the-art methods on the PersonHD dataset. These two benchmarks also enable this study to rigorously evaluate the performance of existing and future human pose transfer methods. In addition, this paper proposes a practical framework to generate higher-resolution and higher-quality human images. Specifically, this paper first designs a semantic enhanced part-wise augmentation to solve the challenging overfitting problem in human image generation. Then, a Conditional Up-sampling Module is introduced for the generation and further refinement of high-resolution images. Result Compared with existing datasets, PersonHD has significant advantages in higher image resolution, more diverse pose variance, and larger sample sizes. Based on the PersonHD dataset, experiments systematically evaluated the current representative pose guided character image generation methods on two different resolution evaluation benchmarks, and systematically validated the effectiveness of each module of the framework proposed in this paper. The experiment used five indicators to quantitatively analyze the performance of the model, including SSIM, FID, LPIPS, mask LPIPS, and PCKh. For low resolution benchmarks, most current methods are designed for low resolution images of 256×256. In order to evaluate these methods on the PersonHD dataset, the image size was adjusted to 256×256 during the experiment. At the same time, PersonHD was split into two subsets: clean subset and complex subset, to evaluate the processing ability of different models for different background. This method compares several latest methods on two subsets of PersonHD, including PATN, Must GAN, Xing GAN, PISE, and SPGNet. During the experiment, Semantic Enhanced Part-wise Augmentation and One-shot Driven Personalization were used to improve the performance of SPGNet as the baseline. The Semantic Enhanced Part-wise Augmentation and the One-shot Driven Personalization proposed in this article have improved the performance of SPGNet in multiple indicators, and the relevant modules of the framework have significantly improved the model""s performance. For high-resolution benchmarks, due to the limited work on pose guided character image generation at high resolution, this study uses Conditional Up-sampling Module to design the most advanced SPGNet model, and further improves performance by using Semantic Enhanced Part-wise Augmentation methods and One-shot Driven Personalization. The experimental results indicate that the framework has good performance. Conclusion We build a large-scale high-resolution person image dataset, namely PersonHD, which contains 299,817 high quality images. Compared with existing datasets, PersonHD has significant superiority in terms of higher image resolution, more diverse pose-variance, and larger scale samples. The high-definition character image generation benchmark dataset proposed in this article has the characteristics of large scale and strong diversity of high-resolution data, which helps to comprehensively evaluate pose guided character image generation algorithms. We establish comprehensive benchmarks and implement extensive experimental evaluations based on the general low-definition and the first proposed high-definition protocols, which would contribute an important platform for analyzing recent state-of-the-art person image generating methods. We further propose a unified framework for high definition person image generation, including the semantic enhanced part-wise augmentation and the conditional up-sampling module. Both modules are flexible and could work separately in a plug-and-play manner. Our dataset and code are available at: