Current Issue Cover
面向非受控场景的人脸图像正面化重建

辛经纬1, 魏子凯1, 王楠楠1, 李洁2, 高新波3(1.西安电子科技大学通信工程学院, 西安 710071;2.西安电子科技大学电子工程学院, 西安 710071;3.重庆邮电大学图像认知重庆市重点实验室, 重庆 400065)

摘 要
目的 人脸正面化重建是当前视觉领域的热点问题。现有方法对于模型的训练数据具有较高的需求,如精确的输入输出图像配准、完备的人脸先验信息等。但该类数据采集成本较高,可应用的数据集规模较小,直接将现有方法应用于真实的非受控场景中往往难以取得理想表现。针对上述问题,提出了一种无图像配准和先验信息依赖的任意视角人脸图像正面化重建方法。方法 首先提出了一种具有双输入路径的人脸编码网络,分别用于学习输入人脸的视觉表征信息以及人脸的语义表征信息,两者联合构造出更加完备的人脸表征模型。随后建立了一种多类别表征融合的解码网络,通过以视觉表征为基础、以语义表征为引导的方式对两种表征信息进行融合,融合后的信息经过图像解码即可得到最终的正面化人脸图像重建结果。结果 首先在Multi-PIE(multi-pose, illumination and expression)数据集上与8种较先进方法进行了性能评估。定量和定性的实验结果表明,所提方法在客观指标以及视觉质量方面均优于对比方法。此外,相较于当前性能先进的基于光流的特征翘曲模型(flow-based feature warping model,FFWM)方法,本文方法能够节省79%的参数量和42%的计算操作数。进一步基于CASIA-WebFace(Institute of Automation, Chinese Academy of Sciences—WebFace)数据集对所提出方法在真实非受控场景中的表现进行了评估,识别精度超过现有方法10%以上。结论 本文提出的双层级表征集成推理网络,能够挖掘并联合人脸图像的底层视觉特征以及高层语义特征,充分利用图像自身信息,不仅以更低的计算复杂度取得了更优的视觉质量和身份识别精度,而且在非受控的场景下同样展现出了出色的泛化性能。
关键词
Face frontalization for uncontrolled scenes

Xin Jingwei1, Wei Zikai1, Wang Nannan1, Li Jie2, Gao Xinbo3(1.School of Telecommunications Engineering, Xidian University, Xi'an 710071, China;2.School of Electronic Engineering, Xidian University, Xi'an 710071, China;3.Chongqing Key Laboratory of Image Cognition, Chongqing University of Posts and Telecommunications, Chongqing 400065, China)

Abstract
Objective The issue of uncontrolled-scenes-oriented human face recognition is challenged of series of uncontrollable factors like image perspective changes and face pose variations.Facial images reconstruction enables the interface between uncontrolled scenarios and matured recognition techniques.It aims to synthesize a standardized facial image derived from an arbitrary light and pose face image.The reconstructed facial image can be as a commonly used human face recognition method with no additional introduced inference.Beyond a pre-processing model of facial imaging contexts (e.g.,recognition,semantic parsing,and animation generation,etc.),it has potentials in virtual and augmented reality like facial clipping,decoration and reconstruction.It is challenging to pursue 3D-rotation-derived predictable objects and the same of preserved identity for multi-view generations.Many classical tackling approaches have been proposed,which can be categorized into model-driven-based approaches,data-driven-based approaches,and a combination of both.Recent generative adversarial networks (GANs) have shown good results in multi-view generation.However,some high requirements of these methods have to be resolved in the training dataset,such as accurate input and output of image alignment and rich facial prior.We facilitate a novel facial reconstruction method beyond its image alignment and prior information.Method Our two-level representation integration inference network is composed of three aspects on a high-level facial semantic information encoder,a low-level facial visual information encoder,and an integrated multi-information decoder.The encoding process is concerned of the learning issue of richer identity representation information in terms of an arbitrary-posed facial image.The convolution weights of the pre-trained face recognition model is melted into our semantic encoder.The face recognition model is trained on a large-scale dataset,which enables the encoder to adapt complex face variations through facial prior knowledge.Benefiting from the face recognition model's ability for identity information,our semantic analysis could obtain accurate semantic representation information.We illustrate a visual encoder to complement the texture features in the network in terms of reconstructed facial texture information.Additionally,the receptive field of convolution neural networks has a significant impact on the nonlinear mapping capability.To obtain accurate prediction results and optimize the extraction of texture information further,a larger receptive field could meet the requirement for mapping process of multi-sources networks.The downsampling and dilated convolutions are integrated for feature extraction,i.e.,each downsampling module in the network consists of a downsampling convolution and a dilated convolution.The dilated convolution can effectively increase the corresponding mapping range of each reconstructed pixel in the context of network layers and improve the receptive field of each layer of the network.The image reconstruction task is very sensitive to the low-level visual information of the image.If the coding-derived visual representation is combined with the semantic representation directly,the image reconstruction process of the model will be restricted of the visual representation and ignore the guiding function of the semantic representation.Therefore,the key to our multiple information integration process is first focused on the mapping relationship of the probability distribution of facial images in the feature space,and the following representation information of facial reconstruction is realized.To obtain the final analysis of facial reconstruction,the representation information is decoded at the end of the network.Result Our method is compared to 8 state-of-the-art facial methods on Multi-PIE (multi-pose,illumination and expression) dataset.The quantitative evaluation metric is Rank-1 recognition rates and the qualitative analysis is based on the visual effect of reconstructed images.The experiment results show that our model could get more optimized results.In addition,our method can save 79% of the number of parameters and 42% of the number of computational operations compared to the current state-of-the-art flow-based feature warping model (FFWM) method.To fully validate the effectiveness of the proposed method,we analyze the effects of dual coding paths,multiple information integration and the different loss functions on the model performance.To match the real uncontrolled scenarios,we evaluate the proposed method on the CASIA-WebFace (Institute of Automation,Chinese Academy of Sciences-WebFace) dataset.At the same time,we carry out quantitative and qualitative method compared to existing methods.Conclusion We develop a two-level representational integrated inference network,which integrate the low-level visual facial features and the high-level semantic features.It optimizes reconstructed results and higher recognition accuracy with lower computational complexity and shows its priority of generalization performance in uncontrolled scenes as well.
Keywords

订阅号|日报