面向非受控场景的人脸图像正面化重建

辛经纬; 魏子凯; 王楠楠; 李洁; 高新波

doi:10.11834/jig.211256

多媒体分析与理解 | 浏览量 : 0 下载量: 1 CSCD: 0

PDF
导出
分享
收藏
专辑

面向非受控场景的人脸图像正面化重建
Face frontalization for uncontrolled scenes
2022年27卷第9期页码：2788-2800
纸质出版日期： 2022-09-16 ，

录用日期： 2022-06-20
DOI： 10.11834/jig.211256
稿件说明：

移动端阅览

辛经纬, 魏子凯, 王楠楠, 李洁, 高新波. 面向非受控场景的人脸图像正面化重建[J]. 中国图象图形学报, 2022,27(9):2788-2800.

Jingwei Xin, Zikai Wei, Nannan Wang, Jie Li, Xinbo Gao. Face frontalization for uncontrolled scenes[J]. Journal of Image and Graphics, 2022,27(9):2788-2800.
辛经纬, 魏子凯, 王楠楠, 李洁, 高新波. 面向非受控场景的人脸图像正面化重建[J]. 中国图象图形学报, 2022,27(9):2788-2800. DOI： 10.11834/jig.211256.

Jingwei Xin, Zikai Wei, Nannan Wang, Jie Li, Xinbo Gao. Face frontalization for uncontrolled scenes[J]. Journal of Image and Graphics, 2022,27(9):2788-2800. DOI： 10.11834/jig.211256.

摘要

目的

人脸正面化重建是当前视觉领域的热点问题。现有方法对于模型的训练数据具有较高的需求，如精确的输入输出图像配准、完备的人脸先验信息等。但该类数据采集成本较高，可应用的数据集规模较小，直接将现有方法应用于真实的非受控场景中往往难以取得理想表现。针对上述问题，提出了一种无图像配准和先验信息依赖的任意视角人脸图像正面化重建方法。

方法

首先提出了一种具有双输入路径的人脸编码网络，分别用于学习输入人脸的视觉表征信息以及人脸的语义表征信息，两者联合构造出更加完备的人脸表征模型。随后建立了一种多类别表征融合的解码网络，通过以视觉表征为基础、以语义表征为引导的方式对两种表征信息进行融合，融合后的信息经过图像解码即可得到最终的正面化人脸图像重建结果。

结果

首先在Multi-PIE(multi-pose

illumination and expression)数据集上与8种较先进方法进行了性能评估。定量和定性的实验结果表明，所提方法在客观指标以及视觉质量方面均优于对比方法。此外，相较于当前性能先进的基于光流的特征翘曲模型(flow-based feature warping model，FFWM)方法，本文方法能够节省79%的参数量和42%的计算操作数。进一步基于CASIA-WebFace(Institute of Automation

Chinese Academy of Sciences—WebFace)数据集对所提出方法在真实非受控场景中的表现进行了评估，识别精度超过现有方法10%以上。

结论

本文提出的双层级表征集成推理网络，能够挖掘并联合人脸图像的底层视觉特征以及高层语义特征，充分利用图像自身信息，不仅以更低的计算复杂度取得了更优的视觉质量和身份识别精度，而且在非受控的场景下同样展现出了出色的泛化性能。

Abstract

Objective

The issue of uncontrolled-scenes-oriented human face recognition is challenged of series of uncontrollable factors like image perspective changes and face pose variations. Facial images reconstruction enables the interface between uncontrolled scenarios and matured recognition techniques. It aims to synthesize a standardized facial image derived from an arbitrary light and pose face image. The reconstructed facial image can be as a commonly used human face recognition method with no additional introduced inference. Beyond a pre-processing model of facial imaging contexts (e.g.

recognition

semantic parsing

and animation generation

etc.)

it has potentials in virtual and augmented reality like facial clipping

decoration and reconstruction. It is challenging to pursue 3D-rotation-derived predictable objects and the same of preserved identity for multi-view generations. Many classical tackling approaches have been proposed

which can be categorized into model-driven-based approaches

data-driven-based approaches

and a combination of both. Recent generative adversarial networks (GANs) have shown good results in multi-view generation. However

some high requirements of these methods have to be resolved in the training dataset

such as accurate input and output of image alignment and rich facial prior. We facilitate a novel facial reconstruction method beyond its image alignment and prior information.

Method

Our two-level representation integration inference network is composed of three aspects on a high-level facial semantic information encoder

a low-level facial visual information encoder

and an integrated multi-information decoder. The encoding process is concerned of the learning issue of richer identity representation information in terms of an arbitrary-posed facial image. The convolution weights of the pre-trained face recognition model is melted into our semantic encoder. The face recognition model is trained on a large-scale dataset

which enables the encoder to adapt complex face variations through facial prior knowledge. Benefiting from the face recognition model's ability for identity information

our semantic analysis could obtain accurate semantic representation information. We illustrate a visual encoder to complement the texture features in the network in terms of reconstructed facial texture information. Additionally

the receptive field of convolution neural networks has a significant impact on the nonlinear mapping capability. To obtain accurate prediction results and optimize the extraction of texture information further

a larger receptive field could meet the requirement for mapping process of multi-sources networks. The downsampling and dilated convolutions are integrated for feature extraction

i.e.

each downsampling module in the network consists of a downsampling convolution and a dilated convolution. The dilated convolution can effectively increase the corresponding mapping range of each reconstructed pixel in the context of network layers and improve the receptive field of each layer of the network. The image reconstruction task is very sensitive to the low-level visual information of the image. If the coding-derived visual representation is combined with the semantic representation directly

the image reconstruction process of the model will be restricted of the visual representation and ignore the guiding function of the semantic representation. Therefore

the key to our multiple information integration process is first focused on the mapping relationship of the probability distribution of facial images in the feature space

and the following representation information of facial reconstruction is realized. To obtain the final analysis of facial reconstruction

the representation information is decoded at the end of the network.

Result

Our method is compared to 8 state-of-the-art facial methods on Multi-PIE(multi-pose

illumination and expression) dataset. The quantitative evaluation metric is Rank-1 recognition rates and the qualitative analysis is based on the visual effect of reconstructed images. The experiment results show that our model could get more optimized results. In addition

our method can save 79% of the number of parameters and 42% of the number of computational operations compared to the current state-of-the-art flow-based feature warping model(FFWM) method. To fully validate the effectiveness of the proposed method

we analyze the effects of dual coding paths

multiple information integration and the different loss functions on the model performance. To match the real uncontrolled scenarios

we evaluate the proposed method on the CASIA-WebFace(Institute of Automation

Chinese Academy of Sciences—WebFace) dataset. At the same time

we carry out quantitative and qualitative method compared to existing methods.

Conclusion

We develop a two-level representational integrated inference network

which integrate the low-level visual facial features and the high-level semantic features. It optimizes reconstructed results and higher recognition accuracy with lower computational complexity and shows its priority of generalization performance in uncontrolled scenes as well.

关键词

人脸正面化重建任意姿态双编码路径视觉表征语义表征融合算法

Keywords

face frontalizationarbitrary posedual encoding pathvisual representationsemantic representationfusion algorithm

references

Blanz V and Vetter T. 1999. A morphable model for the synthesis of 3D faces//Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques. Los Angeles, USA: ACM: 187-194 [DOI:10.1145/311535.311556http://dx.doi.org/10.1145/311535.311556]

Cao J, Hu Y B, Zhang H W, He R and Sun Z N. 2018. Learning a high fidelity pose invariant model for high-resolution face frontalization//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal, Canada: Curran Associates Inc. : 2872-2882

Cao S H, Liu X H, Mao X Q and Zou Q. 2022. A review of human face forgery and forgery-detection technologies. Journal of Image and Graphics, 27(4): 1023-1038

曹申豪, 刘晓辉, 毛秀青, 邹勤. 2022. 人脸伪造及检测技术综述. 中国图象图形学报, 27(4): 1023-1038 [DOI: 10.11834/jig.200466]

Cole F, Belanger D, Krishnan D, Sarna A, Mosseri I and Freeman W T. 2017. Synthesizing normalized faces from facial identity features//Proceedings of the 2nd International Conference on Learning Representations. Banff, Canada: IEEE: 3386-3395 [DOI:10.1109/CVPR.2017.361http://dx.doi.org/10.1109/CVPR.2017.361]

Deng J K, Guo J, Xue N N and Zafeiriou S. 2019. ArcFace: additive angular margin loss for deep face recognition//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 4685-4694 [DOI:10.1109/CVPR.2019.00482http://dx.doi.org/10.1109/CVPR.2019.00482]

Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A and Bengio Y. 2014. Generative adversarial nets//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press: 2672-2680

Gross R, Matthews I, Cohn J, Kanade T and Baker S. 2010. Multi-PIE. Image and Vision Computing, 28(5): 807-813 [DOI: 10.1016/j.imavis.2009.08.002]

Guo Y D, Zhang L, Hu Y X, He X D and Gao J F. 2016. MS-Celeb-1M: a dataset and benchmark for large-scale face recognition//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 87-102 [DOI:10.1007/978-3-319-46487-9_6http://dx.doi.org/10.1007/978-3-319-46487-9_6]

Hassner T, Harel S, Paz E and Enbar R. 2015. Effective face frontalization in unconstrained images//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 4295-4304 [DOI:10.1109/CVPR.2015.7299058http://dx.doi.org/10.1109/CVPR.2015.7299058]

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778 [DOI:10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]

Hu Y B, Wu X, Yu B, He R and Sun Z N. 2018. Pose-guided photorealistic face rotation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 8398-8406 [DOI:10.1109/CVPR.2018.00876http://dx.doi.org/10.1109/CVPR.2018.00876]

Huang R, Zhang S, Li T Y and He R. 2017. Beyond face rotation: global and local perception GAN for photorealistic and identity preserving frontal view synthesis//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2458-2467 [DOI:10.1109/ICCV.2017.267http://dx.doi.org/10.1109/ICCV.2017.267]

Kan M N, Shan S G, Chang H and Chen X L. 2014. Stacked progressive auto-encoders (SPAE) for face recognition across poses//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 1883-1890 [DOI:10.1109/CVPR.2014.243http://dx.doi.org/10.1109/CVPR.2014.243]

Kingma D P and Welling M. 2014. Auto-encoding variational Bayes [EB/OL]. [2021-05-01].https://arxiv.org/pdf/1312.6114.pdfhttps://arxiv.org/pdf/1312.6114.pdf

Rezende D J, Eslami S M A, Mohamed S, Battaglia P, Jaderberg M and Heess N. 2016. Unsupervised learning of 3D structure from images//Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain: Curran Associates Inc. : 5003-5011

Sagonas C, Panagakis Y, Zafeiriou Sand Pantic M. 2015. Robust statistical face frontalization//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 3871-3879 [DOI:10.1109/ICCV.2015.441http://dx.doi.org/10.1109/ICCV.2015.441]

Tran L, Yin X and Liu X M. 2017. Disentangled representation learning GAN for pose-invariant face recognition//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 1283-1292 [DOI:10.1109/CVPR.2017.141http://dx.doi.org/10.1109/CVPR.2017.141]

Tu X G, Zhao J, Liu Q K, Ai W J, Guo G D, Li Z F, Liu W and Feng J S. 2022. Joint face image restoration and frontalization for recognition. IEEE Transactions on Circuits and Systems for Video Technology, 32(3): 1285-1298 [DOI: 10.1109/TCSVT.2021.3078517]

Wang H, Wu C D, Chi J N, Yu X S and Hu Q. 2020. Face super-resolution reconstruction based on multitask joint learning. Journal of Image and Graphics, 25(2): 229-240

王欢, 吴成东, 迟剑宁, 于晓升, 胡倩. 2020. 联合多任务学习的人脸超分辨率重建. 中国图象图形学报, 25(2): 229-240 [DOI: 10.11834/jig.190233]

Wei Y X, Liu M, Wang H L, Zhu R F, Hu G S and Zuo W M. 2020. Learning flow-based feature warping for face frontalization with illumination inconsistent supervision//Proceedings of the 16th European Conference on Computer Vision. Edinburghm, UK: Springer: 558-574 [DOI:10.1007/978-3-030-58610-2_33http://dx.doi.org/10.1007/978-3-030-58610-2_33]

Wu X, He R, Sun Z N and Tan T N. 2018. A light CNN for deep face representation with noisy labels. IEEE Transactions on Information Forensics and Security, 13(11): 2884-2896 [DOI: 10.1109/TIFS.2018.2833032]

Yan X C, Yang J M, Yumer E, Guo Y J and Lee H. 2017. Perspective transformer nets: learning single-view 3D object reconstruction without 3D supervision[EB/OL]. [2021-08-13].https://arxiv.org/pdf/1612.00814.pdfhttps://arxiv.org/pdf/1612.00814.pdf

Yang J M, Reed S, Yang M H and Lee H. 2015. Weakly-supervised disentangling with recurrent transformations for 3D view synthesis//Proceedings of the 28th International Conference on Neural Information Processing Systems. Quebec, Canada: MIT Press: 1099-1107

Yi D, Lei Z, Liao S C and Li S Z. 2014. Learning face representation from scratch[EB/OL]. [2021-11-28].https://arxiv.org/pdf/1411.7923.pdfhttps://arxiv.org/pdf/1411.7923.pdf

Yim J, Jung H, Yoo B, Choi C, Park D S and Kim J. 2015. Rotating your face using multi-task deep neural network//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 676-684 [DOI:10.1109/CVPR.2015.7298667http://dx.doi.org/10.1109/CVPR.2015.7298667]

Yin X, Yu X, Kihyuk S, Liu X M and Manmohan C. 2017. Towards large-pose face frontalization in the wild//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 4010-4019 [DOI:10.1109/ICCV.2017.430http://dx.doi.org/10.1109/ICCV.2017.430]

Yin Y, Jiang S Y, Robinson J P and Fu Y. 2020. Synthesizing normalized faces from facial identity features//Proceedings of the2nd International Conference on Learning Representations. Buenos Aires, Argentina: IEEE: 249-256 [DOI:10.1109/FG47880.2020.00004http://dx.doi.org/10.1109/FG47880.2020.00004]

Zhang J, He H, Zhan X S and Xiao J. 2014. Three dimensional face reconstruction via feature adaptation and Laplace deformation. Journal of Image and Graphics, 19(9): 1349-1359

张剑, 何骅, 詹小四, 肖俊. 2014. 结合特征适配与拉普拉斯形变的3维人脸重建. 中国图象图形学报, 19(9): 1349-1359 [DOI: 10.11834/jig.20140912]

Zhao B, Wu X, Cheng Z Q, Liu H, Jie Z Q and Feng J S. 2018a. Multi-view image generation from a single-view//Proceedings of the 26th ACM International Conference on Multimedia. Seoul, Korea(South): ACM: 383-391 [DOI:10.1145/3240508.3240536http://dx.doi.org/10.1145/3240508.3240536]

Zhao J, Xiong L, Cheng Y, Cheng Y, Li J S, Zhou L, Xu Y, Karlekar J, Pranata S, Shen S M, Xing J L, Yan S H and Feng J S. 2018b. 3D-aided deep pose-invariant face recognition//Proceedings of the 27th International Joint Conference on Artificial Intelligence. Stockholm, Sweden: AAAI Press: 1184-1190 [DOI:10.24963/ijcai.2018/165http://dx.doi.org/10.24963/ijcai.2018/165]

Zhao J, Cheng Y, Xu Y, Xiong L, Li J S, Zhao F, Jayashree K, Pranata S, Shen S M, Xing J L, Yan S C and Feng J S. 2018c. Towards pose invariant face recognition in the wild//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 2207-2216 [DOI:10.1109/CVPR.2018.00235http://dx.doi.org/10.1109/CVPR.2018.00235]

Zhu K M, Xu W B, Lu W and Zhao X F. 2022. Deepfake video detection with feature interaction amongst key frames. Journal of Image and Graphics, 27(1): 188-202

祝恺蔓, 徐文博, 卢伟, 赵险峰. 2022. 多关键帧特征交互的人脸篡改视频检测. 中国图象图形学报, 27(1): 188-202 [DOI: 10.11834/jig.210408]

Zhu X Y, Lei Z, Liu X M, Shi H L and Li S Z. 2016. Face alignment across large poses: a 3D solution//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE:146-155 [DOI:10.1109/CVPR.2016.23http://dx.doi.org/10.1109/CVPR.2016.23]

Zhu X Y, Lei Z, Yan J J, Yi D and Li S Z. 2015. High-fidelity pose and expression normalization for face recognition in the wild//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 787-796 [DOI:10.1109/CVPR.2015.7298679http://dx.doi.org/10.1109/CVPR.2015.7298679]

Zhu Z Y, Luo P, Wang X G and Tang X O. 2014. Multi-view perceptron: a deep model for learning face identity and view representations//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press: 217-225

文章被引用时，请邮件提醒。

提交