戴昀书1, 费建伟2, 夏志华2,3, 刘家男3, 翁健3(1.中山大学网络空间安全学院, 深圳 518107;2.南京信息工程大学计算机学院, 南京 210044;3.暨南大学网络空间安全学院, 广州 510632)
目的 人脸伪造技术迅猛发展，对社会信息安全构成了严重威胁，亟需强泛化性伪造人脸检测算法抵抗多种多样的伪造模型。目前的研究发现伪造算法普遍包含人脸与背景融合的操作，这意味着任何伪造方式都难以避免在人脸边缘遗留下伪造痕迹。根据这一发现，本文将模型的学习目标从特定的伪造痕迹特征转化为更加普适的人脸图像局部相似度特征，并提出了局部相似度异常的深度伪造人脸检测算法。方法 首先提出了局部相似度预测（local similarity predicator，LSP）模块，通过一组局部相似度预测器分别计算RGB图像中间层特征图的局部异常，同时，为了捕捉频域中的真伪线索，还提出了可学习的空域富模型卷积金字塔（spatial rich model convolutional pyramid，SRMCP）来提取多尺度的高频噪声特征。结果 在多个数据集上进行了大量实验。在泛化性方面，本文以ResNet18为骨干网络的模型在FF++4个子集上的跨库检测精度分别以0.77%、5.59%、6.11%和4.28%的优势超越了对比方法。在图像压缩鲁棒性方面，在3种不同压缩效果下，分别以2.48%、4.83%和10.10%的优势超越了对比方法。结论 本文方法能够大幅度提升轻量型卷积神经网络的检测性能，相比于绝大部分工作都取得了更优异的泛化性和鲁棒性效果。
Local similarity anomaly for general face forgery detection
Dai Yunshu1, Fei Jianwei2, Xia Zhihua2,3, Liu Jianan3, Weng Jian3(1.School of Cyber Science and Technology, Sun Yat-sen University, Shenzhen 518107, China;2.School of Computer Science, Nanjing University of Information Science and Technology, Nanjing 210044, China;3.School of Cyberspace Security, Jinan University, Guangzhou 510632, China)
Objective In recent years，the development of DeepFake has made great progress，and the highly realistic forged face images created by such technology are posing a great threat not only to people’s privacy and security but also to the international political situation. Therefore，detection methods with good generalization ability need to be developed. In their early stages of development，forged faces had low fidelity with obvious defects. Therefore，traditional digital forensic algorithms and deep learning models could achieve good detection performances. However，with the development of Deep-Fake，these forged faces become increasingly realistic，thus posing a challenge to detection algorithms. Researchers have focused on the essential differences between real and forged faces to improve the detection performances of their algorithms. The process of DeepFake can be decomposed into the following steps：1）detect and crop the face in the target image；2）forge the face using a forgery algorithm；3）paste the forged face back to the original image and use image fusion technology to eliminate the boundary defects and improve the visual effect. Step 3 often results in easily detectable local forgery traces，which are important cues for distinguishing real faces from fake ones. Many researchers have attempted to build models that can learn such traces to improve accuracy or to implement tampering localization. However，given that both the local traces and the image fusion methods involved in different forgery techniques widely differ，the detection algorithms for different forgery techniques have limited generalization ability. Therefore，although the local traces caused by Step 3 above are universal，directly learning such features for real and forged face recognition contributes little to generalizability. Method This paper proposes a DeepFake detection method based on local similarity anomalies to achieve high generalizability. Instead of directly learning local forgery traces to distinguish real faces from fake ones，this method transforms the learning objective into the similarity of local features. Specifically，the face region of the forged face image has source features that differ from the background region，and although these two types of regions have uniform source features internally，the fusion boundary between the face and background contains conflicting source features and thus has low level of local similarity. These local similarity anomalies are independent of both the specific forgery algorithm and the fusion algorithm and can be regarded as heterogeneous features that are highly consistent with the essential difference between real and fake faces. To cache these traces，this paper proposes the local similarity predicator module. By decomposing the local depth features of face images into horizontal and vertical groups，the learning objective is converted from recognizing specific forgery traces to predicting the similarity of source features within the image by calculating the similarity of local depth features and their neighbors so as to capture the essential differences between real and fake faces in a general way. In addition，previous studies find that frequency domain features contain important clues for distinguishing real from fake faces. The proposed method draws on the domain knowledge of steganalysis and constructs a learnable convolutional pyramid module based on the spatial rich model（SRM），which compensates for the limited ability to express true and false features in the RGB space and improves the in-domain detection performance. This study also proposes the spatial rich model convolutional pyramid，which inherits the high-frequency noise features extracted by the spatial rich model convolutional pyramid （SRMCP）kernel，can be continuously updated during the training，and can be extended to a pyramid architecture with different receptive fields to effectively capture high-frequency noise features at different scales. Result The overall results of FF++ are compared under three compression factors. The proposed method，which uses ResNet18 as its backbone， achieves extremely high detection accuracy on both raw and compressed datasets. This method not only significantly outperforms the classical digital forensic algorithms but also surpasses some of the recently proposed advanced algorithms for deep forgery detection. Specifically，the proposed method achieves 99. 72%，98. 34%，and 90. 73% accuracies on RAW，C23， and C40，respectively，and its average accuracy is 2. 31% and 13. 33%（20. 26% on the C40 dataset）higher than those of Xception and MesoNet，respectively. The proposed method also outperforms a metric learning method published in CVPR 2021 that incorporates the frequency and space domains. Specifically，the proposed method achieves 0. 29%，1. 63%，and 1. 22% higher accuracies on RAW，C23，and C40，respectively，compared with this metric learning method. Overall，the proposed method takes the lead in terms of accuracy. Experimental results reveal that the local similarity module can effectively capture the inherent features of forged faces，thus substantially improving detection accuracy and achieving high accuracy even with a simple ResNet18 as the backbone. The average cross-domain area under curves（AUCs）of the proposed method reach 91. 40%，96. 03%，99. 08%，and 96. 05% on the four subsets of FF++ ，which are 15. 41%， 16. 47%，21. 11%，and 14. 7% higher than those of Xception，respectively. In addition，the average accuracies of the proposed method are improved by 0. 77%，5. 59%，6. 11%，and 4. 28%，respectively，compared with state-of-the-art methods. The cross-domain results on Celeb-DF show that the proposed method outperforms the existing methods with the help of ResNet18. Although recently introduced methods have made significant progress in cross-domain detection with an average accuracy exceeding 70%，the cross-domain accuracies of the proposed method are 1. 11%，3. 73%，and 5. 17% higher compared with those of state-of-the-art methods. Conclusion The method proposed in this paper can greatly improve the detection performance of lightweight convolutional neural networks and achieves better generalization and robustness compared with other recently proposed methods. The local similarity learning module will be further optimized in future work to ensure that it can predict local anomalies with different types of forged faces to further improve its generalizability on unknown forged faces.