CNN结合Transformer的深度伪造高效检测

李颖; 边山; 王春桃; 卢伟

doi:10.11834/jig.220519

图像取证 | 浏览量 : 0 下载量: 2 CSCD: 3

PDF
导出
分享
收藏
专辑

CNN结合Transformer的深度伪造高效检测
CNN and Transformer-coordinated deepfake detection
2023年28卷第3期页码：804-819
纸质出版日期： 2023-03-16 ，

录用日期： 2022-11-03
DOI： 10.11834/jig.220519
稿件说明：

移动端阅览

李颖, 边山, 王春桃, 卢伟. CNN结合Transformer的深度伪造高效检测[J]. 中国图象图形学报, 2023,28(3):804-819.

Ying Li, Shan Bian, Chuntao Wang, Wei Lu. CNN and Transformer-coordinated deepfake detection[J]. Journal of Image and Graphics, 2023,28(3):804-819.
李颖, 边山, 王春桃, 卢伟. CNN结合Transformer的深度伪造高效检测[J]. 中国图象图形学报, 2023,28(3):804-819. DOI： 10.11834/jig.220519.

Ying Li, Shan Bian, Chuntao Wang, Wei Lu. CNN and Transformer-coordinated deepfake detection[J]. Journal of Image and Graphics, 2023,28(3):804-819. DOI： 10.11834/jig.220519.

摘要

目的

深度伪造视频检测是目前计算机视觉领域的热点研究问题。卷积神经网络和Vision Transformer(ViT)都是深度伪造检测模型中的基础结构，二者虽各有优势，但都面临训练和测试阶段耗时较长、跨压缩场景精度显著下降问题。针对这两类模型各自的优缺点，以及不同域特征在检测场景下的适用性，提出了一种高效的CNN(convolutional neural network)结合Transformer的联合模型。

方法

设计基于EfficientNet的空间域特征提取分支及频率域特征提取分支，以丰富单分支的特征表示。之后与Transformer的编码器结构、交叉注意力结构进行连接，对全局区域间特征相关性进行建模。针对跨压缩、跨库场景下深度伪造检测模型精度下降问题，设计注意力机制及嵌入方式，结合数据增广策略，提高模型在跨压缩率、跨库场景下的鲁棒性。

结果

在FaceForensics++的4个数据集上与其他9种方法进行跨压缩率的精度比较，在交叉压缩率检测实验中，本文方法对Deepfake、Face2Face和Neural Textures伪造图像的检测准确率分别达到90.35%、71.79%和80.71%，优于对比算法。在跨数据集的实验中，本文模型同样优于其他方法，并且同设备训练耗时大幅缩减。

结论

本文提出的联合模型综合了卷积神经网络和Vision Transformer的优点，利用了不同域特征的检测特性及注意力机制和数据增强机制，改善了深度伪造检测在跨压缩、跨库检测时的效果，使模型更加准确且高效。

Abstract

Objective

The research of deepfake detection methods has become one of the hot topics recently to counter deepfake videos. Its purpose is to identify fake videos synthesized by deep forgery technology on social networks

such as WeChat

Instagram and TikTok. Forged features are extracted on the basis of a convolutional neural network (CNN) and the final classification score is determined in terms of the features-forged classifier. When facing the deep forged video with low quality or high compression

these methods improve the detection performance by extracting deeper spatial domain information. However

the forged features left in the spatial domain decrease with the compression

and the local features tend to be similar

which degrades the performances severely. This also urges us to retain the frequency domain information of forged image artifacts as one of the clues of forensics

which contains less interference caused by JPEG compression. The CNN-based spatial domain feature extraction method can be conducted to capture facial artifacts via stacking convolution. But

its receptive field is limited

so it is better at modelling local information but ignores the relationship between global pixels. Transformer has its potentials at long-term dependency modelling in relevant to natural language processing and computer vision tasks

therefore it is usually employed to model the relationship between pixels of images and make up for the CNN-based deficiency in global information acquisition. However

the transformer can only process sequence information

making it still need the cooperation of convolutional neural network in computer vision tasks.

Method

First

we develop a novel joint detection model

which can leverage the advantages of CNN and transformer

and enriches the feature representation via frequency domain-related information. The EfficientNet-b0 is as the feature extractor. To optimize more forensics features

in the spatial feature extraction stage

the attention module is embedded in the shallow layer and the deep features are multiplied with the activation map obtained by the attention module. In the frequency domain feature extraction stage

to better learn the frequency domain features

we utilize the discrete cosine transform as the frequency domain transform means and an adaptive part is added to the frequency band decomposition. In the training process

to accelerate the memory-efficient training

we adopt the method of mixed precision training. Then

to construct the joint model

we link the feature extraction branches to a modified Transformer structure. The Transformer is used to model inter-region feature correlation using global self-attention feature encoding through an encoder structure. To further realize the information interaction between the dual-domain features

the cross attention is calculated between branches on the basis of the cross-attention structure. Furthermore

we design and implement a random data augmentation strategy

which is coordinated with the attention mechanism to improve the detection accuracy of the model in the scenarios of cross compression rate and cross dataset.

Result

Our joint model is compared to 9 state-of-the-art deepfake detection methods on two datasets called FaceForensics++(FF++) and Celeb-DF. In the experiments of cross compression-rate detection on the FF++ dataset

our detection accuracy can be reached to 90.35%

71.79% and 80.71% for Deepfakes

Face2Face and Neural Textures(NT) manipulated images

respectively. In the cross-dataset experiments

i.e.

training on FaceForensics++ and testing on Celeb-DF

our training time is reduced.

Conclusion

The experiments demonstrate that our joint model proposed can improve datasets-crossed and compression-rate acrossed detection accuracy. Our joint model takes advantage of the EfficientNet and the Transformer

and combines the characteristics of different domain features

attention

and data augmentation mechanism

making the model more accurate and efficient.

关键词

深度伪造检测卷积神经网络(CNN)Vision Transformer(ViT)空间域频率域

Keywords

deepfake detectionconvolutional neural network(CNN)Vision Transformer(ViT)spatial domainfrequency domain

references

Afchar D, Nozick V, Yamagishi J and Echizen I. 2018. MesoNet: a compact facial video forgery detection network//Proceedings of 2018 IEEE International Workshop on Information Forensics and Security (WIFS). Hong Kong, China: IEEE: 1-7 [DOI: 10.1109/WIFS.2018.8630761http://dx.doi.org/10.1109/WIFS.2018.8630761]

Bazarevsky V, Kartynnik Y, Vakunov A, Raveendran K and Grundmann M. 2019. Blazeface: sub-millisecond neural face detection on mobile GPUs [EB/OL]. [2022-07-14].https://arxiv.org/pdf/1907.05047.pdfhttps://arxiv.org/pdf/1907.05047.pdf

Carreira J and Zisserman A. 2017. Quo vadis, action recognition? A new model and the kinetics dataset//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 4724-4733 [DOI: 10.1109/CVPR.2017.502http://dx.doi.org/10.1109/CVPR.2017.502]

Chen C F R, Fan Q F and Panda R. 2021. CrossViT: cross-attention multi-scale vision transformer for image classification//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 347-356 [DOI: 10.1109/ICCV48922.2021.00041http://dx.doi.org/10.1109/ICCV48922.2021.00041]

Coccomini D A, Messina N, Gennaro C and Falchi F. 2022. Combining efficientnet and vision transformers for video deepfake detection//Proceedings of the 21st International Conference on Image Analysis and Processing. Lecce, Italy: Springer [DOI: 10.1007/978-3-031-06433-3_19http://dx.doi.org/10.1007/978-3-031-06433-3_19]

Cozzolino D, Poggi G and Verdoliva L. 2017. Recasting residual-based local descriptors as convolutional neural networks: an application to image forgery detection//Proceedings of the 5th ACM Workshop on Information Hiding and Multimedia Security. Philadelphia, USA: ACM: 159-164 [DOI: 10.1145/3082031.3083247http://dx.doi.org/10.1145/3082031.3083247]

Durall R, Keuper M, Pfreundt F J and Keuper J. 2019. Unmasking deepfakes with simple features [EB/OL]. [2022-11-08].https://arxiv.org/pdf/1911.00686.pdfhttps://arxiv.org/pdf/1911.00686.pdf

Fagni T, Falchi F, Gambini M, Martella A and Tesconi M. 2021. TweepFake: about detecting deepfake tweets. PLoS One, 16(5): #e0251415 [DOI: 10.1371/journal.pone.0251415]

Fridrich J and Kodovsky J. 2012. Rich models for steganalysis of digital images. IEEE Transactions on Information Forensics and Security, 7(3): 868-882 [DOI: 10.1109/TIFS.2012.2190402]

Güera D and Delp E J. 2018. Deepfake video detection using recurrent neural networks//Proceedings of the 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). Auckland, New Zealand: IEEE: 1-6 [DOI: 10.1109/AVSS.2018.8639163http://dx.doi.org/10.1109/AVSS.2018.8639163]

He P S, Li W C, Zhang J Y, Wang H X and Jiang X H. 2022. Overview of passive forensics and anti-forensics techniques for GAN-generated image. Journal of Image and Graphics, 27(1): 88-110

何沛松, 李伟创, 张婧媛, 王宏霞, 蒋兴浩. 2022. 面向GAN生成图像的被动取证及反取证技术综述. 中国图象图形学报, 27(1): 88-110 [DOI: 10.11834/jig.210430]

Hu J, Liao X, Wang W and Qin Z. 2022. Detecting compressed deepfake videos in social networks using frame-temporality two-stream convolutional network. IEEE Transactions on Circuits and Systems for Video Technology, 32(3): 1089-1102 [DOI: 10.1109/TCSVT.2021.3074259]

Li J C, Liu F B, Hu Y J, Wang Y F, Liao G J and Liu G Y. 2020. Deepfake video detection based on consistency of illumination direction. Journal of Nanjing University of Aeronautics and Astronautics, 52(5): 760-767

李纪成, 刘琲贝, 胡永健, 王宇飞, 廖广军, 刘光尧. 2020. 基于光照方向一致性的换脸视频检测. 南京航空航天大学学报, 52(5): 760-767 [DOI: 10.16356/j.1005-2615.2020.05.012]

Li L Z, Bao J M, Zhang T, Yang H, Chen D, Wen F and Guo B N. 2020a. Face X-ray for more general face forgery detection//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 5001-5009 [DOI: 10.1109/CVPR42600.2020.00505http://dx.doi.org/10.1109/CVPR42600.2020.00505]

Li Y Z and Lyu S W. 2019. Exposing deepFake videos by detecting face warping artifacts//Proceedings of 2019 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Long Beach, USA: IEEE: 46-52

Li Y Z, Yang X, Sun P, Qi H G and Lv S W. 2020b. Celeb-DF: a large-scale challenging dataset for deepfake forensics//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 3204-3213 [DOI: 10.1109/CVPR42600.2020.00327http://dx.doi.org/10.1109/CVPR42600.2020.00327]

Liu L Y, Wang J X, Cao S L, Zhao L and Zhang X Q. 2022. U-Net for detecting small forgery region. Journal of Image and Graphics, 27(1): 176-187

刘丽颖, 王金鑫, 曹少丽, 赵丽, 张笑钦. 2022. 检测小篡改区域的U型网络. 中国图象图形学报, 27(1): 176-187 [DOI: 10.11834/jig.210438]

Nguyen H H, Yamagishi J and Echizen I. 2019a. Use of a capsule network to detect fake images and videos [EB/OL]. [2022-10-29].https://arxiv.org/pdf/1910.12467.pdfhttps://arxiv.org/pdf/1910.12467.pdf

Nguyen H H, Fang F M, YamagishiJ and Echizen I. 2019b. Multi-task learning for detecting and segmenting manipulated facial images and videos//Proceedings of the 10th IEEE International Conference on Biometrics Theory, Applications and Systems (BTAS). Tampa, USA: IEEE: 1-8 [DOI: 10.1109/BTAS46853.2019.9185974http://dx.doi.org/10.1109/BTAS46853.2019.9185974]

Qian Y Q, Yin G J, Sheng L, Chen Z X and Shao J. 2020. Thinking in frequency: face forgery detection by mining frequency-aware clues//Proceedings of the 16th European Conference on Computer Vision. Cham, Germany: Springer: 86-103 [DOI: 10.1007/978-3-030-58610-2_6http://dx.doi.org/10.1007/978-3-030-58610-2_6]

Rössler A, Cozzolino D, Verdoliva L, Riess C, Thies J and Niessner M. 2019. FaceForensics++: learning to detect manipulated facial images//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE: 1-11 [DOI: 10.1109/ICCV.2019.00009http://dx.doi.org/10.1109/ICCV.2019.00009]

Stuchi J A, Angeloni M A, Pereira R F, Boccato L, Folego G, Prado P V S and Attux R R F. 2017. Improving image classification with frequency domain layers for feature extraction//Proceedings of the 27th IEEE International Workshop on Machine Learning for Signal Processing (MLSP). Tokyo, Japan: IEEE: 1-6 [DOI: 10.1109/MLSP.2017.8168168http://dx.doi.org/10.1109/MLSP.2017.8168168]

Szegedy C, Liu W, Yang Q J, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V and Rabinovich A. 2015. Going deeper with convolutions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA: IEEE: 1-9 [DOI: 10.1109/CVPR.2015.7298594http://dx.doi.org/10.1109/CVPR.2015.7298594]

Tan M X and Le Q. 2019. Efficientnet: rethinking model scaling for convolutional neural networks//Proceedings of the 36th International Conference on Machine Learning. Long Beach, USA: PMLR: 6105-6114

Tran D, Bourdev L, Fergus R, Torresani L and Paluri M. 2015. Learning spatiotemporal features with 3D convolutional networks//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 4489-4497 [DOI: 10.1109/ICCV.2015.510http://dx.doi.org/10.1109/ICCV.2015.510]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L and Polosukhin I. 2017. Attention is all you need [EB/OL]. [2022-05-29].http://arxiv.org/pdf/1706.03762.pdfhttp://arxiv.org/pdf/1706.03762.pdf

Wang J K, Wu Z X, Ouyang W H, Han X T, Chen J J, Jiang Y G and Chen J. 2021. M2TR: multi-modal multi-scale transformers for deepfake detection//Proceedings of ICMR'22: International Conference on Multimedia Retrieval. Newark, USA: ACM: 615-623

Wang S Y, Wang O, Zhang R, Owens A and Efros A A. 2020. CNN-generated images are surprisingly easy to spot. . . for now//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 8695-8704 [DOI: 10.1109/CVPR42600.2020.00872http://dx.doi.org/10.1109/CVPR42600.2020.00872]

Wodajo D and Atnafu S. 2021. Deepfake video detection using convolutional vision transformer [EB/OL]. [2022-08-19].http://arxiv.org/pdf/2102.11126.pdfhttp://arxiv.org/pdf/2102.11126.pdf

Zhang X, Karaman S and Chang S F. 2019. Detecting and simulating artifacts in gan fake images//Proceedings of 2019 IEEE International Workshop on Information Forensics and Security (WIFS). Delft, the Netherlands: IEEE: 1-6 [DOI: 10.1109/WIFS47025.2019.9035107http://dx.doi.org/10.1109/WIFS47025.2019.9035107]

Zhang Y X, Li G, Cao Y and Zhao X F. 2020. A method for detecting human-face-tampered videos based on interframe difference. Journal of Cyber Security, 5(2): 49-72

张怡暄, 李根, 曹纭, 赵险峰. 2020. 基于帧间差异的人脸篡改视频检测方法. 信息安全学报, 5(2): 49-72 [DOI: 10.19363/J.cnki.cn10-1380/tn.2020.02.05)

Zhao H Q, Zhou W Y, Zhou W B, Zhang W M, Chen D D and Yu N H. 2021. Multi-attentional deepfake detection//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 2185-2194 [DOI: 10.1109/CVPR46437.2021.00222http://dx.doi.org/10.1109/CVPR46437.2021.00222]

Zhou P, Han X T, Morariu V I and Davis L S. 2017. Two-stream neural networks for tampered face detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Honolulu, USA: IEEE: 1831-1839 [DOI: 10.1109/CVPRW.2017.229http://dx.doi.org/10.1109/CVPRW.2017.229]

Zhu K M, Xu W B, Lu W and Zhao X F. 2022. Deepfake video detection with feature interaction amongst key frames. Journal of Image and Graphics, 27(1): 188-202

祝恺蔓, 徐文博, 卢伟, 赵险峰. 2022. 多关键帧特征交互的人脸篡改视频检测. 中国图象图形学报, 27(1): 188-202 [DOI: 10.11834/jig.210408]

文章被引用时，请邮件提醒。

提交

基于多质量因子压缩误差的对抗样本攻击方法识别

抑制图像非语义信息的通用后门防御策略

面向轻量级深度伪造检测的无数据模型压缩

遥感图像全色锐化的卷积神经网络方法研究进展

结合轻量化骨干与多尺度融合的单阶段检测器