CNN结合Transformer的深度伪造高效检测
CNN and Transformer-coordinated deepfake detection
- 2023年28卷第3期 页码:804-819
纸质出版日期: 2023-03-16 ,
录用日期: 2022-11-03
DOI: 10.11834/jig.220519
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2023-03-16 ,
录用日期: 2022-11-03
移动端阅览
李颖, 边山, 王春桃, 卢伟. CNN结合Transformer的深度伪造高效检测[J]. 中国图象图形学报, 2023,28(3):804-819.
Ying Li, Shan Bian, Chuntao Wang, Wei Lu. CNN and Transformer-coordinated deepfake detection[J]. Journal of Image and Graphics, 2023,28(3):804-819.
目的
2
深度伪造视频检测是目前计算机视觉领域的热点研究问题。卷积神经网络和Vision Transformer(ViT)都是深度伪造检测模型中的基础结构,二者虽各有优势,但都面临训练和测试阶段耗时较长、跨压缩场景精度显著下降问题。针对这两类模型各自的优缺点,以及不同域特征在检测场景下的适用性,提出了一种高效的CNN(convolutional neural network)结合Transformer的联合模型。
方法
2
设计基于EfficientNet的空间域特征提取分支及频率域特征提取分支,以丰富单分支的特征表示。之后与Transformer的编码器结构、交叉注意力结构进行连接,对全局区域间特征相关性进行建模。针对跨压缩、跨库场景下深度伪造检测模型精度下降问题,设计注意力机制及嵌入方式,结合数据增广策略,提高模型在跨压缩率、跨库场景下的鲁棒性。
结果
2
在FaceForensics++的4个数据集上与其他9种方法进行跨压缩率的精度比较,在交叉压缩率检测实验中,本文方法对Deepfake、Face2Face和Neural Textures伪造图像的检测准确率分别达到90.35%、71.79%和80.71%,优于对比算法。在跨数据集的实验中,本文模型同样优于其他方法,并且同设备训练耗时大幅缩减。
结论
2
本文提出的联合模型综合了卷积神经网络和Vision Transformer的优点,利用了不同域特征的检测特性及注意力机制和数据增强机制,改善了深度伪造检测在跨压缩、跨库检测时的效果,使模型更加准确且高效。
Objective
2
The research of deepfake detection methods has become one of the hot topics recently to counter deepfake videos. Its purpose is to identify fake videos synthesized by deep forgery technology on social networks
such as WeChat
Instagram and TikTok. Forged features are extracted on the basis of a convolutional neural network (CNN) and the final classification score is determined in terms of the features-forged classifier. When facing the deep forged video with low quality or high compression
these methods improve the detection performance by extracting deeper spatial domain information. However
the forged features left in the spatial domain decrease with the compression
and the local features tend to be similar
which degrades the performances severely. This also urges us to retain the frequency domain information of forged image artifacts as one of the clues of forensics
which contains less interference caused by JPEG compression. The CNN-based spatial domain feature extraction method can be conducted to capture facial artifacts via stacking convolution. But
its receptive field is limited
so it is better at modelling local information but ignores the relationship between global pixels. Transformer has its potentials at long-term dependency modelling in relevant to natural language processing and computer vision tasks
therefore it is usually employed to model the relationship between pixels of images and make up for the CNN-based deficiency in global information acquisition. However
the transformer can only process sequence information
making it still need the cooperation of convolutional neural network in computer vision tasks.
Method
2
First
we develop a novel joint detection model
which can leverage the advantages of CNN and transformer
and enriches the feature representation via frequency domain-related information. The EfficientNet-b0 is as the feature extractor. To optimize more forensics features
in the spatial feature extraction stage
the attention module is embedded in the shallow layer and the deep features are multiplied with the activation map obtained by the attention module. In the frequency domain feature extraction stage
to better learn the frequency domain features
we utilize the discrete cosine transform as the frequency domain transform means and an adaptive part is added to the frequency band decomposition. In the training process
to accelerate the memory-efficient training
we adopt the method of mixed precision training. Then
to construct the joint model
we link the feature extraction branches to a modified Transformer structure. The Transformer is used to model inter-region feature correlation using global self-attention feature encoding through an encoder structure. To further realize the information interaction between the dual-domain features
the cross attention is calculated between branches on the basis of the cross-attention structure. Furthermore
we design and implement a random data augmentation strategy
which is coordinated with the attention mechanism to improve the detection accuracy of the model in the scenarios of cross compression rate and cross dataset.
Result
2
Our joint model is compared to 9 state-of-the-art deepfake detection methods on two datasets called FaceForensics++(FF++) and Celeb-DF. In the experiments of cross compression-rate detection on the FF++ dataset
our detection accuracy can be reached to 90.35%
71.79% and 80.71% for Deepfakes
Face2Face and Neural Textures(NT) manipulated images
respectively. In the cross-dataset experiments
i.e.
training on FaceForensics++ and testing on Celeb-DF
our training time is reduced.
Conclusion
2
The experiments demonstrate that our joint model proposed can improve datasets-crossed and compression-rate acrossed detection accuracy. Our joint model takes advantage of the EfficientNet and the Transformer
and combines the characteristics of different domain features
attention
and data augmentation mechanism
making the model more accurate and efficient.
深度伪造检测卷积神经网络(CNN)Vision Transformer(ViT)空间域频率域
deepfake detectionconvolutional neural network(CNN)Vision Transformer(ViT)spatial domainfrequency domain
Afchar D, Nozick V, Yamagishi J and Echizen I. 2018. MesoNet: a compact facial video forgery detection network//Proceedings of 2018 IEEE International Workshop on Information Forensics and Security (WIFS). Hong Kong, China: IEEE: 1-7 [DOI: 10.1109/WIFS.2018.8630761http://dx.doi.org/10.1109/WIFS.2018.8630761]
Bazarevsky V, Kartynnik Y, Vakunov A, Raveendran K and Grundmann M. 2019. Blazeface: sub-millisecond neural face detection on mobile GPUs [EB/OL]. [2022-07-14].https://arxiv.org/pdf/1907.05047.pdfhttps://arxiv.org/pdf/1907.05047.pdf
Carreira J and Zisserman A. 2017. Quo vadis, action recognition? A new model and the kinetics dataset//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 4724-4733 [DOI: 10.1109/CVPR.2017.502http://dx.doi.org/10.1109/CVPR.2017.502]
Chen C F R, Fan Q F and Panda R. 2021. CrossViT: cross-attention multi-scale vision transformer for image classification//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 347-356 [DOI: 10.1109/ICCV48922.2021.00041http://dx.doi.org/10.1109/ICCV48922.2021.00041]
Coccomini D A, Messina N, Gennaro C and Falchi F. 2022. Combining efficientnet and vision transformers for video deepfake detection//Proceedings of the 21st International Conference on Image Analysis and Processing. Lecce, Italy: Springer [DOI: 10.1007/978-3-031-06433-3_19http://dx.doi.org/10.1007/978-3-031-06433-3_19]
Cozzolino D, Poggi G and Verdoliva L. 2017. Recasting residual-based local descriptors as convolutional neural networks: an application to image forgery detection//Proceedings of the 5th ACM Workshop on Information Hiding and Multimedia Security. Philadelphia, USA: ACM: 159-164 [DOI: 10.1145/3082031.3083247http://dx.doi.org/10.1145/3082031.3083247]
Durall R, Keuper M, Pfreundt F J and Keuper J. 2019. Unmasking deepfakes with simple features [EB/OL]. [2022-11-08].https://arxiv.org/pdf/1911.00686.pdfhttps://arxiv.org/pdf/1911.00686.pdf
Fagni T, Falchi F, Gambini M, Martella A and Tesconi M. 2021. TweepFake: about detecting deepfake tweets. PLoS One, 16(5): #e0251415 [DOI: 10.1371/journal.pone.0251415]
Fridrich J and Kodovsky J. 2012. Rich models for steganalysis of digital images. IEEE Transactions on Information Forensics and Security, 7(3): 868-882 [DOI: 10.1109/TIFS.2012.2190402]
Güera D and Delp E J. 2018. Deepfake video detection using recurrent neural networks//Proceedings of the 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). Auckland, New Zealand: IEEE: 1-6 [DOI: 10.1109/AVSS.2018.8639163http://dx.doi.org/10.1109/AVSS.2018.8639163]
He P S, Li W C, Zhang J Y, Wang H X and Jiang X H. 2022. Overview of passive forensics and anti-forensics techniques for GAN-generated image. Journal of Image and Graphics, 27(1): 88-110
何沛松, 李伟创, 张婧媛, 王宏霞, 蒋兴浩. 2022. 面向GAN生成图像的被动取证及反取证技术综述. 中国图象图形学报, 27(1): 88-110 [DOI: 10.11834/jig.210430]
Hu J, Liao X, Wang W and Qin Z. 2022. Detecting compressed deepfake videos in social networks using frame-temporality two-stream convolutional network. IEEE Transactions on Circuits and Systems for Video Technology, 32(3): 1089-1102 [DOI: 10.1109/TCSVT.2021.3074259]
Li J C, Liu F B, Hu Y J, Wang Y F, Liao G J and Liu G Y. 2020. Deepfake video detection based on consistency of illumination direction. Journal of Nanjing University of Aeronautics and Astronautics, 52(5): 760-767
李纪成, 刘琲贝, 胡永健, 王宇飞, 廖广军, 刘光尧. 2020. 基于光照方向一致性的换脸视频检测. 南京航空航天大学学报, 52(5): 760-767 [DOI: 10.16356/j.1005-2615.2020.05.012]
Li L Z, Bao J M, Zhang T, Yang H, Chen D, Wen F and Guo B N. 2020a. Face X-ray for more general face forgery detection//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 5001-5009 [DOI: 10.1109/CVPR42600.2020.00505http://dx.doi.org/10.1109/CVPR42600.2020.00505]
Li Y Z and Lyu S W. 2019. Exposing deepFake videos by detecting face warping artifacts//Proceedings of 2019 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Long Beach, USA: IEEE: 46-52
Li Y Z, Yang X, Sun P, Qi H G and Lv S W. 2020b. Celeb-DF: a large-scale challenging dataset for deepfake forensics//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 3204-3213 [DOI: 10.1109/CVPR42600.2020.00327http://dx.doi.org/10.1109/CVPR42600.2020.00327]
Liu L Y, Wang J X, Cao S L, Zhao L and Zhang X Q. 2022. U-Net for detecting small forgery region. Journal of Image and Graphics, 27(1): 176-187
刘丽颖, 王金鑫, 曹少丽, 赵丽, 张笑钦. 2022. 检测小篡改区域的U型网络. 中国图象图形学报, 27(1): 176-187 [DOI: 10.11834/jig.210438]
Nguyen H H, Yamagishi J and Echizen I. 2019a. Use of a capsule network to detect fake images and videos [EB/OL]. [2022-10-29].https://arxiv.org/pdf/1910.12467.pdfhttps://arxiv.org/pdf/1910.12467.pdf
Nguyen H H, Fang F M, YamagishiJ and Echizen I. 2019b. Multi-task learning for detecting and segmenting manipulated facial images and videos//Proceedings of the 10th IEEE International Conference on Biometrics Theory, Applications and Systems (BTAS). Tampa, USA: IEEE: 1-8 [DOI: 10.1109/BTAS46853.2019.9185974http://dx.doi.org/10.1109/BTAS46853.2019.9185974]
Qian Y Q, Yin G J, Sheng L, Chen Z X and Shao J. 2020. Thinking in frequency: face forgery detection by mining frequency-aware clues//Proceedings of the 16th European Conference on Computer Vision. Cham, Germany: Springer: 86-103 [DOI: 10.1007/978-3-030-58610-2_6http://dx.doi.org/10.1007/978-3-030-58610-2_6]
Rössler A, Cozzolino D, Verdoliva L, Riess C, Thies J and Niessner M. 2019. FaceForensics++: learning to detect manipulated facial images//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE: 1-11 [DOI: 10.1109/ICCV.2019.00009http://dx.doi.org/10.1109/ICCV.2019.00009]
Stuchi J A, Angeloni M A, Pereira R F, Boccato L, Folego G, Prado P V S and Attux R R F. 2017. Improving image classification with frequency domain layers for feature extraction//Proceedings of the 27th IEEE International Workshop on Machine Learning for Signal Processing (MLSP). Tokyo, Japan: IEEE: 1-6 [DOI: 10.1109/MLSP.2017.8168168http://dx.doi.org/10.1109/MLSP.2017.8168168]
Szegedy C, Liu W, Yang Q J, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V and Rabinovich A. 2015. Going deeper with convolutions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA: IEEE: 1-9 [DOI: 10.1109/CVPR.2015.7298594http://dx.doi.org/10.1109/CVPR.2015.7298594]
Tan M X and Le Q. 2019. Efficientnet: rethinking model scaling for convolutional neural networks//Proceedings of the 36th International Conference on Machine Learning. Long Beach, USA: PMLR: 6105-6114
Tran D, Bourdev L, Fergus R, Torresani L and Paluri M. 2015. Learning spatiotemporal features with 3D convolutional networks//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 4489-4497 [DOI: 10.1109/ICCV.2015.510http://dx.doi.org/10.1109/ICCV.2015.510]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L and Polosukhin I. 2017. Attention is all you need [EB/OL]. [2022-05-29].http://arxiv.org/pdf/1706.03762.pdfhttp://arxiv.org/pdf/1706.03762.pdf
Wang J K, Wu Z X, Ouyang W H, Han X T, Chen J J, Jiang Y G and Chen J. 2021. M2TR: multi-modal multi-scale transformers for deepfake detection//Proceedings of ICMR'22: International Conference on Multimedia Retrieval. Newark, USA: ACM: 615-623
Wang S Y, Wang O, Zhang R, Owens A and Efros A A. 2020. CNN-generated images are surprisingly easy to spot. . . for now//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 8695-8704 [DOI: 10.1109/CVPR42600.2020.00872http://dx.doi.org/10.1109/CVPR42600.2020.00872]
Wodajo D and Atnafu S. 2021. Deepfake video detection using convolutional vision transformer [EB/OL]. [2022-08-19].http://arxiv.org/pdf/2102.11126.pdfhttp://arxiv.org/pdf/2102.11126.pdf
Zhang X, Karaman S and Chang S F. 2019. Detecting and simulating artifacts in gan fake images//Proceedings of 2019 IEEE International Workshop on Information Forensics and Security (WIFS). Delft, the Netherlands: IEEE: 1-6 [DOI: 10.1109/WIFS47025.2019.9035107http://dx.doi.org/10.1109/WIFS47025.2019.9035107]
Zhang Y X, Li G, Cao Y and Zhao X F. 2020. A method for detecting human-face-tampered videos based on interframe difference. Journal of Cyber Security, 5(2): 49-72
张怡暄, 李根, 曹纭, 赵险峰. 2020. 基于帧间差异的人脸篡改视频检测方法. 信息安全学报, 5(2): 49-72 [DOI: 10.19363/J.cnki.cn10-1380/tn.2020.02.05)
Zhao H Q, Zhou W Y, Zhou W B, Zhang W M, Chen D D and Yu N H. 2021. Multi-attentional deepfake detection//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 2185-2194 [DOI: 10.1109/CVPR46437.2021.00222http://dx.doi.org/10.1109/CVPR46437.2021.00222]
Zhou P, Han X T, Morariu V I and Davis L S. 2017. Two-stream neural networks for tampered face detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Honolulu, USA: IEEE: 1831-1839 [DOI: 10.1109/CVPRW.2017.229http://dx.doi.org/10.1109/CVPRW.2017.229]
Zhu K M, Xu W B, Lu W and Zhao X F. 2022. Deepfake video detection with feature interaction amongst key frames. Journal of Image and Graphics, 27(1): 188-202
祝恺蔓, 徐文博, 卢伟, 赵险峰. 2022. 多关键帧特征交互的人脸篡改视频检测. 中国图象图形学报, 27(1): 188-202 [DOI: 10.11834/jig.210408]
相关作者
相关机构