多关键帧特征交互的人脸篡改视频检测

祝恺蔓; 徐文博; 卢伟; 赵险峰

doi:10.11834/jig.210408

篡改检测与内容恢复 | 浏览量 : 0 下载量: 0 CSCD: 5

PDF
导出
分享
收藏
专辑

多关键帧特征交互的人脸篡改视频检测
Deepfake video detection with feature interaction amongst key frames
2022年27卷第1期页码：188-202
纸质出版日期： 2022-01-16 ，

录用日期： 2021-10-21
DOI： 10.11834/jig.210408
稿件说明：

移动端阅览

祝恺蔓, 徐文博, 卢伟, 赵险峰. 多关键帧特征交互的人脸篡改视频检测[J]. 中国图象图形学报, 2022,27(1):188-202.

Kaiman Zhu, Wenbo Xu, Wei Lu, Xianfeng Zhao. Deepfake video detection with feature interaction amongst key frames[J]. Journal of Image and Graphics, 2022,27(1):188-202.
祝恺蔓, 徐文博, 卢伟, 赵险峰. 多关键帧特征交互的人脸篡改视频检测[J]. 中国图象图形学报, 2022,27(1):188-202. DOI： 10.11834/jig.210408.

Kaiman Zhu, Wenbo Xu, Wei Lu, Xianfeng Zhao. Deepfake video detection with feature interaction amongst key frames[J]. Journal of Image and Graphics, 2022,27(1):188-202. DOI： 10.11834/jig.210408.

摘要

目的

深度伪造是新兴的一种使用深度学习手段对图像和视频进行篡改的技术，其中针对人脸视频进行的篡改对社会和个人有着巨大的威胁。目前，利用时序或多帧信息的检测方法仍处于初级研究阶段，同时现有工作往往忽视了从视频中提取帧的方式对检测的意义和效率的问题。针对人脸交换篡改视频提出了一个在多个关键帧中进行帧上特征提取与帧间交互的高效检测框架。

方法

从视频流直接提取一定数量的关键帧，避免了帧间解码的过程；使用卷积神经网络将样本中单帧人脸图像映射到统一的特征空间；利用多层基于自注意力机制的编码单元与线性和非线性的变换，使得每帧特征能够聚合其他帧的信息进行学习与更新，并提取篡改帧图像在特征空间中的异常信息；使用额外的指示器聚合全局信息，作出最终的检测判决。

结果

所提框架在FaceForensics++的3个人脸交换数据集上的检测准确率均达到96.79%以上；在Celeb-DF数据集的识别准确率达到了99.61%。在检测耗时上的对比实验也证实了使用关键帧作为样本对检测效率的提升以及本文所提检测框架的高效性。

结论

本文所提出的针对人脸交换篡改视频的检测框架通过提取关键帧减少视频级检测中的计算成本和时间消耗，使用卷积神经网络将每帧的人脸图像映射到特征空间，并利用基于自注意力的帧间交互学习机制，使得每帧特征之间可以相互关注，学习到有判别性的信息，使得检测结果更加准确，整体检测过程更高效。

Abstract

Objective

Images and videos manipulation is becoming more easy-use and indistinguishable with development of deep learning. Deepfake is a sort of face manipulation technique which poses a great threat to social security and individual rights. Researchers have been working to propose various detection models or frameworks

which can be divided into three categories combined with their inputs factors like frame level

clip level and video level

respectively. Detection models of frame level have focused on single frame and ignore temporal information only

potentially leading to low confidence in videos detection. Although detection models of clip level make use of a sequence of frames simultaneously

the length of sequence is relatively shorter than the real length of a video. Thus

a clip cannot well represent a video. Moreover

video clips are fragmented and may have adverse effect on video level detection. The consecutive frames in a short clip have little difference and cause redundant information

which may cut the detection performance. The video level detection methods use frames of large interval as input and capture more key features to represent qualified video. The existing methods ignore the impact of sample extraction procedure and its expensive computation of decoding video stream. To solve this problem and provide more efficient detection method on face-swap manipulation videos

a detection framework based on the interaction of key frames' features is illustrated.

Method

The proposed detection framework has consisted of two parts: key frames extraction in context of face region images extraction and the detection model. First

an amount of key frames from the video stream have been extracted and checked. Inter-frame decoding is avoided and computation time is deducted via key frames extraction. Next

multitask cascaded convolutional neural networks(MTCNN) is applied to locate the position of face region on the extracted frames. Face images are cropped with 80 margins from them. MTCNN is re-applied to the images extracted before. Compact face images are extracted from them. The face images input are mapped into high dimensional embedding space by Inception-ResNet-V1. This convolution neural network is initialized by pre-trained parameters in face recognition task and updated end-to-end implementation. At last

these features of key frames are melted into an interaction learning module

which contains various self-attention-based encoders. In this module

each key frame feature can learn from every other key frame and update itself. Distinctive abnormal features of manipulated images are extracted via part of linear and non-linear transformations. A global classification vector is concatenated at the first of key frame features

updating along with them

and makes the final decision.

Result

The detection framework has been evaluated on five mainstream datasets listed below: Deepfakes

FaceSwap

FaceShifter

DeepFakeDetection and Celeb-DF

respectively. The three datasets of Deepfakes

FaceSwap

FaceShifter are from FaceForensics++. It achieves accuracies of 97.50%

97.14%

96.79%

97.09% and 98.64%

respectively

with a small quantity of key frames. Original 3D convolution models and LSTM-based models are compared with the illustrated detection model on Celeb-DF in terms of 16 key frames as input. A demonstrated lightweight 3D model(L3D) for deepfake detection has been tested as well. As the samples size is smaller than that of exisited work

R3D

C3D

I3D and L3D have demonstrated poor detection performance while LSTM-based one achieves an accuracy of 98.06%. The demonstrated model is much better than before (99.61%). In the condition that the input is changed to consecutive frames

the proposed model has shown qualified performance 98.64% as well. The time cost of detection is evaluated and illustrated that our framework can detect a video in an average time of 3.17 s

less than major models or with consecutive frames as input. The research strategy of key frame extraction and the framework proposed are shown to be efficient based on the experiments results. A realistic scene has been considered

in which key frames quantity of the video has been checked. A little more frames than training can achieve higher accuracy as the detection model has learned the relation well amongst frames and can be generalized well

but fewer frames can also lead to insufficient information and worse performance. In general

the proposed model can achieve good and stable detection performance

training with 16 key frames.

Conclusion

An efficient detection framework for face-swap manipulation videos has been demonstrated. It takes the advantage of key frame extraction that it skips the procedure of inter-frame decoding and get time cutting in the preprocessing step. Based on face region images being cropped from valid key frames' pictures

Inception-ResNet-V1 maps them to a standardized embedding space followed by several layers of self-attention based encoders and linear or non-linear transformations. More meaningful and distinguishing information is captured when every frame feature can learn from each other. The experiments on Celeb-DF dataset demonstrate that the illustrated model outperforms other sequential model and 3D convolution neural networks. The time cost is relatively deducted and the effiency of the proposed framework is improved.

关键词

Deepfake检测人脸交换篡改视频关键帧层级结构多帧交互自注意力机制

Keywords

Deepfake detectionface-swap manipulation videoskey frameshierarchical structuremulti-frame interactionself-attention mechanism

references

Afchar D, NozickV, Yamagishi J and Echizen I. 2018. MesoNet: a compact facial video forgery detection network//Proceedings of 2018 IEEE International Workshop on Information Forensics and Security (WIFS). Hong Kong, China: IEEE: 1-7 [DOI: 10.1109/WIFS.2018.8630761http://dx.doi.org/10.1109/WIFS.2018.8630761]

Bonettini N, Cannas E D, Mandelli S, Bondi L, Bestagini P and Tubaro S. 2021. Video face manipulation detection through ensemble of CNNs//Proceedings of the 25th International Conference on Pattern Recognition (ICPR). Milan, Italy: IEEE: 5012-5019 [DOI: 10.1109/ICPR48806.2021.9412711http://dx.doi.org/10.1109/ICPR48806.2021.9412711]

Cao Q, Shen L, Xie W D, Parkhi O M and Zisserman A. 2018. VGGFace2: a dataset for recognising faces across pose and age//Proceedings of the 13th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2018). Xi′an, China: IEEE: 67-74 [DOI: 10.1109/FG.2018.00020http://dx.doi.org/10.1109/FG.2018.00020]

Carreira J and Zisserman A. 2017. Quo Vadis, action recognition? A new model and the kinetics dataset//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 4724-4733 [DOI: 10.1109/CVPR.2017.502http://dx.doi.org/10.1109/CVPR.2017.502]

Chao J, Jiang X H and Sun T F. 2013. A novel video inter-frame forgery model detection scheme based on optical flow consistency//Proceedings of the 11th International Workshop on Digital Forensics and Watermarking 2012. Shanghai, China: Springer: 267-281 [DOI: 10.1007/978-3-642-40099-5_22http://dx.doi.org/10.1007/978-3-642-40099-5_22]

Chen P, Liu J, Liang T, Zhou G Z, Gao H C, Dai J and Han J Z. 2020. FSSPOTTER: Spotting face-swapped video by spatial and temporal clues//Proceedings of 2020 IEEE International Conference on Multimedia and Expo (ICME). London, UK: IEEE: 1-6 [DOI: 10.1109/ICME46284.2020.9102914http://dx.doi.org/10.1109/ICME46284.2020.9102914]

Chen X Y, Xu C, Yang X K, Song L and Tao D C. 2019. Gated-GAN: adversarial gated networks for multi-collection style transfer. IEEE Transactions on Image Processing, 28(2): 546-560 [DOI: 10.1109/TIP.2018.2869695]

Creswell A, White T, Dumoulin V, Arulkumaran K, Sengupta B and Bharath A A. 2018. Generative adversarial networks: an overview. IEEE Signal Processing Magazine, 35(1): 53-65 [DOI: 10.1109/MSP.2017.2765202]

de Lima O, Franklin S, Basu S, Karwoski B and George A. 2020. Deepfake detection using spatiotemporal convolutional networks[EB/OL]. [2020-06-26].https://arxiv.org/pdf/2006.14749.pdfhttps://arxiv.org/pdf/2006.14749.pdf

Dufour N and Gully A. 2019. Contributing data to deepfake detection research[EB/OL]. [2020-06-26].https://torontoai.org/2019/09/23/contributing-data-to-deepfake-detection-research/https://torontoai.org/2019/09/23/contributing-data-to-deepfake-detection-research/

Durall R, Keuper M, Pfreundt F J and Keuper J. 2019. Unmasking deepfakes with simple features[EB/OL]. [2020-06-26].https://arxiv.org/pdf/1911.00686.pdfhttps://arxiv.org/pdf/1911.00686.pdf

Ganiyusufoglu I, Ngô L M, Savov N, Karaoglu S and Gevers T. 2020. Spatio-temporal features for generalized detection of deepfake videos[EB/OL]. [2020-10-22].https://arxiv.org/pdf/2010.11844.pdfhttps://arxiv.org/pdf/2010.11844.pdf

Güera D and Delp E J. 2018. Deepfake video detection using recurrent neural networks//Proceedings of the 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). Auckland, New Zealand: IEEE: 1-6 [DOI: 10.1109/AVSS.2018.8639163http://dx.doi.org/10.1109/AVSS.2018.8639163]

Guo Z Q, Yang G B, Chen J Y and Sun X M. 2020. Fake face detection via adaptive manipulation traces extraction network[EB/OL]. [2020-05-11].https://arxiv.org/pdf/2005.04945.pdfhttps://arxiv.org/pdf/2005.04945.pdf

Korshunov P and Marcel S. 2018. DeepFakes: a new threat to face recognition? Assessment and detection[EB/OL]. [2020-06-26].https://arxiv.org/pdf/1812.08685.pdfhttps://arxiv.org/pdf/1812.08685.pdf

Kumar A, Bhavsar A and Verma R. 2020. Detecting deepfakes with metric learning//Proceedings of the 8th International Workshop on Biometrics and Forensics (IWBF). Porto, Portugal: IEEE: 1-6 [DOI: 10.1109/IWBF49977.2020.9107962http://dx.doi.org/10.1109/IWBF49977.2020.9107962]

Lai Y C, Huang T Q and Jiang R X. 2015. Image region copy-move forgery detection based on Exponential-Fourier moments. Journal of Image and Graphics, 20(9): 1212-1221

赖玥聪, 黄添强, 蒋仁祥. 2015. 采用指数矩的图像区域复制粘贴篡改检测. 中国图象图形学报, 20(9): 1212-1221[DOI: 10.11834/jig.20150908]

Li H D, Li B, Tan S Q and Huang J W. 2020a. Identification of deep network generated images using disparities in color components. Signal Processing, 174: #107616 [DOI: 10.1016/j.sigpro.2020.107616]

Li X R and Yu K. 2020. A Deepfakes detection technique based on two-stream network. Journal of Cyber Security, 5(2): 84-91

李旭嵘, 于鲲. 2020. 一种基于双流网络的Deepfakes检测技术. 信息安全学报, 5(2): 84-91 [DOI: 10.19363/J.cnki.cn10-1380/tn.2020.02.07]

Li Y Z, Chang M C and Lyu S. 2018. In Ictu oculi: Exposing AI created fake videos by detecting eye blinking//Proceedings of 2018 IEEE International Workshop on Information Forensics and Security (WIFS). Hong Kong, China: IEEE: 1-7 [DOI: 10.1109/WIFS.2018.8630787http://dx.doi.org/10.1109/WIFS.2018.8630787]

Li Y Z, Yang X, Sun P, Qi H G and Lyu S. 2020b. Celeb-DF: a large-scale challenging dataset for DeepFake forensics//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 3207-3216 [DOI: 10.1109/CVPR42600.2020.00327http://dx.doi.org/10.1109/CVPR42600.2020.00327]

Liang T, Chen P, Zhou G Z, Gao H C, Liu J, Li Z X and Dai J. 2020. SDHF: Spotting DeepFakes with hierarchical features//Proceedings of the IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI). Baltimore, USA: IEEE: 675-680 [DOI: 10.1109/ICTAI50040.2020.00108http://dx.doi.org/10.1109/ICTAI50040.2020.00108]

Liu J R, Zhu K M, Lu W, Luo X Y and Zhao X F. 2021. A lightweight 3D convolutional neural network for deepfake detection. International Journal of Intelligent Systems, 36(9): 4990-5004 [DOI: 10.1002/int.22499]

Matern F, Riess C and Stamminger M. 2019. Exploiting visual artifacts to expose deepfakes and face manipulations//Proceedings of 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW). Waikoloa, USA: IEEE: 83-92 [DOI: 10.1109/WACVW.2019.00020http://dx.doi.org/10.1109/WACVW.2019.00020]

Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S A, Huang Z H, Karpathy A, Khosla A, Bernstein M, Berg A C and Li F F. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3): 211-252 [DOI: 10.1007/s11263-015-0816-y]

Rössler A, Cozzolino D, Verdoliva L, Riess C, Thies J and Nießner M. 2018. Faceforensics: a large-scale video dataset for forgery detection in human faces[EB/OL]. [2020-06-17].https://arxiv.org/pdf/1803.09179.pdfhttps://arxiv.org/pdf/1803.09179.pdf

Rössler A, Cozzolino D, Verdoliva L, Riess C, Thies J and Niessner M. 2019. FaceForensics++: learning to detect manipulated facial images//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE: 1-11 [DOI: 10.1109/ICCV.2019.00009http://dx.doi.org/10.1109/ICCV.2019.00009]

Sabir E, Cheng J X, Jaiswal A, AbdAlmageed W, Masi I and Natarajan P. 2019. Recurrent convolutional strategies for face manipulation detection in videos//Proceedings of 2019 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Long Beach, USA: IEEE: 80-87

Schroff F, Kalenichenko D and Philbin J. 2015. FaceNet: a unified embedding for face recognition and clustering//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 815-823 [DOI: 10.1109/CVPR.2015.7298682http://dx.doi.org/10.1109/CVPR.2015.7298682]

Suwajanakorn S, Seitz S M and Kemelmacher-Shlizerman I. 2017. Synthesizing Obama: learning lip sync from audio. ACM Transactions on Graphics, 36(4): #95 [DOI: 10.1145/3072959.3073640]

Szegedy C, Ioffe S, Vanhoucke V and Alemi A A. 2016. Inception-V4, inception-ResNet and the impact of residual connections on learning//Proceedings of the 31st AAAI Conference on Artificial Intelligence. San Francisco, USA: AIAA: 4278-4284

Thies J, Zollhofer M, Stamminger M, Theobalt C and Nießner M. 2016. Face2Face: real-time face capture and reenactment of RGB videos//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 2387-2395 [DOI: 10.1109/CVPR.2016.262http://dx.doi.org/10.1109/CVPR.2016.262]

Tolosana R, Vera-Rodriguez R, Fierrez J, Morales A and Ortega-Garcia J. 2020. Deepfakes and beyond: a survey of face manipulation and fake detection. Information Fusion, 64: 131-148 [DOI: 10.1016/j.inffus.2020.06.014]

Tran D, Bourdev L, Fergus R, Torresani L and Paluri M. 2015. Learning spatiotemporal features with 3D convolutional networks//Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE: 4489-4497 [DOI: 10.1109/ICCV.2015.510http://dx.doi.org/10.1109/ICCV.2015.510]

Tran D, Wang H, Torresani L, Ray J, LeCun Y and Paluri M. 2018. A closer look at spatiotemporal convolutions for action recognition//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 6450-6459 [DOI: 10.1109/CVPR.2018.00675http://dx.doi.org/10.1109/CVPR.2018.00675]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31 st International Conference on Neural Information Processing Systems. Long Beach, USA: ACM: 6000-6010

Wei W, Fan X L, Song H B and Wang H H. 2019. Video tamper detection based on multi-scale mutual information. Multimedia Tools and Applications, 78(19): 27109-27126 [DOI: 10.1007/s11042-017-5083-1]

Xu W J, Keshmiri S and Wang G H. 2019. Adversarially approximated autoencoder for image generation and manipulation. IEEE Transactions on Multimedia, 21(9): 2387-2396 [DOI: 10.1109/TMM.2019.2898777]

Xu Z P, Liu J R, Lu W, Xu B Z, Zhao X F, Li B and Huang J W. 2021. Detecting facial manipulated videos based on set convolutional neural networks. Journal of Visual Communication and Image Representation, 77: #103119 [DOI: 10.1016/j.jvcir.2021.103119]

Yang X, Li Y Z and Lyu S. 2019a. Exposing deep fakes using inconsistent head poses//Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK: IEEE: 8261-8265 [DOI: 10.1109/ICASSP.2019.8683164http://dx.doi.org/10.1109/ICASSP.2019.8683164]

Yang X, Li Y Z, Qi H G and Lyu S. 2019b. Exposing GAN-synthesized faces using landmark locations//Proceedings of 2019 ACM Workshop on Information Hiding and Multimedia Security. Paris, France: ACM: 113-118 [DOI: 10.1145/3335203.3335724http://dx.doi.org/10.1145/3335203.3335724]

Yu N, Davis L and Fritz M. 2019. Attributing fake images to GANs: learning and analyzing GAN fingerprints//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 7556-7566 [DOI: 10.1109/ICCV.2019.00765http://dx.doi.org/10.1109/ICCV.2019.00765]

Zhang K P, Zhang Z P, Li Z F and Qiao Y. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10): 1499-1503 [DOI: 10.1109/LSP.2016.2603342]

Zhang Y J. 2012. Image Engineering (1) Image Processing. 3rd ed. Beijing: Tsinghua University Press

章毓晋. 2012. 图像工程(上册)图像处理. 3版. 北京: 清华大学出版社

Zhang Y X, Li G, Cao Y and Zhao X F. 2020. A method for detecting human-face-tampered videos based on interframe difference. Journal of Cyber Security, 5(2): 49-72

张怡暄, 李根, 曹纭, 赵险峰. 2020. 基于帧间差异的人脸篡改视频检测方法. 信息安全学报, 5(2): 49-72 [DOI: 10.19363/J.cnki.cn10-1380/tn.2020.02.05]

Zhao J, Guo J C, Zhang Y and Zhang Z W. 2015. Automatic detection and localization of image forgery regions based on offset estimation of double JPEG compression. Journal of Image and Graphics, 20(10): 1304-1312

赵洁, 郭继昌, 张艳, 张众维. 2015. JPEG图像双重压缩偏移量估计的篡改区域自动检测定位. 中国图象图形学报, 20(10): 1304-1312 [DOI: 10.11834/jig.20151003]

文章被引用时，请邮件提醒。

提交

结合双边交叉增强与自注意力补偿的点云语义分割

标记分布与时空注意力感知的视频动作质量评估

Transformer驱动的图像分类研究进展

融合自注意力和自编码器的视频异常检测

图网络层级信息挖掘分类算法综述