语音深度伪造技术是利用深度学习方法进行合成或生成语音的技术。人工智能生成内容技术的快速迭代与优化，推动了语音深度伪造技术在伪造语音的自然度、逼真度和多样性等方面取得显著提升，同时也使得语音深度伪造检测技术面临着巨大的挑战。本文对近年来语音深度伪造及其检测技术的研究进展进行全面梳理回顾。首先，介绍以语音合成（speech synthesis，SS）和语音转换（voice conversion，VC）为代表的伪造技术。然后，介绍语音深度伪造检测领域的常用数据集和相关评价指标。在此基础上，从数据增强、特征提取和优化以及学习机制等处理流程的角度对现有的语音深度伪造检测技术进行分类与深入分析介绍。具体而言，从语音加噪、掩码增强、信道增强和压缩增强等数据增强的角度来分析不同增强方式对伪造检测技术性能的影响，从基于手工特征的伪造检测、基于混合特征的伪造检测、基于端到端的伪造检测和基于特征融合的伪造检测等特征提取和优化的角度对比分析各类方法的优缺点，从自监督学习、对抗训练和多任务学习等学习机制的角度对伪造检测技术的训练方式进行探讨。最后，总结分析语音深度伪造检测技术存在的挑战性问题，并对未来研究进行展望。本文汇总的相关数据集和代码可在https://github.com/media-sec-lab/Audio-Deepfake-Detection访问。
Research progress on speech deepfake and its detection techniques
XuYuxiong, Li Bin, Tan Shunquan, Huang Jiwu(Shenzhen University)
Speech deepfake technology, which utilizes deep learning methods to synthesize or generate speech, has emerged as a critical research hotspot in multimedia information security. The rapid iteration and optimization of artificial intelligence-generated content technology have significantly advanced speech deepfake techniques. These advancements have significantly enhanced synthesized speech"s naturalness, fidelity, and diversity. However, they have also presented great challenges for speech deepfake detection technology. Because of these challenges, this paper comprehensively reviews recent research progress on speech deepfake generation and its detection techniques. Based on an extensive literature survey, this paper first introduces the research background of speech forgery and its detection and compares and analyzes previously published reviews in this field. Secondly, this paper provides a concise overview of speech deepfake generation, focusing on speech synthesis (SS) and voice conversion (VC). SS, commonly known as text-to-speech (TTS), analyzes text and generates speech that aligns with the provided input by applying linguistic rules for text description. Various deep models are employed in TTS, including sequence-to-sequence (Seq2Seq) models, Flow models, generative adversarial network (GAN) models, variational auto-encoder (VAE) models, and Diffusion models. VC involves modifying acoustic features, such as emotion, accent, pronunciation, and speaker identity, to produce speech resembling human-like speech. VC algorithms can be categorized as single, multiple, and arbitrary target speech conversion depending on the number of target speakers. Thirdly, this paper briefly introduces commonly used datasets in speech deepfake detection and provides relevant access links to open-source datasets. This paper briefly introduces two commonly used evaluation metrics in speech deepfake detection: equal error rate (EER) and tandem detection cost function (t-DCF). Additionally, this paper analyzes and categorizes the existing deep speech forgery detection techniques in detail, and the pros and cons of different detection techniques are studied and compared in depth, mainly from data processing, feature extraction and optimization, and learning mechanisms. Notably, this paper summarizes the experimental results of existing detection techniques on the ASVspoof 2019 and 2021 datasets in tabular form. Within this context, the primary focus of this paper is to investigate the generality of current detection techniques in the field of speech deepfake detection without focusing on specific forgery attack methods. Data augmentation (DA) performs a series of transformations or augmentations on the original speech data, which can be roughly divided into speech noise addition, mask enhancement, channel enhancement, and compression enhancement. Among them, one of the most common data processing methods is speech noise addition, which aims to interfere with the speech signal by adding noise to simulate the complex acoustic environment of a real scenario as much as possible. Mask enhancement is the masking operation on the time or frequency domain of speech to achieve noise suppression and enhancement of the speech signal to improve the accuracy and robustness of speech detection techniques. Transmission channel enhancement focuses on solving the problems of signal attenuation, data loss, and noise interference caused by changes in the codec and transmission channel of speech data. Compression enhancement techniques address the problem of degradation of speech quality during data compression. In particular, the main data compression methods are mp3, m4a, and ogg, etc. From the feature extraction and optimization perspective, speech deepfake detection can be divided into handcrafted feature-based, hybrid feature-based, deep feature-based, and feature fusion-based. Handcrafted features refer to speech features extracted with the help of certain prior knowledge, mainly constant-Q transform (CQT), linear frequency cepstral coefficients (LFCC), and mel spectrogram, etc. In contrast, feature-based hybrid forgery detection methods utilize the domain knowledge provided by handcrafted features to mine richer information about speech representations through deep learning networks. End-to-end forgery detection methods directly learn feature representation and classification models from raw speech signals, eliminating the need for handcrafted feature extraction. This allows the model to discover discriminative features from the input data automatically. Moreover, these detection techniques can be trained using a single feature. Alternatively, feature-level fusion forgery detection can be employed to combine multiple features, whether they are identical or different. Techniques such as weighted aggregation and feature concatenation are used for feature-level fusion. The detection techniques can capture richer speech information by fusing these features, improving performance. For the learning mechanism, this study explores the impact of different training methods on forgery detection techniques, focusing on self-supervised learning, adversarial training, and multi-task learning. Self-supervised learning plays an important role in forgery detection techniques by automatically generating auxiliary targets or labels from speech data to train models. Fine-tuning the self-supervised based pre-trained model can effectively distinguish between real and forged speech. Then, adversarial training-based forgery detection enhances the robustness and generalization of the model by adding adversarial samples to the training data. In contrast to binary classification tasks, the forgery detection based on multi-task learning learns more comprehensive and useful speech feature information from different speech-related tasks by sharing the underlying feature representations, which improves the model"s detection performance while effectively utilizing speech training data. Although speech deepfake detection techniques have achieved excellent performance in some datasets, their performance is less satisfactory when testing speech data from natural scenarios. By analyzing the existing research work, this paper concludes that the main future research directions are to establish diversified speech deepfake datasets, to study adversarial samples or data enhancement methods to enhance the robustness of speech deepfake detection techniques, to establish generalized speech deepfake detection techniques, and to explore interpretable speech deepfake detection techniques. The relevant datasets and code mentioned are linked at: https://github.com/media-sec-lab/Audio-Deepfake-Detection.