Time-frequency adversarial network-guided method for anti-forensics for forged speech
-
Graphical Abstract
-
Abstract
To address the imbalance between perturbation stealthiness and attack efficiency in existing speech anti-forensics methods, this paper constructs a time-frequency adversarial speech generation framework based on a generative adversarial network and proposes a speech anti-forensics approach built on this time-frequency adversarial network. By precisely controlling the distribution of time-frequency domain features of the speech signal, the method achieves high success rate attacks while ensuring imperceptibility to the human ear. First, to solve the challenge of precise control over frequency-domain perturbations in existing methods, a frequency-domain perturbation generation module is designed. By integrating discrete cosine transform and its inverse with residual convolution across domains, the module enables directional optimization of frequency energy distribution, effectively enhancing the attack success rate of adversarial examples against speech recognition systems. Second, addressing the issue of abrupt noise artifacts in traditional time-domain perturbation methods, a time-domain perturbation generation module is introduced. This module utilizes depthwise separable convolutions and a dynamic gating mechanism to collaboratively optimize the time-domain waveform, eliminating signal spikes while maintaining attack strength, thereby significantly improving the signal quality of adversarial speech. Finally, to balance the trade-off between attack strength and imperceptibility, a multi-objective joint loss function is constructed. By dynamically weighting and integrating adversarial loss, signal-to-noise ratio constraint loss, and CW loss, the discriminator and the target surrogate model are jointly optimized. Experiments on the Speech Commands dataset show that the proposed method achieves a 99.39% attack success rate on the WideResNet model, an improvement of 2.97% over the existing TSSA method. The speech intelligibility score reaches 0.80, indicating good imperceptibility.
-
-