时频对抗网络引导的伪造语音反取证方法

袁程胜; 陈益飞; 张雪原; 刘庆程; 李欣亭; 夏志华

doi:10.12395/0371-0025.2025236

时频对抗网络引导的伪造语音反取证方法

Time-frequency adversarial network-guided method for anti-forensics for forged speech

摘要

摘要: 针对现有语音反取证方法在扰动隐蔽性与攻击效率间难以兼顾的问题, 基于生成对抗网络构建时频对抗语音生成框架, 提出了一种基于时频对抗网络的语音反取证方法。通过精确调控语音信号在时频域的特征分布, 在保证人类听觉无法感知扰动存在的同时, 实现了对语音识别系统的高成功率攻击。首先, 针对现有对抗攻击方法在频域扰动精准调控方面的不足, 设计了频域扰动生成模块, 通过离散余弦变换及其逆变换与残差卷积的跨域融合, 实现频域能量分布的定向优化, 提高对抗样本对语音识别系统的攻击成功率。其次, 为解决传统时域扰动方法易引入突变噪声干扰的问题, 提出时域扰动生成模块, 采用深度可分离卷积与动态门控机制, 对时域波形进行协同优化, 在保持攻击强度的同时有效消除信号中的突变点, 提升了对抗语音的听觉质量。最后, 为平衡攻击强度与隐蔽性之间的矛盾, 构建了多目标联合损失函数, 通过动态加权融合对抗损失、信噪比约束损失与CW损失, 实现判别器与目标代理模型的协同优化。在Speech Commands数据集上的实验结果显示, 本方法应用于WideResNet模型时, 攻击成功率高达99.39%, 较现有TSSA方法提升了2.97%。同时, 语音可懂度STOI达到0.80, 具有较好的语音不可感知性。

Abstract: To address the imbalance between perturbation stealthiness and attack efficiency in existing speech anti-forensics methods, this paper constructs a time-frequency adversarial speech generation framework based on a generative adversarial network and proposes a speech anti-forensics approach built on this time-frequency adversarial network. By precisely controlling the distribution of time-frequency domain features of the speech signal, the method achieves high success rate attacks while ensuring imperceptibility to the human ear. First, to solve the challenge of precise control over frequency-domain perturbations in existing methods, a frequency-domain perturbation generation module is designed. By integrating discrete cosine transform and its inverse with residual convolution across domains, the module enables directional optimization of frequency energy distribution, effectively enhancing the attack success rate of adversarial examples against speech recognition systems. Second, addressing the issue of abrupt noise artifacts in traditional time-domain perturbation methods, a time-domain perturbation generation module is introduced. This module utilizes depthwise separable convolutions and a dynamic gating mechanism to collaboratively optimize the time-domain waveform, eliminating signal spikes while maintaining attack strength, thereby significantly improving the signal quality of adversarial speech. Finally, to balance the trade-off between attack strength and imperceptibility, a multi-objective joint loss function is constructed. By dynamically weighting and integrating adversarial loss, signal-to-noise ratio constraint loss, and CW loss, the discriminator and the target surrogate model are jointly optimized. Experiments on the Speech Commands dataset show that the proposed method achieves a 99.39% attack success rate on the WideResNet model, an improvement of 2.97% over the existing TSSA method. The speech intelligibility score reaches 0.80, indicating good imperceptibility.

HTML全文

参考文献(39)

施引文献

资源附件(0)