集合多说话人语音合成与识别的双优回路数据多样性增强
Enhancing speaker diversity through dual-optimizing loop between text-to-speech and speaker recognition
-
摘要: 为了克服真实语音数据的说话人多样性稀缺问题, 实现基于生成语音数据的多说话人语音合成、说话人识别模型迭代双向优化, 提出一种名为双优回路的方法。双优回路综合利用多说话人语音合成模型的数据生成能力和说话人识别模型的特征过滤能力, 基于有限的手工标记语音数据, 以生成和过滤语音数据的方式迭代增强说话人多样性, 并实现对回路系统的迭代双向优化。在Aishell-1, Aishell-3, MagicData-READ和LibriTTS数据集上的实验证明, 该方法可以与传统数据增广算法结合, 有效扩充了语音数据的说话人多样性, 显著提高了多说话人语音合成模型的泛化能力和说话人识别模型的判别能力。Abstract: To address the challenge of speaker diversity scarcity in real-world speech data and to achieve bi-directional optimization of multi-speaker speech synthesis and speaker recognition models based on generated speech data, the dual optimization loop (DOL) method is proposed. The DOL comprehensively utilizes the data generation capability of multi-speaker speech synthesis models and the discriminative ability of speaker recognition models, aiming to expand the speaker diversity of a limited manually labeled speech dataset through the generation and filtering of speech data and achieve bidirectional optimization of the loop system. Experimental results on Aishell-1, Aishell-3, MagicData-READ and LibriTTS illustrate that the proposed approach, when integrated with conventional data augmentation techniques, proficiently expands speaker diversity within speech data. Consequently, this enhancement markedly advances the generalization ability of multi-speaker speech synthesis models and the discriminative power of speaker recognition models.