EI / SCOPUS / CSCD 收录

中文核心期刊

融合梅尔谱增强与特征解耦的噪声鲁棒语音转换

Noise robust voice conversion with the fusion of Mel-spectrum enhancement and feature disentanglement

  • 摘要: 提出了一种融合梅尔谱增强与特征解耦的噪声鲁棒语音转换模型, 即MENR-VC模型。该模型采用3个编码器提取语音内容、基频和说话人身份矢量特征, 并引入互信息作为相关性度量指标, 通过最小化互信息进行矢量特征解耦, 实现对说话人身份的转换。为了改善含噪语音的频谱质量, 模型使用深度复数循环卷积网络对含噪梅尔谱进行增强, 并将其作为说话人编码器的输入; 同时, 在训练过程中, 引入梅尔谱增强损失函数对模型整体损失函数进行了改进。仿真实验结果表明, 与同类最优的噪声鲁棒语音转换方法相比, 所提模型得到的转换语音在语音自然度和说话人相似度的平均意见得分方面,分别提高了0.12和0.07。解决了语音转换模型在使用含噪语音进行训练时, 会导致深度神经网络训练过程难以收敛, 转换语音质量大幅下降的问题。

     

    Abstract: A novel noise-robust voice conversion model named MENR-VC which combines Mel-spectrum enhancement and feature decoupling is proposed in this paper. Speech content, fundamental frequency, and speaker identity vector features are extracted by three encoders in the model, and mutual information is introduced as a correlation metric to achieve speaker identity conversion via minimizing mutual information for feature decoupling. To overcome the limitations of noisy speech, a deep complex recurrent convolutional network is employed by the model to enhance the noisy Mel spectrum, which serves as input to the speaker encoder. Additionally, the Mel-spectrum enhancement loss function is introduced during the training process to improve the overall loss function of the model. The simulation results demonstrate that similar optimal noise-robust voice conversion methods are outperformed by the proposed model, with an enhancement of 0.12 and 0.07 in the average opinion scores of speech naturalness and speaker similarity of the converted speech respectively. The training of deep neural network can be easily converged when noisy speech is used as training data in the proposed voice conversion model, and the quality of the converted speech is also satisfatory.

     

/

返回文章
返回