Noise robust voice conversion with the fusion of Mel-spectrum enhancement and feature disentanglement
-
-
Abstract
A novel noise-robust voice conversion model named MENR-VC which combines Mel-spectrum enhancement and feature decoupling is proposed in this paper. Speech content, fundamental frequency, and speaker identity vector features are extracted by three encoders in the model, and mutual information is introduced as a correlation metric to achieve speaker identity conversion via minimizing mutual information for feature decoupling. To overcome the limitations of noisy speech, a deep complex recurrent convolutional network is employed by the model to enhance the noisy Mel spectrum, which serves as input to the speaker encoder. Additionally, the Mel-spectrum enhancement loss function is introduced during the training process to improve the overall loss function of the model. The simulation results demonstrate that similar optimal noise-robust voice conversion methods are outperformed by the proposed model, with an enhancement of 0.12 and 0.07 in the average opinion scores of speech naturalness and speaker similarity of the converted speech respectively. The training of deep neural network can be easily converged when noisy speech is used as training data in the proposed voice conversion model, and the quality of the converted speech is also satisfatory.
-
-