Abstract:
For the multilingual speech synthesis task, due to the scarcity of single-person multilingual data, it becomes very difficult for one voice to support multilingual synthesis at the same time. Unlike previous methods that only decouple timbre and pronunciation within acoustic models, this paper proposes an end-to-end multilingual speech synthesis method that incorporates cross-speaker prosody transfer, which uses a two-level hierarchical conditional variational auto-encoder to directly model the generation process from text-to-waveform and decouples timbre, pronunciation, and prosody. The method improves the prosody of cross-lingual synthesis by transferring the prosody style of existing speakers in the target language. Experiments reveal that the proposed model achieves an average opinion score of 3.91 and 4.01 for naturalness and similarity in cross-lingual speech generation. Objective indicators also show that the word error rate of this method is reduced to 5.85% compared with baselines. Besides, prosody transfer and ablation experiments further prove the effectiveness of proposed method.