Abstract:
On the basis of the source-filter model, a hybrid parametric model, consisting of a voiced acoustic model, an unvoiced acoustic model and a compensation model of prosody, is presented for voice conversion and built by statistical learning. The voiced acoustic model is built on linear prediction analysis and mel cepstrum analysis to characterize the resonance of the vocal tract of speakers. The unvoiced acoustic model is adopted by linear prediction and noise-source modeling, to reflect the characteristics of the unvoiced speech of speakers. Statistical learning is involved to train the compensation model of prosody, which characterizes the distributions of pitch, energy, and duration respectively. An algorithm on the basis of the hybrid parametric model is proposed and applied to voice conversion of Mandarin syllables. The experiments demonstrate that the proposed algorithm not only improves the articulation and intelligibility of the converted speech, but also reduces the perceptual distance between the target and converted speech significantly. The formal listening tests also show that the prosodic features of target speakers are presented in the converted speech.