EI / SCOPUS / CSCD 收录

中文核心期刊

HAO Xiaoyang, ZHANG Pengyuan. Autoregressive multi-speaker model in Chinese speech synthesis based on variational autoencoder[J]. ACTA ACUSTICA, 2022, 47(3): 405-416. DOI: 10.15949/j.cnki.0371-0025.2022.03.004
Citation: HAO Xiaoyang, ZHANG Pengyuan. Autoregressive multi-speaker model in Chinese speech synthesis based on variational autoencoder[J]. ACTA ACUSTICA, 2022, 47(3): 405-416. DOI: 10.15949/j.cnki.0371-0025.2022.03.004

Autoregressive multi-speaker model in Chinese speech synthesis based on variational autoencoder

  • Speaker adaption and speaker labels are two common methods for multi-speaker speech synthesis.The model obtained by speaker adaption can only support the speech of the adaptive speaker,and not robust enough.The conventional speaker label needs to obtain the speaker information of speech with supervision,and can’t learn the speaker label unsupervised from the speech itself.In order to solve the problems,a variational autoencoder based autoregressive multi-speaker framework is proposed.Firstly,speaker information is learned by variational autoencoder unsupervisedly and encoded into speaker labels.Then,speaker labels together with linguistic features are fed into an autoregressive acoustic model.Besides,acoustic model adopts multi-task learning to avoid the over-fitting of fundamental frequency.Pre-experiment shows,the autoregressive network structure decreases the cepstral distortion by 1.018 dB and root mean square error of fundamental frequency drops 6.861 Hz by multi-task learning.In the following comparative experiments,the Mean Opinion Score(MOS)scores respectively achieve 3.71,3.55,3.15 and Pinyin Error Rate achieve6.71%,7.54%,9.87%in three sub-tasks in multi-speaker speech synthesis by proposed method,which shows proposed methods observably improve the quality of synthesized speech.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return