利用深度卷积神经网络将耳语转换为正常语音
Whisper to normal speech conversion using deep convolutional neural networks
-
摘要: 耳语是一种特殊发音方式,将耳语转换为正常语音是提升耳语质量和可懂度的关键方法。为了充分利用语音的频域和时域相关性实现耳语转换,提出了使用深度卷积神经网络(Deep Convolutional Neural Networks,DCNN)将耳语转换为正常语音。它的卷积层用来提取连续帧语音谱包络之间的频域与时域的相关特征,而全连接层用来拟合耳语在卷积层提取的特征和对应正常语音之间的映射关系。实验结果表明与深度神经网络(Deep Neural Networks,DNN)模型相比,DCNN模型获得的转换后语音的梅尔倒谱失真度(Cepstral Distance,CD)降低了4.64%,而语音质量感知评价(Perceptual Evaluation of Speech Quality,PESQ)、短时客观可懂度(Short-Time Objective Intelligibility,STOI)与平均主观意见分(Mean Opinion Score,MOS)分别提高了5.41%,5.77%,9.68%。Abstract: Whisper is a special phonation mode.Whisper to normal speech conversion is the key method to improve the quality and intelligibility of whisper.We proposed a Deep Convolutional Neural Networks (DCNN) which can make full use of the correlation between frequency domain and time domain of speech for whisper conversion.Its convolutional layer was used to extract the correlation features between frequency domain and time domain of spectral envelope of consecutive frames,while the fully connected layer was used to fit the mapping function between whisper features extracted by convolution layer and the corresponding normal speech.Experimental results show that Mel Cepstral Distance (CD) of the converted speech decreases 4.64%,while Perceptual Evaluation of Speech Quality (PESQ),ShortTime Objective Intelligibility (STOI) and Mean Opinion Score (MOS) increase 5.41%,5.77%,and 9.68%respectively.