联合精确比值掩蔽与深度神经网络的单通道语音增强方法

柏浩钧; 张天骐; 刘鉴兴; 叶绍鹏

doi:10.15949/j.cnki.0371-0025.2022.03.009

联合精确比值掩蔽与深度神经网络的单通道语音增强方法

Speech enhancement combining accurate ratio masking and deep neural network

摘要

摘要: 针对目前有监督语音增强忽略了纯净语音、噪声与带噪语音之间的幅度谱相似性对增强效果影响等问题,提出了一种联合精确比值掩蔽(ARM)与深度神经网络(DNN)的语音增强方法。该方法利用纯净语音与带噪语音、噪声与带噪语音的幅度谱归一化互相关系数,设计了一种基于时频域理想比值掩蔽的精确比值掩蔽作为目标掩蔽;然后以纯净语音和噪声幅度谱为训练目标的DNN为基线,通过该DNN的输出来估计目标掩蔽,并对基线DNN和目标掩蔽进行联合优化,增强语音由目标掩蔽从带噪语音中估计得到;此外,考虑到纯净语音与噪声的区分性信息,采用一种区分性训练函数代替均方误差(MSE)函数作为基线DNN的目标函数,以使网络输出更加准确。实验表明,区分性训练函数提升了基线DNN以及整个联合优化网络的增强效果;在匹配噪声和不匹配噪声下,相比于其它常见DNN方法,本文方法取得了更高的平均客观语音质量评估(PESQ)和短时客观可懂度(STOI),增强后的语音保留了更多语音成分,同时对噪声的抑制效果更加明显。

Abstract: Aiming at the problem that the impact of the similarity of amplitude spectrum between pure speech,noise,and noisy speech on enhancement effect is neglected in recent supervised speech enhancement,a method combining Accurate Ratio Masking(ARM)and Deep Neural Network(DNN)is proposed for monaural speech enhancement.Firstly,an accurate ratio masking based on ideal ratio masking in the time-frequency domain is designed,which utilizes the normalized cross-correlation coefficient of amplitude spectrum between pure speech and noisy speech,and between noise and noisy speech.Then,the target masking is estimated by the output of the baseline DNN which takes the amplitude spectrum of pure speech and noise as training target,and further uses the target masking to optimize the baseline DNN and get the enhanced speech from noisy speech.Moreover,considering the discriminative information between pure speech and noise,a discriminative training function is used to replace the Mean Square Error(MSE)as the objective function of the baseline DNN,thus making the output of network more accurate.The experimental results show that the discriminative training function improves the enhancement effect of baseline DNN and the overall joint optimization network Under matched and mismatched noise,compared with other common DNN methods,the proposed method gets higher average Perceptual Evaluation of Speech Quality(PESQ)and Short-Time Objective Intelligibility(STOI),and the enhanced speech retains more speech components and has a more obvious suppression effect on noise.

HTML全文

参考文献(0)

施引文献

资源附件(0)