Abstract:
Aiming at the problem that the impact of the similarity of amplitude spectrum between pure speech,noise,and noisy speech on enhancement effect is neglected in recent supervised speech enhancement,a method combining Accurate Ratio Masking(ARM)and Deep Neural Network(DNN)is proposed for monaural speech enhancement.Firstly,an accurate ratio masking based on ideal ratio masking in the time-frequency domain is designed,which utilizes the normalized cross-correlation coefficient of amplitude spectrum between pure speech and noisy speech,and between noise and noisy speech.Then,the target masking is estimated by the output of the baseline DNN which takes the amplitude spectrum of pure speech and noise as training target,and further uses the target masking to optimize the baseline DNN and get the enhanced speech from noisy speech.Moreover,considering the discriminative information between pure speech and noise,a discriminative training function is used to replace the Mean Square Error(MSE)as the objective function of the baseline DNN,thus making the output of network more accurate.The experimental results show that the discriminative training function improves the enhancement effect of baseline DNN and the overall joint optimization network Under matched and mismatched noise,compared with other common DNN methods,the proposed method gets higher average Perceptual Evaluation of Speech Quality(PESQ)and Short-Time Objective Intelligibility(STOI),and the enhanced speech retains more speech components and has a more obvious suppression effect on noise.