联合深度编解码网络和时频掩蔽估计的单通道语音增强
Time frequency masking based speech enhancement using deep encoder-decoder neural network
-
摘要: 提出了一种联合深度编解码神经网络和时频掩蔽估计的语音增强方法。该方法利用深度编解码网络估计时频掩蔽表示,并联合带噪语音的幅度谱学习带噪语音与纯净语音幅度谱之间的非线性映射关系。深度编解码网络采用卷积-反卷积网络结构。在编码端,利用卷积网络的局部感知特性,对带噪语音的时频域结构特征进行建模,提取语音特征,同时抑制背景噪声。在解码端,利用编码端提取到的语音特征逐层恢复局部细节信息并重构语音信号。同时,在编解码端对应层之间引入跳跃连接,以减少由于池化和全连接操作导致的低层细节信息丢失的问题。在TIMIT语音库和不完全匹配噪声集下进行仿真实验,实验结果表明,该方法可以有效抑制噪声,且能较好地恢复出语音细节成分。Abstract: A time-frequency masking estimation based speech enhancement method with deep encoder-decoder neural network is presented.In this method,the time-frequency masking representation is estimated using deep encoder-decoder neural network,and it is combined with the amplitude spectrum of noisy speech to get the nonlinear mapping relationship between noisy and target speech.The convolutional and de-convolutional structures are employed in the deep encoderdecoder neural network.At the encoder,the local perception characteristics are used to model the typical structural features of noisy speech in the time-frequency domain,speech features are extracted and the influence of background noise is suppressed.At the decoder,the speech signal is reconstructed from the extracted speech features and local details of speech are recovered layer by layer.Meanwhile,skip connections are introduced between the corresponding layers of encoder and decoder to reduce the loss of details at low levels which is induced by pooling and full connection operations.Experiments are carried out with speech from TIMIT database and noise from NOISEX-92 database.The simulation results demonstrate that the proposed method can effectively suppress noise and recover the detailed information of speech.