面向鲁棒自动语音识别的一致性自监督学习方法
Consistency self-supervised learning method for robust automatic speech recognition
-
摘要: 提出了一种基于一致性自监督学习的鲁棒自动语音识别方法。该方法通过使用语音信号仿真技术, 模拟一条语音在不同声学场景下的副本; 在通过自监督学习方式学习语音表征的同时, 极大化一条语音在不同声学环境下对应语音表征的相似性, 从而获取到与环境干扰无关的语音表征方式, 提高下游语音识别模型的性能。在远讲数据集CHiME-4和会议数据集AMI上的实验表明, 所提的一致性自监督学习算法能够取得相比已有的wav2vec2.0自监督学习基线算法30%以上的识别词错误率下降。这表明, 所提方法是一种获取噪声无关语音表征、提升鲁棒语音识别性能的有效方法。Abstract: A robust automatic speech recognition (ASR) method using consistency self-supervised learning (CSSL) is proposed. This method uses speech simulation to generate the speech with different acoustic environments, then uses the self-supervised learning to extract the speech representations and maximize the similarity between the representations of the simulated speech. So invariant speech representations can be extracted in different acoustic environments and the ASR performance can be improved. The proposed method is evaluated on the far-field dataset, CHiME-4, and the meeting dataset, AMI. With the help of the CSSL and appropriate pre-training pipeline, up to 30% relative word error rate can be achieved compared to the wav2vec2.0. This proves the CSSL can extract noise-invariant speech feathers and improve the ASR performance effectively.