EI / SCOPUS / CSCD 收录

中文核心期刊

融合辅助目标学习和卷积循环网络的非侵入式语音质量评价算法

A non-invasive speech quality evaluation algorithm combining auxiliary target learning and convolutional recurrent network

  • 摘要: 语音质量的客观评价可以代替昂贵的人工评分,但是目前客观指标的计算通常需要纯净的参考语音,这在许多实际声学系统中很难获得。为此提出了一种融合辅助目标学习和卷积循环网络(CRN)的非侵入式语音质量评价算法。为降低算法的复杂度,算法采用基于仿人耳听觉特性滤波器的Bark频率倒谱系数(BFCCs)作为CRN的输入。算法首先构建一个卷积神经网络(CNN)从BFCCs中提取帧级特征。然后,构建双向的长短记忆网络,在帧级特征中建模长期的时间依赖性和序列特征。最后,利用自注意力机制自适应地从帧级特征中筛选出有用信息,将其整合至话语层面的特征中,并将这些话语级特征映射为客观得分。为改善质量评测的有效性,算法采用多任务训练策略,引入语音激活检测(VAD)作为辅助学习目标。基于开源数据库的实验显示,与其他非侵入式算法相比,提出的算法和平均主观意见分(MOS)具有更好的相关性。而且,算法参数规模较小且对ITU-T P.808发布的带有主观MOS的失真语音数据库具有良好的泛化能力,接近语音质量感知评估(PESQ)指标的精度。

     

    Abstract: The objective evaluation of speech quality can replace expensive manual scoring,but current objective indicators usually need pure reference speech,which is difficult to obtain in many practical acoustic systems.A noninvasive speech quality evaluation algorithm combining auxiliary target learning and Convolutional Recurrent Network(CRN) is proposed.Bark Frequency Cepstral Coefficients(BFCCs) which are based on human-like auditory filters,are used as the input of the CRN network to effectively reduce the network complexity.Firstly,frame-level features are extracted by a Convolutional Neural Network(CNN) from BFCCs.Then,long-term time dependence and sequence features are modeled by the Bidirectional Long Short-Term Memory(BiLSTM) networks in frame-level features.Finally,a self-attention mechanism is introduced into the CRN,thereby adaptively extracting useful information from frame-level features,which is then integrated into the characteristics of the sentence level and mapped into the final objective score.In addition,a multi-task training strategy is adopted,and Voice Activity Detection(VAD) is introduced as an auxiliary learning target to improve the performance of the algorithm.The experiments in public databases show that compared with other non-invasive algorithms,the proposed algorithm has a better correlation with the mean opinion score(MOS).Moreover,it has a small parameter size and good generalization ability for the distorted speech database with MOS released by ITU-T P.808,which is close to the accuracy of the Perceptual Evaluation of Speech Quality(PESQ).

     

/

返回文章
返回