EI / SCOPUS / CSCD 收录

中文核心期刊

语义类的提取及其在语音搜索系统中的应用

Semantic class induction and its application for voice search system

  • 摘要: 本研究的目的是解决语音搜索系统中新领域语料稀缺的问题。对于手中的少量语料,采取的方法是:首先从中进行语义类的提取,语义类的提取采用的是一种基于同现概率的语义类提取方法,这种基于相似度计算方法的提取结果在正确率、召回率、F1值的评价中均优于常用的基于Kullback-Leibler散度的距离度量。利用从少量文本中提取出的语义类别和文本结构,生成句子模板;再把领域信息加入到模板中,并由此生成大量领域相关语料。最后,利用生成的大量语料,进行语言模型自适应,这时的语音识别结果(字识别正确率)从85.2%提高到91%。实验结果说明语音搜索领域的语料不足问题可以通过语义类提取后得到的模板,生成领域相关语料的方法来有效解决。

     

    Abstract: A measure was studied to solve the problem of lacking corpus for a Chinese voice search system. First, semantic class induction was done from the existing corpus using a novel similarity measure which is based on cooccurrence probabilities. Clustering with the new similarity measure outperformed that with the widely used distance measure based on Kullback-Leibler divergence in precision, recall and F1 evaluation. Then corpus was generated using induced semantic classes and structures. Finally, generated corpus were used to do language model adaptation and improve the result of character recognition from 85.2% to 91%. The experiment results show that the problem of lacking corpus for a new voice search system can be solved through semantic class induction, template generation then in-domain data generation.

     

/

返回文章
返回