Semantic class induction and its application for voice search system
-
-
Abstract
A measure was studied to solve the problem of lacking corpus for a Chinese voice search system. First, semantic class induction was done from the existing corpus using a novel similarity measure which is based on cooccurrence probabilities. Clustering with the new similarity measure outperformed that with the widely used distance measure based on Kullback-Leibler divergence in precision, recall and F1 evaluation. Then corpus was generated using induced semantic classes and structures. Finally, generated corpus were used to do language model adaptation and improve the result of character recognition from 85.2% to 91%. The experiment results show that the problem of lacking corpus for a new voice search system can be solved through semantic class induction, template generation then in-domain data generation.
-
-