Abstract:
The emotional speech is a kind of non-stationary time and frequency signal,and it has been shown that local features extracted from each frame make great contribution to speech emotion recognition.However,it's inadequate to use only local features to build a robust speech emotion classification system,as local features extracted from speech divided into frames can not reflect the dynamic characteristics of emotion speech signal accurately.In this paper, utterance-level global features without dividing the emotion speech into frames based on multi-scale optimal wavelet packet decomposition,and 384-dimensional utterance-level local features,are extracted together to improve the robustness and recognition rate of classification system.Given less training samples,while the dimensions of eigenvectors being reduced by Fisher discriminant,a fusion strategy with metric learning,which is called weak metric learning in this work,is adopt for fusing global and local utterance-level features.The experimental results with LIBSVM show that our method achieves significant improvements about 4.2% to 13.8% with comparison to using local utterance-level feature merely,and the speech emotion recognition rate has less fluctuations especially in the case of small sample size.