基于离散小波变换及高低频子带解耦的低计算资源占用端到端语音识别
Low computational cost speech recognition based on discrete wavelet transform and subband decoupling
-
摘要: 针对目前端到端语音识别模型计算资源占用过高的问题, 提出了一种将离散小波变换(DWT)与端到端语音识别相融合的方法(WLformer), 大幅降低计算资源占用量的同时还可提升识别性能。WLformer的构建以目前端到端语音识别中广泛使用的Conformer模型为基础, 在模型中引入所提出的基于DWT的信号压缩模块, 该模块通过去除模型中间层表征内信息量较少的高频成分从而对该表征进行压缩, 进而降低模型的计算资源占用。此外还提出了DWT子带解耦前馈网络的子模块结构以替换原模型中部分前馈网络, 从而进一步降低模型的计算量。在Aishell-1、HKUST和LibriSpeech三个常用的中英文数据集上的实验表明, 提出的WLformer相较于Conformer的显存占用相对下降47.4%, 计算量Gflops相对下降39.2%, 同时还获得了平均13.1%的错误率改善。此外, WLformer在计算资源占用少于其他主流端到端语音识别模型的情况下同样取得了更好的识别性能, 进一步验证了所提方法的有效性。Abstract: To solve the problem of high computational cost of the current end-to-end automatic speech recognition (E2E ASR), a method (WLformer) that integrates discrete wavelet transform (DWT) with E2E ASR is proposed, which can significantly reduce the computing resource usage while improving performance. WLformer is built upon the mostly used Conformer model. WLformer introduces the proposed DWT Signal Compression Module, which compresses the model's middle hidden representation by removing its high-frequency components with less information. In addition, a new module structure named DWT Subband Decoupling Feed-Forward Network (DSD-FFN) is proposed to further reduce the model’s computational cost. Experiments are conducted on Aishell-1, HKUST, and LibriSpeech datasets. The results show that WLformer achieves 47.4% relative memory usage reduction and 39.2% relative Gflops reduction, and achieves an average 13.1% relative character/word error rate reduction compared to Conformer. In addition, WLformer also achieves better recognition performance while occupying fewer computing resources than other mainstream E2E ASR models, which further verifies its effectiveness.