Low computational cost end-to-end speech recognition based on discrete wavelet transform and subband decoupling

TIAN Sanli; LI Ta; YE Lingxuan; WU Shisong; ZHAO Qingwei; ZHANG Pengyuan

doi:10.12395/0371-0025.2024205

TIAN Sanli, LI Ta, YE Lingxuan, WU Shisong, ZHAO Qingwei, ZHANG Pengyuan. Low computational cost end-to-end speech recognition based on discrete wavelet transform and subband decoupling[J]. ACTA ACUSTICA, 2025, 50(2): 373-383. DOI: 10.12395/0371-0025.2024205

Citation:

Low computational cost end-to-end speech recognition based on discrete wavelet transform and subband decoupling

Graphical Abstract

Graphical Abstract

Abstract

Abstract

To solve the problem of high computational cost of the current end-to-end automatic speech recognition (E2E ASR), a method (WLformer) that integrates discrete wavelet transform (DWT) with E2E ASR is proposed, which can significantly reduce the computing resource usage while improving performance. WLformer is built upon the mostly used Conformer model. WLformer introduces the proposed DWT Signal Compression Module, which compresses the model’s middle hidden representation by removing its high-frequency components with less information. In addition, a new module structure named DWT Subband Decoupling Feed-Forward Network (DSD-FFN) is proposed to further reduce the model’s computational cost. Experiments are conducted on Aishell-1, HKUST, and LibriSpeech datasets. The results show that WLformer achieves 47.4% relative memory usage reduction and 39.2% relative Gflops reduction, and achieves an average 13.1% relative character/word error rate reduction compared to Conformer. In addition, WLformer also achieves better recognition performance while occupying fewer computing resources than other mainstream E2E ASR models, which further verifies its effectiveness.

FullText(HTML)

References (29)

Cited By

Low computational cost end-to-end speech recognition based on discrete wavelet transform and subband decoupling

Graphical Abstract

Abstract

Catalog

Export File

Citation

Format

Content