EI / SCOPUS / CSCD 收录

中文核心期刊

TIAN Sanli, LI Ta, YE Lingxuan, WU Shisong, ZHAO Qingwei, ZHANG Pengyuan. Low computational cost end-to-end speech recognition based on discrete wavelet transform and subband decoupling[J]. ACTA ACUSTICA, 2025, 50(2): 373-383. DOI: 10.12395/0371-0025.2024205
Citation: TIAN Sanli, LI Ta, YE Lingxuan, WU Shisong, ZHAO Qingwei, ZHANG Pengyuan. Low computational cost end-to-end speech recognition based on discrete wavelet transform and subband decoupling[J]. ACTA ACUSTICA, 2025, 50(2): 373-383. DOI: 10.12395/0371-0025.2024205

Low computational cost end-to-end speech recognition based on discrete wavelet transform and subband decoupling

  • To solve the problem of high computational cost of the current end-to-end automatic speech recognition (E2E ASR), a method (WLformer) that integrates discrete wavelet transform (DWT) with E2E ASR is proposed, which can significantly reduce the computing resource usage while improving performance. WLformer is built upon the mostly used Conformer model. WLformer introduces the proposed DWT Signal Compression Module, which compresses the model’s middle hidden representation by removing its high-frequency components with less information. In addition, a new module structure named DWT Subband Decoupling Feed-Forward Network (DSD-FFN) is proposed to further reduce the model’s computational cost. Experiments are conducted on Aishell-1, HKUST, and LibriSpeech datasets. The results show that WLformer achieves 47.4% relative memory usage reduction and 39.2% relative Gflops reduction, and achieves an average 13.1% relative character/word error rate reduction compared to Conformer. In addition, WLformer also achieves better recognition performance while occupying fewer computing resources than other mainstream E2E ASR models, which further verifies its effectiveness.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return