EI / SCOPUS / CSCD 收录

中文核心期刊

XIAO Feiyang, ZHU Qiaoxi, GUAN Jian, LIU Xubo, LIU Haohe, ZHANG Kejia, HE Guangjun, WANG Wenwu. Enhancing automated audio captioning with synthetic supervision[J]. ACTA ACUSTICA, 2024, 49(6): 1315-1323. DOI: 10.12395/0371-0025.2024232
Citation: XIAO Feiyang, ZHU Qiaoxi, GUAN Jian, LIU Xubo, LIU Haohe, ZHANG Kejia, HE Guangjun, WANG Wenwu. Enhancing automated audio captioning with synthetic supervision[J]. ACTA ACUSTICA, 2024, 49(6): 1315-1323. DOI: 10.12395/0371-0025.2024232

Enhancing automated audio captioning with synthetic supervision

  • The data-driven automated audio captioning methods are limited by the quantity and quality of available audio-text pairs, resulting in insufficient cross-modal representation, which undermines the captioning performance. To address this, this paper proposes an audio captioning framework enhanced with synthetic supervision, termed SynthAC. This framework leverages commonly available high-quality image captioning text corpus and a text-to-audio generative model to create synthetic audio signals. Therefore, the proposed SynthAC framework can effectively expand audio-text pairs and enhance the cross-modal text-audio representation by learning relations within synthetic audio-text pairs. Experiments demonstrate that the proposed SynthAC framework can significantly improve audio captioning performance by incorporating high-quality text corpus from image captioning, providing an effective solution to the challenge of data scarcity. Additionally, SynthAC can be easily adapted to various state-of-the-art methods, significantly enhancing audio captioning performance without modifying the existing model structures.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return