Enhancing automated audio captioning with synthetic supervision

XIAO Feiyang; ZHU Qiaoxi; GUAN Jian; LIU Xubo; LIU Haohe; ZHANG Kejia; HE Guangjun; WANG Wenwu

doi:10.12395/0371-0025.2024232

XIAO Feiyang, ZHU Qiaoxi, GUAN Jian, LIU Xubo, LIU Haohe, ZHANG Kejia, HE Guangjun, WANG Wenwu. Enhancing automated audio captioning with synthetic supervision[J]. ACTA ACUSTICA, 2024, 49(6): 1315-1323. DOI: 10.12395/0371-0025.2024232

Citation:

Enhancing automated audio captioning with synthetic supervision

Graphical Abstract

Graphical Abstract

Abstract

Abstract

The data-driven automated audio captioning methods are limited by the quantity and quality of available audio-text pairs, resulting in insufficient cross-modal representation, which undermines the captioning performance. To address this, this paper proposes an audio captioning framework enhanced with synthetic supervision, termed SynthAC. This framework leverages commonly available high-quality image captioning text corpus and a text-to-audio generative model to create synthetic audio signals. Therefore, the proposed SynthAC framework can effectively expand audio-text pairs and enhance the cross-modal text-audio representation by learning relations within synthetic audio-text pairs. Experiments demonstrate that the proposed SynthAC framework can significantly improve audio captioning performance by incorporating high-quality text corpus from image captioning, providing an effective solution to the challenge of data scarcity. Additionally, SynthAC can be easily adapted to various state-of-the-art methods, significantly enhancing audio captioning performance without modifying the existing model structures.

FullText(HTML)

References (39)

Cited By

Enhancing automated audio captioning with synthetic supervision

Graphical Abstract

Abstract

Catalog

Export File

Citation

Format

Content