Enhancing automated audio captioning with synthetic supervision
-
-
Abstract
The data-driven automated audio captioning methods are limited by the quantity and quality of available audio-text pairs, resulting in insufficient cross-modal representation, which undermines the captioning performance. To address this, this paper proposes an audio captioning framework enhanced with synthetic supervision, termed SynthAC. This framework leverages commonly available high-quality image captioning text corpus and a text-to-audio generative model to create synthetic audio signals. Therefore, the proposed SynthAC framework can effectively expand audio-text pairs and enhance the cross-modal text-audio representation by learning relations within synthetic audio-text pairs. Experiments demonstrate that the proposed SynthAC framework can significantly improve audio captioning performance by incorporating high-quality text corpus from image captioning, providing an effective solution to the challenge of data scarcity. Additionally, SynthAC can be easily adapted to various state-of-the-art methods, significantly enhancing audio captioning performance without modifying the existing model structures.
-
-