合成监督增强的自动音频字幕框架
Enhancing automated audio captioning with synthetic supervision
-
摘要: 基于数据驱动的自动音频字幕方法受限于音频–文本数据对的数量和质量,导致其跨模态表示能力不足,制约了整体性能。为此, 提出了一种合成监督增强的自动音频字幕框架(SynthAC), 该框架利用广泛可用的高质量图像字幕文本语料及文本到音频生成模型生成音频信号, 有效扩充音频–文本数据对, 并通过学习合成音频–文本数据对中的对应关系, 增强音频文本跨模态表示能力。实验表明, 所提SynthAC框架通过利用图像字幕中的高质量文本语料库, 显著提升了音频字幕模型性能, 该框架为应对音频–文本数据稀缺挑战提供了有效的解决方案。此外, 该框架可适用于各种主流方法, 在不改变音频字幕模型结构的情况下显著提升音频字幕性能。Abstract: The data-driven automated audio captioning methods are limited by the quantity and quality of available audio-text pairs, resulting in insufficient cross-modal representation, which undermines the captioning performance. To address this, this paper proposes an audio captioning framework enhanced with synthetic supervision, termed SynthAC. This framework leverages commonly available high-quality image captioning text corpus and a text-to-audio generative model to create synthetic audio signals. Therefore, the proposed SynthAC framework can effectively expand audio-text pairs and enhance the cross-modal text-audio representation by learning relations within synthetic audio-text pairs. Experiments demonstrate that the proposed SynthAC framework can significantly improve audio captioning performance by incorporating high-quality text corpus from image captioning, providing an effective solution to the challenge of data scarcity. Additionally, SynthAC can be easily adapted to various state-of-the-art methods, significantly enhancing audio captioning performance without modifying the existing model structures.