利用对抗学习与短时先验的非平衡时长说话人日志系统
Adversial training-based imbalanced speaker diarization system using short-phrase prior
-
摘要: 针对各说话人累计发音时长不平衡的场景下主流说话人日志系统的性能均出现下降的问题, 本文提出了一种基于对抗学习与短时先验的说话人日志方法。该方法在说话人数据层面, 利用了短时先验, 通过在所有说话人语音的长时片段与短时片段间进行非平衡采样, 来增加不同说话人在累计发音时长上的平衡程度。在表征提取与聚类层面, 设计了说话人表征提取与聚类的一体化训测方案: 在训练阶段增加非平衡数据采样后各类簇的可分性, 以及与平衡数据相比类簇分布的相似性, 并通过对抗学习将该优化过程从高维空间转至低维空间进行, 避免了高维空间中由于样本稀疏导致的优化困难; 在测试阶段约束了同一条短时语音在不同声学环境下增广后的副本在聚类结果上的一致性。方法相比于已有的非平衡聚类算法在 VoxConverse 数据集以及 AISHELL-4 数据集的非平衡时长的子集上分别取得了说话人日志错误率6.15% 及 4.27% 的绝对下降(相对下降 22.2% 与 21.7%)。 这表明, 所提方法可以在各说话人累计发音时长不平衡的场景下有效缓解说话人日志系统性能下降的问题。Abstract:To address the issue that the performance of recent speaker diarization systems degrades when speaker durations are imbalanced, a speaker diarization system is designed using adversarial learning and short-phrase prior. In the speaker data aspect, under the short-phrase prior, the proposed method applies imbalanced data sampling to speakers with different durations, minimizing the speech duration gap among different speakers. For speaker representation extraction and clustering, a training scheme is designed to enhance the separability of clusters after imbalanced data sampling and to maintain similarity in cluster distribution compared to balanced data. To avoid data sparsity problem, adversarial learning is utilized to transfer the optimization process to a lower-dimensional embedding space. During the inference, the proposed method constrains the consistency of clustering results from replicas augmented in different acoustic environments. Compared to existing methods, the proposed approach achieves a DER reduction of 6.15% and 4.27% on imbalanced duration subsets of the VoxConverse dataset and the AISHELL-4 dataset, respectively (The relative reduction is 22.2% and 21.7% correspondingly). The result indicates that the proposed method is a practical approach for mitigating speaker diarization system performance degradation in scenarios with imbalanced speaker durations.
下载: