Abstract:
An end-to-end time-domain music source separation method based on dual-dimension sequential attention combing the structural characters related to instrument sound sources and contents of song pattern is proposed to address the insufficient specificity when characterizing the instrument sound sources in music. First, characteristic basis functions are weighted with different attention based on two dimensions, namely time and characteristic channels, because of the significant regularity of the occurrence of different instrument sound sources in different parts of the song pattern. Second, a multi-resolution frequency factor is introduced into the loss function to measure the difference between separated sound sources and ideal ones from both time and frequency domains at the same time. As shown by the experimental results on the MUSDB18 dataset, the separation results of instrumental sound sources can be improved by giving special attention to both the time-domain song pattern structure features and discrete harmonic features of the sound sources. Compared with Demucs, the most advanced end-to-end time-domain music source separation method, the signal-to-noise ratio index of this method is improved by 0.40 dB, with particularly outstanding performances in the separation of drum and bass audio sources, whose signal-to-noise ratio index is improved by 0.13 dB and 0.60 dB, respectively. The specificity of the characterization of sound sources can be improved through sufficient use of multi-dimensional
a priori knowledge, such as the semantic content and acoustic feature of sound sources, thus improving the degree of separability of sound sources.