Music Source Separation in the Waveform Domain

Open AccessPosted Content

Music Source Separation in the Waveform Domain

Alexandre Défossez, +5 more

- 25 Sep 2019 -

arXiv: Sound

Chats0

TLDR

Demucs is proposed, a new waveform-to-waveform model, which has an architecture closer to models for audio generation with more capacity on the decoder, and human evaluations show that Demucs has significantly higher quality than Conv-Tasnet, but slightly more contamination from other sources, which explains the difference in SDR.

Abstract:

Source separation for music is the task of isolating contributions, or stems, from different instruments recorded individually and arranged together to form a song. Such components include voice, bass, drums and any other accompaniments. Contrarily to many audio synthesis tasks where the best performances are achieved by models that directly generate the waveform, the state-of-the-art in source separation for music is to compute masks on the magnitude spectrum. In this paper, we compare two waveform domain architectures. We first adapt Conv-Tasnet, initially developed for speech source separation, to the task of music source separation. While Conv-Tasnet beats many existing spectrogram-domain methods, it suffers from significant artifacts, as shown by human evaluations. We propose instead Demucs, a novel waveform-to-waveform model, with a U-Net structure and bidirectional LSTM. Experiments on the MusDB dataset show that, with proper data augmentation, Demucs beats all existing state-of-the-art architectures, including Conv-Tasnet, with 6.3 SDR on average, (and up to 6.8 with 150 extra training songs, even surpassing the IRM oracle for the bass source). Using recent development in model quantization, Demucs can be compressed down to 120MB without any loss of accuracy. We also provide human evaluations, showing that Demucs benefit from a large advantage in terms of the naturalness of the audio. However, it suffers from some bleeding, especially between the vocals and other source.

Music Source Separation in the Waveform Domain

Citations

Real Time Speech Enhancement in the Waveform Domain.

Spleeter: a fast and efficient music source separation tool with pre-trained models

Voice Separation with an Unknown Number of Multiple Speakers

Sudo RM -RF: Efficient Networks for Universal Audio Source Separation

High Fidelity Neural Audio Compression

References

Adam: A Method for Stochastic Optimization

U-Net: Convolutional Networks for Biomedical Image Segmentation

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

U-Net: Convolutional Networks for Biomedical Image Segmentation

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Related Papers (5)

Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation

Adam: A Method for Stochastic Optimization

Performance measurement in blind audio source separation

The 2018 Signal Separation Evaluation Campaign

Improving music source separation based on deep neural networks through data augmentation and network blending

Trending Questions (1)