Speaker-Independent Speech Separation With Deep Attractor Network

doi:10.1109/TASLP.2018.2795749

Open AccessJournal ArticleDOI

Speaker-Independent Speech Separation With Deep Attractor Network

Yi Luo, +2 more

- 01 Apr 2018 -

IEEE Transactions on Audio, Speech, and ...

- Vol. 26, Iss: 4, pp 787-796

TLDR

In this article, a neural network is used to project the time-frequency representation of the mixture signal into a high-dimensional embedding space and a reference point (attractor) is created to represent each speaker.

Abstract:

Despite the recent success of deep learning for many speech processing tasks, single-microphone, speaker-independent speech separation remains challenging for two main reasons. The first reason is the arbitrary order of the target and masker speakers in the mixture (permutation problem), and the second is the unknown number of speakers in the mixture (output dimension problem). We propose a novel deep learning framework for speech separation that addresses both of these issues. We use a neural network to project the time-frequency representation of the mixture signal into a high-dimensional embedding space. A reference point (attractor) is created in the embedding space to represent each speaker which is defined as the centroid of the speaker in the embedding space. The time-frequency embeddings of each speaker are then forced to cluster around the corresponding attractor point which is used to determine the time-frequency assignment of the speaker. We propose three methods for finding the attractors for each source in the embedding space and compare their advantages and limitations. The objective function for the network is standard signal reconstruction error which enables end-to-end operation during both training and test phases. We evaluated our system using the Wall Street Journal dataset (WSJ0) on two and three speaker mixtures and report comparable or better performance than other state-of-the-art deep learning methods for speech separation.

Speaker-Independent Speech Separation With Deep Attractor Network

Citations

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation

SDR – Half-baked or Well Done?

Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation

Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation

Wavesplit: End-to-End Speech Separation by Speaker Clustering

References

Adam: A Method for Stochastic Optimization

Long short-term memory

Dropout: a simple way to prevent neural networks from overfitting

Visualizing Data using t-SNE

Distributed Representations of Words and Phrases and their Compositionality

Related Papers (5)

Deep clustering: Discriminative embeddings for segmentation and separation

Permutation invariant training of deep models for speaker-independent multi-talker speech separation

Performance measurement in blind audio source separation

Supervised Speech Separation Based on Deep Learning: An Overview

Adam: A Method for Stochastic Optimization