scispace - formally typeset
Open AccessJournal ArticleDOI

Speaker-Independent Speech Separation With Deep Attractor Network

TLDR
In this article, a neural network is used to project the time-frequency representation of the mixture signal into a high-dimensional embedding space and a reference point (attractor) is created to represent each speaker.
Abstract
Despite the recent success of deep learning for many speech processing tasks, single-microphone, speaker-independent speech separation remains challenging for two main reasons. The first reason is the arbitrary order of the target and masker speakers in the mixture (permutation problem), and the second is the unknown number of speakers in the mixture (output dimension problem). We propose a novel deep learning framework for speech separation that addresses both of these issues. We use a neural network to project the time-frequency representation of the mixture signal into a high-dimensional embedding space. A reference point (attractor) is created in the embedding space to represent each speaker which is defined as the centroid of the speaker in the embedding space. The time-frequency embeddings of each speaker are then forced to cluster around the corresponding attractor point which is used to determine the time-frequency assignment of the speaker. We propose three methods for finding the attractors for each source in the embedding space and compare their advantages and limitations. The objective function for the network is standard signal reconstruction error which enables end-to-end operation during both training and test phases. We evaluated our system using the Wall Street Journal dataset (WSJ0) on two and three speaker mixtures and report comparable or better performance than other state-of-the-art deep learning methods for speech separation.

read more

Citations
More filters
Journal ArticleDOI

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation

TL;DR: A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures.
Proceedings ArticleDOI

SDR – Half-baked or Well Done?

TL;DR: The scale-invariant signal-to-distortion ratio (SI-SDR) as mentioned in this paper is a more robust measure for single-channel separation, which has been proposed in the BSS_eval toolkit.
Proceedings ArticleDOI

Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation

TL;DR: In this paper, a dual-path recurrent neural network (DPRNN) is proposed for modeling extremely long sequences. But the model is not effective for modeling such long sequences due to optimization difficulties, while one-dimensional CNNs cannot perform utterance-level sequence modeling when its receptive field is smaller than the sequence length.
Proceedings ArticleDOI

Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation

TL;DR: It is found that simply encoding inter-microphone phase patterns as additional input features during deep clustering provides a significant improvement in separation performance, even with random microphone array geometry.
Posted Content

Wavesplit: End-to-End Speech Separation by Speaker Clustering

TL;DR: Wavesplit redefines the state-of-the-art on clean mixtures of 2 or 3 speakers, as well as in noisy and reverberated settings, and set a new benchmark on the recent LibriMix dataset.
References
More filters
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Journal ArticleDOI

Long short-term memory

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Journal Article

Dropout: a simple way to prevent neural networks from overfitting

TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Journal Article

Visualizing Data using t-SNE

TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.
Proceedings Article

Distributed Representations of Words and Phrases and their Compositionality

TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.
Related Papers (5)