scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Blind separation of speech mixtures via time-frequency masking

TL;DR: The results demonstrate that there exist ideal binary time-frequency masks that can separate several speech signals from one mixture and show that the W-disjoint orthogonality of speech can be approximate in the case where two anechoic mixtures are provided.
Abstract: Binary time-frequency masks are powerful tools for the separation of sources from a single mixture. Perfect demixing via binary time-frequency masks is possible provided the time-frequency representations of the sources do not overlap: a condition we call W-disjoint orthogonality. We introduce here the concept of approximate W-disjoint orthogonality and present experimental results demonstrating the level of approximate W-disjoint orthogonality of speech in mixtures of various orders. The results demonstrate that there exist ideal binary time-frequency masks that can separate several speech signals from one mixture. While determining these masks blindly from just one mixture is an open problem, we show that we can approximate the ideal masks in the case where two anechoic mixtures are provided. Motivated by the maximum likelihood mixing parameter estimators, we define a power weighted two-dimensional (2-D) histogram constructed from the ratio of the time-frequency representations of the mixtures that is shown to have one peak for each source with peak location corresponding to the relative attenuation and delay mixing parameters. The histogram is used to create time-frequency masks that partition one of the mixtures into the original sources. Experimental results on speech mixtures verify the technique. Example demixing results can be found online at http://alum.mit.edu/www/rickard/bss.html.
Citations
More filters
Book
08 Mar 2010
TL;DR: This handbook provides the definitive reference on Blind Source Separation, giving a broad and comprehensive description of all the core principles and methods, numerical algorithms and major applications in the fields of telecommunications, biomedical engineering and audio, acoustic and speech processing.
Abstract: Edited by the people who were forerunners in creating the field, together with contributions from 34 leading international experts, this handbook provides the definitive reference on Blind Source Separation, giving a broad and comprehensive description of all the core principles and methods, numerical algorithms and major applications in the fields of telecommunications, biomedical engineering and audio, acoustic and speech processing. Going beyond a machine learning perspective, the book reflects recent results in signal processing and numerical analysis, and includes topics such as optimization criteria, mathematical tools, the design of numerical algorithms, convolutive mixtures, and time frequency approaches. This Handbook is an ideal reference for university researchers, RD algebraic identification of under-determined mixtures, time-frequency methods, Bayesian approaches, blind identification under non negativity approaches, semi-blind methods for communicationsShows the applications of the methods to key application areas such as telecommunications, biomedical engineering, speech, acoustic, audio and music processing, while also giving a general method for developing applications

1,627 citations


Cites background from "Blind separation of speech mixtures..."

  • ...Some ten years elapsed before these techniques began to be fully exploited for blind source separation [72, 106, 66, 70, 116, 61, 15, 111]....

    [...]

Journal ArticleDOI
TL;DR: A comprehensive overview of deep learning-based supervised speech separation can be found in this paper, where three main components of supervised separation are discussed: learning machines, training targets, and acoustic features.
Abstract: Speech separation is the task of separating target speech from background interference. Traditionally, speech separation is studied as a signal processing problem. A more recent approach formulates speech separation as a supervised learning problem, where the discriminative patterns of speech, speakers, and background noise are learned from training data. Over the past decade, many supervised separation algorithms have been put forward. In particular, the recent introduction of deep learning to supervised speech separation has dramatically accelerated progress and boosted separation performance. This paper provides a comprehensive overview of the research on deep learning based supervised speech separation in the last several years. We first introduce the background of speech separation and the formulation of supervised separation. Then, we discuss three main components of supervised separation: learning machines, training targets, and acoustic features. Much of the overview is on separation algorithms where we review monaural methods, including speech enhancement (speech-nonspeech separation), speaker separation (multitalker separation), and speech dereverberation, as well as multimicrophone techniques. The important issue of generalization, unique to supervised learning, is discussed. This overview provides a historical perspective on how advances are made. In addition, we discuss a number of conceptual issues, including what constitutes the target source.

1,009 citations

Journal ArticleDOI
TL;DR: Several commonly-used sparsity measures are compared based on whether or not they satisfy these six propositions and only two of these measures satisfy all six: the pq-mean with p les 1, q > 1 and the Gini index.
Abstract: Sparsity of representations of signals has been shown to be a key concept of fundamental importance in fields such as blind source separation, compression, sampling and signal analysis. The aim of this paper is to compare several commonly-used sparsity measures based on intuitive attributes. Intuitively, a sparse representation is one in which a small number of coefficients contain a large proportion of the energy. In this paper, six properties are discussed: (Robin Hood, Scaling, Rising Tide, Cloning, Bill Gates, and Babies), each of which a sparsity measure should have. The main contributions of this paper are the proofs and the associated summary table which classify commonly-used sparsity measures based on whether or not they satisfy these six propositions. Only two of these measures satisfy all six: the pq-mean with p les 1, q > 1 and the Gini index.

667 citations


Cites background from "Blind separation of speech mixtures..."

  • ...There has also been research in the uniqueness of sparse solutions in overcomplete representations [22], [23]....

    [...]

Journal ArticleDOI
TL;DR: This paper proposes to analyze a large number of established and recent techniques according to four transverse axes: 1) the acoustic impulse response model, 2) the spatial filter design criterion, 3) the parameter estimation algorithm, and 4) optional postfiltering.
Abstract: Speech enhancement and separation are core problems in audio signal processing, with commercial applications in devices as diverse as mobile phones, conference call systems, hands-free systems, or hearing aids. In addition, they are crucial preprocessing steps for noise-robust automatic speech and speaker recognition. Many devices now have two to eight microphones. The enhancement and separation capabilities offered by these multichannel interfaces are usually greater than those of single-channel interfaces. Research in speech enhancement and separation has followed two convergent paths, starting with microphone array processing and blind source separation, respectively. These communities are now strongly interrelated and routinely borrow ideas from each other. Yet, a comprehensive overview of the common foundations and the differences between these approaches is lacking at present. In this paper, we propose to fill this gap by analyzing a large number of established and recent techniques according to four transverse axes: 1 the acoustic impulse response model, 2 the spatial filter design criterion, 3 the parameter estimation algorithm, and 4 optional postfiltering. We conclude this overview paper by providing a list of software and data resources and by discussing perspectives and future trends in the field.

452 citations

Journal ArticleDOI
TL;DR: This review presents an overview of a challenging problem in auditory perception, the cocktail party phenomenon, the delineation of which goes back to a classic paper by Cherry in 1953.
Abstract: This review presents an overview of a challenging problem in auditory perception, the cocktail party phenomenon, the delineation of which goes back to a classic paper by Cherry in 1953. In this review, we address the following issues: (1) human auditory scene analysis, which is a general process carried out by the auditory system of a human listener; (2) insight into auditory perception, which is derived from Marr's vision theory; (3) computational auditory scene analysis, which focuses on specific approaches aimed at solving the machine cocktail party problem; (4) active audition, the proposal for which is motivated by analogy with active vision, and (5) discussion of brain theory and independent component analysis, on the one hand, and correlative neural firing, on the other.

408 citations


Cites background from "Blind separation of speech mixtures..."

  • ...three approaches being reviewed here, several other approaches, some of them quite promising, have been discussed in the literature: Bayesian approaches (e.g., Knuth, 1999; Mohammad-Djafari, 1999; Rowe, 2002; Attias, 1999; Chan, Lee, & Sejnowski, 2003), timefrequency analysis approaches (e.g., Belouchrani & Amin, 1998; Rickard, Balan, & Rosca, 2001; Rickard & Yilmaz, 2002; Yilmaz & Rickard, 2004 ), and neural network ......

    [...]

  • ...…Attias, 1999; Chan, Lee, & Sejnowski, 2003), timefrequency analysis approaches (e.g., Belouchrani & Amin, 1998; Rickard, Balan, & Rosca, 2001; Rickard & Yilmaz, 2002; Yilmaz & Rickard, 2004), and neural network approaches (e.g., Amari & Cichocki, 1998; Grossberg, Govindarajan, Wyse, & Cohen, 2004)....

    [...]

References
More filters
Book
01 May 1992
TL;DR: This paper presents a meta-analyses of the wavelet transforms of Coxeter’s inequality and its applications to multiresolutional analysis and orthonormal bases.
Abstract: Introduction Preliminaries and notation The what, why, and how of wavelets The continuous wavelet transform Discrete wavelet transforms: Frames Time-frequency density and orthonormal bases Orthonormal bases of wavelets and multiresolutional analysis Orthonormal bases of compactly supported wavelets More about the regularity of compactly supported wavelets Symmetry for compactly supported wavelet bases Characterization of functional spaces by means of wavelets Generalizations and tricks for orthonormal wavelet bases References Indexes.

16,073 citations


"Blind separation of speech mixtures..." refers methods in this paper

  • ...In Section IV, we verify the method presenting demixing results for speech signals mixed synthetically and in both anechoic and echoic rooms....

    [...]

Journal ArticleDOI
TL;DR: In this article, the regularity of compactly supported wavelets and symmetry of wavelet bases are discussed. But the authors focus on the orthonormal bases of wavelets, rather than the continuous wavelet transform.
Abstract: Introduction Preliminaries and notation The what, why, and how of wavelets The continuous wavelet transform Discrete wavelet transforms: Frames Time-frequency density and orthonormal bases Orthonormal bases of wavelets and multiresolutional analysis Orthonormal bases of compactly supported wavelets More about the regularity of compactly supported wavelets Symmetry for compactly supported wavelet bases Characterization of functional spaces by means of wavelets Generalizations and tricks for orthonormal wavelet bases References Indexes.

14,157 citations

Journal ArticleDOI
TL;DR: This work suggests a two-stage separation process: a priori selection of a possibly overcomplete signal dictionary in which the sources are assumed to be sparsely representable, followed by unmixing the sources by exploiting the their sparse representability.
Abstract: The blind source separation problem is to extract the underlying source signals from a set of linear mixtures, where the mixing matrix is unknown. This situation is common in acoustics, radio, medical signal and image processing, hyperspectral imaging, and other areas. We suggest a two-stage separation process: a priori selection of a possibly overcomplete signal dictionary (for instance, a wavelet frame or a learned dictionary) in which the sources are assumed to be sparsely representable, followed by unmixing the sources by exploiting the their sparse representability. We consider the general case of more sources than mixtures, but also derive a more efficient algorithm in the case of a nonovercomplete dictionary and an equal numbers of sources and mixtures. Experiments with artificial signals and musical sounds demonstrate significantly better separation than other known techniques.

829 citations

Proceedings ArticleDOI
05 Jun 2000
TL;DR: A novel method for blind separation of any number of sources using only two mixtures when sources are (W-)disjoint orthogonal, that is, when the supports of the (windowed) Fourier transform of any two signals in the mixture are disjoint sets.
Abstract: We present a novel method for blind separation of any number of sources using only two mixtures. The method applies when sources are (W-)disjoint orthogonal, that is, when the supports of the (windowed) Fourier transform of any two signals in the mixture are disjoint sets. We show that, for anechoic mixtures of attenuated and delayed sources, the method allows one to estimate the mixing parameters by clustering ratios of the time-frequency representations of the mixtures. The estimates of the mixing parameters are then used to partition the time-frequency representation of one mixture to recover the original sources. The technique is valid even in the case when the number of sources is larger than the number of mixtures. The general results are verified on both speech and wireless signals.

477 citations


"Blind separation of speech mixtures..." refers background or methods in this paper

  • ...The mixing model in [1]–[3], [5], [8], [9], and [11] is “instantaneous” (sources have different amplifications in different mixtures), whereas [4], [6], [7], [10], and [12] use an anechoic mixing model (sources have different amplifications and time delays in different mixtures)....

    [...]

  • ...In Section IV, we verify the method presenting demixing results for speech signals mixed synthetically and in both anechoic and echoic rooms....

    [...]

  • ...Based on this, we extend the degenerate unmixing estimation technique (DUET), which was originally presented in [4] for sources with disjointly supported STFTs, to anechoic mixtures of speech signals....

    [...]

Proceedings Article
01 Jan 2000
TL;DR: A technique called refiltering is presented which recovers sources by a nonstationary reweighting of frequency sub-bands from a single recording, and it is argued for the application of statistical algorithms to learning this masking function.
Abstract: Source separation, or computational auditory scene analysis, attempts to extract individual acoustic objects from input which contains a mixture of sounds from different sources, altered by the acoustic environment. Unmixing algorithms such as ICA and its extensions recover sources by reweighting multiple observation sequences, and thus cannot operate when only a single observation signal is available. I present a technique called refiltering which recovers sources by a nonstationary reweighting ("masking") of frequency sub-bands from a single recording, and argue for the application of statistical algorithms to learning this masking function. I present results of a simple factorial HMM system which learns on recordings of single speakers and can then separate mixtures using only one observation signal by computing the masking function and then refiltering.

476 citations


"Blind separation of speech mixtures..." refers background or methods in this paper

  • ...The fact that such a mask exists has also been observed in [14] in the context of BSS of speech signals from one mixture and in [15] in the context of source localization....

    [...]

  • ...In Section IV, we verify the method presenting demixing results for speech signals mixed synthetically and in both anechoic and echoic rooms....

    [...]