What is the initialization algorithm for instantaneous mixtures?

Rj × F ×N subject to the time-invariance constraint and subject to the frequency invariance constraint for instantaneous mixtures only.

Why do the authors update parameters in a joint manner?

since the authors can here update parameters jointly without loss of flexibility, the authors do so, since joint optimization, as compared to the alternated one, leads in general to a faster convergence.

How many iterations of the proposed GEM algorithm did the results of the analysis be measured?

After 500 iterations of the proposed GEM algorithm the separation results, measured in terms of the source to distortion ratio (SDR) [48], were 7.2 and 8.9 dB for voice and guitar, respectively.

What is the EM algorithm update rule for qexj?

An EM algorithm update rules for time pattern weights Gexj or Gftj with time continuity priors, such as inverse-Gamma or Gamma Markov chain priors, can be found in [9].

What is the simplest way to denote the mixing parameters?

If the mixing parameters are given some Gaussian priors, closed-form updates similar to (26), (27) can be still derived, since the modified log-posterior (18) will be a quadratic form with respect to the mixing parameters.

What is the e-step for estimating the model parameters?

2. First, given initial parameter values, the model parameters θ are estimated from the mixture X using an iterative GEM algorithm, where the E-step consists in computing some quantity

What are the spectral patterns of the bass and drums?

The narrowband spectral patterns W ex j (j = 9, . . . , 12) include 3 × L harmonic patterns modeling the harmonic part of L pitches (see [14]).

What is the spectral power of the mixing parameters?

the time-varying mixing parameters could be represented in terms of time-localized and locally time-invariant mixing parameter patterns, thus allowing the modeling of moving sources.

How can the authors implement rank-1 adaptive spatial timeinvariances?

This structure can be implemented in their framework by choosing rank-1 adaptive spatial timeinvariant covariances, i.e., Aj is an adaptive tensor of size 2 × 1 × F × N subject to the time-invariance constraint, and constraining the spectral power to Vj = Wexj G ex j H ex j5 with Wexj being the F × F identity matrix, Gex a F × ⌈N/L⌉ adaptive matrix, and Hexj the ⌈N/L⌉ × N fixed matrix with entries hexj,mn = 1 for n ∈ Lm and h ex j,mn = 0 for n /∈

(Open Access) A General Flexible Framework for the Handling of Prior Information in Audio Source Separation (2012) | Alexey Ozerov

Q: What are the contributions in "A general flexible framework for the handling of prior information in audio source separation" ?

In this paper the authors introduce a general audio source separation framework based on a library of structured source models that enable the incorporation of prior knowledge about each source via user-specifiable constraints. While this framework generalizes several existing audio source separation methods, it also allows to imagine and implement new efficient methods that were not yet reported in the literature. The authors first introduce the framework by describing the model structure and constraints, explaining its generality, and summarizing its algorithmic implementation using a generalized expectation-maximization algorithm.

Q: What have the authors stated for future works in "A general flexible framework for the handling of prior information in audio source separation" ?

As for further research, the following extensions could be introduced to the framework. Html spectral power, a flexible structure can be specified for the mixing parameters.

Q: what is the spectral power of a glottal source?

The spectral power, denoted as Vj , is assumed to be the product of an excitation spectral power Vexj , representing, e.g., the excitation of the glottal source for voice or the plucking of the string of a guitar, and a filter spectral power Vftj , representing, e.g., the vocal tract or the impedance of the guitar body [23], [35].

Q: how many parameters are used to define the spectral patterns?

In order to further constrain the fine structure of the spectral patterns, they are represented as linear combinations of narrowband spectral patterns Wexj [14] with weights U ex j .

HAL Id: hal-00626962

https://hal.archives-ouvertes.fr/hal-00626962v2

Submitted on 22 Jun 2012 (v2), last revised 5 Jan 2016 (v4)

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

A General Flexible Framework for the Handling of Prior

Information in Audio Source Separation

Alexey Ozerov, Emmanuel Vincent, Frédéric Bimbot

To cite this version:

Alexey Ozerov, Emmanuel Vincent, Frédéric Bimbot. A General Flexible Framework for the Handling

of Prior Information in Audio Source Separation. IEEE Transactions on Audio, Speech and Language

Processing, Institute of Electrical and Electronics Engineers, 2012, 20 (4), pp.1118 - 1133. �hal-

00626962v2�

IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING 1

A General Flexible Framework for the Handling of

Prior Information in Audio Source Separation

Alexey Ozerov, Member, IEEE, Emmanuel Vincent, Senior Member, IEEE, and Fr

eric Bimbot

Abstract—Most of audio source separation methods are de-

veloped for a particular scenario characterized by the number

of sources and channels and the characteristics of the sources

and the mixing process. In this paper we introduce a general

audio source separation framework based on a library of

structured source models that enable the incorporation of prior

knowledge about each source via user-speciﬁable constraints.

While this framework generalizes several existing audio source

separation methods, it also allows to imagine and implement

new efﬁcient methods that were not yet reported in the lit-

erature. We ﬁrst introduce the framework by describing the

model structure and constraints, explaining its generality, and

summarizing its algorithmic implementation using a generalized

expectation-maximization algorithm. Finally, we illustrate the

above-mentioned capabilities of the framework by applying it

in several new and existing conﬁgurations to different source

separation problems. We have released a software tool named

Flexible Audio Source Separation Toolbox (FASST) implementing a

baseline version of the framework in Matlab.

Index Terms—Audio source separation, local Gaussian model,

nonnegative matrix factorization, expectation-maximization

I. INTRODUCTION

Separating audio sources from multichannel mixtures is still

challenging in most situations. The main difﬁculty is that

audio source separation problems are usually mathematically

ill-posed and to succeed one needs to incorporate additional

knowledge about the mixing process and/or the source signals.

Thus, efﬁcient source separation methods are usually devel-

oped for a particular source separation problem characterized

by a certain problem dimensionality, e.g., determined or under-

determined, certain mixing process characteristics, e.g., instan-

taneous or convolutive, and certain source characteristics, e.g.,

speech, singing voice, drums, bass or noise [1]. For example,

a source separation problem may be formulated as follows:

“Separate bass, drums, melody and the remaining

instruments from a stereo professionally produced

music recording.”

Given a source separation problem, one typically must intro-

duce as much knowledge about this problem as possible into

the corresponding separation method so as to achieve good

separation performance. However, there is often no common

A. Ozerov and E. Vincent are with INRIA, Rennes Bretagne At-

lantique, Campus de Beaulieu, 35042 Rennes cedex, France (e-mails:

alexey.ozerov@inria.fr, emmanuel.vincent@inria.fr).

F. Bimbot is with IRISA, CNRS - UMR 6074, Campus de Beaulieu, 35042

Rennes cedex, France (e-mail: frederic.bimbot@irisa.fr).

This work was partly supported by OSEO, the French State agency for

innovation, under the Quaero program, and by the French Ministry of Foreign

and European Affairs, the French Ministry of Higher Education and Research

and the German Academic Exchange Service under project Procope 20142UD.

formulation describing methods applied for different problems,

and this makes it difﬁcult to reuse a method for a problem it

was not originally conceived for. Thus, given a new source

separation problem, the common approach consists in (i)

model design, taking into account problem formulation, (ii)

algorithm design and (iii) implementation (see Fig. 1, top).

Model

design

Algorithm

design

Algorithm

implementation

Source

separation

problem

Source

separation

Specification of constraints

from a library

Source

separation

problem

Source

separation

Current approach

Proposed flexible framework

Fig. 1. Current way of addressing a new source separation problem (top)

and the way of addressing it using the proposed ﬂexible framework (bottom).

The motivation of this work is to improve over this time-

consuming process by designing a general audio source sep-

aration framework that can be applied to virtually any source

separation problem by simply selecting from a library of

constraints suitable constraints accounting for the available

information about that source (see Fig. 1, bottom). More

precisely, we wish such a framework to be

• general, i.e., generalizing existing methods and making

it possible to combine them,

• ﬂexible, allowing easy incorporation of the a priori

knowledge about a particular problem considered.

To achieve the property of generality, we need to ﬁnd

some common formulation for methods we would like to

generalize. Many recently proposed methods for audio source

separation and/or characterization [2]–[19] (see also [1] and

references therein) are based on the same so-called local

Gaussian model describing both the properties of the sources

and of the mixing process. Thus, we chose this model as

the core of our framework. To achieve ﬂexibility, we ﬁx

the global structure of Gaussian covariances, and by means

of a parametric model allow the introduction of knowledge

about each individual source and its mixing characteristics

via constraints on individual parameter subsets. The global

IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING 2

structure we consider corresponds to a generative model of the

data that is motivated by the physics of the modeled processes,

e.g., the source-ﬁlter model to represent a sound source and an

approximation of the convolutive ﬁlter to represent its mixing

characteristics. In summary, our framework generalizes the

methods from [2]–[19], and, thanks to its ﬂexibility, it becomes

applicable in many other scenarios one can imagine.

We implement our framework using a generalized

expectation-maximization (GEM) algorithm [20], where the

M-step is solved by alternating optimization of different

parameter subsets, taking the corresponding constraints into

account and using multiplicative update (MU) rules inspired

from the nonnegative matrix factorization (NMF) methodology

(see, e.g., [9]) to update the nonnegative spectral parameters.

Such an implementation is in fact possible thanks to the Gaus-

sianity assumption leading to closed form update equations.

The idea of mixing GEM algorithm with MU rules was already

reported in [21] in the case of plain NMF spectral models and

rank-1 spatial models, and we extend it here to the newly

proposed structures. Our algorithmic contribution consists of

(i) identifying the GEM-MU approach as suitable thanks to the

implementability of the conﬁgurable framework, the simplicity

of the update rules, the implicit veriﬁcation of nonnegative

constraints and its good convergence speed; and (ii) deriving

of the update rules for the new model structures.

Our approach is in line with the library of components by

Cardoso et al [22] developed for the separation of compo-

nents in astrophysical images. However, we consider advanced

audio-speciﬁc structures inspired by [1], [23] for source spec-

tral power, as opposed to the unique block structure in [22]

based on the assumption that source power is constant in some

pre-deﬁned region of time and space. In that sense, our frame-

work is more ﬂexible than [22]. Besides the framework itself,

we propose a new structure for NMF-like decompositions

of source power spectrograms, where the temporal envelope

associated with each spectral pattern is represented as a

nonnegative linear combination of time-localized temporal pat-

terns. This structure can be used to ensure temporal continuity,

but also to model more complex temporal characteristics, such

as the attack or decay parts of a note. In line with time-

localized patterns we include in our framework the so-called

narrowband spectral patterns that allow constraining spectral

patterns to be harmonic, inharmonic or noise-like. These

structures were already reported in [14], [15], but only in case

of harmonic constraints. Moreover, they were not applied for

source separation so far. As compared to [24], where some

preliminary aspects of this work were presented, we here

present the framework in details, describe its implementation,

and extend the experimental part illustrating the framework.

Moreover, we propose an original mixing model formulation

that allows the representation and the estimation of rank-1 [5]

and full-rank [19] (actually any rank) spatial mixing models

in a homogeneous way, thus enabling the combination of

both models within a given mixture. Finally, we provide a

proper probabilistic formulation of local Gaussian modeling

for quadratic time-frequency representations [18] that supports

and justiﬁes the formulation given in [18].

We have also implemented and released a baseline version

of the framework in Matlab. The corresponding software tool

named Flexible Audio Source Separation Toolbox (FASST) is

available at [25] together with a user guide, examples of usage

(where the constraints are speciﬁed) and the corresponding

audio examples. Given a source separation problem, one can

choose one or few suitable constraint combinations based on

his/her expertise and on the a priori knowledge, and then test

all of them using FASST so as to select the best one.

In summary, the main contributions of this work include

• a general modeling structure,

• a general estimation algorithm,

• new spectral an temporal structures (time-localized pat-

terns, narrowband spectral patterns),

• the implementation and distribution of a baseline version

of the framework (the FASST toolbox [25]).

The rest of this paper is organized as follows. In Section II,

existing approaches generalized by the proposed framework

are discussed and an overview of the framework is given.

Sections III and IV provide a detailed description of the frame-

work and its algorithmic implementation. Thus, Section II

is devoted to a reader interested in understanding the main

principles of the framework and the physical meaning of the

objects, and Sections III and IV to one willing to go deeper

into the technical details. The results of a few source separation

experiments are given in Section V to illustrate the ﬂexibility

of our framework and its potential performance improvement

compared to individual approaches. Conclusions are drawn in

Section VI.

II. RELATED EXISTING APPROACHES AND FRAMEWORK

OVERVIEW

Source separation methods based on the local Gaussian

model can be characterized by the following assumptions [1],

[2], [5], [13], [19]:

1) Gaussianity: in some time-frequency (TF) representation

the sources are modeled in each TF bin by zero-mean

Gaussian random variables.

2) Independence: conditionally to their covariance matri-

ces, these random variables are independent over time,

frequency and between sources.

3) Factorization of spectral and spatial characteristics:

for each TF bin, the covariance matrix of each source

is expressed as the product of a spatial covariance

matrix representing its spatial characteristics and a scalar

spectral power representing its spectral characteristics.

4) Linearity of mixing: the mixing process translates into

addition in the covariance domain.

A. State-of-the-art approaches based on the local Gaussian

model

The state-of-the-art approaches [2]–[19] cover a wide range

of source separation problems and models expressed via

particular structures of local Gaussian covariances, including:

1) Problem dimensionality: Denoting by I and J, re-

spectively, the number of channels of the observed

IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING 3

mixture and the number of sources to separate, the

single-channel (I = 1) case is addressed in [6], and

underdetermined (1 < I < J) and (over-)determined

(I ≥ J) cases are addressed in [5] and [2], respectively.

2) Spatial covariance model: Instantaneous and convolu-

tive mixtures of point sources are modeled by rank-1

spatial covariance matrices in [5] and [3], respectively.

In [19] reverberant convolutive mixtures of point sources

are modeled by full-rank spatial covariance matrices

that, in contrast to rank-1 covariance matrices, can

account for the spatial spread of each source induced

by the reverberation.

3) Spectral power model: Several models were proposed

for the spectral power, e.g., unconstrained models [10],

block constant models [5], Gaussian mixture models

(GMM) or hidden Markov models (HMM) [2], Gaus-

sian scaled mixture models (GSMM) or scaled HMMs

(S-HMM) [13], NMF [4] together with its variants,

harmonic NMF [14] or temporal activation constrained

NMF [9], and source-ﬁlter models [16]. These models

are suitable for the representation of different types

of sources, for example GSMM is rather suitable for

a monophonic source, e.g., speech, and NMF for a

polyphonic one, e.g., polyphonic musical instrument,

[13].

4) Input representation: While the most of the considered

methods use the short time Fourier transform (STFT) as

the input TF representation, some of them, e.g., [14],

[15], [18], use the auditory-motivated equivalent rectan-

gular bandwidth (ERB) quadratic representation. More

generally, we consider here both linear representations,

where the signal is represented by a vector of complex-

valued coefﬁcients in each TF bin, as well as quadratic

representations, where the signal is represented via its

local covariance matrix in each TF bin [26].

Table I provides an overview of some of the local Gaussian

model-based approaches considered here, where the speci-

ﬁcities of each method are marked by crosses ×

×. We see

from Table I that a few of these methods have already

been combined together, for example GSMM and NMF were

combined in [8], and NMF [9] was combined with rank-1

and full-rank mixing models in [13] and [17], respectively.

However, many combinations have not yet been investigated.

Indeed, assuming that each source follows one of the 3 spatial

covariance models and one of the 8 spectral variance models

from Table I, the total number of conﬁgurations equals to

2 × 24

for J sources (in fact much more since each source

can follow several spectral variance models at the same time),

while Table I reports only 16 existing conﬁgurations.

B. Other related state-of-the-art approaches

While the local Gaussian model-based framework offers

maximum of ﬂexibility, there exist some methods that do not

satisfy (fully or partially) the aforementioned assumptions and

are thus not strictly covered by the framework. Nevertheless,

our framework allows the implementation of similar structures.

Let us give some examples. Binary masking-based source

estimation [27], [28] does not satisfy the source independence

assumption. However, it is known to perform poorly compared

to local Gaussian model-based separation, as it was shown

in [13], [18] for convolutive mixtures

and demonstrated

through the signal separation evaluation campaigns SiSEC

2008 [30] and SiSEC 2010 [29], where for instantaneous

mixtures local Gaussian model-based approaches gave better

results than the oracle (using the ground truth) binary masks.

The methods proposed in [31], [32] are also based on Gaussian

models albeit in the time domain. Notably, time sample-based

GMMs and time-varying autoregressive models are considered

as source models in [31] and [32], respectively. However, the

number of existing time-domain structures is fairly reduced.

Our TF domain models make it possible to account for

these structures by means of suitable constraints over spectral

power, while allowing their combination with more advanced

structures. There are also many works on NMF and its exten-

sions [33]–[38] and on GMMs / HMMs [39], [40] based on

nongaussian models of the complex-valued STFT coefﬁcients.

These models are essentially covered by our framework in

the sense that we can implement similar or equivalent model

structures, albeit under Gaussian assumptions. The beneﬁt of

local Gaussian modeling is that it naturally leads to closed-

form expressions in the multichannel case and allows the

modeling of diffuse sources [19], contrary to the models in

[33]–[40]. Finally, according to Cardoso [41], nongaussianity

and nonstationarity are alternative routes to source separation,

such that nonstationary nongaussian models would offer little

beneﬁt compared to nonstationary Gaussian models in terms

of separation performance despite considerably greater com-

putation cost.

C. Framework overview

We now present an overview of the proposed framework

focusing on the most important concepts. An exhaustive de-

scription is given in Sections III and IV.

The framework is based on a ﬂexible model described by

parameters θ = {θ

}

j=1

, where θ

are the parameters of the

j-th source (j = 1, . . . , J). Each θ

is split in turn into nine

parameter subsets according to a ﬁxed structure, as described

below and summarized in Table II.

1) Model structure: The parameters of j-th source include

a complex-valued tensor A

modeling its spatial covariance,

and eight nonnegative matrices (θ

j,2

, . . . , θ

j,9

) modeling its

spectral power over all TF bins.

The spectral power, denoted as V

, is assumed to be the

product of an excitation spectral power V

, representing, e.g.,

the excitation of the glottal source for voice or the plucking

of the string of a guitar, and a ﬁlter spectral power V

representing, e.g., the vocal tract or the impedance of the guitar

body [23], [35]. While such a model is usually called source-

ﬁlter model, we call it here excitation-ﬁlter model in order to

avoid possible confusions with the “sources” to be separated.

Binary masking-based approaches can still be quite powerful for convo-

lutive mixtures, as demonstrated in [29]. Thus, a good way to proceed is

probably to use them to initialize local Gaussian model-based approaches, as

it is done in [13], and as we do in the experimental part.

IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING 4

Reference [7] [6] [8] [16] [4] [14] [15] [9] [5] [11] [13] [19] [18] [17] [3] [2]

single-channel ×

× ×

Problem

underdetermined ×

× ×

dimensionality

(over-)determined ×

× ×

Spatial rank-1 instantaneous ×

× ×

covariance rank-1 convolutive ×

× ×

model full-rank ×

× ×

unconstrained ×

× ×

block constant ×

× ×

GMM / HMM ×

× ×

Spectral

GSMM / S-HMM ×

× ×

variance

NMF ×

× ×

model

harmonic NMF ×

× ×

temp. constr. NMF ×

× ×

source-ﬁlter ×

× ×

Input linear ×

× ×

representation quadratic ×

× ×

TABLE I

SOME STATE-OF-THE-ART LOCAL GAUSSIAN MODEL-BASED APPROACHES FOR AUDIO SOURCE SEPARATION.

The excitation spectral power V

is further decomposed as

the sum of characteristic spectral patterns E

modulated by

time activation coefﬁcients P

[4], [9]. Each characteristic

spectral pattern may be associated for instance with one

speciﬁc pitch, so that the time activation coefﬁcients denote

which pitches are active on each time frame. In order to further

constrain the ﬁne structure of the spectral patterns, they are

represented as linear combinations of narrowband spectral

patterns W

[14] with weights U

. These narrowband

patterns may be for instance harmonic, inharmonic or noise-

like and the weights determine the overall spectral envelope.

Following the same idea, we propose here to represent the

series of time activation coefﬁcients P

as sums of time-

localized patterns H

with weights G

. The time-localized

patterns may represent the typical temporal shape of the notes

while the weights encode their onset times. Different temporal

ﬁne structures such as continuity or speciﬁc rhythm patterns

may also be accounted for in this way. Note that temporal

models of the activation coefﬁcients have been proposed in

the state-of-the-art, using probabilistic priors [9], [34], note-

speciﬁc Gaussian-shaped time-localized patterns [42], or un-

structured TF patterns [33]. Our proposition is complementary

to [9], [34] in that it accounts for temporal behaviour in the

model structure itself in addition to possible priors on the

model parameters. Moreover, it is more ﬂexible than [9], [34],

[42], since it allows the modeling of other characteristics than

continuity or sparsity. Finally, while it can model similar TF

patterns to [33], it involves much fewer parameters, which

typically leads to more robust parameter estimation.

The ﬁlter spectral power V

is similarly expressed in

terms of characteristic spectral patterns E

modulated by time

activation coefﬁcients [16], which are in turn decomposed

into narrowband spectral patterns W

with weights U

and

time-localized patterns H

with weights G

, respectively.

In the case of speech or singing voice, each characteristic

spectral pattern may represent the spectral formants of a

given phoneme, while the plosiveness and the sequence of

pronounced phonemes may be encoded by the time-localized

patterns and the associated weights.

In summary, as it will be explained in details in Sec-

tion III-E, the spectral power of each source obeys a three-

level hierarchical nonnegative matrix decomposition structure

(see equations (9), (10), (12), (13) and Figures 3 and 4 below)

including at the bottom level the eight parameter subsets W

, G

, H

, W

, U

, G

and H

(see Eq. (13)).

Parameter subsets

Size Range

j,1

= A

mixing parameters I × R

× F × N ∈ C

j,2

= W

ex. narrowband spectral patterns F × L

∈ R

j,3

= U

ex. spectral pattern weights L

× K

∈ R

j,4

= G

ex. time pattern weights K

× M

∈ R

j,5

= H

ex. time-localized patterns M

× N ∈ R

j,6

= W

ft. narrowband spectral patterns F × L

∈ R

j,7

= U

ft. spectral pattern weights L

× K

∈ R

j,8

= G

ft. time pattern weights K

× M

∈ R

j,9

= H

ft. time-localized patterns M

× N

∈ R

TABLE II

PARAMETER SUBSETS θ

j,k

(j = 1, . . . , J , k = 1, . . . , 9) ENCODING THE

STRUCTURE OF EACH SOURCE.

2) Constraints: Given the above ﬁxed model structure,

prior information about each source can now be exploited by

specifying deterministic or probabilistic constraints over each

parameter subset of Table II. Examples of such constraints

are given in Table III. Each parameter subset can be ﬁxed

(i.e., unchanged during estimation), adaptive (i.e., fully ﬁtted

to the mixture) or partially adaptive (only some parameters

within the subset are adaptive). In the latter two cases, a

probabilistic prior, such as a continuity prior [9] or a sparsity-

inducing prior [4], can be speciﬁed over the parameters. The

mixing parameters A

can be time-varying or time-invariant

(in Table III the latter case is only considered), frequency-

dependent for convolutive mixtures or frequency-independent

for instantaneous mixtures. Mixing parameters A

can be

given a probabilistic prior as well. E.g., it can be a Gaussian

prior with the mean corresponding to the parameters of a pre-

sumed direction and with the covariance matrix representing

The ﬁxed parameters can be either set manually or learned beforehand

from some training data. Learning is equivalent to model parameter estimation

over the training data and can thus be achieved using our framework.

A General Flexible Framework for the Handling of Prior Information in Audio Source Separation

Figures

Citations

Nonnegative matrix factorization with the itakura-saito divergence: With application to music analysis

Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR

Towards Scaling Up Classification-Based Speech Separation

A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation

Deep Unfolding: Model-Based Inspiration of Novel Deep Architectures

References

Maximum likelihood from incomplete data via the EM algorithm

A tutorial on hidden Markov models and selected applications in speech recognition

Maximum likelihood estimation from incomplete data via the EM algorithm

Performance measurement in blind audio source separation

Non-negative Matrix Factorization with Sparseness Constraints

Related Papers (5)

Performance measurement in blind audio source separation

Nonnegative matrix factorization with the itakura-saito divergence: With application to music analysis

Multichannel Nonnegative Matrix Factorization in Convolutive Mixtures for Audio Source Separation

Under-Determined Reverberant Audio Source Separation Using a Full-Rank Spatial Covariance Model

Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria

Frequently Asked Questions (14)

Q1. What are the contributions in "A general flexible framework for the handling of prior information in audio source separation" ?

Q2. What have the authors stated for future works in "A general flexible framework for the handling of prior information in audio source separation" ?

Q3. What is the initialization algorithm for instantaneous mixtures?

Q4. Why do the authors update parameters in a joint manner?

Q5. How many iterations of the proposed GEM algorithm did the results of the analysis be measured?

Q6. what is the spectral power of a glottal source?

Q7. What is the EM algorithm update rule for qexj?

Q8. What is the simplest way to denote the mixing parameters?

Q9. What is the e-step for estimating the model parameters?

Q10. how many parameters are used to define the spectral patterns?

Q11. What are the spectral patterns of the bass and drums?

Q12. What is the spectral power of the mixing parameters?

Q13. What are the main assumptions of the local Gaussian model-based framework?

Q14. How can the authors implement rank-1 adaptive spatial timeinvariances?