scispace - formally typeset
Open AccessJournal ArticleDOI

A General Flexible Framework for the Handling of Prior Information in Audio Source Separation

TLDR
This paper introduces a general audio source separation framework based on a library of structured source models that enable the incorporation of prior knowledge about each source via user-specifiable constraints.
Abstract
Most audio source separation methods are developed for a particular scenario characterized by the number of sources and channels and the characteristics of the sources and the mixing process. In this paper, we introduce a general audio source separation framework based on a library of structured source models that enable the incorporation of prior knowledge about each source via user-specifiable constraints. While this framework generalizes several existing audio source separation methods, it also allows to imagine and implement new efficient methods that were not yet reported in the literature. We first introduce the framework by describing the model structure and constraints, explaining its generality, and summarizing its algorithmic implementation using a generalized expectation-maximization algorithm. Finally, we illustrate the above-mentioned capabilities of the framework by applying it in several new and existing configurations to different source separation problems. We have released a software tool named Flexible Audio Source Separation Toolbox (FASST) implementing a baseline version of the framework in Matlab.

read more

Content maybe subject to copyright    Report

HAL Id: hal-00626962
https://hal.archives-ouvertes.fr/hal-00626962v2
Submitted on 22 Jun 2012 (v2), last revised 5 Jan 2016 (v4)
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
A General Flexible Framework for the Handling of Prior
Information in Audio Source Separation
Alexey Ozerov, Emmanuel Vincent, Frédéric Bimbot
To cite this version:
Alexey Ozerov, Emmanuel Vincent, Frédéric Bimbot. A General Flexible Framework for the Handling
of Prior Information in Audio Source Separation. IEEE Transactions on Audio, Speech and Language
Processing, Institute of Electrical and Electronics Engineers, 2012, 20 (4), pp.1118 - 1133. �hal-
00626962v2�

IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING 1
A General Flexible Framework for the Handling of
Prior Information in Audio Source Separation
Alexey Ozerov, Member, IEEE, Emmanuel Vincent, Senior Member, IEEE, and Fr
´
ed
´
eric Bimbot
Abstract—Most of audio source separation methods are de-
veloped for a particular scenario characterized by the number
of sources and channels and the characteristics of the sources
and the mixing process. In this paper we introduce a general
audio source separation framework based on a library of
structured source models that enable the incorporation of prior
knowledge about each source via user-specifiable constraints.
While this framework generalizes several existing audio source
separation methods, it also allows to imagine and implement
new efficient methods that were not yet reported in the lit-
erature. We first introduce the framework by describing the
model structure and constraints, explaining its generality, and
summarizing its algorithmic implementation using a generalized
expectation-maximization algorithm. Finally, we illustrate the
above-mentioned capabilities of the framework by applying it
in several new and existing configurations to different source
separation problems. We have released a software tool named
Flexible Audio Source Separation Toolbox (FASST) implementing a
baseline version of the framework in Matlab.
Index Terms—Audio source separation, local Gaussian model,
nonnegative matrix factorization, expectation-maximization
I. INTRODUCTION
Separating audio sources from multichannel mixtures is still
challenging in most situations. The main difficulty is that
audio source separation problems are usually mathematically
ill-posed and to succeed one needs to incorporate additional
knowledge about the mixing process and/or the source signals.
Thus, efficient source separation methods are usually devel-
oped for a particular source separation problem characterized
by a certain problem dimensionality, e.g., determined or under-
determined, certain mixing process characteristics, e.g., instan-
taneous or convolutive, and certain source characteristics, e.g.,
speech, singing voice, drums, bass or noise [1]. For example,
a source separation problem may be formulated as follows:
“Separate bass, drums, melody and the remaining
instruments from a stereo professionally produced
music recording.
Given a source separation problem, one typically must intro-
duce as much knowledge about this problem as possible into
the corresponding separation method so as to achieve good
separation performance. However, there is often no common
A. Ozerov and E. Vincent are with INRIA, Rennes Bretagne At-
lantique, Campus de Beaulieu, 35042 Rennes cedex, France (e-mails:
alexey.ozerov@inria.fr, emmanuel.vincent@inria.fr).
F. Bimbot is with IRISA, CNRS - UMR 6074, Campus de Beaulieu, 35042
Rennes cedex, France (e-mail: frederic.bimbot@irisa.fr).
This work was partly supported by OSEO, the French State agency for
innovation, under the Quaero program, and by the French Ministry of Foreign
and European Affairs, the French Ministry of Higher Education and Research
and the German Academic Exchange Service under project Procope 20142UD.
formulation describing methods applied for different problems,
and this makes it difficult to reuse a method for a problem it
was not originally conceived for. Thus, given a new source
separation problem, the common approach consists in (i)
model design, taking into account problem formulation, (ii)
algorithm design and (iii) implementation (see Fig. 1, top).
Model
design
Algorithm
design
Algorithm
implementation
Source
separation
problem
Source
separation
Specification of constraints
from a library
Source
separation
problem
Source
separation
Current approach
Proposed flexible framework
Fig. 1. Current way of addressing a new source separation problem (top)
and the way of addressing it using the proposed flexible framework (bottom).
The motivation of this work is to improve over this time-
consuming process by designing a general audio source sep-
aration framework that can be applied to virtually any source
separation problem by simply selecting from a library of
constraints suitable constraints accounting for the available
information about that source (see Fig. 1, bottom). More
precisely, we wish such a framework to be
general, i.e., generalizing existing methods and making
it possible to combine them,
flexible, allowing easy incorporation of the a priori
knowledge about a particular problem considered.
To achieve the property of generality, we need to find
some common formulation for methods we would like to
generalize. Many recently proposed methods for audio source
separation and/or characterization [2]–[19] (see also [1] and
references therein) are based on the same so-called local
Gaussian model describing both the properties of the sources
and of the mixing process. Thus, we chose this model as
the core of our framework. To achieve flexibility, we fix
the global structure of Gaussian covariances, and by means
of a parametric model allow the introduction of knowledge
about each individual source and its mixing characteristics
via constraints on individual parameter subsets. The global

IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING 2
structure we consider corresponds to a generative model of the
data that is motivated by the physics of the modeled processes,
e.g., the source-filter model to represent a sound source and an
approximation of the convolutive filter to represent its mixing
characteristics. In summary, our framework generalizes the
methods from [2]–[19], and, thanks to its flexibility, it becomes
applicable in many other scenarios one can imagine.
We implement our framework using a generalized
expectation-maximization (GEM) algorithm [20], where the
M-step is solved by alternating optimization of different
parameter subsets, taking the corresponding constraints into
account and using multiplicative update (MU) rules inspired
from the nonnegative matrix factorization (NMF) methodology
(see, e.g., [9]) to update the nonnegative spectral parameters.
Such an implementation is in fact possible thanks to the Gaus-
sianity assumption leading to closed form update equations.
The idea of mixing GEM algorithm with MU rules was already
reported in [21] in the case of plain NMF spectral models and
rank-1 spatial models, and we extend it here to the newly
proposed structures. Our algorithmic contribution consists of
(i) identifying the GEM-MU approach as suitable thanks to the
implementability of the configurable framework, the simplicity
of the update rules, the implicit verification of nonnegative
constraints and its good convergence speed; and (ii) deriving
of the update rules for the new model structures.
Our approach is in line with the library of components by
Cardoso et al [22] developed for the separation of compo-
nents in astrophysical images. However, we consider advanced
audio-specific structures inspired by [1], [23] for source spec-
tral power, as opposed to the unique block structure in [22]
based on the assumption that source power is constant in some
pre-defined region of time and space. In that sense, our frame-
work is more flexible than [22]. Besides the framework itself,
we propose a new structure for NMF-like decompositions
of source power spectrograms, where the temporal envelope
associated with each spectral pattern is represented as a
nonnegative linear combination of time-localized temporal pat-
terns. This structure can be used to ensure temporal continuity,
but also to model more complex temporal characteristics, such
as the attack or decay parts of a note. In line with time-
localized patterns we include in our framework the so-called
narrowband spectral patterns that allow constraining spectral
patterns to be harmonic, inharmonic or noise-like. These
structures were already reported in [14], [15], but only in case
of harmonic constraints. Moreover, they were not applied for
source separation so far. As compared to [24], where some
preliminary aspects of this work were presented, we here
present the framework in details, describe its implementation,
and extend the experimental part illustrating the framework.
Moreover, we propose an original mixing model formulation
that allows the representation and the estimation of rank-1 [5]
and full-rank [19] (actually any rank) spatial mixing models
in a homogeneous way, thus enabling the combination of
both models within a given mixture. Finally, we provide a
proper probabilistic formulation of local Gaussian modeling
for quadratic time-frequency representations [18] that supports
and justifies the formulation given in [18].
We have also implemented and released a baseline version
of the framework in Matlab. The corresponding software tool
named Flexible Audio Source Separation Toolbox (FASST) is
available at [25] together with a user guide, examples of usage
(where the constraints are specified) and the corresponding
audio examples. Given a source separation problem, one can
choose one or few suitable constraint combinations based on
his/her expertise and on the a priori knowledge, and then test
all of them using FASST so as to select the best one.
In summary, the main contributions of this work include
a general modeling structure,
a general estimation algorithm,
new spectral an temporal structures (time-localized pat-
terns, narrowband spectral patterns),
the implementation and distribution of a baseline version
of the framework (the FASST toolbox [25]).
The rest of this paper is organized as follows. In Section II,
existing approaches generalized by the proposed framework
are discussed and an overview of the framework is given.
Sections III and IV provide a detailed description of the frame-
work and its algorithmic implementation. Thus, Section II
is devoted to a reader interested in understanding the main
principles of the framework and the physical meaning of the
objects, and Sections III and IV to one willing to go deeper
into the technical details. The results of a few source separation
experiments are given in Section V to illustrate the flexibility
of our framework and its potential performance improvement
compared to individual approaches. Conclusions are drawn in
Section VI.
II. RELATED EXISTING APPROACHES AND FRAMEWORK
OVERVIEW
Source separation methods based on the local Gaussian
model can be characterized by the following assumptions [1],
[2], [5], [13], [19]:
1) Gaussianity: in some time-frequency (TF) representation
the sources are modeled in each TF bin by zero-mean
Gaussian random variables.
2) Independence: conditionally to their covariance matri-
ces, these random variables are independent over time,
frequency and between sources.
3) Factorization of spectral and spatial characteristics:
for each TF bin, the covariance matrix of each source
is expressed as the product of a spatial covariance
matrix representing its spatial characteristics and a scalar
spectral power representing its spectral characteristics.
4) Linearity of mixing: the mixing process translates into
addition in the covariance domain.
A. State-of-the-art approaches based on the local Gaussian
model
The state-of-the-art approaches [2]–[19] cover a wide range
of source separation problems and models expressed via
particular structures of local Gaussian covariances, including:
1) Problem dimensionality: Denoting by I and J, re-
spectively, the number of channels of the observed

IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING 3
mixture and the number of sources to separate, the
single-channel (I = 1) case is addressed in [6], and
underdetermined (1 < I < J) and (over-)determined
(I J) cases are addressed in [5] and [2], respectively.
2) Spatial covariance model: Instantaneous and convolu-
tive mixtures of point sources are modeled by rank-1
spatial covariance matrices in [5] and [3], respectively.
In [19] reverberant convolutive mixtures of point sources
are modeled by full-rank spatial covariance matrices
that, in contrast to rank-1 covariance matrices, can
account for the spatial spread of each source induced
by the reverberation.
3) Spectral power model: Several models were proposed
for the spectral power, e.g., unconstrained models [10],
block constant models [5], Gaussian mixture models
(GMM) or hidden Markov models (HMM) [2], Gaus-
sian scaled mixture models (GSMM) or scaled HMMs
(S-HMM) [13], NMF [4] together with its variants,
harmonic NMF [14] or temporal activation constrained
NMF [9], and source-filter models [16]. These models
are suitable for the representation of different types
of sources, for example GSMM is rather suitable for
a monophonic source, e.g., speech, and NMF for a
polyphonic one, e.g., polyphonic musical instrument,
[13].
4) Input representation: While the most of the considered
methods use the short time Fourier transform (STFT) as
the input TF representation, some of them, e.g., [14],
[15], [18], use the auditory-motivated equivalent rectan-
gular bandwidth (ERB) quadratic representation. More
generally, we consider here both linear representations,
where the signal is represented by a vector of complex-
valued coefficients in each TF bin, as well as quadratic
representations, where the signal is represented via its
local covariance matrix in each TF bin [26].
Table I provides an overview of some of the local Gaussian
model-based approaches considered here, where the speci-
ficities of each method are marked by crosses ×
×
×. We see
from Table I that a few of these methods have already
been combined together, for example GSMM and NMF were
combined in [8], and NMF [9] was combined with rank-1
and full-rank mixing models in [13] and [17], respectively.
However, many combinations have not yet been investigated.
Indeed, assuming that each source follows one of the 3 spatial
covariance models and one of the 8 spectral variance models
from Table I, the total number of configurations equals to
2 × 24
J
for J sources (in fact much more since each source
can follow several spectral variance models at the same time),
while Table I reports only 16 existing configurations.
B. Other related state-of-the-art approaches
While the local Gaussian model-based framework offers
maximum of flexibility, there exist some methods that do not
satisfy (fully or partially) the aforementioned assumptions and
are thus not strictly covered by the framework. Nevertheless,
our framework allows the implementation of similar structures.
Let us give some examples. Binary masking-based source
estimation [27], [28] does not satisfy the source independence
assumption. However, it is known to perform poorly compared
to local Gaussian model-based separation, as it was shown
in [13], [18] for convolutive mixtures
1
and demonstrated
through the signal separation evaluation campaigns SiSEC
2008 [30] and SiSEC 2010 [29], where for instantaneous
mixtures local Gaussian model-based approaches gave better
results than the oracle (using the ground truth) binary masks.
The methods proposed in [31], [32] are also based on Gaussian
models albeit in the time domain. Notably, time sample-based
GMMs and time-varying autoregressive models are considered
as source models in [31] and [32], respectively. However, the
number of existing time-domain structures is fairly reduced.
Our TF domain models make it possible to account for
these structures by means of suitable constraints over spectral
power, while allowing their combination with more advanced
structures. There are also many works on NMF and its exten-
sions [33]–[38] and on GMMs / HMMs [39], [40] based on
nongaussian models of the complex-valued STFT coefficients.
These models are essentially covered by our framework in
the sense that we can implement similar or equivalent model
structures, albeit under Gaussian assumptions. The benefit of
local Gaussian modeling is that it naturally leads to closed-
form expressions in the multichannel case and allows the
modeling of diffuse sources [19], contrary to the models in
[33]–[40]. Finally, according to Cardoso [41], nongaussianity
and nonstationarity are alternative routes to source separation,
such that nonstationary nongaussian models would offer little
benefit compared to nonstationary Gaussian models in terms
of separation performance despite considerably greater com-
putation cost.
C. Framework overview
We now present an overview of the proposed framework
focusing on the most important concepts. An exhaustive de-
scription is given in Sections III and IV.
The framework is based on a flexible model described by
parameters θ = {θ
j
}
J
j=1
, where θ
j
are the parameters of the
j-th source (j = 1, . . . , J). Each θ
j
is split in turn into nine
parameter subsets according to a fixed structure, as described
below and summarized in Table II.
1) Model structure: The parameters of j-th source include
a complex-valued tensor A
j
modeling its spatial covariance,
and eight nonnegative matrices (θ
j,2
, . . . , θ
j,9
) modeling its
spectral power over all TF bins.
The spectral power, denoted as V
j
, is assumed to be the
product of an excitation spectral power V
ex
j
, representing, e.g.,
the excitation of the glottal source for voice or the plucking
of the string of a guitar, and a filter spectral power V
ft
j
,
representing, e.g., the vocal tract or the impedance of the guitar
body [23], [35]. While such a model is usually called source-
filter model, we call it here excitation-filter model in order to
avoid possible confusions with the “sources” to be separated.
1
Binary masking-based approaches can still be quite powerful for convo-
lutive mixtures, as demonstrated in [29]. Thus, a good way to proceed is
probably to use them to initialize local Gaussian model-based approaches, as
it is done in [13], and as we do in the experimental part.

IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING 4
Reference [7] [6] [8] [16] [4] [14] [15] [9] [5] [11] [13] [19] [18] [17] [3] [2]
single-channel ×
×
× ×
×
× ×
×
× ×
×
× ×
×
× ×
×
× ×
×
×
Problem
underdetermined ×
×
× ×
×
× ×
×
× ×
×
× ×
×
× ×
×
×
dimensionality
(over-)determined ×
×
× ×
×
×
Spatial rank-1 instantaneous ×
×
× ×
×
×
covariance rank-1 convolutive ×
×
× ×
×
× ×
×
×
model full-rank ×
×
× ×
×
× ×
×
×
unconstrained ×
×
× ×
×
×
block constant ×
×
× ×
×
×
GMM / HMM ×
×
× ×
×
× ×
×
×
Spectral
GSMM / S-HMM ×
×
× ×
×
× ×
×
×
variance
NMF ×
×
× ×
×
× ×
×
× ×
×
× ×
×
×
model
harmonic NMF ×
×
× ×
×
×
temp. constr. NMF ×
×
× ×
×
×
source-filter ×
×
× ×
×
×
Input linear ×
×
× ×
×
× ×
×
× ×
×
× ×
×
× ×
×
× ×
×
× ×
×
× ×
×
× ×
×
× ×
×
× ×
×
× ×
×
×
representation quadratic ×
×
× ×
×
× ×
×
×
TABLE I
SOME STATE-OF-THE-ART LOCAL GAUSSIAN MODEL-BASED APPROACHES FOR AUDIO SOURCE SEPARATION.
The excitation spectral power V
ex
j
is further decomposed as
the sum of characteristic spectral patterns E
ex
j
modulated by
time activation coefficients P
ex
j
[4], [9]. Each characteristic
spectral pattern may be associated for instance with one
specific pitch, so that the time activation coefficients denote
which pitches are active on each time frame. In order to further
constrain the fine structure of the spectral patterns, they are
represented as linear combinations of narrowband spectral
patterns W
ex
j
[14] with weights U
ex
j
. These narrowband
patterns may be for instance harmonic, inharmonic or noise-
like and the weights determine the overall spectral envelope.
Following the same idea, we propose here to represent the
series of time activation coefficients P
ex
j
as sums of time-
localized patterns H
ex
j
with weights G
ex
j
. The time-localized
patterns may represent the typical temporal shape of the notes
while the weights encode their onset times. Different temporal
fine structures such as continuity or specific rhythm patterns
may also be accounted for in this way. Note that temporal
models of the activation coefficients have been proposed in
the state-of-the-art, using probabilistic priors [9], [34], note-
specific Gaussian-shaped time-localized patterns [42], or un-
structured TF patterns [33]. Our proposition is complementary
to [9], [34] in that it accounts for temporal behaviour in the
model structure itself in addition to possible priors on the
model parameters. Moreover, it is more flexible than [9], [34],
[42], since it allows the modeling of other characteristics than
continuity or sparsity. Finally, while it can model similar TF
patterns to [33], it involves much fewer parameters, which
typically leads to more robust parameter estimation.
The filter spectral power V
ft
j
is similarly expressed in
terms of characteristic spectral patterns E
ft
j
modulated by time
activation coefficients [16], which are in turn decomposed
into narrowband spectral patterns W
ft
j
with weights U
ft
j
and
time-localized patterns H
ft
j
with weights G
ft
j
, respectively.
In the case of speech or singing voice, each characteristic
spectral pattern may represent the spectral formants of a
given phoneme, while the plosiveness and the sequence of
pronounced phonemes may be encoded by the time-localized
patterns and the associated weights.
In summary, as it will be explained in details in Sec-
tion III-E, the spectral power of each source obeys a three-
level hierarchical nonnegative matrix decomposition structure
(see equations (9), (10), (12), (13) and Figures 3 and 4 below)
including at the bottom level the eight parameter subsets W
ex
j
,
U
ex
j
, G
ex
j
, H
ex
j
, W
ft
j
, U
ft
j
, G
ft
j
and H
ft
j
(see Eq. (13)).
Parameter subsets
Size Range
θ
j,1
= A
j
mixing parameters I × R
j
× F × N C
θ
j,2
= W
ex
j
ex. narrowband spectral patterns F × L
ex
j
R
+
θ
j,3
= U
ex
j
ex. spectral pattern weights L
ex
j
× K
ex
j
R
+
θ
j,4
= G
ex
j
ex. time pattern weights K
ex
j
× M
ex
j
R
+
θ
j,5
= H
ex
j
ex. time-localized patterns M
ex
j
× N R
+
θ
j,6
= W
ft
j
ft. narrowband spectral patterns F × L
ft
j
R
+
θ
j,7
= U
ft
j
ft. spectral pattern weights L
ft
j
× K
ft
j
R
+
θ
j,8
= G
ft
j
ft. time pattern weights K
ft
j
× M
ft
j
R
+
θ
j,9
= H
ft
j
ft. time-localized patterns M
ft
j
× N
R
+
TABLE II
PARAMETER SUBSETS θ
j,k
(j = 1, . . . , J , k = 1, . . . , 9) ENCODING THE
STRUCTURE OF EACH SOURCE.
2) Constraints: Given the above fixed model structure,
prior information about each source can now be exploited by
specifying deterministic or probabilistic constraints over each
parameter subset of Table II. Examples of such constraints
are given in Table III. Each parameter subset can be fixed
2
(i.e., unchanged during estimation), adaptive (i.e., fully fitted
to the mixture) or partially adaptive (only some parameters
within the subset are adaptive). In the latter two cases, a
probabilistic prior, such as a continuity prior [9] or a sparsity-
inducing prior [4], can be specified over the parameters. The
mixing parameters A
j
can be time-varying or time-invariant
(in Table III the latter case is only considered), frequency-
dependent for convolutive mixtures or frequency-independent
for instantaneous mixtures. Mixing parameters A
j
can be
given a probabilistic prior as well. E.g., it can be a Gaussian
prior with the mean corresponding to the parameters of a pre-
sumed direction and with the covariance matrix representing
2
The fixed parameters can be either set manually or learned beforehand
from some training data. Learning is equivalent to model parameter estimation
over the training data and can thus be achieved using our framework.

Citations
More filters
Journal ArticleDOI

Nonnegative matrix factorization with the itakura-saito divergence: With application to music analysis

TL;DR: Results indicate that IS-NMF correctly captures the semantics of audio and is better suited to the representation of music signals than NMF with the usual Euclidean and KL costs.
Book ChapterDOI

Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR

TL;DR: It is demonstrated that LSTM speech enhancement, even when used 'naively' as front-end processing, delivers competitive results on the CHiME-2 speech recognition task.
Journal ArticleDOI

Towards Scaling Up Classification-Based Speech Separation

TL;DR: This work proposes to learn more linearly separable and discriminative features from raw acoustic features and train linear SVMs, which are much easier and faster to train than kernel SVMs.
Journal ArticleDOI

A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation

TL;DR: This paper proposes to analyze a large number of established and recent techniques according to four transverse axes: 1) the acoustic impulse response model, 2) the spatial filter design criterion, 3) the parameter estimation algorithm, and 4) optional postfiltering.
Posted Content

Deep Unfolding: Model-Based Inspiration of Novel Deep Architectures

TL;DR: This work starts with a model-based approach and an associated inference algorithm, and folds the inference iterations as layers in a deep network, and shows how this framework allows to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm.
References
More filters
Journal ArticleDOI

A tutorial on hidden Markov models and selected applications in speech recognition

TL;DR: In this paper, the authors provide an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and give practical details on methods of implementation of the theory along with a description of selected applications of HMMs to distinct problems in speech recognition.
Journal ArticleDOI

Performance measurement in blind audio source separation

TL;DR: This paper considers four different sets of allowed distortions in blind audio source separation algorithms, from time-invariant gains to time-varying filters, and derives a global performance measure using an energy ratio, plus a separate performance measure for each error term.
Journal ArticleDOI

Non-negative Matrix Factorization with Sparseness Constraints

TL;DR: In this paper, the notion of sparseness is incorporated into NMF to improve the found decompositions, and the authors provide complete MATLAB code both for standard NMF and for their extension.
Related Papers (5)
Frequently Asked Questions (14)
Q1. What are the contributions in "A general flexible framework for the handling of prior information in audio source separation" ?

In this paper the authors introduce a general audio source separation framework based on a library of structured source models that enable the incorporation of prior knowledge about each source via user-specifiable constraints. While this framework generalizes several existing audio source separation methods, it also allows to imagine and implement new efficient methods that were not yet reported in the literature. The authors first introduce the framework by describing the model structure and constraints, explaining its generality, and summarizing its algorithmic implementation using a generalized expectation-maximization algorithm. 

As for further research, the following extensions could be introduced to the framework. Html spectral power, a flexible structure can be specified for the mixing parameters. 

Rj × F ×N subject to the time-invariance constraint and subject to the frequency invariance constraint for instantaneous mixtures only. 

since the authors can here update parameters jointly without loss of flexibility, the authors do so, since joint optimization, as compared to the alternated one, leads in general to a faster convergence. 

After 500 iterations of the proposed GEM algorithm the separation results, measured in terms of the source to distortion ratio (SDR) [48], were 7.2 and 8.9 dB for voice and guitar, respectively. 

The spectral power, denoted as Vj , is assumed to be the product of an excitation spectral power Vexj , representing, e.g., the excitation of the glottal source for voice or the plucking of the string of a guitar, and a filter spectral power Vftj , representing, e.g., the vocal tract or the impedance of the guitar body [23], [35]. 

An EM algorithm update rules for time pattern weights Gexj or Gftj with time continuity priors, such as inverse-Gamma or Gamma Markov chain priors, can be found in [9]. 

If the mixing parameters are given some Gaussian priors, closed-form updates similar to (26), (27) can be still derived, since the modified log-posterior (18) will be a quadratic form with respect to the mixing parameters. 

2. First, given initial parameter values, the model parameters θ are estimated from the mixture X using an iterative GEM algorithm, where the E-step consists in computing some quantity 

In order to further constrain the fine structure of the spectral patterns, they are represented as linear combinations of narrowband spectral patterns Wexj [14] with weights U ex j . 

The narrowband spectral patterns W ex j (j = 9, . . . , 12) include 3 × L harmonic patterns modeling the harmonic part of L pitches (see [14]). 

the time-varying mixing parameters could be represented in terms of time-localized and locally time-invariant mixing parameter patterns, thus allowing the modeling of moving sources. 

While the local Gaussian model-based framework offers maximum of flexibility, there exist some methods that do not satisfy (fully or partially) the aforementioned assumptions and are thus not strictly covered by the framework. 

This structure can be implemented in their framework by choosing rank-1 adaptive spatial timeinvariant covariances, i.e., Aj is an adaptive tensor of size 2 × 1 × F × N subject to the time-invariance constraint, and constraining the spectral power to Vj = Wexj G ex j H ex j5 with Wexj being the F × F identity matrix, Gex a F × ⌈N/L⌉ adaptive matrix, and Hexj the ⌈N/L⌉ × N fixed matrix with entries hexj,mn = 1 for n ∈ Lm and h ex j,mn = 0 for n /∈