scispace - formally typeset
Open AccessJournal ArticleDOI

Supervised and Unsupervised Speech Enhancement Using Nonnegative Matrix Factorization

Reads0
Chats0
TLDR
This paper proposes a novel speech enhancement method that is based on a Bayesian formulation of NMF (BNMF), and compares the performance of the developed algorithms with state-of-the-art speech enhancement schemes using various objective measures.
Abstract
Reducing the interference noise in a monaural noisy speech signal has been a challenging task for many years. Compared to traditional unsupervised speech enhancement methods, e.g., Wiener filtering, supervised approaches, such as algorithms based on hidden Markov models (HMM), lead to higher-quality enhanced speech signals. However, the main practical difficulty of these approaches is that for each noise type a model is required to be trained a priori. In this paper, we investigate a new class of supervised speech denoising algorithms using nonnegative matrix factorization (NMF). We propose a novel speech enhancement method that is based on a Bayesian formulation of NMF (BNMF). To circumvent the mismatch problem between the training and testing stages, we propose two solutions. First, we use an HMM in combination with BNMF (BNMF-HMM) to derive a minimum mean square error (MMSE) estimator for the speech signal with no information about the underlying noise type. Second, we suggest a scheme to learn the required noise BNMF model online, which is then used to develop an unsupervised speech enhancement system. Extensive experiments are carried out to investigate the performance of the proposed methods under different conditions. Moreover, we compare the performance of the developed algorithms with state-of-the-art speech enhancement schemes using various objective measures. Our simulations show that the proposed BNMF-based methods outperform the competing algorithms substantially.

read more

Content maybe subject to copyright    Report

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 1
Supervised and Unsupervised Speech Enhancement
Using Nonnegative Matrix Factorization
Nasser Mohammadiha*, Student Member, IEEE, Paris Smaragdis, Member, IEEE, Arne Leijon, Member, IEEE
Abstract—Reducing the interference noise in a monaural noisy
speech signal has been a challenging task for many years. Com-
pared to traditional unsupervised speech enhancement methods,
e.g., Wiener filtering, supervised approaches, such as algorithms
based on hidden Markov models (HMM), lead to higher-quality
enhanced speech signals. However, the main practical difficulty of
these approaches is that for each noise type a model is required
to be trained a priori. In this paper, we investigate a new class of
supervised speech denoising algorithms using nonnegative matrix
factorization (NMF). We propose a novel speech enhancement
method that is based on a Bayesian formulation of NMF
(BNMF). To circumvent the mismatch problem between the
training and testing stages, we propose two solutions. First,
we use an HMM in combination with BNMF (BNMF-HMM)
to derive a minimum mean square error (MMSE) estimator
for the speech signal with no information about the underlying
noise type. Second, we suggest a scheme to learn the required
noise BNMF model online, which is then used to develop an
unsupervised speech enhancement system. Extensive experiments
are carried out to investigate the performance of the proposed
methods under different conditions. Moreover, we compare the
performance of the developed algorithms with state-of-the-art
speech enhancement schemes using various objective measures.
Our simulations show that the proposed BNMF-based methods
outperform the competing algorithms substantially.
Index Terms—Nonnegative matrix factorization (NMF), speech
enhancement, PLCA, HMM, Bayesian Inference
I. INTRODUCTION
Estimating the clean speech signal in a single-channel
recording of a noisy speech signal has been a research topic
for a long time and is of interest for various applications
including hearing aids, speech/speaker recognition, and speech
communication over telephone and internet. A major outcome
of these techniques is the improved quality and reduced
listening effort in the presence of an interfering noise signal.
In general, speech enhancement methods can be catego-
rized into two broad classes: unsupervised and supervised.
Unsupervised methods include a wide range of approaches
such as spectral subtraction [1], Wiener and Kalman filtering,
e.g., [2], [3], short-time spectral amplitude (STSA) estimators
[4], estimators based on super-Gaussian prior distributions
for speech DFT coefficients [5]–[8], and schemes based on
Copyright (c) 2013 IEEE. Personal use of this material is permitted.
However, permission to use this material for any other purposes must be
obtained from the IEEE by sending a request to pubs-permissions@ieee.org.
N. Mohammadiha and A. Leijon are with the Department of Electrical
Engineering, KTH Royal Institute of Technology, SE-100 44 Stockholm,
Sweden (e-mail: nmoh@kth.se; leijon@kth.se).
P. Smaragdis is with the Department of Computer Science and Department
of Electrical and Computer Engineering, University of Illinois at Urbana-
Champaign, IL, USA (e-mail: paris@illinois.edu).
periodic models of the speech signal [9]. In these methods, a
statistical model is assumed for the speech and noise signals,
and the clean speech is estimated from the noisy observation
without any prior information on the noise type or speaker
identity. However, the main difficulty of most of these methods
is estimation of the noise power spectral density (PSD) [10]–
[12], which is a challenging task if the background noise is
non-stationary.
For the supervised methods, a model is considered for both
the speech and noise signals and the model parameters are
estimated using the training samples of that signal. Then, an
interaction model is defined by combining speech and noise
models and the noise reduction task is carried out. Some
examples of this class of algorithms include the codebook-
based approaches, e.g., [13], [14] and hidden Markov model
(HMM) based methods [15]–[19]. One advantage of these
methods is that there is no need to estimate the noise PSD
using a separate algorithm.
The supervised approaches have been shown to produce
better quality enhanced speech signals compared to the unsu-
pervised methods [14], [16], which can be expected as more
prior information is fed to the system in these cases and
the considered models are trained for each specific type of
signals. The required prior information on noise type (and
speaker identity in some cases) can be given by the user, or
can be obtained using a built-in classification scheme [14],
[16], or can be provided by a separate acoustic environment
classification algorithm [20]. The primary goal of this work is
to propose supervised and unsupervised speech enhancement
algorithms based on nonnegative matrix factorization (NMF)
[21], [22].
NMF is a technique to project a nonnegative matrix y onto
a space spanned by a linear combination of a set of basis
vectors, i.e., y bv, where both b and v are nonnegative
matrices. In speech processing, y is usually the spectrogram
of the speech signal with spectral vectors stored by column,
b is the basis matrix or dictionary, and v is referred to as the
NMF coefficient or activation matrix. NMF has been widely
used as a source separation technique applied to monaural
mixtures, e.g., [23]–[25]. More recently, NMF has also been
used to estimate the clean speech from a noisy observation
[26]–[31].
When applied to speech source separation, a good sepa-
ration can be expected only when speaker-dependent basis
are learned. In contrast, for noise reduction, even if a general
speaker-independent basis matrix of speech is learned, a good
enhancement can be achieved [29], [31]. Nevertheless, there
might be some scenarios (such as speech degraded with

2 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING
multitalker babble noise) for which the basis matrices of
speech and noise are quite similar. In these cases, although
the traditional NMF-based approaches can be used to get state-
of-the-art performance, other constraints can be imposed into
NMF to obtain a better noise reduction. For instance, assuming
that the babble waveform is obtained as a sum of different
speech signals, a nonnegative hidden Markov model is pro-
posed in [26] to model the babble noise in which the babble
basis is identical to the speech basis. Another fundamental
issue in basic NMF is that it ignores the important temporal
dependencies of the audio signals. Different approaches have
been proposed in the literature to employ temporal dynamics
in NMF, e.g., [23]–[25], [27], [30], [31].
In this paper, we first propose a new supervised NMF-
based speech enhancement system. In the proposed method,
the temporal dependencies of speech and noise signals are used
to construct informative prior distributions that are applied in
a Bayesian framework to perform NMF (BNMF). We then
develop an HMM structure with output density functions given
by BNMF to simultaneously classify the environmental noise
and enhance the noisy signal. Therefore, the noise type doesn’t
need to be specified a priori. Here, the classification is done
using the noisy input and is not restricted to be applied at only
the speech pauses as it is in [16], and it doesn’t require any
additional noise PSD tracking algorithm, as it is required in
[14].
Moreover, we propose an unsupervised NMF-based ap-
proach in which the noise basis matrix is learned online from
the noisy mixture. Although online dictionary learning from
clean data has been addressed in some prior works, e.g., [32],
[33], our causal method learns the noise basis matrix from
the noisy mixture. The main contributions of this work can be
summarized as:
1) We present a review of state-of-the-art NMF-based noise
reduction approaches.
2) We propose a speech enhancement method based on
BNMF that inherently captures the temporal dependen-
cies in the form of hierarchical prior distributions. Some
preliminary results of this approach has been presented
in [31]. Here, we further develop the method and eval-
uate its performance comprehensively. In particular, we
present an approach to construct SNR-dependent prior
distributions.
3) An environmental noise classification technique is sug-
gested and is combined with the above BNMF approach
(BNMF-HMM) to develop an unsupervised speech en-
hancement system.
4) A causal online dictionary learning scheme is proposed
that learns the noise basis matrix from the noisy obser-
vation. Our simulations show that the final unsupervised
noise reduction system outperforms state-of-the-art ap-
proaches significantly.
The rest of the paper is organized as follows: The review of
the NMF-based speech enhancement algorithms is presented in
Section II. In Section III, we describe our main contributions,
namely the BNMF-based noise reduction, BNMF-HMM struc-
ture, and online noise dictionary learning. Section IV presents
TABLE I
THE TABLE SUMMARIZES SOME OF THE NOTATIONS THAT ARE
CONSISTENTLY USED IN THE PAPER.
k frequency index
t time index
X a scalar random variable
Y = [Y
kt
] a matrix of random variabels
Y
t
t-th column of Y
y = [y
kt
] a matrix of observed magnitude spectrogram
y
t
t-th column of y
b
(s)
speech parameters (b
(s)
is the speech basis matrix)
b
(n)
noise parameters (b
(n)
is the noise basis matrix)
b =
b
(s)
b
(n)
mixture parameters (b is the mixture basis matrix)
our experiments and results with supervised and unsupervised
noise reduction systems. Finally, Section V concludes the
study.
II. REVIEW OF STATE-OF-THE-ART NMF-BASED SPEECH
ENHANCEMENT
In this section, we first explain a basic NMF approach,
and then we review NMF-based speech enhancement. Let us
represent the random variables associated with the magnitude
of the discrete Fourier transform (DFT) coefficients of the
speech, noise, and noisy signals as S = [S
kt
], N = [N
kt
] and
Y = [Y
kt
], respectively, where k and t denote the frequency
and time indices, respectively. The actual realizations are
shown in small letters, e.g., y = [y
kt
]. Table I summarizes
some of the notations that are frequently used in the paper.
To obtain a nonnegative decomposition of a given matrix
x, a cost function is usually defined and is minimized. Let
us denote the basis matrix and NMF coefficient matrix by b
and v, respectively. Nonnegative factorization is achieved by
solving the following optimization problem:
(b, v) = arg min
b,v
D(ykbv) + µh (b, v) , (1)
where D(yk
ˆ
y) is a cost function, h(·) is an optional reg-
ularization term, and µ is the regularization weight. The
minimization in (1) is performed under the nonnegativity con-
straint of b and v. The common choices for the cost function
include Euclidean distance [21], generalized Kullback-Leibler
divergence [21], [34], Itakura-Saito divergence [25], and the
negative likelihood of data in the probabilistic NMFs [35].
Depending on the application, the sparsity of the activations v
and the temporal dependencies of input data x are two popular
motivations to design the regularization function, e.g., [24],
[27], [36], [37]. Since (1) is not a convex problem, iterative
gradient descent or expectation-maximization (EM) algorithms
are usually followed to obtain a locally optimal solution for
the problem [21], [25], [35].
Let us consider a supervised denoising approach where the
basis matrix of speech b
(s)
and the basis matrix of noise b
(n)
are learned using the appropriate training data in advance. The
common assumption used to model the noisy speech signal is
the additivity of speech and noise spectrograms, i.e., y = s+n.
Although in the real world problems this assumption is not jus-
tified completely, the developed algorithms have been shown
to produce satisfactory results, e.g., [24]. The basis matrix of

MOHAMMADIHA et al.: SPEECH ENHANCEMENT USING NMF 3
the noisy signal is obtained by concatenating the speech and
noise basis matrices as b=[b
(s)
b
(n)
]. Given the magnitude of
DFT coefficients of the noisy speech at time t, y
t
, the problem
in (1) is now solved—with b held fixed—to obtain the noisy
NMF coefficients v
t
. The NMF decomposition takes the form
y
t
bv
t
= [b
(s)
b
(n)
][(v
(s)
t
)
>
(v
(n)
t
)
>
]
>
, where > denotes
transposition. Finally, an estimate of the clean speech DFT
magnitudes is obtained by a Wiener-type filtering as:
ˆ
s
t
=
b
(s)
v
(s)
t
b
(s)
v
(s)
t
+ b
(n)
v
(n)
t
y
t
, (2)
where the division is performed element-wise, and denotes
an element-wise multiplication. The clean speech waveform
is estimated using the noisy phase and inverse DFT. One
advantage of the NMF-based approaches over the HMM-based
[16], [17] or codebook-driven [14] approaches is that NMF
automatically captures the long-term levels of the signals, and
no additional gain modeling is necessary.
Schmidt et al. [28] presented an NMF-based unsupervised
batch algorithm for noise reduction. In this approach, it is
assumed that the entire noisy signal is observed, and then the
noise basis vectors are learned during the speech pauses. In
the intervals of speech activity, the noise basis matrix is kept
fixed and the rest of the parameters (including speech basis and
speech and noise NMF coefficients) are learned by minimizing
the Euclidean distance with an additional regularization term
to impose sparsity on the NMF coefficients. The enhanced
signal is then obtained similarly to (2). The reported results
show that this method outperforms a spectral subtraction al-
gorithm, especially for highly non-stationary noises. However,
the NMF approach is sensitive to the performance of the voice
activity detector (VAD). Moreover, the proposed algorithm in
[28] is applicable only in the batch mode, which is usually
not practical in the real world.
In [27], a supervised NMF-based denoising scheme is
proposed in which a heuristic regularization term is added to
the cost function. By doing so, the factorization is enforced
to follow the pre-obtained statistics. In this method, the basis
matrices of speech and noise are learned from training data
offline. Also, as part of the training, the mean and covariance
of the log of the NMF coefficients are computed. Using these
statistics, the negative likelihood of a Gaussian distribution
(with the calculated mean and covariance) is used to regularize
the cost function during the enhancement. The clean speech
signal is then estimated as
ˆ
s
t
= b
(s)
v
(s)
t
. Although it is not
explicitly mentioned in [27], to make regularization meaning-
ful the statistics of the speech and noise NMF coefficients have
to be adjusted according to the long-term levels of speech and
noise signals.
In [29], authors propose a linear minimum mean square
error (MMSE) estimator for NMF-based speech enhancement.
In this work, NMF is applied to y
p
t
(i.e., y
p
t
= bv
t
, where
p = 1 corresponds to using magnitude of DFT coefficients
and p = 2 corresponds to using magnitude-squared DFT
coefficients) in a frame by frame routine. Then, a gain variable
g
t
is estimated to filter the noisy signal as:
ˆ
s
t
= (g
t
y
p
t
)
1/p
.
Assuming that the basis matrices of speech and noise are ob-
tained during the training stage, and that the NMF coefficients
V
t
are random variables, g
t
is derived such that the mean
square error between S
p
t
and
c
S
p
t
is minimized. The optimal
gain is shown to be:
g
t
=
ξ
t
+ c
2
p
ξ
t
ξ
t
+ 1 + 2c
2
p
ξ
t
, (3)
where c is a constant that depends on p [29] and ξ
t
is called
the smoothed speech to noise ratio that is estimated using a
decision-directed approach. For a theoretical comparison of (3)
to a usual Wiener filter see [29]. The conducted simulations
show that the results using p = 1 are superior to those using
p = 2 (which is in line with previously reported observations,
e.g., [24]) and that both of them are better than the results of
a state-of-the-art Wiener filter.
A semi-supervised approach is proposed in [30] to denoise a
noisy signal using NMF. In this method, a nonnegative hidden
Markov model (NHMM) is used to model the speech mag-
nitude spectrogram. Here, the HMM state-dependent output
density functions are assumed to be a mixture of multinomial
distributions, and thus, the model is closely related to proba-
bilistic latent component analysis (PLCA) [35]. An NHMM is
described by a set of basis matrices and a Markovian transition
matrix that captures the temporal dynamics of the underlying
data. To describe a mixture signal, the corresponding NHMMs
are then used to construct a factorial HMM. When applied for
noise reduction, first a speaker-dependent NHMM is trained
on a speech signal. Then, assuming that the whole noisy signal
is available (batch mode), the EM algorithm is run to simul-
taneously estimate a single-state NHMM for noise and also to
estimate the NMF coefficients of the speech and noise signals.
The proposed algorithm doesn’t use a VAD to update the noise
dictionary, as was done in [28]. But the algorithm requires
the entire spectrogram of the noisy signal, which makes it
difficult for practical applications. Moreover, the employed
speech model is speaker-dependent, and requires a separate
speaker identification algorithm in practice. Finally, similar to
the other approaches based on the factorial models, the method
in [30] suffers from high computational complexity.
A linear nonnegative dynamical system is presented in [38]
to model temporal dependencies in NMF. The proposed causal
filtering and fixed-lag smoothing algorithms use Kalman-
like prediction in NMF and PLCA. Compared to the ad-hoc
methods that use temporal correlations to design regularity
functions, e.g., [27], [37], this approach suggests a solid frame-
work to incorporate temporal dynamics into the system. Also,
the computational complexity of this method is significantly
less than [30].
Raj et al. [39] proposed a phoneme-dependent approach to
use NMF for speech enhancement in which a set of basis
vectors are learned for each phoneme a priori. Given the noisy
recording, an iterative NMF-based speech enhancer combined
with an automatic speech recognizer (ASR) is pursued to
estimate the clean speech signal. In the experiments, a mixture
of speech and music is considered and using a set of speaker-
dependent basis matrices the estimation of the clean speech is
carried out.
NMF-based noise PSD estimation is addressed in [37]. In
this work, the speech and noise basis matrices are trained

4 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING
offline, after which a constrained NMF is applied to the noisy
spectrogram in a frame by frame basis. To utilize the time
dependencies of the speech and noise signals, an l
2
-norm
regularization term is added to the cost function. This penalty
term encourages consecutive speech and noise NMF coeffi-
cients to take similar values, and hence, to model the signals’
time dependencies. The instantaneous noise periodogram is
obtained similarly to (2) by switching the role of speech
and noise approximates. This estimate is then smoothed over
time using an exponential smoothing to get a less-fluctuating
estimate of the noise PSD, which can be combined with any
algorithm that needs a noise PSD, e.g., Wiener filter.
III. SPEECH ENHANCEMENT USING BAYESIAN NMF
In this section, we present our Bayesian NMF (BNMF)
based speech enhancement methods. In the following, an
overview of the employed BNMF is provided first, which
was originally proposed in [34]. Our proposed extensions of
this BNMF to modeling a noisy signal, namely BNMF-HMM
and Online-BNMF are given in Subsections III-A and III-B,
respectively. Subsection III-C presents a method to construct
informative priors to use temporal dynamics in NMF.
The probabilistic NMF in [34] assumes that an input matrix
is stochastic, and to perform NMF as y bv the following
model is considered:
Y
kt
=
X
i
Z
kit
, (4)
f
Z
kit
(z
kit
)
= PO (z
kit
; b
ki
v
it
)
= (b
ki
v
it
)
z
kit
e
b
ki
v
it
/ (z
kit
!) , (5)
where Z
kit
are latent variables, PO (z; λ) denotes the Poisson
distribution, and z! is the factorial of z. A schematic repre-
sentation of this model is shown in Fig. 1.
As a result of (4) and (5), Y
kt
is assumed Poisson-distributed
and integer-valued. In practice, the observed spectrogram is
first scaled up and then rounded to the closest integer numbers
to avoid large quantization errors. The maximum likelihood
(ML) estimate of the parameters b and v can be obtained
using an EM algorithm [34], and the result would be identical
to the well-known multiplicative update rules for NMF using
Kullback-Leibler (KL-NMF) divergence [21].
In the Bayesian formulation, the nonnegative factors are
further assumed to be random variables. In this hierarchical
model, gamma prior distributions are considered to govern the
basis (B) and NMF coefficient (V) matrices:
f
V
it
(v
it
) = G (v
it
; φ
it
, θ
it
it
) ,
f
B
ki
(b
ki
) = G (b
ki
; ψ
ki
, γ
ki
ki
) ,
(6)
in which G (v; φ, θ) = exp((φ 1) log v v log Γ (φ)
φ log θ) denotes the gamma density function with φ as the
shape parameter and θ as the scale parameter, and Γ (φ) is
the gamma function. φ, θ, ψ, and γ are referred to as the
hyperparameters.
As the exact Bayesian inference for (4), (5), and (6) is
difficult, a variational Bayes approach has been proposed in
[34] to obtain the approximate posterior distributions of B
and V. In this approximate inference, it is assumed that



i
Fig. 1. A schematic representation of (4) and (5) [34]. Each time-frequency
bin of a magnitude spectrogram (Y
kt
) is assumed to be a sum of some Poisson-
distributed hidden random variables (Z
kit
).
the posterior distribution of the parameters are independent,
and these uncoupled posteriors are inferred iteratively by
maximizing a lower bound on the marginal log-likelihood of
data.
More specifically for this Bayesian NMF, in an iterative
scheme, the current estimates of the posterior distributions of
Z are used to update the posterior distributions of B and V,
and these new posteriors are used to update the posteriors
of Z in the next iteration. The iterations are carried on until
convergence. The posterior distributions for Z
k,:,t
are shown to
be multinomial density functions (: denotes ’all the indices’),
while for B
ki
and V
it
they are gamma density functions. Full
details of the update rules can be found in [34]. This variational
approach is much faster than an alternative Gibbs sampler, and
its computational complexity can be comparable to that of the
ML estimate of the parameters (KL-NMF).
A. BNMF-HMM for Simultaneous Noise Classification and
Reduction
In the following, we describe the proposed BNMF-HMM
noise reduction scheme in which the state-dependent output
density functions are instances of the BNMF explained in
the introductory part of this section. Each state of the HMM
corresponds to one specific noise type. Let us consider a set
of noise types for which we are able to gather some training
data, and let us denote the cardinality of the set by M. We
can train a BNMF model for each of these noise types given
its training data. Moreover, we consider a universal BNMF
model for speech that can be trained a priori. Note that the
considered speech model doesn’t introduce any limitation in
the method since we train a model for the speech signal in
general, and we don’t use any assumption on the identity or
gender of the speakers.
The structure of the BNMF-HMM is shown in Fig. 2. Each
state of the HMM has some state-dependent parameters, which
are the noise BNMF model parameters. Also, all the states
share some state-independent parameters, which consist of the
speech BNMF model and an estimate of the long-term signal
to noise ratio (SNR) that will be used for the enhancement.
To complete the Markovian model, we need to predefine an
empirical state transition matrix (whose dimension is M ×

MOHAMMADIHA et al.: SPEECH ENHANCEMENT USING NMF 5
State-independent parameters:
(1) BNMF model of speech
(2) Estimate of long-term SNR
State 1
BNMF model of
babble noise
State 2
BNMF model of
factory noise
State 3
BNMF model of
traffic noise
Fig. 2. A block diagram representation of BNMF-HMM with three states.
M) and an initial state probability vector. For this purpose,
we assign some high values to the diagonal elements of the
transition matrix, and we set the rest of its elements to some
small values such that each row of the transition matrix sums
to one. Each element of the initial state probability vector is
also set to 1/M.
We model the magnitude spectrogram of the clean speech
and noise signals by (4). To obtain a BNMF model, we need
to find the posterior distribution of the basis matrix, and
optimize for the hyperparameters if desired. During training,
we assign some sparse and broad prior distributions to B and
V according to (6). For this purpose, ψ and γ are chosen
such that the mean of the prior distribution for B is small,
and its variance is very high. On the other hand, φ and
θ are chosen such that the prior distribution of V has a
mean corresponding to the scale of the data and has a high
variance to represent uncertainty. To have good initializations
for the posterior means, the multiplicative update rules for
KL-NMF are applied first for a few iterations, and the result
is used as the initial values for the posterior means. After
the initialization, variational Bayes (as explained before) is
run until convergence. We also optimize the hyperparameters
using Newton’s method, as proposed in [34].
In the following, the speech and noise random basis matrices
are denoted by B
(s)
and B
(n)
, respectively. A similar notation
is used to distinguish all the speech and noise parameters.
Let us denote the hidden state variable at each time frame
t by X
t
, which can take one of the M possible outcomes
x
t
= 1, 2, . . . M. The noisy magnitude spectrogram, given the
state X
t
, is modeled using (4). Here, we use the additivity
assumption to approximate the state-dependent distribution of
the noisy signal, i.e., y
t
= s
t
+ n
t
. To obtain the distribution
of the noisy signal, given the state X
t
, the parameters of
the speech and noise basis matrices (B
(s)
and B
(n)
) are
concatenated to obtain the parameters of the noisy basis matrix
B. Since the sum of independent Poisson random variables is
Poisson, (4) leads to:
f
Y
kt
(y
kt
| x
t
, b, v
t
) =
λ
y
kt
kt
e
λ
kt
y
kt
!
, (7)
where λ
kt
=
P
i
b
ki
v
it
. Note that although the basis matrix b
is state-dependent, to keep the notations uncluttered, we skip
writing this dependency explicitly.
The state-conditional likelihood of the noisy signal can now
be computed by integrating over B and V
t
as:
f
Y
kt
(y
kt
| x
t
) =
Z Z
f
Y
kt
,B,V
t
(y
kt
, b, v
t
| x
t
) dbdv
t
=
Z Z
f
Y
kt
(y
kt
| b, v
t
, x
t
)
f
B,V
t
(b, v
t
| x
t
) dbdv
t
. (8)
The distribution of y
t
is obtained by assuming that different
frequency bins are independent [5], [7]:
f
Y
t
(y
t
| x
t
) =
Y
k
f
Y
kt
(y
kt
| x
t
) . (9)
As the first step of the enhancement, variational Bayes
approach is applied to approximate the posterior distributions
of the NMF coefficient vector V
t
by maximizing the varia-
tional lower bound on (9). Here, we assume that the state-
dependent posterior distributions of B are time-invariant and
are identical to those obtained during the training. Moreover,
we use the temporal dynamics of noise and speech to construct
informative prior distributions for V
t
, which is explained in
Subsection III-C. After convergence of the variational learning,
we will have the parameters (including expected values) of the
posterior distributions of V
t
as well as the latent variables Z
t
.
The MMSE estimate [40] of the speech DFT magnitudes
can be shown to be [15], [26]:
ˆs
kt
= E (S
kt
| y
t
) =
P
M
x
t
=1
ξ
t
(y
t
, x
t
) E (S
kt
| x
t
, y
t
)
P
M
x
t
=1
ξ
t
(y
t
, x
t
)
,
(10)
where
ξ
t
(y
t
, x
t
) = f
Y
t
,X
t
y
t
, x
t
| y
t1
1
= f
Y
t
(y
t
| x
t
) f
X
t
x
t
| y
t1
1
, (11)
in which y
t1
1
= {y
1
, . . . y
t1
}. Here, f
X
t
(x
t
| y
t1
1
) is
computed using the forward algorithm [41]. Since (8) can
not be evaluated analytically, one can either use numerical
methods or use approximations to calculate f
Y
kt
(y
kt
| x
t
).
Instead of expensive stochastic integrations, we approximate
(8) by evaluating the integral at the mean value of the posterior
distributions of B and V
t
:
f
Y
kt
(y
kt
| x
t
) f
Y
kt
(y
kt
| b
0
, v
0
t
, x
t
) , (12)
where b
0
= E(B | y
t
, x
t
), and v
0
t
= E(V
t
| y
t
, x
t
) are the
posterior means of the basis matrix and NMF coefficient vector
that are obtained using variational Bayes. Other types of point
approximations have also been used for gain modeling in the
context of HMM-based speech enhancement [17], [18].
To finish our derivation, we need to calculate the state-
dependent MMSE estimate of the speech DFT magnitudes
E(S
kt
| x
t
, y
t
). First, let us rewrite (4) for the noisy signal
as:
Y
kt
= S
kt
+ N
kt
=
I
(s)
X
i=1
Z
(s)
kit
+
I
(n)
X
i=1
Z
(n)
kit
=
I
(s)
+I
(n)
X
i=1
Z
kit
,

Figures
Citations
More filters
Journal ArticleDOI

On training targets for supervised speech separation

TL;DR: Results in various test conditions reveal that the two ratio mask targets, the IRM and the FFT-MASK, outperform the other targets in terms of objective intelligibility and quality metrics, and that masking based targets, in general, are significantly better than spectral envelope based targets.
Journal ArticleDOI

An overview of noise-robust automatic speech recognition

TL;DR: A thorough overview of modern noise-robust techniques for ASR developed over the past 30 years is provided and methods that are proven to be successful and that are likely to sustain or expand their future applicability are emphasized.
Proceedings ArticleDOI

TCNN: Temporal Convolutional Neural Network for Real-time Speech Enhancement in the Time Domain

TL;DR: Experimental results demonstrate that the proposed model gives consistently better enhancement results than a state-of-the-art real-time convolutional recurrent model.
Journal ArticleDOI

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

TL;DR: This article provides a comprehensive overview of the state-of-the-art of voice conversion techniques and their performance evaluation methods from the statistical approaches to deep learning, and discusses their promise and limitations.
Proceedings ArticleDOI

Multiple-target deep learning for LSTM-RNN based speech enhancement

TL;DR: The proposed framework can consistently and significantly improve the objective measures for both speech quality and intelligibility and a novel multiple-target joint learning approach is designed to fully utilize this complementarity.
References
More filters
Journal ArticleDOI

Fundamentals of statistical signal processing: estimation theory

TL;DR: The Fundamentals of Statistical Signal Processing: Estimation Theory as mentioned in this paper is a seminal work in the field of statistical signal processing, and it has been used extensively in many applications.
Proceedings Article

Algorithms for Non-negative Matrix Factorization

TL;DR: Two different multiplicative algorithms for non-negative matrix factorization are analyzed and one algorithm can be shown to minimize the conventional least squares error while the other minimizes the generalized Kullback-Leibler divergence.
Journal ArticleDOI

Suppression of acoustic noise in speech using spectral subtraction

TL;DR: A stand-alone noise suppression algorithm that resynthesizes a speech waveform and can be used as a pre-processor to narrow-band voice communications systems, speech recognition systems, or speaker authentication systems.
Journal ArticleDOI

Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator

TL;DR: In this article, a system which utilizes a minimum mean square error (MMSE) estimator is proposed and then compared with other widely used systems which are based on Wiener filtering and the "spectral subtraction" algorithm.
Journal ArticleDOI

Performance measurement in blind audio source separation

TL;DR: This paper considers four different sets of allowed distortions in blind audio source separation algorithms, from time-invariant gains to time-varying filters, and derives a global performance measure using an energy ratio, plus a separate performance measure for each error term.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What have the authors contributed in "Supervised and unsupervised speech enhancement using nonnegative matrix factorization" ?

In this paper, the authors investigate a new class of supervised speech denoising algorithms using nonnegative matrix factorization ( NMF ). The authors propose a novel speech enhancement method that is based on a Bayesian formulation of NMF ( BNMF ). To circumvent the mismatch problem between the training and testing stages, the authors propose two solutions. Moreover, the authors compare the performance of the developed algorithms with state-of-the-art speech enhancement schemes using various objective measures. Second, the authors suggest a scheme to learn the required noise BNMF model online, which is then used to develop an unsupervised speech enhancement system.