(Open Access) Supervised and Unsupervised Speech Enhancement Using Nonnegative Matrix Factorization (2013) | Nasser Mohammadiha

Q: What have the authors contributed in "Supervised and unsupervised speech enhancement using nonnegative matrix factorization" ?

In this paper, the authors investigate a new class of supervised speech denoising algorithms using nonnegative matrix factorization ( NMF ). The authors propose a novel speech enhancement method that is based on a Bayesian formulation of NMF ( BNMF ). To circumvent the mismatch problem between the training and testing stages, the authors propose two solutions. Moreover, the authors compare the performance of the developed algorithms with state-of-the-art speech enhancement schemes using various objective measures. Second, the authors suggest a scheme to learn the required noise BNMF model online, which is then used to develop an unsupervised speech enhancement system.

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 1

Supervised and Unsupervised Speech Enhancement

Using Nonnegative Matrix Factorization

Nasser Mohammadiha*, Student Member, IEEE, Paris Smaragdis, Member, IEEE, Arne Leijon, Member, IEEE

Abstract—Reducing the interference noise in a monaural noisy

speech signal has been a challenging task for many years. Com-

pared to traditional unsupervised speech enhancement methods,

e.g., Wiener ﬁltering, supervised approaches, such as algorithms

based on hidden Markov models (HMM), lead to higher-quality

enhanced speech signals. However, the main practical difﬁculty of

these approaches is that for each noise type a model is required

to be trained a priori. In this paper, we investigate a new class of

supervised speech denoising algorithms using nonnegative matrix

factorization (NMF). We propose a novel speech enhancement

method that is based on a Bayesian formulation of NMF

(BNMF). To circumvent the mismatch problem between the

training and testing stages, we propose two solutions. First,

we use an HMM in combination with BNMF (BNMF-HMM)

to derive a minimum mean square error (MMSE) estimator

for the speech signal with no information about the underlying

noise type. Second, we suggest a scheme to learn the required

noise BNMF model online, which is then used to develop an

unsupervised speech enhancement system. Extensive experiments

are carried out to investigate the performance of the proposed

methods under different conditions. Moreover, we compare the

performance of the developed algorithms with state-of-the-art

speech enhancement schemes using various objective measures.

Our simulations show that the proposed BNMF-based methods

outperform the competing algorithms substantially.

Index Terms—Nonnegative matrix factorization (NMF), speech

enhancement, PLCA, HMM, Bayesian Inference

I. INTRODUCTION

Estimating the clean speech signal in a single-channel

recording of a noisy speech signal has been a research topic

for a long time and is of interest for various applications

including hearing aids, speech/speaker recognition, and speech

communication over telephone and internet. A major outcome

of these techniques is the improved quality and reduced

listening effort in the presence of an interfering noise signal.

In general, speech enhancement methods can be catego-

rized into two broad classes: unsupervised and supervised.

Unsupervised methods include a wide range of approaches

such as spectral subtraction [1], Wiener and Kalman ﬁltering,

e.g., [2], [3], short-time spectral amplitude (STSA) estimators

[4], estimators based on super-Gaussian prior distributions

for speech DFT coefﬁcients [5]–[8], and schemes based on

However, permission to use this material for any other purposes must be

obtained from the IEEE by sending a request to pubs-permissions@ieee.org.

N. Mohammadiha and A. Leijon are with the Department of Electrical

Engineering, KTH Royal Institute of Technology, SE-100 44 Stockholm,

Sweden (e-mail: nmoh@kth.se; leijon@kth.se).

P. Smaragdis is with the Department of Computer Science and Department

of Electrical and Computer Engineering, University of Illinois at Urbana-

Champaign, IL, USA (e-mail: paris@illinois.edu).

periodic models of the speech signal [9]. In these methods, a

statistical model is assumed for the speech and noise signals,

and the clean speech is estimated from the noisy observation

without any prior information on the noise type or speaker

identity. However, the main difﬁculty of most of these methods

is estimation of the noise power spectral density (PSD) [10]–

[12], which is a challenging task if the background noise is

non-stationary.

For the supervised methods, a model is considered for both

the speech and noise signals and the model parameters are

estimated using the training samples of that signal. Then, an

interaction model is deﬁned by combining speech and noise

models and the noise reduction task is carried out. Some

examples of this class of algorithms include the codebook-

based approaches, e.g., [13], [14] and hidden Markov model

(HMM) based methods [15]–[19]. One advantage of these

methods is that there is no need to estimate the noise PSD

using a separate algorithm.

The supervised approaches have been shown to produce

better quality enhanced speech signals compared to the unsu-

pervised methods [14], [16], which can be expected as more

prior information is fed to the system in these cases and

the considered models are trained for each speciﬁc type of

signals. The required prior information on noise type (and

speaker identity in some cases) can be given by the user, or

can be obtained using a built-in classiﬁcation scheme [14],

[16], or can be provided by a separate acoustic environment

classiﬁcation algorithm [20]. The primary goal of this work is

to propose supervised and unsupervised speech enhancement

algorithms based on nonnegative matrix factorization (NMF)

[21], [22].

NMF is a technique to project a nonnegative matrix y onto

a space spanned by a linear combination of a set of basis

vectors, i.e., y ≈ bv, where both b and v are nonnegative

matrices. In speech processing, y is usually the spectrogram

of the speech signal with spectral vectors stored by column,

b is the basis matrix or dictionary, and v is referred to as the

NMF coefﬁcient or activation matrix. NMF has been widely

used as a source separation technique applied to monaural

mixtures, e.g., [23]–[25]. More recently, NMF has also been

used to estimate the clean speech from a noisy observation

[26]–[31].

When applied to speech source separation, a good sepa-

ration can be expected only when speaker-dependent basis

are learned. In contrast, for noise reduction, even if a general

speaker-independent basis matrix of speech is learned, a good

enhancement can be achieved [29], [31]. Nevertheless, there

might be some scenarios (such as speech degraded with

2 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING

multitalker babble noise) for which the basis matrices of

speech and noise are quite similar. In these cases, although

the traditional NMF-based approaches can be used to get state-

of-the-art performance, other constraints can be imposed into

NMF to obtain a better noise reduction. For instance, assuming

that the babble waveform is obtained as a sum of different

speech signals, a nonnegative hidden Markov model is pro-

posed in [26] to model the babble noise in which the babble

basis is identical to the speech basis. Another fundamental

issue in basic NMF is that it ignores the important temporal

dependencies of the audio signals. Different approaches have

been proposed in the literature to employ temporal dynamics

in NMF, e.g., [23]–[25], [27], [30], [31].

In this paper, we ﬁrst propose a new supervised NMF-

based speech enhancement system. In the proposed method,

the temporal dependencies of speech and noise signals are used

to construct informative prior distributions that are applied in

a Bayesian framework to perform NMF (BNMF). We then

develop an HMM structure with output density functions given

by BNMF to simultaneously classify the environmental noise

and enhance the noisy signal. Therefore, the noise type doesn’t

need to be speciﬁed a priori. Here, the classiﬁcation is done

using the noisy input and is not restricted to be applied at only

the speech pauses as it is in [16], and it doesn’t require any

additional noise PSD tracking algorithm, as it is required in

[14].

Moreover, we propose an unsupervised NMF-based ap-

proach in which the noise basis matrix is learned online from

the noisy mixture. Although online dictionary learning from

clean data has been addressed in some prior works, e.g., [32],

[33], our causal method learns the noise basis matrix from

the noisy mixture. The main contributions of this work can be

summarized as:

1) We present a review of state-of-the-art NMF-based noise

reduction approaches.

2) We propose a speech enhancement method based on

BNMF that inherently captures the temporal dependen-

cies in the form of hierarchical prior distributions. Some

preliminary results of this approach has been presented

in [31]. Here, we further develop the method and eval-

uate its performance comprehensively. In particular, we

present an approach to construct SNR-dependent prior

distributions.

3) An environmental noise classiﬁcation technique is sug-

gested and is combined with the above BNMF approach

(BNMF-HMM) to develop an unsupervised speech en-

hancement system.

4) A causal online dictionary learning scheme is proposed

that learns the noise basis matrix from the noisy obser-

vation. Our simulations show that the ﬁnal unsupervised

noise reduction system outperforms state-of-the-art ap-

proaches signiﬁcantly.

The rest of the paper is organized as follows: The review of

the NMF-based speech enhancement algorithms is presented in

Section II. In Section III, we describe our main contributions,

namely the BNMF-based noise reduction, BNMF-HMM struc-

ture, and online noise dictionary learning. Section IV presents

TABLE I

THE TABLE SUMMARIZES SOME OF THE NOTATIONS THAT ARE

CONSISTENTLY USED IN THE PAPER.

k frequency index

t time index

X a scalar random variable

Y = [Y

] a matrix of random variabels

t-th column of Y

y = [y

] a matrix of observed magnitude spectrogram

t-th column of y

(s)

speech parameters (b

(s)

is the speech basis matrix)

(n)

noise parameters (b

(n)

is the noise basis matrix)

b =



(s)

(n)



mixture parameters (b is the mixture basis matrix)

our experiments and results with supervised and unsupervised

noise reduction systems. Finally, Section V concludes the

study.

II. REVIEW OF STATE-OF-THE-ART NMF-BASED SPEECH

ENHANCEMENT

In this section, we ﬁrst explain a basic NMF approach,

and then we review NMF-based speech enhancement. Let us

represent the random variables associated with the magnitude

of the discrete Fourier transform (DFT) coefﬁcients of the

speech, noise, and noisy signals as S = [S

], N = [N

] and

Y = [Y

], respectively, where k and t denote the frequency

and time indices, respectively. The actual realizations are

shown in small letters, e.g., y = [y

]. Table I summarizes

some of the notations that are frequently used in the paper.

To obtain a nonnegative decomposition of a given matrix

x, a cost function is usually deﬁned and is minimized. Let

us denote the basis matrix and NMF coefﬁcient matrix by b

and v, respectively. Nonnegative factorization is achieved by

solving the following optimization problem:

(b, v) = arg min

b,v

D(ykbv) + µh (b, v) , (1)

where D(yk

y) is a cost function, h(·) is an optional reg-

ularization term, and µ is the regularization weight. The

minimization in (1) is performed under the nonnegativity con-

straint of b and v. The common choices for the cost function

include Euclidean distance [21], generalized Kullback-Leibler

divergence [21], [34], Itakura-Saito divergence [25], and the

negative likelihood of data in the probabilistic NMFs [35].

Depending on the application, the sparsity of the activations v

and the temporal dependencies of input data x are two popular

motivations to design the regularization function, e.g., [24],

[27], [36], [37]. Since (1) is not a convex problem, iterative

gradient descent or expectation-maximization (EM) algorithms

are usually followed to obtain a locally optimal solution for

the problem [21], [25], [35].

Let us consider a supervised denoising approach where the

basis matrix of speech b

(s)

and the basis matrix of noise b

(n)

are learned using the appropriate training data in advance. The

common assumption used to model the noisy speech signal is

the additivity of speech and noise spectrograms, i.e., y = s+n.

Although in the real world problems this assumption is not jus-

tiﬁed completely, the developed algorithms have been shown

to produce satisfactory results, e.g., [24]. The basis matrix of

MOHAMMADIHA et al.: SPEECH ENHANCEMENT USING NMF 3

the noisy signal is obtained by concatenating the speech and

noise basis matrices as b=[b

(s)

(n)

]. Given the magnitude of

DFT coefﬁcients of the noisy speech at time t, y

, the problem

in (1) is now solved—with b held ﬁxed—to obtain the noisy

NMF coefﬁcients v

. The NMF decomposition takes the form

≈ bv

= [b

(s)

(n)

][(v

(s)

)

(n)

)

]

, where > denotes

transposition. Finally, an estimate of the clean speech DFT

magnitudes is obtained by a Wiener-type ﬁltering as:

(s)

+ b

(n)

 y

, (2)

where the division is performed element-wise, and  denotes

an element-wise multiplication. The clean speech waveform

is estimated using the noisy phase and inverse DFT. One

advantage of the NMF-based approaches over the HMM-based

[16], [17] or codebook-driven [14] approaches is that NMF

automatically captures the long-term levels of the signals, and

no additional gain modeling is necessary.

Schmidt et al. [28] presented an NMF-based unsupervised

batch algorithm for noise reduction. In this approach, it is

assumed that the entire noisy signal is observed, and then the

noise basis vectors are learned during the speech pauses. In

the intervals of speech activity, the noise basis matrix is kept

ﬁxed and the rest of the parameters (including speech basis and

speech and noise NMF coefﬁcients) are learned by minimizing

the Euclidean distance with an additional regularization term

to impose sparsity on the NMF coefﬁcients. The enhanced

signal is then obtained similarly to (2). The reported results

show that this method outperforms a spectral subtraction al-

gorithm, especially for highly non-stationary noises. However,

the NMF approach is sensitive to the performance of the voice

activity detector (VAD). Moreover, the proposed algorithm in

[28] is applicable only in the batch mode, which is usually

not practical in the real world.

In [27], a supervised NMF-based denoising scheme is

proposed in which a heuristic regularization term is added to

the cost function. By doing so, the factorization is enforced

to follow the pre-obtained statistics. In this method, the basis

matrices of speech and noise are learned from training data

ofﬂine. Also, as part of the training, the mean and covariance

of the log of the NMF coefﬁcients are computed. Using these

statistics, the negative likelihood of a Gaussian distribution

(with the calculated mean and covariance) is used to regularize

the cost function during the enhancement. The clean speech

signal is then estimated as

= b

(s)

. Although it is not

explicitly mentioned in [27], to make regularization meaning-

ful the statistics of the speech and noise NMF coefﬁcients have

to be adjusted according to the long-term levels of speech and

noise signals.

In [29], authors propose a linear minimum mean square

error (MMSE) estimator for NMF-based speech enhancement.

In this work, NMF is applied to y

(i.e., y

= bv

, where

p = 1 corresponds to using magnitude of DFT coefﬁcients

and p = 2 corresponds to using magnitude-squared DFT

coefﬁcients) in a frame by frame routine. Then, a gain variable

is estimated to ﬁlter the noisy signal as:

= (g

 y

)

1/p

Assuming that the basis matrices of speech and noise are ob-

tained during the training stage, and that the NMF coefﬁcients

are random variables, g

is derived such that the mean

square error between S

and

is minimized. The optimal

gain is shown to be:

+ c

+ 1 + 2c

, (3)

where c is a constant that depends on p [29] and ξ

is called

the smoothed speech to noise ratio that is estimated using a

decision-directed approach. For a theoretical comparison of (3)

to a usual Wiener ﬁlter see [29]. The conducted simulations

show that the results using p = 1 are superior to those using

p = 2 (which is in line with previously reported observations,

e.g., [24]) and that both of them are better than the results of

a state-of-the-art Wiener ﬁlter.

A semi-supervised approach is proposed in [30] to denoise a

noisy signal using NMF. In this method, a nonnegative hidden

Markov model (NHMM) is used to model the speech mag-

nitude spectrogram. Here, the HMM state-dependent output

density functions are assumed to be a mixture of multinomial

distributions, and thus, the model is closely related to proba-

bilistic latent component analysis (PLCA) [35]. An NHMM is

described by a set of basis matrices and a Markovian transition

matrix that captures the temporal dynamics of the underlying

data. To describe a mixture signal, the corresponding NHMMs

are then used to construct a factorial HMM. When applied for

noise reduction, ﬁrst a speaker-dependent NHMM is trained

on a speech signal. Then, assuming that the whole noisy signal

is available (batch mode), the EM algorithm is run to simul-

taneously estimate a single-state NHMM for noise and also to

estimate the NMF coefﬁcients of the speech and noise signals.

The proposed algorithm doesn’t use a VAD to update the noise

dictionary, as was done in [28]. But the algorithm requires

the entire spectrogram of the noisy signal, which makes it

difﬁcult for practical applications. Moreover, the employed

speech model is speaker-dependent, and requires a separate

speaker identiﬁcation algorithm in practice. Finally, similar to

the other approaches based on the factorial models, the method

in [30] suffers from high computational complexity.

A linear nonnegative dynamical system is presented in [38]

to model temporal dependencies in NMF. The proposed causal

ﬁltering and ﬁxed-lag smoothing algorithms use Kalman-

like prediction in NMF and PLCA. Compared to the ad-hoc

methods that use temporal correlations to design regularity

functions, e.g., [27], [37], this approach suggests a solid frame-

work to incorporate temporal dynamics into the system. Also,

the computational complexity of this method is signiﬁcantly

less than [30].

Raj et al. [39] proposed a phoneme-dependent approach to

use NMF for speech enhancement in which a set of basis

vectors are learned for each phoneme a priori. Given the noisy

recording, an iterative NMF-based speech enhancer combined

with an automatic speech recognizer (ASR) is pursued to

estimate the clean speech signal. In the experiments, a mixture

of speech and music is considered and using a set of speaker-

dependent basis matrices the estimation of the clean speech is

carried out.

NMF-based noise PSD estimation is addressed in [37]. In

this work, the speech and noise basis matrices are trained

4 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING

ofﬂine, after which a constrained NMF is applied to the noisy

spectrogram in a frame by frame basis. To utilize the time

dependencies of the speech and noise signals, an l

-norm

regularization term is added to the cost function. This penalty

term encourages consecutive speech and noise NMF coefﬁ-

cients to take similar values, and hence, to model the signals’

time dependencies. The instantaneous noise periodogram is

obtained similarly to (2) by switching the role of speech

and noise approximates. This estimate is then smoothed over

time using an exponential smoothing to get a less-ﬂuctuating

estimate of the noise PSD, which can be combined with any

algorithm that needs a noise PSD, e.g., Wiener ﬁlter.

III. SPEECH ENHANCEMENT USING BAYESIAN NMF

In this section, we present our Bayesian NMF (BNMF)

based speech enhancement methods. In the following, an

overview of the employed BNMF is provided ﬁrst, which

was originally proposed in [34]. Our proposed extensions of

this BNMF to modeling a noisy signal, namely BNMF-HMM

and Online-BNMF are given in Subsections III-A and III-B,

respectively. Subsection III-C presents a method to construct

informative priors to use temporal dynamics in NMF.

The probabilistic NMF in [34] assumes that an input matrix

is stochastic, and to perform NMF as y ≈ bv the following

model is considered:

kit

, (4)

kit

)

= PO (z

kit

; b

)

= (b

)

kit

−b

/ (z

kit

!) , (5)

where Z

kit

are latent variables, PO (z; λ) denotes the Poisson

distribution, and z! is the factorial of z. A schematic repre-

sentation of this model is shown in Fig. 1.

As a result of (4) and (5), Y

is assumed Poisson-distributed

and integer-valued. In practice, the observed spectrogram is

ﬁrst scaled up and then rounded to the closest integer numbers

to avoid large quantization errors. The maximum likelihood

(ML) estimate of the parameters b and v can be obtained

using an EM algorithm [34], and the result would be identical

to the well-known multiplicative update rules for NMF using

Kullback-Leibler (KL-NMF) divergence [21].

In the Bayesian formulation, the nonnegative factors are

further assumed to be random variables. In this hierarchical

model, gamma prior distributions are considered to govern the

basis (B) and NMF coefﬁcient (V) matrices:

) = G (v

; φ

, θ

/φ

) ,

) = G (b

; ψ

, γ

/ψ

) ,

(6)

in which G (v; φ, θ) = exp((φ − 1) log v − v/θ − log Γ (φ) −

φ log θ) denotes the gamma density function with φ as the

shape parameter and θ as the scale parameter, and Γ (φ) is

the gamma function. φ, θ, ψ, and γ are referred to as the

hyperparameters.

As the exact Bayesian inference for (4), (5), and (6) is

difﬁcult, a variational Bayes approach has been proposed in

[34] to obtain the approximate posterior distributions of B

and V. In this approximate inference, it is assumed that





















Fig. 1. A schematic representation of (4) and (5) [34]. Each time-frequency

bin of a magnitude spectrogram (Y

) is assumed to be a sum of some Poisson-

distributed hidden random variables (Z

kit

the posterior distribution of the parameters are independent,

and these uncoupled posteriors are inferred iteratively by

maximizing a lower bound on the marginal log-likelihood of

data.

More speciﬁcally for this Bayesian NMF, in an iterative

scheme, the current estimates of the posterior distributions of

Z are used to update the posterior distributions of B and V,

and these new posteriors are used to update the posteriors

of Z in the next iteration. The iterations are carried on until

convergence. The posterior distributions for Z

k,:,t

are shown to

be multinomial density functions (: denotes ’all the indices’),

while for B

and V

they are gamma density functions. Full

details of the update rules can be found in [34]. This variational

approach is much faster than an alternative Gibbs sampler, and

its computational complexity can be comparable to that of the

ML estimate of the parameters (KL-NMF).

A. BNMF-HMM for Simultaneous Noise Classiﬁcation and

Reduction

In the following, we describe the proposed BNMF-HMM

noise reduction scheme in which the state-dependent output

density functions are instances of the BNMF explained in

the introductory part of this section. Each state of the HMM

corresponds to one speciﬁc noise type. Let us consider a set

of noise types for which we are able to gather some training

data, and let us denote the cardinality of the set by M. We

can train a BNMF model for each of these noise types given

its training data. Moreover, we consider a universal BNMF

model for speech that can be trained a priori. Note that the

considered speech model doesn’t introduce any limitation in

the method since we train a model for the speech signal in

general, and we don’t use any assumption on the identity or

gender of the speakers.

The structure of the BNMF-HMM is shown in Fig. 2. Each

state of the HMM has some state-dependent parameters, which

are the noise BNMF model parameters. Also, all the states

share some state-independent parameters, which consist of the

speech BNMF model and an estimate of the long-term signal

to noise ratio (SNR) that will be used for the enhancement.

To complete the Markovian model, we need to predeﬁne an

empirical state transition matrix (whose dimension is M ×

MOHAMMADIHA et al.: SPEECH ENHANCEMENT USING NMF 5

State-independent parameters:

(1) BNMF model of speech

(2) Estimate of long-term SNR

State 1

BNMF model of

babble noise

State 2

BNMF model of

factory noise

State 3

BNMF model of

traffic noise

Fig. 2. A block diagram representation of BNMF-HMM with three states.

M) and an initial state probability vector. For this purpose,

we assign some high values to the diagonal elements of the

transition matrix, and we set the rest of its elements to some

small values such that each row of the transition matrix sums

to one. Each element of the initial state probability vector is

also set to 1/M.

We model the magnitude spectrogram of the clean speech

and noise signals by (4). To obtain a BNMF model, we need

to ﬁnd the posterior distribution of the basis matrix, and

optimize for the hyperparameters if desired. During training,

we assign some sparse and broad prior distributions to B and

V according to (6). For this purpose, ψ and γ are chosen

such that the mean of the prior distribution for B is small,

and its variance is very high. On the other hand, φ and

θ are chosen such that the prior distribution of V has a

mean corresponding to the scale of the data and has a high

variance to represent uncertainty. To have good initializations

for the posterior means, the multiplicative update rules for

KL-NMF are applied ﬁrst for a few iterations, and the result

is used as the initial values for the posterior means. After

the initialization, variational Bayes (as explained before) is

run until convergence. We also optimize the hyperparameters

using Newton’s method, as proposed in [34].

In the following, the speech and noise random basis matrices

are denoted by B

(s)

and B

(n)

, respectively. A similar notation

is used to distinguish all the speech and noise parameters.

Let us denote the hidden state variable at each time frame

t by X

, which can take one of the M possible outcomes

= 1, 2, . . . M. The noisy magnitude spectrogram, given the

state X

, is modeled using (4). Here, we use the additivity

assumption to approximate the state-dependent distribution of

the noisy signal, i.e., y

= s

+ n

. To obtain the distribution

of the noisy signal, given the state X

, the parameters of

the speech and noise basis matrices (B

(s)

and B

(n)

) are

concatenated to obtain the parameters of the noisy basis matrix

B. Since the sum of independent Poisson random variables is

Poisson, (4) leads to:

| x

, b, v

) =

−λ

, (7)

where λ

. Note that although the basis matrix b

is state-dependent, to keep the notations uncluttered, we skip

writing this dependency explicitly.

The state-conditional likelihood of the noisy signal can now

be computed by integrating over B and V

as:

| x

) =

Z Z

,B,V

, b, v

| x

) dbdv

Z Z

| b, v

, x

)

B,V

(b, v

| x

) dbdv

. (8)

The distribution of y

is obtained by assuming that different

frequency bins are independent [5], [7]:

| x

) =

| x

) . (9)

As the ﬁrst step of the enhancement, variational Bayes

approach is applied to approximate the posterior distributions

of the NMF coefﬁcient vector V

by maximizing the varia-

tional lower bound on (9). Here, we assume that the state-

dependent posterior distributions of B are time-invariant and

are identical to those obtained during the training. Moreover,

we use the temporal dynamics of noise and speech to construct

informative prior distributions for V

, which is explained in

Subsection III-C. After convergence of the variational learning,

we will have the parameters (including expected values) of the

posterior distributions of V

as well as the latent variables Z

The MMSE estimate [40] of the speech DFT magnitudes

can be shown to be [15], [26]:

ˆs

= E (S

| y

) =

, x

) E (S

| x

, y

)

, x

)

(10)

where

, x

) = f



, x

| y

t−1



= f

| x

) f



| y

t−1



, (11)

in which y

t−1

= {y

, . . . y

t−1

}. Here, f

| y

t−1

) is

computed using the forward algorithm [41]. Since (8) can

not be evaluated analytically, one can either use numerical

methods or use approximations to calculate f

| x

Instead of expensive stochastic integrations, we approximate

(8) by evaluating the integral at the mean value of the posterior

distributions of B and V

| x

) ≈ f

| b

, v

, x

) , (12)

where b

= E(B | y

, x

), and v

= E(V

| y

, x

) are the

posterior means of the basis matrix and NMF coefﬁcient vector

that are obtained using variational Bayes. Other types of point

approximations have also been used for gain modeling in the

context of HMM-based speech enhancement [17], [18].

To ﬁnish our derivation, we need to calculate the state-

dependent MMSE estimate of the speech DFT magnitudes

E(S

| x

, y

). First, let us rewrite (4) for the noisy signal

as:

= S

+ N

(s)

i=1

(s)

kit

(n)

i=1

(n)

kit

(s)

(n)

i=1

kit

Supervised and Unsupervised Speech Enhancement Using Nonnegative Matrix Factorization

Figures

Citations

On training targets for supervised speech separation

An overview of noise-robust automatic speech recognition

TCNN: Temporal Convolutional Neural Network for Real-time Speech Enhancement in the Time Domain

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

Multiple-target deep learning for LSTM-RNN based speech enhancement

References

Fundamentals of statistical signal processing: estimation theory

Algorithms for Non-negative Matrix Factorization

Suppression of acoustic noise in speech using spectral subtraction

Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator

Performance measurement in blind audio source separation

Related Papers (5)

Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator

Suppression of acoustic noise in speech using spectral subtraction

Speech Enhancement: Theory and Practice

Performance measurement in blind audio source separation

Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs

Frequently Asked Questions (1)

Q1. What have the authors contributed in "Supervised and unsupervised speech enhancement using nonnegative matrix factorization" ?