What is the reason for the problem of recognizing speech in the presence of non-stationary?

since the authors have only the noisy speech to estimate the instantaneous noise from, the authors requirestronger a priori information about the signals involved, namely the speech and the noise.

What is the model for the noisy speech spectrogram?

The model for the noisy speech spectrogram Y ≈ S+M can be written as S ≈ BW, (3) where B = [BsBm] be a matrix that combines the bases for speech and noise into a single matrix, and W = [Ws⊤Wm⊤]⊤ combines the weights into a single matrix.

What is the main reason for the difficulty of recognizing speech in the presence of nonstationary?

However a simple characterization such as a linear dynamical system (or the more coarse Gaussian mixture model) is insufficiently detailed for signals such as music or speech, which have nearly unlimited range of variation.

What is the compositional model of speech?

The compositional model represents the magnitude spectrum st of speech in frame t as a weighted linear non-negative combination of basis vectors bsi asst =

(Open Access) Non-negative matrix factorization based compensation of music for automatic speech recognition. (2010) | Bhiksha Raj

Q: What are the future works mentioned in the paper "Non-negative matrix factorization based compensation of music for automatic speech recognition" ?

This and other techniques remain topics for future work.

Q: How did the authors learn the base for the music types?

As speech may be confused with sung vocals, the authors simplify the recognition task by semi-automatically discarding music material containing singing, using simple rules to discard shorter segments, derived from MIDI references, and by listening to the rest.

Q: What is the spectral model for speech?

If the authors represent the set of basis vectors using matrix Bs =[bs1, . . . ,b s S ] , the model and the weights using matrix [Ws]i,t = wsi,t, the authors can write the model for the speech spectrogram as the product of matrices Bs and Ws:S = BsWs (2)Similarly, the noise is modeled as the weighted sum of noise basis vectors bmi , i = 1, . . . ,M , where M is the number of noisebasis vector.

Q: How can the authors learn the contributions of the individual bases?

Through the application of appropriate constraints, the authors can the estimate the contributions of the individual bases, and thereby reconstitute the individual sources contributing to the mixture.

NON-NEGATIVE MATRIX FACTORIZATION BASED COMPENSATION OF MUSIC FOR

AUTOMATIC SPEECH RECOGNITION

Bhiksha Raj

, Tuomas Virtanen

, Sourish Chaudhuri

, Rita Singh

Carnegie Mellon University, Pittsburgh PA, USA,

Department of Signal Processing, Tampere University of Technology, Finland

bhiksha@cs.cmu.edu, tuomas.virtanen@tut.fi, sourishc@cs.cmu.edu, rsingh@cs.cmu.edu

ABSTRACT

This paper proposes to use non-negative matrix factorization based

speech enhancement in robust automatic recognition of mixtures

of speech and music. We represent magnitude spectra of noisy

speech signals as the non-negative weighted linear combination

of speech and noise spectral basis vectors, that are obtained from

training corpora of speech and music. We use overcomplete dic-

tionaries consisting of random exemplars of the training data. The

method is tested on the Wall Street Journal large vocabulary speech

corpus which is artiﬁcially corrupted with polyphonic music from

the RWC music database. Various music styles and speech-to-

music ratios are evaluated. The proposed methods are shown

to produce a consistent, signiﬁcant improvement on the recog-

nition performance in the comparison with the baseline method.

Audio demonstrations of the enhanced signals are available at

http://www.cs.tut.fi/

tuomasv.

Index Terms: noise robustness, automatic speech recognition,

non-negative matrix factorization, speech enhancement

1. INTRODUCTION

The problem of recognizing speech in the presence of non-

stationary noises remains a difﬁcult one without a satisfactory so-

lution to date. A large number of algorithms have been proposed

in the literature to address the more general problem of mitigating

the effect of noise on the speech signal. Many of these attempt to

reduce the noise from the speech signal itself [1, 2]. Other tech-

niques e.g. [3, 4] modify features derived from the signal to reduce

the effect of noise.

Many of the above mentioned methods are quite effective,

often achieving dramatic improvements in recognition accuracy

when speech is corrupted by stationary or slowly-varying noise.

Unfortunately, the same improvements are not achieved when the

noise is non-stationary. The reason for this is simple to intuit -

these algorithms require (statistical) estimates of the spectral char-

acteristics of the noise. Non-stationary noise changes quickly, of-

ten as quickly as the speech signal itself. Any local characteris-

tics that one may estimate for the noise, based on past samples of

speech, or even on a window of samples around the current sample,

are unlikely to be representative of the noise affecting the current

segment.

It can be reasoned, therefore, that techniques that can effec-

tively estimate the noise at the current instant based on the cur-

rent sample of the noisy speech are more likely to be effective

when the noise is fast varying. However, since we have only the

noisy speech to estimate the instantaneous noise from, we require

stronger a priori information about the signals involved, namely

the speech and the noise.

A number of techniques have been proposed based on this

principle. These techniques generally attempt to model the tem-

poral dynamics of either the speech [5], or the corrupting noise

[6] or both [7, 8] by HMMs or linear dynamical systems, in order

to aid localization of the current noise characteristics. However

a simple characterization such as a linear dynamical system (or

the more coarse Gaussian mixture model) is insufﬁciently detailed

for signals such as music or speech, which have nearly unlimited

range of variation. More detailed characterizations such as HMMs

or graphical models [8] are only useful for very restricted noises

as they are very detailed tend to overﬁt to the speciﬁc instances of

noise they are trained from.

In this paper we follow a different approach. In previous

work we (and other researchers) have demonstrated that non-

negative spectral factorization methods, including those based on

non-negative matrix factorization (NMF) [9] and latent-variable

analysis (LVA) [10], can be effectively used for signal separation.

These methods represent signals by a compositional model that

characterizes their spectra as a weighted linear combination of ad-

ditive units, or bases that combine to compose it. By appropri-

ately learning these bases, it becomes possible to attempt to sep-

arate out mixtures of sounds. The mixed sound is modelled as a

composition of the bases of all the contributing sources. Through

the application of appropriate constraints, we can the estimate the

contributions of the individual bases, and thereby reconstitute the

individual sources contributing to the mixture.

In this paper we show that the NMF-based approach, described

in Sections 2-3, is also capable of generating enhanced signals that

signiﬁcantly improve recognition on speech corrupted by a highly

non-stationary signal, speciﬁcally music. Unlike the methods of

[9, 10], we will use an exemplar-based method [11, 12] to learn

overcomplete sets of bases for the signals. As in the previous tech-

niques, both for NMF-based separation and statistical techniques

such as [6, 7, 8], we require characterizations of both signals, i.e.

speech and the corrupting signal (music in our case). However,

we also demonstrate that the separation obtained using the exem-

plar based method generalizes to cases where the speciﬁc music

type is not known. Experiments described in Section 4 on speech

corrupted by music, a notoriously difﬁcult non-stationary signal to

compensate for, show that while large improvements in recogni-

tion accuracy can be obtained if the characterizations for both the

speaker and the speciﬁc genre of music in the signal are known, the

method is equally effective even if only generic characterizations

that represent ensembles of music or speakers are available.

2. FEATURE REPRESENTATION

The basic feature representation we employ for NMF-based en-

hancement is the magnitude spectrogram. To obtain this, we

compute series of short-time Fourier transforms (STFT) from

hamming-windowed frames of the signal, and take the abso-

lutive values of the resulting spectral vectors. For automatic

speech recognition, Mel-frequency cepstral coefﬁcients are com-

puted from the enhanced magnitude spectrum representation.

The optimal analysis window length for NMF-based signal en-

hancement and speech recognition are different — for the former

we have found an analysis window between 40ms and 64ms to be

optimal, whereas speech recognition works best with an analysis

window of about 25ms. As a result, we revert the NMF-enhanced

magnitude spectrogram back to a time-domain signal, as described

in Section 3.4. The reconstituted signal is used to compute features

for recognition.

A key aspect of the magnitude spectrographic representation

is that the magnitude spectrogram of the sum of two signals is ap-

proximately equal to the magnitude spectrogram of the individual

signals. Let y[n] be a noisy speech signal that is the sum of clean

speech signals s[n] and noise m[n], n being the discrete time in-

dex. Let us denote the magnitude spectrum vectors of the signals

in frame t as y

, s

, and m

, respectively. The sequence the spec-

tral column vectors from the whole recording are grouped into

matrices Y, S, and M, respectively. Thus, we can approximate

Y ≈ S + M.

3. SEPARATING SPEECH FROM NOISE

The principle behind NMF-based separation of speech from noise

is this. Per the compositional model presented in Section 3.1, the

spectral vector for clean speech is composed from the bases of

speech, and the spectrum for the noise that corrupts the speech

is composed from the bases for noise. The noisy speech, being

an addition of speech and noise, is therefore a composition of the

combined bases from speech and noise. The contributions of the

individual bases to the noisy speech can be estimated as described

in 3.3, segregating the contributions of speech bases from the noise

bases, so that the speech in the mixture can be separated from the

noise and synthesized as described in Section 3.4.

3.1. The Compositional Model

The compositional model represents the magnitude spectrum s

speech in frame t as a weighted linear non-negative combination

of basis vectors b

∑

i=1

i,t

(1)

where b

is the i

speech basis vector and w

i,t

is the weight of

the basis in frame t, and S is the number of speech basis vectors.

If we represent the set of basis vectors using matrix B

[

, . . . , b

]

, the model and the weights using matrix [W

]

i,t

, we can write the model for the speech spectrogram as the

product of matrices B

and W

S = B

(2)

Similarly, the noise is modeled as the weighted sum of noise

basis vectors b

, i = 1, . . . , M , where M is the number of noise

basis vector. When the noise basis vectors are grouped into matrix

and the noise weights into matrix W

, the model for the noise

spectrogram can be written as M = B

The model for the noisy speech spectrogram Y ≈ S + M can

be written as

S ≈ BW, (3)

where B = [B

] be a matrix that combines the bases for

speech and noise into a single matrix, and W = [W

⊤

]

⊤

combines the weights into a single matrix. If there are S bases

for speech (i.e B

is D × S, where D is the dimensionality of the

spectral vectors) and M bases for the noise (B

is D × M), then

the total number of bases in B is S + M, i.e. B is a D × (S + M)

matrix. W is a (S + M ) × T matrix.

All bases and weights in (3) are strictly non-negative. The in-

tuition behind this is that any sound is composed by constructive

composition of various components. For instance, a segment of

music may be composed by additive composition of the notes that

comprise it. Cancellation, which is a major component of any de-

composition in terms of additive bases, rarely, if ever, factors into

the composition of a sound, except by careful design. In fact, it has

been found out that the non-negativity restriction alone is sufﬁcient

for even blind separation of sources [9].

3.2. The bases

The bases b

and b

which form the columns of B reside in the

same domain as the data vectors y

, i.e. they too are spectral vec-

tors and represent the spectral magnitudes of the composing sig-

nals. Equation 1 does not specify how the bases are obtained, and

this remains a matter of choice. It is possible to obtain a data-

driven estimate of an set of bases by analysis of example data.

Two methods are immediately available - latent variable decom-

positions [10] and non-negative matrix factorization (NMF) [9].

However in this paper we have found an exemplar-based charac-

terization [11, 12] to be most effective.

Exemplar-based characterizations use realizations of spectral

vectors from the source signals itself as the bases. These bases

may simply be drawn randomly from a collection of spectral vec-

tors for the source. Thus each spectral vector is explained as being

a linear combination of the exemplar vectors from the source. Al-

though this deﬁes any clear semantic interpretation (such as notes

being the elementary units of music), such bases nevertheless have

useful theoretical properties, particularly in the context of signal

enhancement, as explained in [11].

We obtain the set of speech basis vectors B

as the magnitude

of the DFT of randomly drawn frames from training examples of

speech. Similarly, the set of noise basis vectors B

is obtained

from from training examples of the corrupting signal. The detailed

explanation of the data sets are described in Section 4.

3.3. Estimating Weights

Once a set of bases B is given, the weights with which they must

be combined to optimally compose the spectral vectors in Y can

be determined using either the EM algorithm from [10] or one of

various NMF-based update rules [13]. In this paper we employ the

NMF update rule that minimizes a generalized Kullback-Leibler

divergence between the spectral vectors in S and the composition

BW. This rule estimates the weights through iterations of:

W = W ⊗

⊤

B.W

]

⊤

(4)

where 1 is a D-by-T matrix of ones. The operation ⊗ repre-

sents element-wise multiplication, and all divisions too are ele-

ment wise. We initialize all the weights W to unity and apply the

update 200 times. After that the weight matrices W

and W

for speech and noise, respectively, are obtained by splitting W as

W = [W

⊤

]

⊤

3.4. Signal reconstruction

The minimum-mean-squared-error estimate of S, i.e. the contri-

bution of speech to Y can be extracted as:

S = Y ⊗

+ B

(5)

The reconstituted speech spectrogram is then converted back

to a time-domain signal by combining it with the phase ob-

tained from the complex spectrogram of the noisy signal, ap-

plying an inverse STFT, and overlap-add combination of the

frames. The time-domain signal is then further used for speech

recognition. The above procedure can also be viewed as ﬁl-

tering the noisy signal with a time-varying ﬁlter deﬁned by

)/(B

+ B

), similarly to Wiener ﬁltering.

4. EXPERIMENTAL EVALUATION

4.1. Acoustic material and recognizer

We performed speech recognition experiments on digital mixtures

of speech and music. The CMU-Sphinx HMM-based continuous-

density speech recognizer was used for all experiments. Since

NMF-based separation typically requires bases to characterize the

speaker, we used a somewhat unconventional setup. Acoustic

models with 1000 tied states, each modelled by a mixture of 8

Gaussians, were trained from the Resource Management database.

As features we use MFCCs and their deltas and double-deltas cal-

culated in 25 ms frames. Speakers from the training components

of the Wall Street Journal were used as our test set. A total of 3775

utterances distributed approximately uniformly across 83 speakers

were used as our test set. The remaining data from each speaker

were used to train bases for the speaker where necessary.

The test utterances were corrupted by digital addition of mu-

sic. For the music, we used the RWC database [14], which is a pro-

fessionally produced polyphonic music database containing many

different music styles. The included styles are “classical” from the

RWC Classical database, and “jazz”, “latin”, and “world” from

the RWC Genre database. Some of the recordings contain vo-

cals. As speech may be confused with sung vocals, we simplify

the recognition task by semi-automatically discarding music ma-

terial containing singing, using simple rules to discard shorter seg-

ments, derived from MIDI references, and by listening to the rest.

The ﬁrst minute of each recording was segmented out and added

to a collection to be used as “training data”. Random segments

from the remaining material were used for corrupting the speech.

The above procedure resulted in total 339, 149, and 281 seconds of

training material for the jazz, latin, and world main categories, re-

spectively, and 1015, 368, and 674 seconds of testing material. For

the classical music we had nearly 3 hours of test data and over 20

minutes of training data. All the material was downsampled from

44.1 to 16 kHz sampling frequency, and downmixed to mono. The

test data were used to corrupt the speech and the training data were

used to learn bases for the music types.

−5 0 5 10 15 20

Accuracy (100−WER)

SNR

(a)

−5

0 5 10 15 20

SNR

Accuracy (100−WER)

(b)

−5 0 5 10 15 20

SNR

Accuracy (100−WER)

(c)

−5 0 5 10 15 20

SNR

Accuracy (100−WER)

(d)

Fig. 1. Recognition performance on speech corrupted by a. clas-

sical, b. jazz, c. latin, d. world music. In each ﬁgure, the lower

dotted curve is the performance on uncompensated speech and the

upper curve is the performance on enhanced speech.

4.2. Speaker and music style dependent bases

We ran a number of different experiments. In the ﬁrst set of experi-

ments, we corrupted the test speech with each of four music types:

classical, jazz, latin, and world to a number of different SNRs,

namely -5dB, 0dB, 10dB, 15dB and 20dB. NMF-based separation

is usually assumed to require detailed knowledge of the corrupt-

ing noise and the speaker. This is often not such an unrealistic

assumption – the identity of the speaker will often be known, and

the bases for music can be learned from music-only segments that

have been detected by a voice-activity detector. For this test we

used the training sets of recordings from each speaker to learn

speaker-dependent bases for each speaker. For each of the mu-

sic types we also learned music-speciﬁc bases. 3000 bases were

obtained for each music type and speaker. In all experiments, the

signals were analyzed using 60ms windows with a 15ms frame

shift between windows to compute spectrograms.

Figure 1 shows the performance on speech corrupted by each

of the four categories of music. We observe, ﬁrstly, that signiﬁ-

cant improvements in recognition accuracy are obtained on speech

corrupted by all music types. The most improvement, however, is

obtained on speech corrupted by classical music.

4.3. Multi-speaker and multi-style music bases

This ﬁrst experiment makes some pretty stringent assumptions. It

assumes that the type of music affecting the signal and the identity

of the speaker are both known. In the next two experiments we

relaxed both assumptions. Speech was corrupted with randomly

selected segments of music from any of the music types from the

RWC Genre database. In the ﬁrst experiment the identity of the

corrupting music was assumed to be unknown. A total of 3000

bases were drawn randomly from l all music types and used for

separation. Although this is not an open set of music types, it is

still a fairly large closed set since the music segments are diverse

in their variety. In the second experiment it was assumed that the

−5 0 5 10 15 20

SNR

Accuracy (100−WER)

Fig. 2. Recognition of speech corrupted by assorted music. The

lowest curve shows performance on uncompensated speech. The

central black dash-dotted line is obtained with mixed music bases

and speaker-speciﬁc speech bases. The top red line is the perfor-

mance with mixed music bases and multi-speaker speech bases.

100

-5dB 0dB 5dB 10dB 15dB 20dB clean

NMF+MLLR

Matched Condi!on

NMF

Baseline MLLR

VTS

Baseline

Accuracy (100 - WER)

Fig. 3. Recognition performance on speech corrupted by classi-

cal music. Results shown are i) basline on uncompensated speech

ii) on speech compensated by VTS [3], iii) with baseline models

adapted to the corrupted data using MLLR, iv) on NMF-enhanced

speech, v) with a “matched” recognizer, and vi) after adaptation

of the models using hypotheses obtained from NMF-enhanced

speech, on NMF-enhanced speech.

identity of the speaker too was unknown. A common set of 6000

bases drawn from all speakers was used for separation. Again,

although this is not an open set of speakers, the number of speakers

(83) is very large. Figure 2 shows the performance obtained in both

experiments. The algorithm not only holds up when the identity of

the music and speaker are unknown, but the performance obtained

when the speaker identity was not known is actually slightly higher

than that obtained when the identity is known.

4.4. Recognizer adaptation

Speech recognition systems frequently employ adaptation tech-

niques to improve recognition. It is often unclear if compensa-

tion methods combine well with adaptation, and if they do what

the upper bound on performance might be. Figure 3 shows the

performance obtained with maximum likelihood linear regression

(MLLR) adaptation. For this experiment we chose the speech data

corrupted by music and speaker-speciﬁc speech bases were em-

ployed. Adaptation was performed by speaker.

Not only does adaptation improve performance greatly, the ﬁ-

nal performance is better than that obtained with a matched rec-

ognizer that was trained on speech corrupted by exactly the same

music and music level as the test speech. This is the best perfor-

mance we have obtained to date on speech corrupted by music.

5. CONCLUSIONS

We have shown that NMF-based compensation of speech cor-

rupted by music can result in large improvements in recognition

accuracy. We have also shown that although the compensation re-

quires bases drawn from the music and speech, it functions very

well even when the identity of the music or speaker are unknown.

Interestingly, we observe that large improvements are obtained in

recognition accuracy even when when perceptual improvements in

the background music level in the signal are not as high.

Various direct enhancements to NMF, including the enforce-

ment of temporal continuity constraints, can improve performance

greatly. In addition, NMF provides an instantaneous characteriza-

tion of the distribution of music, through the weights assigned to

the bases. This can in fact be used directly to adapt the models

in the recognizer for improved recognition. This and other tech-

niques remain topics for future work.

6. REFERENCES

[1] S. Boll, “Suppression of acoustic noise in speech using spec-

tral subtraction,” IEEE Trans. on ASSP, 1979.

[2] Y. Ephraim and D. Malah, “Speech enhancement using a

minimum mean square error log-spectral amplitude estima-

tor,” IEEE Trans. on ASSP, pp. 443–445, 1985.

[3] P. J. Moreno, B. Raj, and R. M. Stern, “A Vector Taylor Se-

ries Approach for Environment-Independent Speech Recog-

nition,” Proc. ICASSP, 1996.

[4] D. Yu, L. Deng, J. Droppo, J. Wu, Y. Gong, and A. Acero, “A

Minimum-Mean-Square-Error Noise reduction Algorithm

on Mel-Frequency Cepstra for Robust Speech Recognition,”

Proc. ICASSP, 2008.

[5] J. Droppo and A. Acero, “Noise Robust Speech Recognition

with a Switching Linear Dynamic Model,” Proc. ICASSP,

2004.

[6] B. Raj, R. Singh, and R. M. Stern, “On Tracking Noise with

Linear Dynamical System Models,” Proc. ICASSP, 2004.

[7] A. P. Varga and R. K. Moore, “Hidden Markov Model de-

composition of speech and noise,” Proc. ICASSP, 1990.

[8] J. R. Hershey, S. J. Rennie, O. P. A., and T. T. Kristjansson,

“Super-human multi-talker speech recognition: A graphical

modeling approach,” Computer Speech and Language, 2010.

[9] T. Virtanen, “Monaural Sound Source Separation by Non-

Negative Matrix Factorization with Temporal Continuity and

Sparseness Criteria,” IEEE Trans. on ASLP, vol. 15, 2007.

[10] M. V. Shashanka, B. Raj, and P. Smaragdis, “Probabilis-

tic Latent Variable Models as Non-Negative Factorizations,”

Computational Intelligence and Neuroscience, May 2008.

[11] P. Smaragdis, R. Shashanka, and B. Raj, “A Sparse Non-

Parameteric Approach for Single Channel Separation of

Known Sounds,” Proc. NIPS, 2009.

[12] J. F. Gammecke and T. Virtanen, “Noise-robust exemplar-

based connected digit recognition,” Proc. ICASSP, 2010.

[13] A. Cichocki, “Csiszar Divergences for Non-negative Matrix

Factorization: Family of New Algorithms,” ICA and BSS,

vol. 3889/2006, pp. 32–39, 2006.

[14] M. Goto, “Development of the RWC music database,” in the

18th International Congress on Acoustics, 2004.

Non-negative matrix factorization based compensation of music for automatic speech recognition.

Figures

Citations

Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks

An overview of noise-robust automatic speech recognition

Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition

Discriminatively trained recurrent neural networks for single-channel speech separation

Deep NMF for speech separation

References

Suppression of acoustic noise in speech using spectral subtraction

Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator

Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria

Hidden Markov model decomposition of speech and noise

A vector Taylor series approach for environment-independent speech recognition

Related Papers (5)

Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria

Performance measurement in blind audio source separation

Algorithms for Non-negative Matrix Factorization

Learning the parts of objects by non-negative matrix factorization

Nonnegative matrix factorization with the itakura-saito divergence: With application to music analysis

Frequently Asked Questions (11)

Q1. What are the contributions in "Non-negative matrix factorization based compensation of music for automatic speech recognition" ?

Q2. What are the future works mentioned in the paper "Non-negative matrix factorization based compensation of music for automatic speech recognition" ?

Q3. What is the key aspect of the magnitude spectrogram?

Q4. What is the reason for the problem of recognizing speech in the presence of non-stationary?

Q5. What is the model for the noisy speech spectrogram?

Q6. How did the authors learn the base for the music types?

Q7. What is the spectral model for speech?

Q8. What is the EM algorithm used to determine the weights of the spectral vectors?

Q9. How can the authors learn the contributions of the individual bases?

Q10. What is the main reason for the difficulty of recognizing speech in the presence of nonstationary?

Q11. What is the compositional model of speech?