What is the sigmoid function of zero offset?

Gf = diag “ˆ gf,1 . . . gf,N ˜” is an NxN diagonal matrixwhose diagonal contains the Gammatone filter frequency response for frequency channel f ; S = diag “ˆ υ1 . . . υN ˜” is an NxN diagonal matrix whosediagonal contains the frequency response of the smoothing filter; b = ˆ b1 . . .

What is the sigmoid function of the GP?

The redefined cost function is:Et = εt − β GPt (5)where εt is the value of the function described in Eq. (2) in time frame t and GPt is the time evolution of the GP as defined in Eq. (3):GPt = 100Nf NfX f=1 L(yspt,f − y ns t,f ) (6)The cepstral coefficient vector ct = [ct,1 . . . ct,m . . . ct,M ]>is given by:ct = argmin ˆ

What is the effect of the proposed method on the word accuracy?

For the high frequency noise case it seems that, although not significantly, the proposed method decreases the word accuracy rates.

what is the definition of the error given by Eq. (5)?

According to the definition of the error given by Eq. (5) the gradient vector is:∇Et = ∇εt − β∇GPt (9)The formula expressing the value of ∇εt can be found in [6].

(Open Access) Cepstral analysis based on the glimpse proportion measure for improving the intelligibility of HMM-based synthetic speech in noise (2012) | Cassia Valentini-Botinhao

Q: What are the contributions mentioned in the paper "Cepstral analysis based on the glimpse proportion measure for improving the intelligibility of hmm-based synthetic speech in noise" ?

In this paper the authors introduce a new cepstral coefficient extraction method based on an intelligibility measure for speech in noise, the Glimpse Proportion measure. The authors first explain how the Glimpse Proportion measure operates and further show how they approximated it to integrate it into an existing spectral envelope parameter extraction method commonly used in the HMM-based speech synthesis framework. The authors then demonstrate how this new method changes the modelled spectrum according to the characteristics of the noise and show results for a listening test with vocoded and HMM-based synthetic speech.

Q: What was the speech material used to generate vocoded speech?

The speech material the authors used to generate vocoded speech was the semantically unpredictable sentences (SUS) set from the Blizzard Challenge 2010.

Q: What is the sigmoid function of the gpse?

Nt and Nf are the number of time frames and frequency channels respectively; L(.) is the logistic sigmoid function of zero offset and slope η.

Q: What was the purpose of the experiments?

For these experiments the authors added vocoded and HMM-generated synthetic speech to two different types of stationary noise, speech shaped noise (ssn) and high frequency noise (hf).

Q: What is the effect of the proposed method on speech intelligibility?

The authors proposed a new cepstral extraction method that aims not only to minimize the mismatch between periodogram and modelled spectrum but also to maximize speech intelligibility in noise, as defined by the Glimpse Proportion measure, given that the noise is known and SNR is known and fixed.

CEPSTRAL ANALYSIS BASED ON THE GLIMPSE PROPORTION MEASURE FOR

IMPROVING THE INTELLIGIBILITY OF HMM-BASED SYNTHETIC SPEECH IN NOISE

Cassia Valentini-Botinhao

, Ranniery Maia

, Junichi Yamagishi

, Simon King

and Heiga Zen

The Centre for Speech Technology Research, University of Edinburgh, UK

Cambridge Research Laboratory, Toshiba Research Europe Limited, UK

ABSTRACT

In this paper we introduce a new cepstral coefﬁcient extraction

method based on an intelligibility measure for speech in noise, the

Glimpse Proportion measure. This new method aims to increase

the intelligibility of speech in noise by modifying the clean speech,

and has applications in scenarios such as public announcement and

car navigation systems. We ﬁrst explain how the Glimpse Propor-

tion measure operates and further show how we approximated it

to integrate it into an existing spectral envelope parameter extrac-

tion method commonly used in the HMM-based speech synthesis

framework. We then demonstrate how this new method changes

the modelled spectrum according to the characteristics of the noise

and show results for a listening test with vocoded and HMM-based

synthetic speech. The test indicates that the proposed method can

signiﬁcantly improve intelligibility of synthetic speech in speech

shaped noise.

Index Terms— cepstral coefﬁcient extraction, objective mea-

sure for speech intelligibility, Lombard speech, HMM-based speech

synthesis

1. INTRODUCTION

This work focuses on compensating for background additive noise

by increasing the intelligibility of synthetic speech generated by a

parametric statistical model. Our method modiﬁes clean speech be-

fore it is added to noise. Applications of such an approach include

car navigation systems and any public announcement system that

makes use of text to speech technology.

Intelligibility of state-of-the-art hidden Markov model (HMM)

generated synthetic speech can be comparable to natural speech in

clean environments [1] but in noisy environments the situation is

quite different and most often natural speech is more intelligible.

The statistical and parametric nature of HMM-based speech syn-

thesis however offers a high degree of control over the generated

speech. By modifying the models or extracted parameters we are

able to control the acoustic characteristics of the generated speech

without the need for new data. It is then possible to generate syn-

thetic speech that is more intelligible in noise than the natural speech

used for training [2]. One way to achieve this is to imitate the acous-

tic properties found in natural speech produced in noise, also known

as Lombard speech. However not all observed acoustic changes im-

prove intelligibility. It has for example been found that changes in

the fundamental frequency have little contribution to intelligibility

gains [3, 4]. What remains unknown is which acoustic modiﬁca-

tions do in fact have a positive impact on intelligibility and how they

relate to the noise characteristics.

We believe that it is possible to increase the intelligibility of

speech in noise by modifying clean speech automatically according

to the noise characteristics. Because we do not know how speech

production and background noise are related, we need a model of

intelligibility, or just simply an objective measure for speech intelli-

gibility in noise, to control how speech should be modiﬁed. This is

what we refer here as an auditory perceptual based approach, as the

modiﬁcations are no longer inspired by speech production in noise

but by how the human auditory system perceives them. Previously

we have shown that simple changes in the spectral domain can result

in signiﬁcant gains in intelligibility for HMM-generated synthetic

speech in noise and that some intelligibility measures can predict

these intelligibility gains [4]. Our idea here is then to use one of

these measures, the Glimpse Proportion (GP) measure [5], to modify

the spectral envelope of speech. To do this we alter the optimization

criterion of the cepstral coefﬁcient extraction method [6] commonly

used in the HMM-based synthesis framework.

In Section 2 of this paper we outline the cepstral coefﬁcient ex-

traction method and in Section 3 we describe the Glimpse Proportion

measure. In Section 4 we show how we can reformulate the Glimpse

Proportion measure to use as a cost function for cepstral extraction

and then we deﬁne the proposed cepstral extraction method, showing

how to solve the new optimization problem. Section 5 gives the ex-

perimental results on the acoustic analysis of the modiﬁcations and

intelligibility evaluation of vocoded and HMM-generated synthetic

speech.

2. UELS-BASED CEPSTRAL ANALYSIS

The cepstral coefﬁcient extraction as described in [6] is a method

commonly used to extract spectral parameters for an HMM-based

speech synthesizer. The method is based on the Unbiased Estimator

of Log Spectrum (UELS) [7].

The cepstral coefﬁcients {c

}

m=0

deﬁne the spectrum of the

speech signal s(n) in the following way:

H(e

jω

) = exp

m=0

−jm ω

= KD(e

jω

) (1)

where K = exp c

and D(e

jω

) is the gain normalized version of

H(e

jω

The authors in [6] propose to extract cepstral coefﬁcients by

minimizing the criterion deﬁned for the unbiased condition as de-

scribed in [7]. Since H(e

jω

) as deﬁned in Eq. (1) is a minimum

phase system it is possible to prove that minimizing the unbiased

criterion with respect to {c

}

m=1

is the same as minimizing the

following cost function:

ε =

2π

−π

(ω)

|D(e

jω

dω (2)

where I

(ω) is the modiﬁed periodogram of a wide-sense stationary

process s(n). Likewise we ﬁnd that K =

√

min

, where ε

min

is the

minimum value of ε.

3. THE GLIMPSE PROPORTION MEASURE

The Glimpse Proportion (GP) measure for speech intelligibility in

noise [5] is based on the idea that in a noisy environment humans fo-

cus on glimpses of speech that are not masked by noise. It correlates

well with subjective scores for intelligibility in noise of both natural

[5] and HMM-based synthetic speech [8] and also when the spec-

tral envelope of HMM-based synthetic speech is modiﬁed [4]. The

GP measure outperforms most existing measures for intelligibility

of speech in noise and it does not require any time delays.

The measure is the proportion of spectral-temporal regions, so

called glimpses, where speech is more energetic than noise. This

comparison takes place in the Spectro Temporal Excitation Pattern

(STEP). In order to represent a signal in this domain the follow-

ing operations are performed over the speech and noise waveform

separately: Gammatone ﬁltering into frequency channel, envelope

extraction, envelope smoothing, average over time frame and level

compression. The centre frequencies of the Gammatone ﬁlters are

linearly spaced in the equivalent rectangular bandwidth (ERB) scale

[9].

4. PROPOSED CESPTRAL ANALYSIS

INCORPORATING THE GP MEASURE

In this section we show how we can approximate the GP measure

and integrate it to the existing cost function for cepstral coefﬁcient

extraction shown in Section 2.

4.1. Proposed GP approximation

To obtain a closed and differentiable formula that relates spectral pa-

rameters to the Glimpse Proportion measure we have to make some

approximations and correspondences. We ﬁrst replace the hard deci-

sion for counting glimpses by a soft one deﬁned by a sigmoid func-

tion. The proposed approximated Glimpse Proportion measure is

then given by:

GP =

100

t=1

f=1

L(y

t,f

− y

t,f

) (3)

where y

t,f

and y

t,f

are the approximated STEP representation for

speech and noise respectively at analysis window t and frequency

channel f ; N

and N

are the number of time frames and frequency

channels respectively; L(.) is the logistic sigmoid function of zero

offset and slope η.

We approximate the calculation of the STEP signal for speech

and noise by performing it over the magnitude spectrum of speech

and the discrete Fourier transform representation of the noise respec-

tively. The absolute value operation representing the envelope ex-

traction step is replaced by a circular convolution of the signal with

itself. The ﬁltering operations are replaced by truncated multiplica-

tions and the level compression is no longer considered. The STEP

approximation as shown in Fig. 1 is given by:

t,f

 G

)

S b (4)

where N is the number of frequency bins of the spectrum,

 is the

circular convolution operation dimension N and:

circular

convolution

t,f

Gammatone filtering

envelope smoothing

average

magnitude

spectrum

Fig. 1. Proposed approximation for the Spectro Temporal Excitation

Pattern (STEP) calculation.

(ω

)| . . . |H

(ω

is an N x1 vector contain-

ing the magnitude spectrum of windowed speech signal at analysis

window t;

= diag

“

f,1

. . . g

f,N

”

is an NxN diagonal matrix

whose diagonal contains the Gammatone ﬁlter frequency response

for frequency channel f;

S = diag

“

. . . υ

”

is an NxN diagonal matrix whose

diagonal contains the frequency response of the smoothing ﬁlter;

b =

. . . b

is an Nx1 vector containing the coefﬁcients

of average ﬁlter.

The approximated version of the GP measure proposed here ob-

tains correlation coefﬁcients that are smaller but still comparable to

the ones obtained by the original GP measure and higher than the

ones obtained by any other spectrum-based measure when using the

subjective data from the experiment described in [4].

4.2. Cost function reformulation

In order to keep a compromise between the minimization of the cost

function deﬁned in Eq. (2) and the maximization of the intelligibility

measure given by Eq. (3) we need to deﬁne a parameter β that con-

trols the weight given to each criterion. The redeﬁned cost function

is:

= ε

− β GP

(5)

where ε

is the value of the function described in Eq. (2) in time

frame t and GP

is the time evolution of the GP as deﬁned in Eq.

(3):

100

f=1

L(y

t,f

− y

t,f

) (6)

The cepstral coefﬁcient vector c

= [c

t,1

. . . c

t,m

. . . c

t,M

]

is given by:

= argmin

− β GP

(7)

It is clear that when β = 0 the proposed cepstral extraction

method becomes the original method of Section 2.

4.3. Solving the optimization problem using Steepest Descent

The update equation for cepstral coefﬁcients using Steepest Descent

is:

(i+1)

= c

(i)

− µ∇E

(i)

(8)

where µ is the step size and the i index refers to iterations. From

now on we are dropping the i index for clarity reasons.

According to the deﬁnition of the error given by Eq. (5) the

gradient vector is:

∇E

= ∇ε

− β∇GP

(9)

The formula expressing the value of ∇ε

can be found in [6].

Considering the deﬁnition of the STEP function and the GP

given by Eqs. (4) and (6) we have that:

∇GP

100

f=1

η L(y

t,f

− y

t,f

)

1 − L(y

t,f

− y

t,f

)

(2 Γ

 G

) S b (10)

where H

is an MxN matrix whose elements are {H

}

m,j

∂|H

(ω

∂c

t,m

and the operation (Γ

 G

) deﬁnes an NxN matrix

of the following form:

 (G

)

 (G

)

 (G

)

where e

is the n-th column of the identity matrix Γ

When spectrum is modeled by cepstral coefﬁcients as deﬁned in

Eq. (1) the elements of the matrix H

are given by:

∂|H

(ω

∂c

t,m

= |H

(ω

)|cos(m ω

) (11)

4.4. Energy normalization

In order to avoid the trivial solution of maximizing the number of

glimpses by increasing the overall energy level and to see how much

we can improve intelligibility given a ﬁxed Signal to Noise Ratio

(SNR) we need to make sure that the optimization does not change

the total energy of the signal at each time frame.

We assume that the excitation signal has power one, with mag-

nitude response constant over all frequency range for both voiced

(single pulse) and unvoiced (white noise) segments. Under this as-

sumption and considering that the cepstral extraction method does

not modify the excitation signal we can say with the help of the Par-

seval theorem that in order to keep the energy in the time domain

constant it is sufﬁcient to keep the following constant:

ψ =

j=1

|H(ω

(12)

An alternative solution to explicitly adding a constraint to the

optimization problem is to normalize the spectrum at each iteration

so that the signal in that frame has ﬁxed energy. For this solution the

only term that needs changing in the gradient vector ∇E

is the one

given by Eq. (11), that for m 6= 0 becomes:

∂|H

(ω

∂c

t,m

= |H

(ω

“

cos(m ω

) −

l=1

(ω

cos(m ω

)

”

(13)

where |H

(ω

)| is the energy normalized magnitude spectrum. It

is possible to prove that there is no need to update the ﬁrst cepstral

coefﬁcient c

in this solution as the normalization operation updates

at each iteration to a certain value regardless of an additional ∆c

term.

5. EVALUATION

We conducted experiments with vocoded and synthetic speech. The

results for HMM-synthetic speech can show us the impact of the

acoustic modelling on the effectiveness of the method.

5.1. Experimental conditions

The speech material we used to generate vocoded speech was the

semantically unpredictable sentences (SUS) set from the Blizzard

Challenge 2010. The samples were of a British male speaker sam-

pled at 20kHz. To train the models we used 1000 other sentences

from the same speaker also at 20kHz. The same sentences used to

generate vocoded speech were used as test sentences for the HMM-

generated synthetic speech. We used as synthesis ﬁlter the log spec-

trum approximation ﬁlter [6] with simple excitation as input.

Using the proposed method we extracted 52 cepstral coefﬁcients

for different β values, including the β = 0 case for comparison. The

periodogram was set to be the smoothed spectrum extracted using

STRAIGHT [10]. We initialize the algorithm with the ﬁrst M + 1

values of the minimum phase cepstrum. The step size was set to

(i)

= 1/||∇E

(i)

||. We used both error convergence and maximum

distortion as stopping criterion.

The acoustic model we used for synthetic speech was a hidden

semi Markov model. The observation vectors for the spectral and ex-

citation parameters contained static, delta and delta-delta values. We

used one stream for the spectrum and three streams for the logF0.

We used the Global Variance method [11] to compensate for the

oversmoothing effect of the acoustical modeling.

For these experiments we added vocoded and HMM-generated

synthetic speech to two different types of stationary noise, speech

shaped noise (ssn) and high frequency noise (hf). Each noise type

was added at a different SNR: 0 dB for ssn and and −20 dB for hf.

For the listening test we played all signals over headphones to

participants in soundproof booths. Each individual sentence could be

played only once before the participant had to type in what he or she

heard. A total of eight native English speakers participated in the

experiment with vocoded speech and other eight participants were

assigned to the experiment with synthetic speech. Each participant

heard twelve different sentences per listening situation.

5.2. Results and discussion

Fig. 2 shows the Long Term Average Spectrum (LTAS) of vocoded

speech generated using the original and the proposed method when

noise is speech shaped and SNR is 0 dB. In the ﬁgure we can also

see the LTAS of the noise. We can see that on average the proposed

method reallocates energy mostly to the frequency range between

800 Hz and 4.8 kHz, the band where the auditory human system is

more sensitive. The attenuation occurs mostly in the lower frequency

regions below 800 Hz. For the high frequency noise the energy boost

occurs in a similar region and we also observed some attenuation in

the high frequency region, as this region is highly masked by noise.

We observed that the proposed method improves not only the ap-

proximated GP measure introduced above but the original GP mea-

sure as well. This improvement was observed for all noise types and

for both vocoded and synthetic speech.

Fig. 3 shows the word accuracy rates obtained in the listening

test with vocoded (left) and synthetic speech (right). Each group

mean is represented by a circle; two means are signiﬁcantly different

at a 0.05 level only if their intervals are disjoint.

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Freq. (Hz)

Sound Pressure Level (dB)

original

proposed

noise

Fig. 2. Long term average spectrum curves extracted of vocoded

speech generated using the original method (β = 0) and the proposed

method (β 6= 0) for speech shaped noise at 0 dB SNR.

We can see that the proposed method does not produce any sig-

niﬁcant differences in word accuracy for vocoded speech. However

for synthetic speech and speech shaped noise there is a signiﬁcant

improvement of word accuracy from 31 % to 44 % (a gain of 44 %).

For the high frequency noise case it seems that, although not sig-

niﬁcantly, the proposed method decreases the word accuracy rates.

We believe this happens because the modiﬁcations imposed by such

noise leads to less natural speech which in turn degrades intelligi-

bility. This could be solved by changing the acceptable amount of

distortion and GP improvement or by stating the amount of distor-

tion as a constraint instead of a stopping criterion.

The impact of the proposed method seems to be stronger for syn-

thetic speech although the GP gains were smaller or similar for syn-

thetic speech, most probably because in harder tasks smaller glimpse

variations lead to stronger effects.

6. CONCLUSION

In this paper we showed how to use a measure of speech intelligi-

bility in noise to modify HMM-synthetic speech and make it more

intelligible for a certain noise. We proposed a new cepstral extrac-

tion method that aims not only to minimize the mismatch between

periodogram and modelled spectrum but also to maximize speech

intelligibility in noise, as deﬁned by the Glimpse Proportion mea-

sure, given that the noise is known and SNR is known and ﬁxed.

The listening tests with vocoded and synthetic speech showed the

effectiveness of the method for speech shaped noise but not for high

frequency noise, which might indicate that the amount of distortion

introduced into the speech by the modiﬁcation was too large. Our

next step is to handle distortion in a better way and then consider

other types of constraints as well, for instance loudness. It is also

in our plans to compare our approach to natural Lombard speech,

in particular for those situations where humans are not fully able to

change their own voice to successfully avoid the background noise.

Acknowledgment

The research leading to these results was partly funded from the Eu-

ropean Community’s Seventh Framework Programme (FP7/2007-

2013) under grant agreements 213850 and 256230 (SCALE and

LISTA).

high frequency speech shaped

vocoded speech

word accuracy rate (%)

high frequency speech shaped

synthetic speech

word accuracy rate (%)

original

proposed

noise type noise type

Fig. 3. Word accuracy rates of listening test with vocoded (left) and

synthetic (right) speech.

7. REFERENCES

[1] J. Yamagishi, H. Zen, Y.-J. Wu, T. Toda, and K. Tokuda, “Yet

another evaluation of the speaker-adaptive HMM-based speech

synthesis system in the 2008 Blizzard Challenge,” in Proc.

Blizzard Challenge Workshop, Brisbane, Australia, Sept. 2008,

vol. 5.

[2] A. Suni, T. Raitio, M. Vainio, and P. Alku, “The GlottHMM

speech synthesis entry for Blizzard Challenge 2010,” in Proc.

Blizzard Challenge Workshop, Kyoto, Japan, Sept. 2010.

[3] Y. Lu and M. Cooke, “The contribution of changes in F0 and

spectral tilt to increased intelligibility of speech produced in

noise,” Speech Comm., vol. 51, no. 12, pp. 1253–1262, 2009.

[4] C. Valentini-Botinhao, J. Yamagishi, and S. King, “Can ob-

jective measures predict the intelligibility of modiﬁed HMM-

based synthetic speech in noise?,” in Proc. Interspeech, Flo-

rence, Italy, August 2011.

[5] M. Cooke, “A glimpsing model of speech perception in noise,”

J. Acoust. Soc. Am., vol. 119, no. 3, pp. 1562–1573, 2006.

[6] K. Tokuda, T. Kobayashi, and S. Imai, “Adaptive cepstral anal-

ysis of speech,” IEEE Trans. Speech and Audio Processing,

vol. SA-3, no. 6, pp. 481–489, Nov. 1995.

[7] S Imai and C. Furuichi, “Unbiased estimator of log spec-

trum and its application to speech signal processing,” in Proc.

EURASIP, Grenoble, France, Sep. 1988, pp. 203–206.

[8] C. Valentini-Botinhao, J. Yamagishi, and S. King, “Evaluation

of objective measures for intelligibility prediction of HMM-

based synthetic speech in noise,” in Proc. ICASSP, Prague,

Czech Republic, May 2011.

[9] B. C. J. Moore and B. R. Glasberg, “A revision of Zwicker’s

loudness model,” Acta Acustica, vol. 82, pp. 335–345, 1996.

[10] H. Kawahara, I. Masuda-Katsuse, and A. Cheveign

e, “Re-

structuring speech representations using a pitch-adaptive time-

frequency smoothing and an instantaneous-frequency-based

F0 extraction: possible role of a repetitive structure in sounds,”

Speech Comm., vol. 27, pp. 187–207, 1999.

[11] T. Toda and K. Tokuda, “A speech parameter generation al-

gorithm considering global variance for HMM-based speech

synthesis,” IEICE Trans. Inf. Syst., vol. E90-D, no. 5, pp. 816–

824, 2007.

Cepstral analysis based on the glimpse proportion measure for improving the intelligibility of HMM-based synthetic speech in noise

Figures

Citations

Evaluating the intelligibility benefit of speech modifications in known noise conditions

The listening talker: A review of human and algorithmic context-induced modifications of speech

Maximizing Phoneme Recognition Accuracy for Enhanced Speech Intelligibility in Noise

Intelligibility enhancement of HMM-generated speech in additive noise by modifying Mel cepstral coefficients to increase the glimpse proportion

Evaluating the predictions of objective intelligibility metrics for modified and synthetic speech

References

Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds

A glimpsing model of speech perception in noise.

A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis

A revision of Zwicker's loudness model

The contribution of changes in F0 and spectral tilt to increased intelligibility of speech produced in noise

Related Papers (5)

A glimpsing model of speech perception in noise.

Can Objective Measures Predict the Intelligibility of Modified HMM-based Synthetic Speech in Noise?

Effects of noise on speech production: Acoustic and perceptual analyses

Energy reallocation strategies for speech enhancement in known noise conditions.

A speech preprocessing strategy for intelligibility improvement in noise based on a perceptual distortion measure

Frequently Asked Questions (16)

Q1. What are the contributions mentioned in the paper "Cepstral analysis based on the glimpse proportion measure for improving the intelligibility of hmm-based synthetic speech in noise" ?

Q2. What was the speech material used to generate vocoded speech?

Q3. What is the sigmoid function of the gpse?

Q4. How many participants were assigned to the experiment?

Q5. What was the purpose of the experiments?

Q6. What is the sigmoid function of zero offset?

Q7. What is the sigmoid function of the GP?

Q8. Why is the impact of the proposed method stronger for synthetic speech?

Q9. What is the effect of the proposed method on speech intelligibility?

Q10. What funding was used for the research?

Q11. What is the average SNR of the proposed method?

Q12. What is the effect of the method for speech shaped noise?

Q13. What is the effect of the proposed method on the word accuracy?

Q14. What is the proposed approximated Glimpse Proportion measure?

Q15. what is the definition of the error given by Eq. (5)?

Q16. What is the GP measure for speech?