Proceedings Article•DOI•

Mel cepstral coefficient modification based on the Glimpse Proportion measure for improving the intelligibility of HMM-generated synthetic speech in noise

Cassia Valentini-Botinhao, Junichi Yamagishi, Simon King¹•Institutions (1)

01 Sep 2012-pp 631-634

TL;DR: A method that modifies the Mel cepstral coefficients of HMM-generated synthetic speech in order to increase the intelligibility of the generated speech when heard by a listener in the presence of a known noise is proposed.

read less

Abstract: We propose a method that modifies the Mel cepstral coefficients of HMM-generated synthetic speech in order to increase the intelligibility of the generated speech when heard by a listener in the presence of a known noise. This method is based on an approximation we previously proposed for the Glimpse Proportion measure. Here we show how to update the Mel cepstral coefficients using this measure as an optimization criterion and how to control the amount of distortion by limiting the frequency resolution of the modifications. To evaluate the method we built eight different voices from normal read-text speech data from a male speaker. Some voices were also built from Lombard speech data produced by the same speaker. Listening experiments with speech-shaped noise and with a single competing talker indicate that our method significantly improves intelligibility when compared to unmodified synthetic speech. The voices built from Lombard speech outperformed the proposed method particularly for the competing talker case. However, compared to a voice using only the spectral parameters from Lombard speech, the proposed method obtains similar or higher performance.

...read moreread less

Summary (2 min read)

Jump to: [Introduction] – [GGBMI = =] – [GGBMI =] – [AGGBMI =] – [AGBMI =] – [SGBMI =] and [SAGBMI =]

Introduction

Experiments in [28] found that handgrip strength is a predictor of mortality and morbidity, in man and woman, predicting up to 5 diseases.
The grip strength is correlated to the overall body strength, muscles and health status.

GGBMI = =

This formula would predict a maximum normal weight of 87kg for somebody with handgrip strength of 108kg, a height of 1.8m.
I believe this is correct in general for boxers and wrestlers and gymnasts and it predicts normal weight in the sense of normal body fat percentage, but the competitive weight of world class gymnasts is lower than the weight predicted by this formula.
For a 1.7 m athlete with handgrip strength of 108lg this formula would predict a maximum normal weight of 79kg, which is reasonable, but of course the optimal competitive weight may be lower.
Even at 79kg such athlete would not be fat or overweight.
It is possible to find an even more general formula.

GGBMI =

It is possible to develop a optimal weight equation using 1.8 instead of 2, the reason is explained in [2].
Any of these formulae is better than the BMI alone and there is a lot of evidence, in some cases it is better by a large margin.
Normalization could be obtained through division by 100 and the authors obtain a smaller factor related to the grip strength.
I consider the previous formula better, but there is also the possibility of division by 100 instead by division with 54 and then.

AGGBMI =

Engineering an optimal formula is also achieved through trial and error.
I developed and tested also formula developed based on similar principles such as AGBMI= weight H2+ gripstrengt − 54 weight weight H1.8+ gripstrengt − 54 weight weight H2+ grip_ strength − 54 100 weight 2 × chest −.

AGBMI =

Therefore it is possible to develop and test the following formulae: Weight Height2× chest −.
For strength on this move equivalent to 50kg, the maximum predicted weight is 84, for 60 kg is 89kg, for 80 kg strength, the maximum normal weight would be according to this formula 97kg.

SGBMI =

This formula would predict as much as 119kg maximum normal weight for a 120kg bar crunch.
Of course, it would be correct to use a force not weight but this is how bars are sold, and the authors can make an equivalent to a lifting move such as dumbbell bench press where the weights are measured in kg.
The advantage of this move is that could be tested with very simple equipment, a short 50cm bar in a medical office or at home.
This formula has the BMI as a particular case but works in both ways, for stronger people it allows higher weigh but for weaker people it allows less weight than the classic BMI.
It is possible to define SABMI = Strength and anthropometric generalization of BMI.

SAGBMI =

In the same way I develop a number of formulae based on some ideas, experiments cited and principles I developed, them test the formulae with test cases, simulate it and present it so that people who design experimental studies can verify these formulae in a large number of cases, on statistical basis.
A treatise on Man and the Development of His Faculties.

Did you find this useful? Give us your feedback

Figures (5)

Figure 3: Word accuracy rates for competing talker.

Figure 2: Word accuracy rates for speech-shaped noise.

Table 1: Voices built for the evaluation.

Figure 1: Long term average spectrum of the normal N, normal modified N-M2 and Lombard L voices for speech-shaped noise.

Table 2: Acoustic properties observed in normal N, modified N-M2 and lombard L voices.

Content maybe subject to copyright Report

Edinburgh Research Explorer

Mel cepstral coefficient modification based on the Glimpse

Proportion measure for improving the intelligibility of HMM-

generated synthetic speech in noise

Citation for published version:

Valentini-Botinhao, C, Yamagishi, J & King, S 2012, Mel cepstral coefficient modification based on the

Glimpse Proportion measure for improving the intelligibility of HMM-generated synthetic speech in noise. in

Proc. Interspeech.

Link:

Link to publication record in Edinburgh Research Explorer

Document Version:

Peer reviewed version

Published In:

Proc. Interspeech

General rights

and / or other copyright owners and it is a condition of accessing these publications that users recognise and

abide by the legal requirements associated with these rights.

Take down policy

The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer

content complies with UK legislation. If you believe that the public display of this file breaches copyright please

contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and

investigate your claim.

Download date: 10. Aug. 2022

Mel cepstral coefﬁcient modiﬁcation based on the Glimpse Proportion measure

for improving the intelligibility of HMM-generated synthetic speech in noise

Cassia Valentini-Botinhao, Junichi Yamagishi, Simon King

The Centre for Speech Technology Research, University of Edinburgh, UK

C.Valentini-Botinhao@sms.ed.ac.uk, jyamagis@inf.ed.ac.uk, Simon.King@ed.ac.uk

Abstract

We propose a method that modiﬁes the Mel cepstral coefﬁcients

of HMM-generated synthetic speech in order to increase the in-

telligibility of the generated speech when heard by a listener

in the presence of a known noise. This method is based on

an approximation we previously proposed for the Glimpse Pro-

portion measure. Here we show how to update the Mel cep-

stral coefﬁcients using this measure as an optimization crite-

rion and how to control the amount of distortion by limiting

the frequency resolution of the modiﬁcations. To evaluate the

method we built eight different voices from normal read-text

speech data from a male speaker. Some voices were also built

from Lombard speech data produced by the same speaker. Lis-

tening experiments with speech-shaped noise and with a sin-

gle competing talker indicate that our method signiﬁcantly im-

proves intelligibility when compared to unmodiﬁed synthetic

speech. The voices built from Lombard speech outperformed

the proposed method particularly for the competing talker case.

However, compared to a voice using only the spectral parame-

ters from Lombard speech, the proposed method obtains similar

or higher performance.

Index Terms: intelligibility of speech in noise, Mel cepstral

coefﬁcients, HMM-based speech synthesis

1. Introduction

Humans change their speaking style when conversing in a noisy

environment so that communication success is ensured, often

producing what is called Lombard speech. It is unclear what

aspects of Lombard speech actually contribute to intelligibility

increases and how they relate to the nature of the noise. Solving

this problem will enable practical applications which automati-

cally modify natural or synthetic speech to increase intelligibil-

ity in noise.

The parametrical statistical framework of HMM-based

speech synthesis offers many different ways to approach this

problem. If Lombard speech data are available for the speaker

whose TTS voice we want to modify, we can use adapta-

tion techniques to produce new Lombard-like speech for that

speaker [1]. If such data are not available, then we can apply

noise-independent modiﬁcations at the feature level based on

known acoustic properties of Lombard speech, such as F0 in-

crease, ﬂattening of spectral tilt and duration stretch [1]. How-

ever if we want to employ noise-dependent techniques then we

need to be able to automatically detect what sort of modiﬁca-

tions should take place for certain pairs of speech and noise

signals. One way in which this can be done is by using an in-

telligibility measure of speech [2]. Such an approach is limited

by the performance of the objective measure: if it fails to ac-

curately predict intelligibility then any modiﬁcation based on

that prediction is likely to fail. Therefore, it is important to

ﬁnd a speciﬁc domain of modiﬁcations where the intelligibility

model behaves well and ensure that the modiﬁcations applied

in this domain remain within the working range of the objective

model.

We have observed that the Glimpse Proportion (GP) mea-

sure for speech intelligibility in noise [3] has a high correla-

tion coefﬁcient with subjective intelligibility scores for HMM-

generated synthetic speech whose spectral envelope has been

modiﬁed [4]. Moreover, modiﬁcations in the spectral envelope

domain can achieve quite high intelligibility gains. We then

proposed a cepstral extraction method based on the GP mea-

sure for the HMM-based synthesis framework [5]. This method

was shown to provide signiﬁcant intelligibility improvement,

although not for all noise types. We hypothesise this is due to

distortions introduced by the method itself. A disadvantage of

that approach is having to train a different model for each noise

type, because the noise-dependent modiﬁcations are performed

as part of feature extraction. Now, we propose a method that can

be applied at generation time, and not requiring any information

about the spectral envelope of natural speech to achieve dis-

tortion control. Rather, we propose to control the distortion in

two ways: using a stopping criteria based on the mismatch be-

tween the auditory representations of modiﬁed and unmodiﬁed

speech, as proposed by the GP measure, and only modifying the

ﬁrst few cepstral coefﬁcients, thus limiting the frequency reso-

lution of the modiﬁcations. A further extension proposed in this

paper is the possibility of using this method for Mel cepstral co-

efﬁcients, which can provide higher speech quality with fewer

coefﬁcients [6].

In Section 2 and 3 we show how Mel cepstral coefﬁcients

model the spectrum, how the GP measure works and how we

previously approximated it for the purpose of cepstral coefﬁ-

cient optimization. In Section 4 we introduce the new method

for Mel cepstral modiﬁcation based on the GP measure. We

then provide experimental results from listening experiments to

support our conclusions.

2. Mel cepstral coefﬁcients

We can represent the spectrum by M-th order Mel cepstral co-

efﬁcients {c

}

m=0

in the following manner [6]:

H(e

jω

) = exp

m=0

−jm ˜ω

(1)

˜ω = tan

−1

(1 − α

) sin ω

(1 + α

) cos ω − 2α

(2)

where α is a warping factor which can be chosen to represent,

for instance, the Mel scale [6].

C. Valentini-Botinhao, J. Yamagishi, and S. King. Mel cepstral coefficient modification based on the Glimpse

Proportion measure for improving the intelligibility of HMM-generated synthetic speech in noise. In Proc.

Interspeech, Portland, USA, September 2012.

3. The Glimpse Proportion measure

The Glimpse Proportion (GP) measure for speech intelligibil-

ity in noise [3] is the proportion of spectral-temporal regions

called glimpses where speech is more energetic than noise. The

motivation behind this measure is that when humans listen to

speech in noise they tend to focus on such regions. The Spec-

tro Temporal Excitation Pattern (STEP) representation used by

the measure is obtained in the following manner: Gammatone

ﬁltering, envelope extraction and smoothing, averaging over a

time frame and level compression [3].

In [5] we showed how to approximate the GP measure in a

way that provides a closed and differentiable formulation:

GP =

100

t=1

f=1

L(y

t,f

− y

t,f

) (3)

where N

and N

are the number of time frames and frequency

channels, L(.) is a logistic sigmoid function of zero offset and

slope η, y

t,f

and y

t,f

are the approximated STEP representa-

tions for speech and noise respectively at analysis window t and

frequency channel f.

The STEP representation for speech is given by:

t,f

 G

)

S b (4)

where N is the number of frequency bins of the spectrum,



is circular convolution of dimension N, h

is an Nx1 vector

containing the magnitude spectrum of windowed speech signal

at analysis window t, G

is an NxN diagonal matrix whose

diagonal contains the Gammatone ﬁlter frequency response for

frequency channel f, S is an NxN diagonal matrix whose di-

agonal contains the frequency response of the smoothing ﬁlter

and b is an Nx1 vector containing the coefﬁcients of the aver-

age ﬁlter.

4. Mel cepstral modiﬁcations

based on the GP measure

Given a set of Mel cepstral coefﬁcients and a noise signal we

want to obtain a new set of Mel cepstral coefﬁcients c

t,1

. . . c

t,m

. . . c

t,M

]

that maximizes GP

, the value of

the function described in Eq. (3) in time frame t. We then have:

= argmax GP

(5)

100

f=1

L(y

t,f

− y

t,f

) (6)

As this function is not necessarily convex with respect to

the Mel cepstral coefﬁcients, we use a Steepest Descent method

to solve the optimization. The update equation is:

(i+1)

= c

(i)

+ µ∇GP

(i)

(7)

where µ is the step size and the i index refers to iterations. From

now on we drop the i index for clarity. The gradient vector is

given by:

∇GP

100

f=1

η L(y

t,f

− y

t,f

)

1 − L(y

t,f

− y

t,f

)

(2 Γ

 G

) S b (8)

where H

is an MxN matrix whose elements are

}

m,j

∂|H

(ω

∂c

t,m

and the operation (Γ

 G

)

deﬁnes an NxN matrix whose n-th row is equal to

 (G

)

, e

being the n-th column of the identity ma-

trix Γ

When the spectrum is modelled by Mel cepstral coefﬁcients

as deﬁned in Eq.(1) the elements of the matrix H

are given

by:

∂|H

(ω

∂c

t,m

= |H

(ω

)| cos(m ˜ω

) (9)

However because we do not wish to modify the energy of

the speech signal we have:

∂|H

(ω

∂c

t,m

= |H

(ω

“

cos(m ˜ω

)

−

l=1

(ω

cos(m ˜ω

)

”

(10)

where |H

(ω

)| is the energy-normalized magnitude spectrum

and ψ =

j=1

(ω

. There is no need to update the ﬁrst

Mel cepstral coefﬁcient c

as the normalization operation up-

dates it to a certain value regardless of an additional ∆c

term.

An issue we face when using the GP measure as an opti-

mization criterion on its own is the need to limit the distortions

caused by the modiﬁcations. To deﬁne an audible distortion we

use the Euclidian distance between the STEP representations of

modiﬁed and unmodiﬁed speech. Including this as an explicit

constraint is unfortunately rather cumbersome, so instead we

use it as a stopping criterion whilst at the same time limiting the

frequency resolution of the modiﬁcations. To implement that,

we simply set the gradient vector for higher dimensions to zero,

thus modify only the ﬁrst few Mel cepstral coefﬁcients, which

represent the coarse properties of the spectrum.

5. Evaluation

In this section we show how we built the TTS voices, give an

acoustic analysis, and present the results of a listening test.

5.1. Voice building

To build the voices used in this evaluation we used two differ-

ent datasets recorded by the same British male speaker: normal

(plain, read-text) speech data and Lombard speech. The Lom-

bard dataset was recorded while the speaker listened to speech-

modulated noise based on another male speaker [7] played over

headphones at a absolute value of 84 dBA.

We built eight different voices as outlined in Table 1. Voice

N was created from a high quality average voice model adapted

to 2803 sentences of the normal speech database, correspond-

ing to three hours of material. We decided to use an average

voice model rather than building a speaker-dependent voice be-

cause the normal speech dataset was not phonetically balanced.

Voices N-M59, N-M10 and N-M2 are variations of N in which

we modify all, just the ﬁrst ten (c

until c

), or just the ﬁrst

two (c

and c

) Mel cepstral coefﬁcients using our proposed

method.

Lombard voice L was based on voice N, further adapted

using 780 sentences from the Lombard speech dataset, corre-

sponding to 53 minutes of recorded material. Again, the rea-

son for using adaptation was the lack of phonetic balance in the

speech dataset. Voice N-L was also created from voice N but

Voice Adaptation Modiﬁcation

N - -

N-M59 - all coefﬁcients

N-M10 - ﬁrst 10 coefﬁcients

N-M2 - ﬁrst 2 coefﬁcients

N-L only spectral parameters -

L all dimensions -

L-E all dimensions extrapolated -

L-E-M2 all dimensions extrapolated ﬁrst 2 coefﬁcients

Table 1: Voices built for the evaluation.

this time only the Mel cepstral coefﬁcients were adapted to the

Lombard data. Voices L-E and L-E-M2 are versions of voice L

where we extrapolated the adaptation (voice L-E), and then also

modiﬁed the two ﬁrst Mel cepstral using the proposed method

(voice L-E-M2).

The training and adaptation data had a sampling rate of

48 kHz. To train, adapt and generate speech we extracted: 59

Mel cepstral coefﬁcients with α = 0.77, Mel scale F0, and 25

aperiodicity energy bands extracted using STRAIGHT [8]. We

used a hidden semi-Markov model. The observation vectors for

the spectral and excitation parameters contained static, delta and

delta-delta values; one stream for the spectrum, three streams

for the logF0 and one for the band-limited aperiodicity. The

Global Variance method [9] was also applied to compensate for

the over smoothing effect of the acoustical modelling.

To modify the generated Mel cepstral coefﬁcients we used

the method proposed in the previous section, obtaining the

STEP representation by using Gammatone ﬁlters that cov-

ered the range 50-7500 Hz as the noise signal used for test-

ing is sampled at 16 kHz. The stepsize was normalized:

(i)

= µ/||∇GP

(i)

|| and we set µ = 0.4 for N-M59 and µ = 0.8

for N-M10 and N-M2. We used as stopping criteria both error

convergence and a maximum distortion threshold set to be 10 %

of relative increase in the Euclidian distance between the STEP

representation of original and modiﬁed speech.

5.2. Acoustic analysis

Fig.1 shows the Long Term Average Spectrum (LTAS) of the

normal (N), modiﬁed (N-M2) and Lombard (L) voices, for the

case of speech-shaped noise. Compared to voice N, voice N-

M2 exhibits enhanced energy in the frequency region of 1-

4 kHz and attenuated below 1 kHz. Voice L shows enhance-

ment and attenuation in the same regions as N-M2, although

these changes are not as pronounced, attenuation is also seen

between 4-5.5 kHz and enhancement at frequencies above this.

Table 2 provides an acoustic analysis of the voices – av-

erage duration of speech and pauses, average spectral tilt, and

F0 – across all sentences used in the listening test for the nor-

mal (N), modiﬁed (N-M2) and lombard (L) voices. We can

see that, as expected, the Lombard voice produces sentences

with longer duration and longer pauses, greatly increased F0

mean and ﬂattening of the spectral tilt. The spectral tilt reﬂects

changes in both spectral envelope and excitation signal. The

modiﬁed voice N-M2 also presents a ﬂatter spectral tilt, though

not to the same extent as the Lombard voice.

5.3. Listening experiments design

We mixed the eight different synthetic voices with two noises:

speech-shaped noise and speech from a single competing fe-

−10

Sound Pressure Level (dB)

N−M2

noise

0 1 2 3 4 5 6 7 8

Freq. (kHz)

Figure 1: Long term average spectrum of the normal N, normal

modiﬁed N-M2 and Lombard L voices for speech-shaped noise.

Voice

speech

(secs.)

pauses

(secs.)

(Hz)

spectral tilt

(dB/octave)

2.11 0.16 104.5

-2.24

N-M2 -1.88

L 2.80 0.19 145.0 -1.70

Table 2: Acoustic properties observed in normal N, modiﬁed

N-M2 and lombard L voices.

male talker. For intelligibility testing, it is important to avoid

ﬂoor or ceiling effects on word error rate. Therefore, in order to

obtain intelligibility scores in similar ranges for each noise, we

mixed them at differing SNRs: -4 dB for speech-shaped noise

and -14 dB for the competing talker. Across the different voices

we made sure that the root mean square value was the same.

For the listening test we used 32 native English speakers

listening to the noisy samples over headphones in soundproof

booths and typing in what he or she heard. Each participant

heard six different sentences per condition, i.e., voice and noise

type, and each sentence could only be played once. We used the

ﬁrst ten sets of the Harvard sentences [10]; another one of the

sets was used as a practice session which listeners completed

before the test proper.

5.4. Results and discussion

Figs. 2 and 3 show the mean word accuracy rate (WAR) ob-

tained by each voice when mixed with speech-shaped noise

and a competing talker respectively, along with 95 % conﬁ-

dence intervals. Fig.2 shows that the modiﬁed voices N-M59,

N-M10 and N-M2 achieve higher WAR than the unmodiﬁed

voices N (40.9 %), and this is signiﬁcantly higher for the N-

M10 (50.6 %) and N-M2 (57.8 %). The N-M2 voice obtains a

higher WAR than the N-L voice (49.4 %). The Lombard voices

L (63.5 %), L-E (68.1 %) and L-E-M2 (70.1 %) performed bet-

ter than the normal speech voices although we did not ﬁnd a sig-

niﬁcant difference between N-M2 and L. The extrapolated voice

L-E is more intelligible than voice L, a trend that is further en-

hanced by applying our modiﬁcations to it, as in voice L-E-M2.

The results obtained for the competing talker situation are dis-

played in Fig. 3 and show a slightly different trend. There is a

drop in performance for N-M59 and N-M10 when compared to

N (36.6 %), although this is not signiﬁcant. The N-M2 (42.7 %)

voice performs better than the unmodiﬁed counterpart N and

obtains a similar WAR to N-L (43.6 %). All Lombard voices

performed signiﬁcantly better than the other voices, in particu-

N N−M59 N−M10 N−M2 N−L L L−E L−E−M2

Word accuracy rate (%)

Figure 2: Word accuracy rates for speech-shaped noise.

lar the L voice (62.2 %). The other versions, L-E (60.5 %) and

L-E-M2 (59.3 %), do not appear to increase intelligibility.

As predicted by our hypothesis that distortions were defeat-

ing potential gains in intelligibility in our previously-published

experiments [5], the voices where we modify only the ﬁrst few

Mel cepstral coefﬁcients achieved a better WAR, indicating that

very ﬁne frequency modiﬁcations cause distortions that cancel

out any potential intelligibility gain they may offer. Compared

to the N-L voice, for which the spectral parameters were ob-

tained using Lombard speech, the modiﬁcations proposed here

obtained a similar or higher intelligibility score. The intelligi-

bility gains obtained by the full Lombard voice L over the N-L

voice reﬂect the impact of changes in duration patterns, F0 and

the aperiodicity parameters that deﬁne the excitation signal, as

pointed out in Table 2. We can see, then, that there is a lot to

gain from modifying those parameters in addition to the spec-

tral ones. The spectral modiﬁcations proposed here increased

the gains obtained with the Lombard voice for speech-shaped

noise, as we can see from the results for voice L-E-M2, which

shows that there are still gains to be had over and above simply

building voices on recorded Lombard speech.

For the competing talker, spectral changes seem to con-

tribute less than for speech-shaped noise. For the competing

talker, duration stretches as well as F0 increases are more im-

portant. This suggests that for non-stationary noise it is more

effective to perform temporal energy re-allocation (e.g., taking

advantage of quiet or silent regions in the noise signal) than it is

to reallocate energy across different frequencies.

6. Conclusions

We have proposed a new method for modifying Mel cepstral co-

efﬁcients based on an intelligibility measure for speech in noise,

the Glimpse proportion measure. We showed how to control

distortion by modifying only the ﬁrst few Mel cepstral coefﬁ-

cients, which is a natural way of limiting the frequency resolu-

tion of the modiﬁcations. In the evaluation, we compared syn-

thetic voices whose spectral parameters were modiﬁed as well

as using spectral parameters from Lombard speech. Listening

tests using speech-shaped noise and a competing talker indicate

that we only need to modify two Mel cepstral coefﬁcients to ob-

tain a similar or higher intelligibility to Lombard spectral modi-

ﬁcations. Moreover we observed that, for the competing talker,

the intelligibility gain obtained by the Lombard voice over the

modiﬁed voice was mainly due to changes in duration, F0 and

excitation parameters. In terms of what can be achieved when

modifying only Mel cepstral coefﬁcients, our method obtains

either higher or similar intelligibility scores to Lombard Mel

N N−M59 N−M10 N−M2 N−L L L−E L−E−M2

Word accuracy rate (%)

Figure 3: Word accuracy rates for competing talker.

cepstral coefﬁcients. We are currently making a more extensive

comparison of our method to other intelligibility enhancement

methods. In future, we plan to investigate reallocating energy

across time. We also plan operating under a loudness constraint

rather than an energy one.

7. Acknowledgment

The research leading to these results was partly funded from

the European Community’s Seventh Framework Programme

(FP7/2007-2013) under grant agreement 213850 (SCALE) and

256230 (LISTA), and from EPSRC grants EP/I031022/1 and

EP/J002526/1.

8. References

[1] T. Raitio, A. Suni, M. Vainio, and P. Alku, “Analysis of HMM-

based lombard speech synthesis,” in Proc. Interspeech, Florence,

Italy, August 2011.

[2] B. Sauert and P. Vary, “Near end listening enhancement: Speech

intelligibility improvement in noisy environments,” in Proc.

ICASSP, Toulouse, France, May 2006, p. 493496.

[3] M. Cooke, “A glimpsing model of speech perception in noise,” J.

Acoust. Soc. Am., vol. 119, no. 3, pp. 1562–1573, 2006.

[4] C. Valentini-Botinhao, J. Yamagishi, and S. King, “Can objec-

tive measures predict the intelligibility of modiﬁed HMM-based

synthetic speech in noise?” in Proc. Interspeech, Florence, Italy,

August 2011.

[5] C. Valentini-Botinhao, R. Maia, J. Yamagishi, S. King, and

H. Zen, “Cepstral analysis based on the Glimpse proportion mea-

sure for improving the intelligibility of HMM-based synthetic

speech in noise,” in Proc. ICASSP, Kyoto, Japan, March 2012.

[6] T. Fukada, K. Tokuda, T. Kobayashi, and S. Imai, “An adaptive

algorithm for mel-cepstral analysis of speech,” in Proc. ICASSP,

vol. 1, San Francisco, USA, March 1992, pp. 137–140.

[7] W. Dreschler, H. Verschuure, C. Ludvigsen, and S. Westermann,

“ICRA noises: artiﬁcial noise signals with speech-like spectral

and temporal properties for hearing instrument assessment. In-

ternational Collegium for Rehabilitative Audiology.” Audiology,

vol. 40, no. 3, pp. 148–57, 2001.

[8] H. Kawahara, I. Masuda-Katsuse, and A. Cheveign

e, “Restructur-

ing speech representations using a pitch-adaptive time-frequency

smoothing and an instantaneous-frequency-based F0 extraction:

possible role of a repetitive structure in sounds,” Speech Comm.,

vol. 27, pp. 187–207, 1999.

[9] T. Toda and K. Tokuda, “A speech parameter generation algorithm

considering global variance for HMM-based speech synthesis,”

IEICE Trans. Inf. Syst., vol. E90-D, no. 5, pp. 816–824, 2007.

[10] “IEEE recommended pratice for speech quality measurements,”

Audio and Electroacoustics, IEEE Transactions on, vol. 17, no. 3,

pp. 225 – 246, sep 1969.

HTML Viewer

Frequently Asked Questions (10)

Q1. What are the contributions mentioned in the paper "Mel cepstral coefficient modification based on the glimpse proportion measure for improving the intelligibility of hmm- generated synthetic speech in noise" ?

The authors propose a method that modifies the Mel cepstral coefficients of HMM-generated synthetic speech in order to increase the intelligibility of the generated speech when heard by a listener in the presence of a known noise. This method is based on an approximation the authors previously proposed for the Glimpse Proportion measure. Here the authors show how to update the Mel cepstral coefficients using this measure as an optimization criterion and how to control the amount of distortion by limiting the frequency resolution of the modifications.

Q2. What are the future works in "Mel cepstral coefficient modification based on the glimpse proportion measure for improving the intelligibility of hmm- generated synthetic speech in noise" ?

In future, the authors plan to investigate reallocating energy across time. The authors also plan operating under a loudness constraint rather than an energy one.

Q3. How did the authors use the LTAS for the Lombard voice?

The authors used as stopping criteria both error convergence and a maximum distortion threshold set to be 10% of relative increase in the Euclidian distance between the STEP representation of original and modified speech.

Q4. How did the authors extract the Mel cepstral coefficients?

To train, adapt and generate speech the authors extracted: 59 Mel cepstral coefficients with α = 0.77, Mel scale F0, and 25 aperiodicity energy bands extracted using STRAIGHT [8].

Q5. How many speakers did the authors use for the listening test?

For the listening test the authors used 32 native English speakers listening to the noisy samples over headphones in soundproof booths and typing in what he or she heard.

Q6. What is the GP measure for speech intelligibility in noise?

The Glimpse Proportion (GP) measure for speech intelligibility in noise [3] is the proportion of spectral-temporal regions called glimpses where speech is more energetic than noise.

Q7. What can be done to improve intelligibility in noise?

If such data are not available, then the authors can apply noise-independent modifications at the feature level based on known acoustic properties of Lombard speech, such as F0 increase, flattening of spectral tilt and duration stretch [1].

Q8. What are the spectral parameters that define the excitation signal?

The intelligibility gains obtained by the full Lombard voice L over the N-L voice reflect the impact of changes in duration patterns, F0 and the aperiodicity parameters that define the excitation signal, as pointed out in Table 2.

Q9. Why did the authors use an average voice model?

The authors decided to use an average voice model rather than building a speaker-dependent voice because the normal speech dataset was not phonetically balanced.

Q10. What is the difference between the Lombard and the N-L voice?

Moreover the authors observed that, for the competing talker, the intelligibility gain obtained by the Lombard voice over the modified voice was mainly due to changes in duration, F0 and excitation parameters.

Mel cepstral coefficient modification based on the Glimpse Proportion measure for improving the intelligibility of HMM-generated synthetic speech in noise

Summary (2 min read)

Introduction

GGBMI = =

GGBMI =

AGGBMI =

AGBMI =

SGBMI =

SAGBMI =

Figures (5)

Citations

Cites methods from "Mel cepstral coefficient modificati..."

Cites methods from "Mel cepstral coefficient modificati..."

Cites methods from "Mel cepstral coefficient modificati..."

References

"Mel cepstral coefficient modificati..." refers background in this paper

"Mel cepstral coefficient modificati..." refers background in this paper

"Mel cepstral coefficient modificati..." refers background or methods in this paper

Related Papers (5)

Frequently Asked Questions (10)

Q1. What are the contributions mentioned in the paper "Mel cepstral coefficient modification based on the glimpse proportion measure for improving the intelligibility of hmm- generated synthetic speech in noise" ?

Q2. What are the future works in "Mel cepstral coefficient modification based on the glimpse proportion measure for improving the intelligibility of hmm- generated synthetic speech in noise" ?

Q3. How did the authors use the LTAS for the Lombard voice?

Q4. How did the authors extract the Mel cepstral coefficients?

Q5. How many speakers did the authors use for the listening test?

Q6. What is the GP measure for speech intelligibility in noise?

Q7. What can be done to improve intelligibility in noise?

Q8. What are the spectral parameters that define the excitation signal?

Q9. Why did the authors use an average voice model?

Q10. What is the difference between the Lombard and the N-L voice?