scispace - formally typeset
Open AccessProceedings ArticleDOI

Cepstral analysis based on the glimpse proportion measure for improving the intelligibility of HMM-based synthetic speech in noise

Reads0
Chats0
TLDR
A new cepstral coefficient extraction method based on an intelligibility measure for speech in noise, the Glimpse Proportion measure, that can significantly improve intelligibility of synthetic speech in speech shaped noise is introduced.
Abstract
In this paper we introduce a new cepstral coefficient extraction method based on an intelligibility measure for speech in noise, the Glimpse Proportion measure. This new method aims to increase the intelligibility of speech in noise by modifying the clean speech, and has applications in scenarios such as public announcement and car navigation systems. We first explain how the Glimpse Proportion measure operates and further show how we approximated it to integrate it into an existing spectral envelope parameter extraction method commonly used in the HMM-based speech synthesis framework. We then demonstrate how this new method changes the modelled spectrum according to the characteristics of the noise and show results for a listening test with vocoded and HMM-based synthetic speech. The test indicates that the proposed method can significantly improve intelligibility of synthetic speech in speech shaped noise.

read more

Content maybe subject to copyright    Report

CEPSTRAL ANALYSIS BASED ON THE GLIMPSE PROPORTION MEASURE FOR
IMPROVING THE INTELLIGIBILITY OF HMM-BASED SYNTHETIC SPEECH IN NOISE
Cassia Valentini-Botinhao
1
, Ranniery Maia
2
, Junichi Yamagishi
1
, Simon King
1
and Heiga Zen
2
1
The Centre for Speech Technology Research, University of Edinburgh, UK
2
Cambridge Research Laboratory, Toshiba Research Europe Limited, UK
ABSTRACT
In this paper we introduce a new cepstral coefficient extraction
method based on an intelligibility measure for speech in noise, the
Glimpse Proportion measure. This new method aims to increase
the intelligibility of speech in noise by modifying the clean speech,
and has applications in scenarios such as public announcement and
car navigation systems. We first explain how the Glimpse Propor-
tion measure operates and further show how we approximated it
to integrate it into an existing spectral envelope parameter extrac-
tion method commonly used in the HMM-based speech synthesis
framework. We then demonstrate how this new method changes
the modelled spectrum according to the characteristics of the noise
and show results for a listening test with vocoded and HMM-based
synthetic speech. The test indicates that the proposed method can
significantly improve intelligibility of synthetic speech in speech
shaped noise.
Index Terms cepstral coefficient extraction, objective mea-
sure for speech intelligibility, Lombard speech, HMM-based speech
synthesis
1. INTRODUCTION
This work focuses on compensating for background additive noise
by increasing the intelligibility of synthetic speech generated by a
parametric statistical model. Our method modifies clean speech be-
fore it is added to noise. Applications of such an approach include
car navigation systems and any public announcement system that
makes use of text to speech technology.
Intelligibility of state-of-the-art hidden Markov model (HMM)
generated synthetic speech can be comparable to natural speech in
clean environments [1] but in noisy environments the situation is
quite different and most often natural speech is more intelligible.
The statistical and parametric nature of HMM-based speech syn-
thesis however offers a high degree of control over the generated
speech. By modifying the models or extracted parameters we are
able to control the acoustic characteristics of the generated speech
without the need for new data. It is then possible to generate syn-
thetic speech that is more intelligible in noise than the natural speech
used for training [2]. One way to achieve this is to imitate the acous-
tic properties found in natural speech produced in noise, also known
as Lombard speech. However not all observed acoustic changes im-
prove intelligibility. It has for example been found that changes in
the fundamental frequency have little contribution to intelligibility
gains [3, 4]. What remains unknown is which acoustic modifica-
tions do in fact have a positive impact on intelligibility and how they
relate to the noise characteristics.
We believe that it is possible to increase the intelligibility of
speech in noise by modifying clean speech automatically according
to the noise characteristics. Because we do not know how speech
production and background noise are related, we need a model of
intelligibility, or just simply an objective measure for speech intelli-
gibility in noise, to control how speech should be modified. This is
what we refer here as an auditory perceptual based approach, as the
modifications are no longer inspired by speech production in noise
but by how the human auditory system perceives them. Previously
we have shown that simple changes in the spectral domain can result
in significant gains in intelligibility for HMM-generated synthetic
speech in noise and that some intelligibility measures can predict
these intelligibility gains [4]. Our idea here is then to use one of
these measures, the Glimpse Proportion (GP) measure [5], to modify
the spectral envelope of speech. To do this we alter the optimization
criterion of the cepstral coefficient extraction method [6] commonly
used in the HMM-based synthesis framework.
In Section 2 of this paper we outline the cepstral coefficient ex-
traction method and in Section 3 we describe the Glimpse Proportion
measure. In Section 4 we show how we can reformulate the Glimpse
Proportion measure to use as a cost function for cepstral extraction
and then we define the proposed cepstral extraction method, showing
how to solve the new optimization problem. Section 5 gives the ex-
perimental results on the acoustic analysis of the modifications and
intelligibility evaluation of vocoded and HMM-generated synthetic
speech.
2. UELS-BASED CEPSTRAL ANALYSIS
The cepstral coefficient extraction as described in [6] is a method
commonly used to extract spectral parameters for an HMM-based
speech synthesizer. The method is based on the Unbiased Estimator
of Log Spectrum (UELS) [7].
The cepstral coefficients {c
m
}
M
m=0
define the spectrum of the
speech signal s(n) in the following way:
H(e
jω
) = exp
M
X
m=0
c
m
e
jm ω
= KD(e
jω
) (1)
where K = exp c
0
and D(e
jω
) is the gain normalized version of
H(e
jω
).
The authors in [6] propose to extract cepstral coefficients by
minimizing the criterion defined for the unbiased condition as de-
scribed in [7]. Since H(e
jω
) as defined in Eq. (1) is a minimum
phase system it is possible to prove that minimizing the unbiased
criterion with respect to {c
m
}
M
m=1
is the same as minimizing the
following cost function:
ε =
1
2π
Z
π
π
I
N
(ω)
|D(e
jω
)|
2
(2)

where I
N
(ω) is the modified periodogram of a wide-sense stationary
process s(n). Likewise we find that K =
ε
min
, where ε
min
is the
minimum value of ε.
3. THE GLIMPSE PROPORTION MEASURE
The Glimpse Proportion (GP) measure for speech intelligibility in
noise [5] is based on the idea that in a noisy environment humans fo-
cus on glimpses of speech that are not masked by noise. It correlates
well with subjective scores for intelligibility in noise of both natural
[5] and HMM-based synthetic speech [8] and also when the spec-
tral envelope of HMM-based synthetic speech is modified [4]. The
GP measure outperforms most existing measures for intelligibility
of speech in noise and it does not require any time delays.
The measure is the proportion of spectral-temporal regions, so
called glimpses, where speech is more energetic than noise. This
comparison takes place in the Spectro Temporal Excitation Pattern
(STEP). In order to represent a signal in this domain the follow-
ing operations are performed over the speech and noise waveform
separately: Gammatone filtering into frequency channel, envelope
extraction, envelope smoothing, average over time frame and level
compression. The centre frequencies of the Gammatone filters are
linearly spaced in the equivalent rectangular bandwidth (ERB) scale
[9].
4. PROPOSED CESPTRAL ANALYSIS
INCORPORATING THE GP MEASURE
In this section we show how we can approximate the GP measure
and integrate it to the existing cost function for cepstral coefficient
extraction shown in Section 2.
4.1. Proposed GP approximation
To obtain a closed and differentiable formula that relates spectral pa-
rameters to the Glimpse Proportion measure we have to make some
approximations and correspondences. We first replace the hard deci-
sion for counting glimpses by a soft one defined by a sigmoid func-
tion. The proposed approximated Glimpse Proportion measure is
then given by:
GP =
100
N
f
N
t
N
t
X
t=1
N
f
X
f=1
L(y
sp
t,f
y
ns
t,f
) (3)
where y
sp
t,f
and y
ns
t,f
are the approximated STEP representation for
speech and noise respectively at analysis window t and frequency
channel f ; N
t
and N
f
are the number of time frames and frequency
channels respectively; L(.) is the logistic sigmoid function of zero
offset and slope η.
We approximate the calculation of the STEP signal for speech
and noise by performing it over the magnitude spectrum of speech
and the discrete Fourier transform representation of the noise respec-
tively. The absolute value operation representing the envelope ex-
traction step is replaced by a circular convolution of the signal with
itself. The filtering operations are replaced by truncated multiplica-
tions and the level compression is no longer considered. The STEP
approximation as shown in Fig. 1 is given by:
y
sp
t,f
=
1
N
(G
f
h
t
N
G
f
h
t
)
>
S b (4)
where N is the number of frequency bins of the spectrum,
N
is the
circular convolution operation dimension N and:
h
t
circular
convolution
X
y
t,f
^
Gammatone filtering
^
^
envelope smoothing
^
average
G
f
X
S
X
b
magnitude
spectrum
sp
Fig. 1. Proposed approximation for the Spectro Temporal Excitation
Pattern (STEP) calculation.
h
t
=
h
|H
t
(ω
1
)| . . . |H
t
(ω
N
)|
i
>
is an N x1 vector contain-
ing the magnitude spectrum of windowed speech signal at analysis
window t;
G
f
= diag
ˆ
g
f,1
. . . g
f,N
˜
is an NxN diagonal matrix
whose diagonal contains the Gammatone filter frequency response
for frequency channel f;
S = diag
ˆ
υ
1
. . . υ
N
˜
is an NxN diagonal matrix whose
diagonal contains the frequency response of the smoothing filter;
b =
ˆ
b
1
. . . b
N
˜
is an Nx1 vector containing the coefficients
of average filter.
The approximated version of the GP measure proposed here ob-
tains correlation coefficients that are smaller but still comparable to
the ones obtained by the original GP measure and higher than the
ones obtained by any other spectrum-based measure when using the
subjective data from the experiment described in [4].
4.2. Cost function reformulation
In order to keep a compromise between the minimization of the cost
function defined in Eq. (2) and the maximization of the intelligibility
measure given by Eq. (3) we need to define a parameter β that con-
trols the weight given to each criterion. The redefined cost function
is:
E
t
= ε
t
β GP
t
(5)
where ε
t
is the value of the function described in Eq. (2) in time
frame t and GP
t
is the time evolution of the GP as defined in Eq.
(3):
GP
t
=
100
N
f
N
f
X
f=1
L(y
sp
t,f
y
ns
t,f
) (6)
The cepstral coefficient vector c
t
= [c
t,1
. . . c
t,m
. . . c
t,M
]
>
is given by:
c
t
= argmin
ˆ
ε
t
β GP
t
˜
(7)
It is clear that when β = 0 the proposed cepstral extraction
method becomes the original method of Section 2.
4.3. Solving the optimization problem using Steepest Descent
The update equation for cepstral coefficients using Steepest Descent
is:
c
(i+1)
t
= c
(i)
t
µE
(i)
t
(8)

where µ is the step size and the i index refers to iterations. From
now on we are dropping the i index for clarity reasons.
According to the definition of the error given by Eq. (5) the
gradient vector is:
E
t
= ε
t
βGP
t
(9)
The formula expressing the value of ε
t
can be found in [6].
Considering the definition of the STEP function and the GP
t
as
given by Eqs. (4) and (6) we have that:
GP
t
=
100
N
f
N
N
f
X
f=1
η L(y
sp
t,f
y
ns
t,f
)
ˆ
1 L(y
sp
t,f
y
ns
t,f
)
˜
·
H
c
t
G
f
(2 Γ
N
N
G
f
h
t
) S b (10)
where H
c
t
is an MxN matrix whose elements are {H
c
t
}
m,j
=
|H
t
(ω
j
)|
c
t,m
and the operation (Γ
N
N
G
f
h
t
) defines an NxN matrix
of the following form:
2
6
6
6
4
e
1
N
(G
f
h
t
)
>
e
2
N
(G
f
h
t
)
>
.
.
.
e
N
N
(G
f
h
t
)
>
3
7
7
7
5
where e
n
is the n-th column of the identity matrix Γ
N
.
When spectrum is modeled by cepstral coefficients as defined in
Eq. (1) the elements of the matrix H
c
t
are given by:
|H
t
(ω
j
)|
c
t,m
= |H
t
(ω
j
)|cos(m ω
j
) (11)
4.4. Energy normalization
In order to avoid the trivial solution of maximizing the number of
glimpses by increasing the overall energy level and to see how much
we can improve intelligibility given a fixed Signal to Noise Ratio
(SNR) we need to make sure that the optimization does not change
the total energy of the signal at each time frame.
We assume that the excitation signal has power one, with mag-
nitude response constant over all frequency range for both voiced
(single pulse) and unvoiced (white noise) segments. Under this as-
sumption and considering that the cepstral extraction method does
not modify the excitation signal we can say with the help of the Par-
seval theorem that in order to keep the energy in the time domain
constant it is sufficient to keep the following constant:
ψ =
N
X
j=1
|H(ω
j
)|
2
(12)
An alternative solution to explicitly adding a constraint to the
optimization problem is to normalize the spectrum at each iteration
so that the signal in that frame has fixed energy. For this solution the
only term that needs changing in the gradient vector E
t
is the one
given by Eq. (11), that for m 6= 0 becomes:
|H
0
t
(ω
j
)|
c
t,m
= |H
0
t
(ω
j
)|
cos(m ω
j
)
1
ψ
N
X
l=1
|H
t
(ω
l
)|
2
cos(m ω
l
)
(13)
where |H
0
t
(ω
j
)| is the energy normalized magnitude spectrum. It
is possible to prove that there is no need to update the first cepstral
coefficient c
0
in this solution as the normalization operation updates
c
0
at each iteration to a certain value regardless of an additional c
0
term.
5. EVALUATION
We conducted experiments with vocoded and synthetic speech. The
results for HMM-synthetic speech can show us the impact of the
acoustic modelling on the effectiveness of the method.
5.1. Experimental conditions
The speech material we used to generate vocoded speech was the
semantically unpredictable sentences (SUS) set from the Blizzard
Challenge 2010. The samples were of a British male speaker sam-
pled at 20kHz. To train the models we used 1000 other sentences
from the same speaker also at 20kHz. The same sentences used to
generate vocoded speech were used as test sentences for the HMM-
generated synthetic speech. We used as synthesis filter the log spec-
trum approximation filter [6] with simple excitation as input.
Using the proposed method we extracted 52 cepstral coefficients
for different β values, including the β = 0 case for comparison. The
periodogram was set to be the smoothed spectrum extracted using
STRAIGHT [10]. We initialize the algorithm with the first M + 1
values of the minimum phase cepstrum. The step size was set to
µ
(i)
= 1/||E
(i)
t
||. We used both error convergence and maximum
distortion as stopping criterion.
The acoustic model we used for synthetic speech was a hidden
semi Markov model. The observation vectors for the spectral and ex-
citation parameters contained static, delta and delta-delta values. We
used one stream for the spectrum and three streams for the logF0.
We used the Global Variance method [11] to compensate for the
oversmoothing effect of the acoustical modeling.
For these experiments we added vocoded and HMM-generated
synthetic speech to two different types of stationary noise, speech
shaped noise (ssn) and high frequency noise (hf). Each noise type
was added at a different SNR: 0 dB for ssn and and 20 dB for hf.
For the listening test we played all signals over headphones to
participants in soundproof booths. Each individual sentence could be
played only once before the participant had to type in what he or she
heard. A total of eight native English speakers participated in the
experiment with vocoded speech and other eight participants were
assigned to the experiment with synthetic speech. Each participant
heard twelve different sentences per listening situation.
5.2. Results and discussion
Fig. 2 shows the Long Term Average Spectrum (LTAS) of vocoded
speech generated using the original and the proposed method when
noise is speech shaped and SNR is 0 dB. In the figure we can also
see the LTAS of the noise. We can see that on average the proposed
method reallocates energy mostly to the frequency range between
800 Hz and 4.8 kHz, the band where the auditory human system is
more sensitive. The attenuation occurs mostly in the lower frequency
regions below 800 Hz. For the high frequency noise the energy boost
occurs in a similar region and we also observed some attenuation in
the high frequency region, as this region is highly masked by noise.
We observed that the proposed method improves not only the ap-
proximated GP measure introduced above but the original GP mea-
sure as well. This improvement was observed for all noise types and
for both vocoded and synthetic speech.
Fig. 3 shows the word accuracy rates obtained in the listening
test with vocoded (left) and synthetic speech (right). Each group
mean is represented by a circle; two means are significantly different
at a 0.05 level only if their intervals are disjoint.

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
0
5
10
15
20
25
30
35
40
45
50
55
Freq. (Hz)
Sound Pressure Level (dB)
original
proposed
noise
Fig. 2. Long term average spectrum curves extracted of vocoded
speech generated using the original method (β = 0) and the proposed
method (β 6= 0) for speech shaped noise at 0 dB SNR.
We can see that the proposed method does not produce any sig-
nificant differences in word accuracy for vocoded speech. However
for synthetic speech and speech shaped noise there is a significant
improvement of word accuracy from 31 % to 44 % (a gain of 44 %).
For the high frequency noise case it seems that, although not sig-
nificantly, the proposed method decreases the word accuracy rates.
We believe this happens because the modifications imposed by such
noise leads to less natural speech which in turn degrades intelligi-
bility. This could be solved by changing the acceptable amount of
distortion and GP improvement or by stating the amount of distor-
tion as a constraint instead of a stopping criterion.
The impact of the proposed method seems to be stronger for syn-
thetic speech although the GP gains were smaller or similar for syn-
thetic speech, most probably because in harder tasks smaller glimpse
variations lead to stronger effects.
6. CONCLUSION
In this paper we showed how to use a measure of speech intelligi-
bility in noise to modify HMM-synthetic speech and make it more
intelligible for a certain noise. We proposed a new cepstral extrac-
tion method that aims not only to minimize the mismatch between
periodogram and modelled spectrum but also to maximize speech
intelligibility in noise, as defined by the Glimpse Proportion mea-
sure, given that the noise is known and SNR is known and fixed.
The listening tests with vocoded and synthetic speech showed the
effectiveness of the method for speech shaped noise but not for high
frequency noise, which might indicate that the amount of distortion
introduced into the speech by the modification was too large. Our
next step is to handle distortion in a better way and then consider
other types of constraints as well, for instance loudness. It is also
in our plans to compare our approach to natural Lombard speech,
in particular for those situations where humans are not fully able to
change their own voice to successfully avoid the background noise.
Acknowledgment
The research leading to these results was partly funded from the Eu-
ropean Community’s Seventh Framework Programme (FP7/2007-
2013) under grant agreements 213850 and 256230 (SCALE and
LISTA).
high frequency speech shaped
20
30
40
50
60
70
80
vocoded speech
word accuracy rate (%)
high frequency speech shaped
20
30
40
50
60
70
80
synthetic speech
word accuracy rate (%)
original
proposed
90
noise type noise type
Fig. 3. Word accuracy rates of listening test with vocoded (left) and
synthetic (right) speech.
7. REFERENCES
[1] J. Yamagishi, H. Zen, Y.-J. Wu, T. Toda, and K. Tokuda, “Yet
another evaluation of the speaker-adaptive HMM-based speech
synthesis system in the 2008 Blizzard Challenge, in Proc.
Blizzard Challenge Workshop, Brisbane, Australia, Sept. 2008,
vol. 5.
[2] A. Suni, T. Raitio, M. Vainio, and P. Alku, “The GlottHMM
speech synthesis entry for Blizzard Challenge 2010, in Proc.
Blizzard Challenge Workshop, Kyoto, Japan, Sept. 2010.
[3] Y. Lu and M. Cooke, “The contribution of changes in F0 and
spectral tilt to increased intelligibility of speech produced in
noise, Speech Comm., vol. 51, no. 12, pp. 1253–1262, 2009.
[4] C. Valentini-Botinhao, J. Yamagishi, and S. King, “Can ob-
jective measures predict the intelligibility of modified HMM-
based synthetic speech in noise?, in Proc. Interspeech, Flo-
rence, Italy, August 2011.
[5] M. Cooke, A glimpsing model of speech perception in noise,
J. Acoust. Soc. Am., vol. 119, no. 3, pp. 1562–1573, 2006.
[6] K. Tokuda, T. Kobayashi, and S. Imai, Adaptive cepstral anal-
ysis of speech, IEEE Trans. Speech and Audio Processing,
vol. SA-3, no. 6, pp. 481–489, Nov. 1995.
[7] S Imai and C. Furuichi, “Unbiased estimator of log spec-
trum and its application to speech signal processing, in Proc.
EURASIP, Grenoble, France, Sep. 1988, pp. 203–206.
[8] C. Valentini-Botinhao, J. Yamagishi, and S. King, “Evaluation
of objective measures for intelligibility prediction of HMM-
based synthetic speech in noise, in Proc. ICASSP, Prague,
Czech Republic, May 2011.
[9] B. C. J. Moore and B. R. Glasberg, “A revision of Zwicker’s
loudness model, Acta Acustica, vol. 82, pp. 335–345, 1996.
[10] H. Kawahara, I. Masuda-Katsuse, and A. Cheveign
´
e, “Re-
structuring speech representations using a pitch-adaptive time-
frequency smoothing and an instantaneous-frequency-based
F0 extraction: possible role of a repetitive structure in sounds,
Speech Comm., vol. 27, pp. 187–207, 1999.
[11] T. Toda and K. Tokuda, “A speech parameter generation al-
gorithm considering global variance for HMM-based speech
synthesis, IEICE Trans. Inf. Syst., vol. E90-D, no. 5, pp. 816–
824, 2007.
Citations
More filters
Journal ArticleDOI

Evaluating the intelligibility benefit of speech modifications in known noise conditions

TL;DR: The current study compares the benefits of speech modification algorithms in a large-scale speech intelligibility evaluation and quantifies the equivalent intensity change, defined as the amount in decibels that unmodified speech would need to be adjusted by in order to achieve the same intelligibility as modified speech.
Journal ArticleDOI

The listening talker: A review of human and algorithmic context-induced modifications of speech

TL;DR: This review article summarises the extensive set of behavioural findings related to human speech modification, identifies which factors appear to be beneficial, and goes on to examine previous computational attempts to improve intelligibility in noise.
Journal ArticleDOI

Maximizing Phoneme Recognition Accuracy for Enhanced Speech Intelligibility in Noise

TL;DR: A speech pre-enhancement method based on matching the recognized text to the text of the original message that indicates a significant improvement over natural speech and a reference system that optimizes a perceptual-distortion-based objective intelligibility measure.
Journal ArticleDOI

Intelligibility enhancement of HMM-generated speech in additive noise by modifying Mel cepstral coefficients to increase the glimpse proportion

TL;DR: A method for modifying the Mel cepstral coefficients generated by statistical parametric models that have been trained on plain speech such that the glimpse proportion - an objective measure of the intelligibility of speech in noise - increases, while keeping the speech energy fixed.
Journal ArticleDOI

Evaluating the predictions of objective intelligibility metrics for modified and synthetic speech

TL;DR: How well seven OIMs predict listener responses in three large datasets of modified and synthetic speech which together represent 396 combinations of speech modification, masker type and signal-to-noise ratio is evaluated.
References
More filters
Journal ArticleDOI

Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds

TL;DR: A set of simple new procedures has been developed to enable the real-time manipulation of speech parameters by using pitch-adaptive spectral analysis combined with a surface reconstruction method in the time–frequency region.
Journal ArticleDOI

A glimpsing model of speech perception in noise.

TL;DR: An automatic speech recognition system, adapted for use with partially specified inputs, to identify consonants in noise revealed that cues to voicing are degraded more in the model than in human auditory processing.
Journal ArticleDOI

A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis

TL;DR: In this article, the authors proposed a parameter generation algorithm for an HMM-based speech synthesis technique. But the generated trajectory is often excessively smoothed due to the statistical processing. And the over-smoothing effect usually causes muffled sounds.
Journal Article

A revision of Zwicker's loudness model

Brian C. J. Moore, +1 more
- 01 Jan 1996 - 
TL;DR: In this article, the authors present some modifications and extensions to Zwicker's loudness model, which is able to account for the loudness of partially masked sounds without the introduction of correction factors.
Journal ArticleDOI

The contribution of changes in F0 and spectral tilt to increased intelligibility of speech produced in noise

TL;DR: In the presence of speech-shaped noise, flattening of spectral tilt contributed greatly to the intelligibility gain of noise-induced speech over speech produced in quiet while an increase in F0 did not have a significant influence.
Related Papers (5)
Frequently Asked Questions (16)
Q1. What are the contributions mentioned in the paper "Cepstral analysis based on the glimpse proportion measure for improving the intelligibility of hmm-based synthetic speech in noise" ?

In this paper the authors introduce a new cepstral coefficient extraction method based on an intelligibility measure for speech in noise, the Glimpse Proportion measure. The authors first explain how the Glimpse Proportion measure operates and further show how they approximated it to integrate it into an existing spectral envelope parameter extraction method commonly used in the HMM-based speech synthesis framework. The authors then demonstrate how this new method changes the modelled spectrum according to the characteristics of the noise and show results for a listening test with vocoded and HMM-based synthetic speech. 

The speech material the authors used to generate vocoded speech was the semantically unpredictable sentences (SUS) set from the Blizzard Challenge 2010. 

Nt and Nf are the number of time frames and frequency channels respectively; L(.) is the logistic sigmoid function of zero offset and slope η. 

A total of eight native English speakers participated in the experiment with vocoded speech and other eight participants were assigned to the experiment with synthetic speech. 

For these experiments the authors added vocoded and HMM-generated synthetic speech to two different types of stationary noise, speech shaped noise (ssn) and high frequency noise (hf). 

Gf = diag “ˆ gf,1 . . . gf,N ˜” is an NxN diagonal matrixwhose diagonal contains the Gammatone filter frequency response for frequency channel f ; S = diag “ˆ υ1 . . . υN ˜” is an NxN diagonal matrix whosediagonal contains the frequency response of the smoothing filter; b = ˆ b1 . . . 

The redefined cost function is:Et = εt − β GPt (5)where εt is the value of the function described in Eq. (2) in time frame t and GPt is the time evolution of the GP as defined in Eq. (3):GPt = 100Nf NfX f=1 L(yspt,f − y ns t,f ) (6)The cepstral coefficient vector ct = [ct,1 . . . ct,m . . . ct,M ]>is given by:ct = argmin ˆ 

The impact of the proposed method seems to be stronger for synthetic speech although the GP gains were smaller or similar for synthetic speech, most probably because in harder tasks smaller glimpse variations lead to stronger effects. 

The authors proposed a new cepstral extraction method that aims not only to minimize the mismatch between periodogram and modelled spectrum but also to maximize speech intelligibility in noise, as defined by the Glimpse Proportion measure, given that the noise is known and SNR is known and fixed. 

The research leading to these results was partly funded from the European Community’s Seventh Framework Programme (FP7/20072013) under grant agreements 213850 and 256230 (SCALE and LISTA). 

The authors can see that on average the proposed method reallocates energy mostly to the frequency range between 800 Hz and 4.8 kHz, the band where the auditory human system is more sensitive. 

The listening tests with vocoded and synthetic speech showed the effectiveness of the method for speech shaped noise but not for high frequency noise, which might indicate that the amount of distortion introduced into the speech by the modification was too large. 

For the high frequency noise case it seems that, although not significantly, the proposed method decreases the word accuracy rates. 

The proposed approximated Glimpse Proportion measure is then given by:GP = 100NfNt NtX t=1 NfX f=1 L(yspt,f − y ns t,f ) (3)where yspt,f and y ns t,f are the approximated STEP representation for speech and noise respectively at analysis window t and frequency channel f ; 

According to the definition of the error given by Eq. (5) the gradient vector is:∇Et = ∇εt − β∇GPt (9)The formula expressing the value of ∇εt can be found in [6]. 

The approximated version of the GP measure proposed here obtains correlation coefficients that are smaller but still comparable to the ones obtained by the original GP measure and higher than the ones obtained by any other spectrum-based measure when using the subjective data from the experiment described in [4].