scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Mel cepstral coefficient modification based on the Glimpse Proportion measure for improving the intelligibility of HMM-generated synthetic speech in noise

TL;DR: A method that modifies the Mel cepstral coefficients of HMM-generated synthetic speech in order to increase the intelligibility of the generated speech when heard by a listener in the presence of a known noise is proposed.
Abstract: We propose a method that modifies the Mel cepstral coefficients of HMM-generated synthetic speech in order to increase the intelligibility of the generated speech when heard by a listener in the presence of a known noise. This method is based on an approximation we previously proposed for the Glimpse Proportion measure. Here we show how to update the Mel cepstral coefficients using this measure as an optimization criterion and how to control the amount of distortion by limiting the frequency resolution of the modifications. To evaluate the method we built eight different voices from normal read-text speech data from a male speaker. Some voices were also built from Lombard speech data produced by the same speaker. Listening experiments with speech-shaped noise and with a single competing talker indicate that our method significantly improves intelligibility when compared to unmodified synthetic speech. The voices built from Lombard speech outperformed the proposed method particularly for the competing talker case. However, compared to a voice using only the spectral parameters from Lombard speech, the proposed method obtains similar or higher performance.

Summary (2 min read)

Jump to: [Introduction][GGBMI = =][GGBMI =][AGGBMI =][AGBMI =][SGBMI =] and [SAGBMI =]

Introduction

  • Experiments in [28] found that handgrip strength is a predictor of mortality and morbidity, in man and woman, predicting up to 5 diseases.
  • The grip strength is correlated to the overall body strength, muscles and health status.

GGBMI = =

  • This formula would predict a maximum normal weight of 87kg for somebody with handgrip strength of 108kg, a height of 1.8m.
  • I believe this is correct in general for boxers and wrestlers and gymnasts and it predicts normal weight in the sense of normal body fat percentage, but the competitive weight of world class gymnasts is lower than the weight predicted by this formula.
  • For a 1.7 m athlete with handgrip strength of 108lg this formula would predict a maximum normal weight of 79kg, which is reasonable, but of course the optimal competitive weight may be lower.
  • Even at 79kg such athlete would not be fat or overweight.
  • It is possible to find an even more general formula.

GGBMI =

  • It is possible to develop a optimal weight equation using 1.8 instead of 2, the reason is explained in [2].
  • Any of these formulae is better than the BMI alone and there is a lot of evidence, in some cases it is better by a large margin.
  • Normalization could be obtained through division by 100 and the authors obtain a smaller factor related to the grip strength.
  • I consider the previous formula better, but there is also the possibility of division by 100 instead by division with 54 and then.

AGGBMI =

  • Engineering an optimal formula is also achieved through trial and error.
  • I developed and tested also formula developed based on similar principles such as AGBMI= weight H2+ gripstrengt − 54 weight weight H1.8+ gripstrengt − 54 weight weight H2+ grip_ strength − 54 100 weight 2 × chest −.

AGBMI =

  • Therefore it is possible to develop and test the following formulae: Weight Height2× chest −.
  • For strength on this move equivalent to 50kg, the maximum predicted weight is 84, for 60 kg is 89kg, for 80 kg strength, the maximum normal weight would be according to this formula 97kg.

SGBMI =

  • This formula would predict as much as 119kg maximum normal weight for a 120kg bar crunch.
  • Of course, it would be correct to use a force not weight but this is how bars are sold, and the authors can make an equivalent to a lifting move such as dumbbell bench press where the weights are measured in kg.
  • The advantage of this move is that could be tested with very simple equipment, a short 50cm bar in a medical office or at home.
  • This formula has the BMI as a particular case but works in both ways, for stronger people it allows higher weigh but for weaker people it allows less weight than the classic BMI.
  • It is possible to define SABMI = Strength and anthropometric generalization of BMI.

SAGBMI =

  • In the same way I develop a number of formulae based on some ideas, experiments cited and principles I developed, them test the formulae with test cases, simulate it and present it so that people who design experimental studies can verify these formulae in a large number of cases, on statistical basis.
  • A treatise on Man and the Development of His Faculties.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Edinburgh Research Explorer
Mel cepstral coefficient modification based on the Glimpse
Proportion measure for improving the intelligibility of HMM-
generated synthetic speech in noise
Citation for published version:
Valentini-Botinhao, C, Yamagishi, J & King, S 2012, Mel cepstral coefficient modification based on the
Glimpse Proportion measure for improving the intelligibility of HMM-generated synthetic speech in noise. in
Proc. Interspeech.
Link:
Link to publication record in Edinburgh Research Explorer
Document Version:
Peer reviewed version
Published In:
Proc. Interspeech
General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)
and / or other copyright owners and it is a condition of accessing these publications that users recognise and
abide by the legal requirements associated with these rights.
Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer
content complies with UK legislation. If you believe that the public display of this file breaches copyright please
contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and
investigate your claim.
Download date: 10. Aug. 2022

Mel cepstral coefficient modification based on the Glimpse Proportion measure
for improving the intelligibility of HMM-generated synthetic speech in noise
Cassia Valentini-Botinhao, Junichi Yamagishi, Simon King
The Centre for Speech Technology Research, University of Edinburgh, UK
C.Valentini-Botinhao@sms.ed.ac.uk, jyamagis@inf.ed.ac.uk, Simon.King@ed.ac.uk
Abstract
We propose a method that modifies the Mel cepstral coefficients
of HMM-generated synthetic speech in order to increase the in-
telligibility of the generated speech when heard by a listener
in the presence of a known noise. This method is based on
an approximation we previously proposed for the Glimpse Pro-
portion measure. Here we show how to update the Mel cep-
stral coefficients using this measure as an optimization crite-
rion and how to control the amount of distortion by limiting
the frequency resolution of the modifications. To evaluate the
method we built eight different voices from normal read-text
speech data from a male speaker. Some voices were also built
from Lombard speech data produced by the same speaker. Lis-
tening experiments with speech-shaped noise and with a sin-
gle competing talker indicate that our method significantly im-
proves intelligibility when compared to unmodified synthetic
speech. The voices built from Lombard speech outperformed
the proposed method particularly for the competing talker case.
However, compared to a voice using only the spectral parame-
ters from Lombard speech, the proposed method obtains similar
or higher performance.
Index Terms: intelligibility of speech in noise, Mel cepstral
coefficients, HMM-based speech synthesis
1. Introduction
Humans change their speaking style when conversing in a noisy
environment so that communication success is ensured, often
producing what is called Lombard speech. It is unclear what
aspects of Lombard speech actually contribute to intelligibility
increases and how they relate to the nature of the noise. Solving
this problem will enable practical applications which automati-
cally modify natural or synthetic speech to increase intelligibil-
ity in noise.
The parametrical statistical framework of HMM-based
speech synthesis offers many different ways to approach this
problem. If Lombard speech data are available for the speaker
whose TTS voice we want to modify, we can use adapta-
tion techniques to produce new Lombard-like speech for that
speaker [1]. If such data are not available, then we can apply
noise-independent modifications at the feature level based on
known acoustic properties of Lombard speech, such as F0 in-
crease, flattening of spectral tilt and duration stretch [1]. How-
ever if we want to employ noise-dependent techniques then we
need to be able to automatically detect what sort of modifica-
tions should take place for certain pairs of speech and noise
signals. One way in which this can be done is by using an in-
telligibility measure of speech [2]. Such an approach is limited
by the performance of the objective measure: if it fails to ac-
curately predict intelligibility then any modification based on
that prediction is likely to fail. Therefore, it is important to
find a specific domain of modifications where the intelligibility
model behaves well and ensure that the modifications applied
in this domain remain within the working range of the objective
model.
We have observed that the Glimpse Proportion (GP) mea-
sure for speech intelligibility in noise [3] has a high correla-
tion coefficient with subjective intelligibility scores for HMM-
generated synthetic speech whose spectral envelope has been
modified [4]. Moreover, modifications in the spectral envelope
domain can achieve quite high intelligibility gains. We then
proposed a cepstral extraction method based on the GP mea-
sure for the HMM-based synthesis framework [5]. This method
was shown to provide significant intelligibility improvement,
although not for all noise types. We hypothesise this is due to
distortions introduced by the method itself. A disadvantage of
that approach is having to train a different model for each noise
type, because the noise-dependent modifications are performed
as part of feature extraction. Now, we propose a method that can
be applied at generation time, and not requiring any information
about the spectral envelope of natural speech to achieve dis-
tortion control. Rather, we propose to control the distortion in
two ways: using a stopping criteria based on the mismatch be-
tween the auditory representations of modified and unmodified
speech, as proposed by the GP measure, and only modifying the
first few cepstral coefficients, thus limiting the frequency reso-
lution of the modifications. A further extension proposed in this
paper is the possibility of using this method for Mel cepstral co-
efficients, which can provide higher speech quality with fewer
coefficients [6].
In Section 2 and 3 we show how Mel cepstral coefficients
model the spectrum, how the GP measure works and how we
previously approximated it for the purpose of cepstral coeffi-
cient optimization. In Section 4 we introduce the new method
for Mel cepstral modification based on the GP measure. We
then provide experimental results from listening experiments to
support our conclusions.
2. Mel cepstral coefficients
We can represent the spectrum by M-th order Mel cepstral co-
efficients {c
m
}
M
m=0
in the following manner [6]:
H(e
jω
) = exp
M
X
m=0
c
m
e
jm ˜ω
(1)
˜ω = tan
1
(1 α
2
) sin ω
(1 + α
2
) cos ω 2α
(2)
where α is a warping factor which can be chosen to represent,
for instance, the Mel scale [6].
C. Valentini-Botinhao, J. Yamagishi, and S. King. Mel cepstral coefficient modification based on the Glimpse
Proportion measure for improving the intelligibility of HMM-generated synthetic speech in noise. In Proc.
Interspeech, Portland, USA, September 2012.

3. The Glimpse Proportion measure
The Glimpse Proportion (GP) measure for speech intelligibil-
ity in noise [3] is the proportion of spectral-temporal regions
called glimpses where speech is more energetic than noise. The
motivation behind this measure is that when humans listen to
speech in noise they tend to focus on such regions. The Spec-
tro Temporal Excitation Pattern (STEP) representation used by
the measure is obtained in the following manner: Gammatone
filtering, envelope extraction and smoothing, averaging over a
time frame and level compression [3].
In [5] we showed how to approximate the GP measure in a
way that provides a closed and differentiable formulation:
GP =
100
N
f
N
t
N
t
X
t=1
N
f
X
f=1
L(y
sp
t,f
y
ns
t,f
) (3)
where N
t
and N
f
are the number of time frames and frequency
channels, L(.) is a logistic sigmoid function of zero offset and
slope η, y
sp
t,f
and y
ns
t,f
are the approximated STEP representa-
tions for speech and noise respectively at analysis window t and
frequency channel f.
The STEP representation for speech is given by:
y
sp
t,f
=
1
N
(G
f
h
t
N
G
f
h
t
)
>
S b (4)
where N is the number of frequency bins of the spectrum,
N
is circular convolution of dimension N, h
t
is an Nx1 vector
containing the magnitude spectrum of windowed speech signal
at analysis window t, G
f
is an NxN diagonal matrix whose
diagonal contains the Gammatone filter frequency response for
frequency channel f, S is an NxN diagonal matrix whose di-
agonal contains the frequency response of the smoothing filter
and b is an Nx1 vector containing the coefficients of the aver-
age filter.
4. Mel cepstral modifications
based on the GP measure
Given a set of Mel cepstral coefficients and a noise signal we
want to obtain a new set of Mel cepstral coefficients c
t
=
[c
t,1
. . . c
t,m
. . . c
t,M
]
>
that maximizes GP
t
, the value of
the function described in Eq. (3) in time frame t. We then have:
c
t
= argmax GP
t
(5)
GP
t
=
100
N
f
N
f
X
f=1
L(y
sp
t,f
y
ns
t,f
) (6)
As this function is not necessarily convex with respect to
the Mel cepstral coefficients, we use a Steepest Descent method
to solve the optimization. The update equation is:
c
(i+1)
t
= c
(i)
t
+ µGP
(i)
t
(7)
where µ is the step size and the i index refers to iterations. From
now on we drop the i index for clarity. The gradient vector is
given by:
GP
t
=
100
N
f
N
N
f
X
f=1
η L(y
sp
t,f
y
ns
t,f
)
ˆ
1 L(y
sp
t,f
y
ns
t,f
)
˜
·
H
c
t
G
f
(2 Γ
N
N
G
f
h
t
) S b (8)
where H
c
t
is an MxN matrix whose elements are
{H
c
t
}
m,j
=
|H
t
(ω
j
)|
c
t,m
and the operation (Γ
N
N
G
f
h
t
)
defines an NxN matrix whose n-th row is equal to
e
n
N
(G
f
h
t
)
>
, e
n
being the n-th column of the identity ma-
trix Γ
N
.
When the spectrum is modelled by Mel cepstral coefficients
as defined in Eq.(1) the elements of the matrix H
c
t
are given
by:
|H
t
(ω
j
)|
c
t,m
= |H
t
(ω
j
)| cos(m ˜ω
j
) (9)
However because we do not wish to modify the energy of
the speech signal we have:
|H
0
t
(ω
j
)|
c
t,m
= |H
0
t
(ω
j
)|
cos(m ˜ω
j
)
1
ψ
N
X
l=1
|H
t
(ω
l
)|
2
cos(m ˜ω
l
)
(10)
where |H
0
t
(ω
j
)| is the energy-normalized magnitude spectrum
and ψ =
P
N
j=1
|H
t
(ω
j
)|
2
. There is no need to update the first
Mel cepstral coefficient c
0
as the normalization operation up-
dates it to a certain value regardless of an additional c
0
term.
An issue we face when using the GP measure as an opti-
mization criterion on its own is the need to limit the distortions
caused by the modifications. To define an audible distortion we
use the Euclidian distance between the STEP representations of
modified and unmodified speech. Including this as an explicit
constraint is unfortunately rather cumbersome, so instead we
use it as a stopping criterion whilst at the same time limiting the
frequency resolution of the modifications. To implement that,
we simply set the gradient vector for higher dimensions to zero,
thus modify only the first few Mel cepstral coefficients, which
represent the coarse properties of the spectrum.
5. Evaluation
In this section we show how we built the TTS voices, give an
acoustic analysis, and present the results of a listening test.
5.1. Voice building
To build the voices used in this evaluation we used two differ-
ent datasets recorded by the same British male speaker: normal
(plain, read-text) speech data and Lombard speech. The Lom-
bard dataset was recorded while the speaker listened to speech-
modulated noise based on another male speaker [7] played over
headphones at a absolute value of 84 dBA.
We built eight different voices as outlined in Table 1. Voice
N was created from a high quality average voice model adapted
to 2803 sentences of the normal speech database, correspond-
ing to three hours of material. We decided to use an average
voice model rather than building a speaker-dependent voice be-
cause the normal speech dataset was not phonetically balanced.
Voices N-M59, N-M10 and N-M2 are variations of N in which
we modify all, just the first ten (c
1
until c
10
), or just the first
two (c
1
and c
2
) Mel cepstral coefficients using our proposed
method.
Lombard voice L was based on voice N, further adapted
using 780 sentences from the Lombard speech dataset, corre-
sponding to 53 minutes of recorded material. Again, the rea-
son for using adaptation was the lack of phonetic balance in the
speech dataset. Voice N-L was also created from voice N but

Voice Adaptation Modification
N - -
N-M59 - all coefficients
N-M10 - first 10 coefficients
N-M2 - first 2 coefficients
N-L only spectral parameters -
L all dimensions -
L-E all dimensions extrapolated -
L-E-M2 all dimensions extrapolated first 2 coefficients
Table 1: Voices built for the evaluation.
this time only the Mel cepstral coefficients were adapted to the
Lombard data. Voices L-E and L-E-M2 are versions of voice L
where we extrapolated the adaptation (voice L-E), and then also
modified the two first Mel cepstral using the proposed method
(voice L-E-M2).
The training and adaptation data had a sampling rate of
48 kHz. To train, adapt and generate speech we extracted: 59
Mel cepstral coefficients with α = 0.77, Mel scale F0, and 25
aperiodicity energy bands extracted using STRAIGHT [8]. We
used a hidden semi-Markov model. The observation vectors for
the spectral and excitation parameters contained static, delta and
delta-delta values; one stream for the spectrum, three streams
for the logF0 and one for the band-limited aperiodicity. The
Global Variance method [9] was also applied to compensate for
the over smoothing effect of the acoustical modelling.
To modify the generated Mel cepstral coefficients we used
the method proposed in the previous section, obtaining the
STEP representation by using Gammatone filters that cov-
ered the range 50-7500 Hz as the noise signal used for test-
ing is sampled at 16 kHz. The stepsize was normalized:
µ
(i)
= µ/||GP
(i)
t
|| and we set µ = 0.4 for N-M59 and µ = 0.8
for N-M10 and N-M2. We used as stopping criteria both error
convergence and a maximum distortion threshold set to be 10 %
of relative increase in the Euclidian distance between the STEP
representation of original and modified speech.
5.2. Acoustic analysis
Fig.1 shows the Long Term Average Spectrum (LTAS) of the
normal (N), modified (N-M2) and Lombard (L) voices, for the
case of speech-shaped noise. Compared to voice N, voice N-
M2 exhibits enhanced energy in the frequency region of 1-
4 kHz and attenuated below 1 kHz. Voice L shows enhance-
ment and attenuation in the same regions as N-M2, although
these changes are not as pronounced, attenuation is also seen
between 4-5.5 kHz and enhancement at frequencies above this.
Table 2 provides an acoustic analysis of the voices av-
erage duration of speech and pauses, average spectral tilt, and
F0 across all sentences used in the listening test for the nor-
mal (N), modified (N-M2) and lombard (L) voices. We can
see that, as expected, the Lombard voice produces sentences
with longer duration and longer pauses, greatly increased F0
mean and flattening of the spectral tilt. The spectral tilt reflects
changes in both spectral envelope and excitation signal. The
modified voice N-M2 also presents a flatter spectral tilt, though
not to the same extent as the Lombard voice.
5.3. Listening experiments design
We mixed the eight different synthetic voices with two noises:
speech-shaped noise and speech from a single competing fe-
−10
0
10
20
30
40
50
Sound Pressure Level (dB)
N
N−M2
L
noise
0 1 2 3 4 5 6 7 8
Freq. (kHz)
Figure 1: Long term average spectrum of the normal N, normal
modified N-M2 and Lombard L voices for speech-shaped noise.
Voice
speech
(secs.)
pauses
(secs.)
F0
(Hz)
spectral tilt
(dB/octave)
N
2.11 0.16 104.5
-2.24
N-M2 -1.88
L 2.80 0.19 145.0 -1.70
Table 2: Acoustic properties observed in normal N, modified
N-M2 and lombard L voices.
male talker. For intelligibility testing, it is important to avoid
floor or ceiling effects on word error rate. Therefore, in order to
obtain intelligibility scores in similar ranges for each noise, we
mixed them at differing SNRs: -4 dB for speech-shaped noise
and -14 dB for the competing talker. Across the different voices
we made sure that the root mean square value was the same.
For the listening test we used 32 native English speakers
listening to the noisy samples over headphones in soundproof
booths and typing in what he or she heard. Each participant
heard six different sentences per condition, i.e., voice and noise
type, and each sentence could only be played once. We used the
first ten sets of the Harvard sentences [10]; another one of the
sets was used as a practice session which listeners completed
before the test proper.
5.4. Results and discussion
Figs. 2 and 3 show the mean word accuracy rate (WAR) ob-
tained by each voice when mixed with speech-shaped noise
and a competing talker respectively, along with 95 % confi-
dence intervals. Fig.2 shows that the modified voices N-M59,
N-M10 and N-M2 achieve higher WAR than the unmodified
voices N (40.9 %), and this is significantly higher for the N-
M10 (50.6 %) and N-M2 (57.8 %). The N-M2 voice obtains a
higher WAR than the N-L voice (49.4 %). The Lombard voices
L (63.5 %), L-E (68.1 %) and L-E-M2 (70.1 %) performed bet-
ter than the normal speech voices although we did not find a sig-
nificant difference between N-M2 and L. The extrapolated voice
L-E is more intelligible than voice L, a trend that is further en-
hanced by applying our modifications to it, as in voice L-E-M2.
The results obtained for the competing talker situation are dis-
played in Fig. 3 and show a slightly different trend. There is a
drop in performance for N-M59 and N-M10 when compared to
N (36.6 %), although this is not significant. The N-M2 (42.7 %)
voice performs better than the unmodified counterpart N and
obtains a similar WAR to N-L (43.6 %). All Lombard voices
performed significantly better than the other voices, in particu-

N N−M59 N−M10 N−M2 N−L L L−E L−E−M2
20
30
40
50
60
70
80
Word accuracy rate (%)
Figure 2: Word accuracy rates for speech-shaped noise.
lar the L voice (62.2 %). The other versions, L-E (60.5 %) and
L-E-M2 (59.3 %), do not appear to increase intelligibility.
As predicted by our hypothesis that distortions were defeat-
ing potential gains in intelligibility in our previously-published
experiments [5], the voices where we modify only the first few
Mel cepstral coefficients achieved a better WAR, indicating that
very fine frequency modifications cause distortions that cancel
out any potential intelligibility gain they may offer. Compared
to the N-L voice, for which the spectral parameters were ob-
tained using Lombard speech, the modifications proposed here
obtained a similar or higher intelligibility score. The intelligi-
bility gains obtained by the full Lombard voice L over the N-L
voice reflect the impact of changes in duration patterns, F0 and
the aperiodicity parameters that define the excitation signal, as
pointed out in Table 2. We can see, then, that there is a lot to
gain from modifying those parameters in addition to the spec-
tral ones. The spectral modifications proposed here increased
the gains obtained with the Lombard voice for speech-shaped
noise, as we can see from the results for voice L-E-M2, which
shows that there are still gains to be had over and above simply
building voices on recorded Lombard speech.
For the competing talker, spectral changes seem to con-
tribute less than for speech-shaped noise. For the competing
talker, duration stretches as well as F0 increases are more im-
portant. This suggests that for non-stationary noise it is more
effective to perform temporal energy re-allocation (e.g., taking
advantage of quiet or silent regions in the noise signal) than it is
to reallocate energy across different frequencies.
6. Conclusions
We have proposed a new method for modifying Mel cepstral co-
efficients based on an intelligibility measure for speech in noise,
the Glimpse proportion measure. We showed how to control
distortion by modifying only the first few Mel cepstral coeffi-
cients, which is a natural way of limiting the frequency resolu-
tion of the modifications. In the evaluation, we compared syn-
thetic voices whose spectral parameters were modified as well
as using spectral parameters from Lombard speech. Listening
tests using speech-shaped noise and a competing talker indicate
that we only need to modify two Mel cepstral coefficients to ob-
tain a similar or higher intelligibility to Lombard spectral modi-
fications. Moreover we observed that, for the competing talker,
the intelligibility gain obtained by the Lombard voice over the
modified voice was mainly due to changes in duration, F0 and
excitation parameters. In terms of what can be achieved when
modifying only Mel cepstral coefficients, our method obtains
either higher or similar intelligibility scores to Lombard Mel
N N−M59 N−M10 N−M2 N−L L L−E L−E−M2
20
30
40
50
60
70
80
Word accuracy rate (%)
Figure 3: Word accuracy rates for competing talker.
cepstral coefficients. We are currently making a more extensive
comparison of our method to other intelligibility enhancement
methods. In future, we plan to investigate reallocating energy
across time. We also plan operating under a loudness constraint
rather than an energy one.
7. Acknowledgment
The research leading to these results was partly funded from
the European Community’s Seventh Framework Programme
(FP7/2007-2013) under grant agreement 213850 (SCALE) and
256230 (LISTA), and from EPSRC grants EP/I031022/1 and
EP/J002526/1.
8. References
[1] T. Raitio, A. Suni, M. Vainio, and P. Alku, “Analysis of HMM-
based lombard speech synthesis, in Proc. Interspeech, Florence,
Italy, August 2011.
[2] B. Sauert and P. Vary, “Near end listening enhancement: Speech
intelligibility improvement in noisy environments, in Proc.
ICASSP, Toulouse, France, May 2006, p. 493496.
[3] M. Cooke, A glimpsing model of speech perception in noise, J.
Acoust. Soc. Am., vol. 119, no. 3, pp. 1562–1573, 2006.
[4] C. Valentini-Botinhao, J. Yamagishi, and S. King, “Can objec-
tive measures predict the intelligibility of modified HMM-based
synthetic speech in noise?” in Proc. Interspeech, Florence, Italy,
August 2011.
[5] C. Valentini-Botinhao, R. Maia, J. Yamagishi, S. King, and
H. Zen, “Cepstral analysis based on the Glimpse proportion mea-
sure for improving the intelligibility of HMM-based synthetic
speech in noise, in Proc. ICASSP, Kyoto, Japan, March 2012.
[6] T. Fukada, K. Tokuda, T. Kobayashi, and S. Imai, An adaptive
algorithm for mel-cepstral analysis of speech, in Proc. ICASSP,
vol. 1, San Francisco, USA, March 1992, pp. 137–140.
[7] W. Dreschler, H. Verschuure, C. Ludvigsen, and S. Westermann,
“ICRA noises: artificial noise signals with speech-like spectral
and temporal properties for hearing instrument assessment. In-
ternational Collegium for Rehabilitative Audiology.” Audiology,
vol. 40, no. 3, pp. 148–57, 2001.
[8] H. Kawahara, I. Masuda-Katsuse, and A. Cheveign
´
e, “Restructur-
ing speech representations using a pitch-adaptive time-frequency
smoothing and an instantaneous-frequency-based F0 extraction:
possible role of a repetitive structure in sounds, Speech Comm.,
vol. 27, pp. 187–207, 1999.
[9] T. Toda and K. Tokuda, “A speech parameter generation algorithm
considering global variance for HMM-based speech synthesis,
IEICE Trans. Inf. Syst., vol. E90-D, no. 5, pp. 816–824, 2007.
[10] “IEEE recommended pratice for speech quality measurements,
Audio and Electroacoustics, IEEE Transactions on, vol. 17, no. 3,
pp. 225 246, sep 1969.
Citations
More filters
Journal ArticleDOI
TL;DR: The current study compares the benefits of speech modification algorithms in a large-scale speech intelligibility evaluation and quantifies the equivalent intensity change, defined as the amount in decibels that unmodified speech would need to be adjusted by in order to achieve the same intelligibility as modified speech.

115 citations


Cites methods from "Mel cepstral coefficient modificati..."

  • ...The first two Mel cepstral coefficients were modified (excluding the log-energy coefficient) in order to maximise intelligibility of speech in noise as given by an approximated version of the glimpse proportion measure (Cooke, 2006; Valentini-Botinhao et al., 2012a)....

    [...]

  • ...To create the ‘TTSGP’ type a Mel cepstral coefficient modification method (Valentini-Botinhao et al., 2012b) was applied to the spectral parameters generated by the TTS type....

    [...]

  • ...…audio power reallocation based on the Speech Intelligibility Index (Sauert and Vary, 2010, 2011) or glimpse proportion (Tang and Cooke, 2012), cepstral extraction based on the glimpse proportion measure (Valentini-Botinhao et al., 2012a), and the insertion of small pauses (Tang and Cooke, 2011)....

    [...]

Proceedings ArticleDOI
25 Aug 2013
TL;DR: Surprisingly, for most conditions the largest gains were observed for noise-independent algorithms, suggesting that performance in this task can be further improved by exploiting information in the masking signal.
Abstract: Speech output is used extensively, including in situations where correct message reception is threatened by adverse listening conditions. Recently, there has been a growing interest in algorithmic modifications that aim to increase the intelligibility of both natural and synthetic speech when presented in noise. The Hurricane Challenge is the first large-scale open evaluation of algorithms designed to enhance speech intelligibility. Eighteen systems operating on a common data set were subjected to extensive listening tests and compared to unmodified natural and text-to-speech (TTS) baselines. The best-performing systems achieved gains over unmodified natural speech of 4.4 and 5.1 dB in competing speaker and stationary noise respectively, while TTS systems made gains of 5.6 and 5.1 dB over their baseline. Surprisingly, for most conditions the largest gains were observed for noise-independent algorithms, suggesting that performance in this task can be further improved by exploiting information in the masking signal.

73 citations


Cites methods from "Mel cepstral coefficient modificati..."

  • ...To enhance the spectral envelope a noise-dependent optimisation based on the glimpse proportion measure was performed [29]....

    [...]

Journal ArticleDOI
TL;DR: A method for modifying the Mel cepstral coefficients generated by statistical parametric models that have been trained on plain speech such that the glimpse proportion - an objective measure of the intelligibility of speech in noise - increases, while keeping the speech energy fixed.

28 citations


Cites methods from "Mel cepstral coefficient modificati..."

  • ...We then proposed a method to extract cepstral coefficients which maximized the GP measure (Valentini-Botinhao et al., 2012a)....

    [...]

  • ...Our solution to this was to modify the generated speech instead (Valentini-Botinhao et al., 2012b), by modifying the Mel cepstral coefficients....

    [...]

Journal ArticleDOI
TL;DR: How well seven OIMs predict listener responses in three large datasets of modified and synthetic speech which together represent 396 combinations of speech modification, masker type and signal-to-noise ratio is evaluated.

26 citations

References
More filters
Journal ArticleDOI
TL;DR: It is demonstrated that the ICRA noises show the effectiveness of the noise reduction schemes, and some initial steps are proposed to develop a standard method of technical specification of noise reduction based on the modulation characteristics.
Abstract: Current standards involving technical specification of hearing aids provide limited possibilities for assessing the influence of the spectral and temporal characteristics of the input signal, and these characteristics have a significant effect on the output signal of many recent types of hearing aids. This is particularly true of digital hearing instruments, which typically include non-linear amplification in multiple channels. Furthermore, these instruments often incorporate additional non-linear functions such as “noise reduction” and “feedback cancellation”. The output signal produced by a non-linear hearing instrument relates to the characteristics of the input signal in a complex manner. Therefore, the choice of input signal significantly influences the outcome of any acoustic or psychophysical assessment of a non-linear hearing instrument. For this reason, the International Collegium for Rehabilitative Audiology (ICRA) has introduced a collection of noise signals that can be used for hearing aid tes...

248 citations

Proceedings ArticleDOI
14 May 2006
TL;DR: A digital signal processing algorithm to improve intelligibility of clean far end speech for the near end listener who is located in an environment with background noise is presented.
Abstract: In contrast to common noise reduction systems, this contribution presents a digital signal processing algorithm to improve intelligibility of clean far end speech for the near end listener who is located in an environment with background noise. Since the noise reaches the ears of the near end listener directly and therefore can hardly be influenced, a sensible option is to manipulate the far end speech. The proposed algorithm raises the average speech spectrum over the average noise spectrum and takes precautions to prevent hearing damage. Informal listening tests and the Speech Intelligibility Index indicate an improved speech intelligibility.

114 citations


"Mel cepstral coefficient modificati..." refers background in this paper

  • ...One way in which this can be done is by using an intelligibility measure of speech [2]....

    [...]

Proceedings ArticleDOI
27 Aug 2011
TL;DR: Comparing several methods of synthesizing speech in noise using a physiologically based statistical speech synthesis system (GlottHMM) shows that in a realistic street noise situation the synthetic Lombardspeech is judged by listeners both as appropriate for the situation and as intelligible as natural Lombard speech.
Abstract: Humans modify their voice in interfering noise in order to maintain the intelligibility of their speech – this is called the Lombard effect. This ability, however, has not been extensively modeled in speech synthesis. Here we compare several methods of synthesizing speech in noise using a physiologically based statistical speech synthesis system (GlottHMM). The results show that in a realistic street noise situation the synthetic Lombard speech is judged by listeners both as appropriate for the situation and as intelligible as natural Lombard speech. Of the different types of models, one using adaptation and extrapolation performed the best.

51 citations

Proceedings Article
01 Aug 2011
TL;DR: The impact on intelligibility is analysed when speaking rate, fundamental frequency, line spectral pairs and spectral peaks are modified and how well objective measures predict it is analyzed.
Abstract: Synthetic speech can be modified to improve intelligibility in noise. In order to perform modifications automatically, it would be useful to have an objective measure that could predict the intelligibility of modified synthetic speech for human listeners. We analysed the impact on intelligibility – and on how well objective measures predict it – when we separately modify speaking rate, fundamental frequency, line spectral pairs and spectral peaks. Shifting LSPs can increase intelligibility for human listeners; other modifications had weaker effects. Among the objective measures we evaluated, the Dau model and the Glimpse proportion were the best predictors of human performance.

39 citations


"Mel cepstral coefficient modificati..." refers background in this paper

  • ...We have observed that the Glimpse Proportion (GP) measure for speech intelligibility in noise [3] has a high correlation coefficient with subjective intelligibility scores for HMMgenerated synthetic speech whose spectral envelope has been modified [4]....

    [...]

Proceedings ArticleDOI
25 Mar 2012
TL;DR: A new cepstral coefficient extraction method based on an intelligibility measure for speech in noise, the Glimpse Proportion measure, that can significantly improve intelligibility of synthetic speech in speech shaped noise is introduced.
Abstract: In this paper we introduce a new cepstral coefficient extraction method based on an intelligibility measure for speech in noise, the Glimpse Proportion measure. This new method aims to increase the intelligibility of speech in noise by modifying the clean speech, and has applications in scenarios such as public announcement and car navigation systems. We first explain how the Glimpse Proportion measure operates and further show how we approximated it to integrate it into an existing spectral envelope parameter extraction method commonly used in the HMM-based speech synthesis framework. We then demonstrate how this new method changes the modelled spectrum according to the characteristics of the noise and show results for a listening test with vocoded and HMM-based synthetic speech. The test indicates that the proposed method can significantly improve intelligibility of synthetic speech in speech shaped noise.

14 citations


"Mel cepstral coefficient modificati..." refers background or methods in this paper

  • ...In [5] we showed how to approximate the GP measure in a way that provides a closed and differentiable formulation:...

    [...]

  • ...As predicted by our hypothesis that distortions were defeating potential gains in intelligibility in our previously-published experiments [5], the voices where we modify only the first few Mel cepstral coefficients achieved a better WAR, indicating that very fine frequency modifications cause distortions that cancel out any potential intelligibility gain they may offer....

    [...]

  • ...We then proposed a cepstral extraction method based on the GP measure for the HMM-based synthesis framework [5]....

    [...]

Frequently Asked Questions (10)
Q1. What are the contributions mentioned in the paper "Mel cepstral coefficient modification based on the glimpse proportion measure for improving the intelligibility of hmm- generated synthetic speech in noise" ?

The authors propose a method that modifies the Mel cepstral coefficients of HMM-generated synthetic speech in order to increase the intelligibility of the generated speech when heard by a listener in the presence of a known noise. This method is based on an approximation the authors previously proposed for the Glimpse Proportion measure. Here the authors show how to update the Mel cepstral coefficients using this measure as an optimization criterion and how to control the amount of distortion by limiting the frequency resolution of the modifications. 

In future, the authors plan to investigate reallocating energy across time. The authors also plan operating under a loudness constraint rather than an energy one. 

The authors used as stopping criteria both error convergence and a maximum distortion threshold set to be 10% of relative increase in the Euclidian distance between the STEP representation of original and modified speech. 

To train, adapt and generate speech the authors extracted: 59 Mel cepstral coefficients with α = 0.77, Mel scale F0, and 25 aperiodicity energy bands extracted using STRAIGHT [8]. 

For the listening test the authors used 32 native English speakers listening to the noisy samples over headphones in soundproof booths and typing in what he or she heard. 

The Glimpse Proportion (GP) measure for speech intelligibility in noise [3] is the proportion of spectral-temporal regions called glimpses where speech is more energetic than noise. 

If such data are not available, then the authors can apply noise-independent modifications at the feature level based on known acoustic properties of Lombard speech, such as F0 increase, flattening of spectral tilt and duration stretch [1]. 

The intelligibility gains obtained by the full Lombard voice L over the N-L voice reflect the impact of changes in duration patterns, F0 and the aperiodicity parameters that define the excitation signal, as pointed out in Table 2. 

The authors decided to use an average voice model rather than building a speaker-dependent voice because the normal speech dataset was not phonetically balanced. 

Moreover the authors observed that, for the competing talker, the intelligibility gain obtained by the Lombard voice over the modified voice was mainly due to changes in duration, F0 and excitation parameters.