scispace - formally typeset

Proceedings ArticleDOI

Mel cepstral coefficient modification based on the Glimpse Proportion measure for improving the intelligibility of HMM-generated synthetic speech in noise

01 Sep 2012-pp 631-634

TL;DR: A method that modifies the Mel cepstral coefficients of HMM-generated synthetic speech in order to increase the intelligibility of the generated speech when heard by a listener in the presence of a known noise is proposed.
Abstract: We propose a method that modifies the Mel cepstral coefficients of HMM-generated synthetic speech in order to increase the intelligibility of the generated speech when heard by a listener in the presence of a known noise. This method is based on an approximation we previously proposed for the Glimpse Proportion measure. Here we show how to update the Mel cepstral coefficients using this measure as an optimization criterion and how to control the amount of distortion by limiting the frequency resolution of the modifications. To evaluate the method we built eight different voices from normal read-text speech data from a male speaker. Some voices were also built from Lombard speech data produced by the same speaker. Listening experiments with speech-shaped noise and with a single competing talker indicate that our method significantly improves intelligibility when compared to unmodified synthetic speech. The voices built from Lombard speech outperformed the proposed method particularly for the competing talker case. However, compared to a voice using only the spectral parameters from Lombard speech, the proposed method obtains similar or higher performance.

Content maybe subject to copyright    Report

Edinburgh Research Explorer
Mel cepstral coefficient modification based on the Glimpse
Proportion measure for improving the intelligibility of HMM-
generated synthetic speech in noise
Citation for published version:
Valentini-Botinhao, C, Yamagishi, J & King, S 2012, Mel cepstral coefficient modification based on the
Glimpse Proportion measure for improving the intelligibility of HMM-generated synthetic speech in noise. in
Proc. Interspeech.
Link:
Link to publication record in Edinburgh Research Explorer
Document Version:
Peer reviewed version
Published In:
Proc. Interspeech
General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)
and / or other copyright owners and it is a condition of accessing these publications that users recognise and
abide by the legal requirements associated with these rights.
Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer
content complies with UK legislation. If you believe that the public display of this file breaches copyright please
contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and
investigate your claim.
Download date: 10. Aug. 2022

Mel cepstral coefficient modification based on the Glimpse Proportion measure
for improving the intelligibility of HMM-generated synthetic speech in noise
Cassia Valentini-Botinhao, Junichi Yamagishi, Simon King
The Centre for Speech Technology Research, University of Edinburgh, UK
C.Valentini-Botinhao@sms.ed.ac.uk, jyamagis@inf.ed.ac.uk, Simon.King@ed.ac.uk
Abstract
We propose a method that modifies the Mel cepstral coefficients
of HMM-generated synthetic speech in order to increase the in-
telligibility of the generated speech when heard by a listener
in the presence of a known noise. This method is based on
an approximation we previously proposed for the Glimpse Pro-
portion measure. Here we show how to update the Mel cep-
stral coefficients using this measure as an optimization crite-
rion and how to control the amount of distortion by limiting
the frequency resolution of the modifications. To evaluate the
method we built eight different voices from normal read-text
speech data from a male speaker. Some voices were also built
from Lombard speech data produced by the same speaker. Lis-
tening experiments with speech-shaped noise and with a sin-
gle competing talker indicate that our method significantly im-
proves intelligibility when compared to unmodified synthetic
speech. The voices built from Lombard speech outperformed
the proposed method particularly for the competing talker case.
However, compared to a voice using only the spectral parame-
ters from Lombard speech, the proposed method obtains similar
or higher performance.
Index Terms: intelligibility of speech in noise, Mel cepstral
coefficients, HMM-based speech synthesis
1. Introduction
Humans change their speaking style when conversing in a noisy
environment so that communication success is ensured, often
producing what is called Lombard speech. It is unclear what
aspects of Lombard speech actually contribute to intelligibility
increases and how they relate to the nature of the noise. Solving
this problem will enable practical applications which automati-
cally modify natural or synthetic speech to increase intelligibil-
ity in noise.
The parametrical statistical framework of HMM-based
speech synthesis offers many different ways to approach this
problem. If Lombard speech data are available for the speaker
whose TTS voice we want to modify, we can use adapta-
tion techniques to produce new Lombard-like speech for that
speaker [1]. If such data are not available, then we can apply
noise-independent modifications at the feature level based on
known acoustic properties of Lombard speech, such as F0 in-
crease, flattening of spectral tilt and duration stretch [1]. How-
ever if we want to employ noise-dependent techniques then we
need to be able to automatically detect what sort of modifica-
tions should take place for certain pairs of speech and noise
signals. One way in which this can be done is by using an in-
telligibility measure of speech [2]. Such an approach is limited
by the performance of the objective measure: if it fails to ac-
curately predict intelligibility then any modification based on
that prediction is likely to fail. Therefore, it is important to
find a specific domain of modifications where the intelligibility
model behaves well and ensure that the modifications applied
in this domain remain within the working range of the objective
model.
We have observed that the Glimpse Proportion (GP) mea-
sure for speech intelligibility in noise [3] has a high correla-
tion coefficient with subjective intelligibility scores for HMM-
generated synthetic speech whose spectral envelope has been
modified [4]. Moreover, modifications in the spectral envelope
domain can achieve quite high intelligibility gains. We then
proposed a cepstral extraction method based on the GP mea-
sure for the HMM-based synthesis framework [5]. This method
was shown to provide significant intelligibility improvement,
although not for all noise types. We hypothesise this is due to
distortions introduced by the method itself. A disadvantage of
that approach is having to train a different model for each noise
type, because the noise-dependent modifications are performed
as part of feature extraction. Now, we propose a method that can
be applied at generation time, and not requiring any information
about the spectral envelope of natural speech to achieve dis-
tortion control. Rather, we propose to control the distortion in
two ways: using a stopping criteria based on the mismatch be-
tween the auditory representations of modified and unmodified
speech, as proposed by the GP measure, and only modifying the
first few cepstral coefficients, thus limiting the frequency reso-
lution of the modifications. A further extension proposed in this
paper is the possibility of using this method for Mel cepstral co-
efficients, which can provide higher speech quality with fewer
coefficients [6].
In Section 2 and 3 we show how Mel cepstral coefficients
model the spectrum, how the GP measure works and how we
previously approximated it for the purpose of cepstral coeffi-
cient optimization. In Section 4 we introduce the new method
for Mel cepstral modification based on the GP measure. We
then provide experimental results from listening experiments to
support our conclusions.
2. Mel cepstral coefficients
We can represent the spectrum by M-th order Mel cepstral co-
efficients {c
m
}
M
m=0
in the following manner [6]:
H(e
jω
) = exp
M
X
m=0
c
m
e
jm ˜ω
(1)
˜ω = tan
1
(1 α
2
) sin ω
(1 + α
2
) cos ω 2α
(2)
where α is a warping factor which can be chosen to represent,
for instance, the Mel scale [6].
C. Valentini-Botinhao, J. Yamagishi, and S. King. Mel cepstral coefficient modification based on the Glimpse
Proportion measure for improving the intelligibility of HMM-generated synthetic speech in noise. In Proc.
Interspeech, Portland, USA, September 2012.

3. The Glimpse Proportion measure
The Glimpse Proportion (GP) measure for speech intelligibil-
ity in noise [3] is the proportion of spectral-temporal regions
called glimpses where speech is more energetic than noise. The
motivation behind this measure is that when humans listen to
speech in noise they tend to focus on such regions. The Spec-
tro Temporal Excitation Pattern (STEP) representation used by
the measure is obtained in the following manner: Gammatone
filtering, envelope extraction and smoothing, averaging over a
time frame and level compression [3].
In [5] we showed how to approximate the GP measure in a
way that provides a closed and differentiable formulation:
GP =
100
N
f
N
t
N
t
X
t=1
N
f
X
f=1
L(y
sp
t,f
y
ns
t,f
) (3)
where N
t
and N
f
are the number of time frames and frequency
channels, L(.) is a logistic sigmoid function of zero offset and
slope η, y
sp
t,f
and y
ns
t,f
are the approximated STEP representa-
tions for speech and noise respectively at analysis window t and
frequency channel f.
The STEP representation for speech is given by:
y
sp
t,f
=
1
N
(G
f
h
t
N
G
f
h
t
)
>
S b (4)
where N is the number of frequency bins of the spectrum,
N
is circular convolution of dimension N, h
t
is an Nx1 vector
containing the magnitude spectrum of windowed speech signal
at analysis window t, G
f
is an NxN diagonal matrix whose
diagonal contains the Gammatone filter frequency response for
frequency channel f, S is an NxN diagonal matrix whose di-
agonal contains the frequency response of the smoothing filter
and b is an Nx1 vector containing the coefficients of the aver-
age filter.
4. Mel cepstral modifications
based on the GP measure
Given a set of Mel cepstral coefficients and a noise signal we
want to obtain a new set of Mel cepstral coefficients c
t
=
[c
t,1
. . . c
t,m
. . . c
t,M
]
>
that maximizes GP
t
, the value of
the function described in Eq. (3) in time frame t. We then have:
c
t
= argmax GP
t
(5)
GP
t
=
100
N
f
N
f
X
f=1
L(y
sp
t,f
y
ns
t,f
) (6)
As this function is not necessarily convex with respect to
the Mel cepstral coefficients, we use a Steepest Descent method
to solve the optimization. The update equation is:
c
(i+1)
t
= c
(i)
t
+ µGP
(i)
t
(7)
where µ is the step size and the i index refers to iterations. From
now on we drop the i index for clarity. The gradient vector is
given by:
GP
t
=
100
N
f
N
N
f
X
f=1
η L(y
sp
t,f
y
ns
t,f
)
ˆ
1 L(y
sp
t,f
y
ns
t,f
)
˜
·
H
c
t
G
f
(2 Γ
N
N
G
f
h
t
) S b (8)
where H
c
t
is an MxN matrix whose elements are
{H
c
t
}
m,j
=
|H
t
(ω
j
)|
c
t,m
and the operation (Γ
N
N
G
f
h
t
)
defines an NxN matrix whose n-th row is equal to
e
n
N
(G
f
h
t
)
>
, e
n
being the n-th column of the identity ma-
trix Γ
N
.
When the spectrum is modelled by Mel cepstral coefficients
as defined in Eq.(1) the elements of the matrix H
c
t
are given
by:
|H
t
(ω
j
)|
c
t,m
= |H
t
(ω
j
)| cos(m ˜ω
j
) (9)
However because we do not wish to modify the energy of
the speech signal we have:
|H
0
t
(ω
j
)|
c
t,m
= |H
0
t
(ω
j
)|
cos(m ˜ω
j
)
1
ψ
N
X
l=1
|H
t
(ω
l
)|
2
cos(m ˜ω
l
)
(10)
where |H
0
t
(ω
j
)| is the energy-normalized magnitude spectrum
and ψ =
P
N
j=1
|H
t
(ω
j
)|
2
. There is no need to update the first
Mel cepstral coefficient c
0
as the normalization operation up-
dates it to a certain value regardless of an additional c
0
term.
An issue we face when using the GP measure as an opti-
mization criterion on its own is the need to limit the distortions
caused by the modifications. To define an audible distortion we
use the Euclidian distance between the STEP representations of
modified and unmodified speech. Including this as an explicit
constraint is unfortunately rather cumbersome, so instead we
use it as a stopping criterion whilst at the same time limiting the
frequency resolution of the modifications. To implement that,
we simply set the gradient vector for higher dimensions to zero,
thus modify only the first few Mel cepstral coefficients, which
represent the coarse properties of the spectrum.
5. Evaluation
In this section we show how we built the TTS voices, give an
acoustic analysis, and present the results of a listening test.
5.1. Voice building
To build the voices used in this evaluation we used two differ-
ent datasets recorded by the same British male speaker: normal
(plain, read-text) speech data and Lombard speech. The Lom-
bard dataset was recorded while the speaker listened to speech-
modulated noise based on another male speaker [7] played over
headphones at a absolute value of 84 dBA.
We built eight different voices as outlined in Table 1. Voice
N was created from a high quality average voice model adapted
to 2803 sentences of the normal speech database, correspond-
ing to three hours of material. We decided to use an average
voice model rather than building a speaker-dependent voice be-
cause the normal speech dataset was not phonetically balanced.
Voices N-M59, N-M10 and N-M2 are variations of N in which
we modify all, just the first ten (c
1
until c
10
), or just the first
two (c
1
and c
2
) Mel cepstral coefficients using our proposed
method.
Lombard voice L was based on voice N, further adapted
using 780 sentences from the Lombard speech dataset, corre-
sponding to 53 minutes of recorded material. Again, the rea-
son for using adaptation was the lack of phonetic balance in the
speech dataset. Voice N-L was also created from voice N but

Voice Adaptation Modification
N - -
N-M59 - all coefficients
N-M10 - first 10 coefficients
N-M2 - first 2 coefficients
N-L only spectral parameters -
L all dimensions -
L-E all dimensions extrapolated -
L-E-M2 all dimensions extrapolated first 2 coefficients
Table 1: Voices built for the evaluation.
this time only the Mel cepstral coefficients were adapted to the
Lombard data. Voices L-E and L-E-M2 are versions of voice L
where we extrapolated the adaptation (voice L-E), and then also
modified the two first Mel cepstral using the proposed method
(voice L-E-M2).
The training and adaptation data had a sampling rate of
48 kHz. To train, adapt and generate speech we extracted: 59
Mel cepstral coefficients with α = 0.77, Mel scale F0, and 25
aperiodicity energy bands extracted using STRAIGHT [8]. We
used a hidden semi-Markov model. The observation vectors for
the spectral and excitation parameters contained static, delta and
delta-delta values; one stream for the spectrum, three streams
for the logF0 and one for the band-limited aperiodicity. The
Global Variance method [9] was also applied to compensate for
the over smoothing effect of the acoustical modelling.
To modify the generated Mel cepstral coefficients we used
the method proposed in the previous section, obtaining the
STEP representation by using Gammatone filters that cov-
ered the range 50-7500 Hz as the noise signal used for test-
ing is sampled at 16 kHz. The stepsize was normalized:
µ
(i)
= µ/||GP
(i)
t
|| and we set µ = 0.4 for N-M59 and µ = 0.8
for N-M10 and N-M2. We used as stopping criteria both error
convergence and a maximum distortion threshold set to be 10 %
of relative increase in the Euclidian distance between the STEP
representation of original and modified speech.
5.2. Acoustic analysis
Fig.1 shows the Long Term Average Spectrum (LTAS) of the
normal (N), modified (N-M2) and Lombard (L) voices, for the
case of speech-shaped noise. Compared to voice N, voice N-
M2 exhibits enhanced energy in the frequency region of 1-
4 kHz and attenuated below 1 kHz. Voice L shows enhance-
ment and attenuation in the same regions as N-M2, although
these changes are not as pronounced, attenuation is also seen
between 4-5.5 kHz and enhancement at frequencies above this.
Table 2 provides an acoustic analysis of the voices av-
erage duration of speech and pauses, average spectral tilt, and
F0 across all sentences used in the listening test for the nor-
mal (N), modified (N-M2) and lombard (L) voices. We can
see that, as expected, the Lombard voice produces sentences
with longer duration and longer pauses, greatly increased F0
mean and flattening of the spectral tilt. The spectral tilt reflects
changes in both spectral envelope and excitation signal. The
modified voice N-M2 also presents a flatter spectral tilt, though
not to the same extent as the Lombard voice.
5.3. Listening experiments design
We mixed the eight different synthetic voices with two noises:
speech-shaped noise and speech from a single competing fe-
−10
0
10
20
30
40
50
Sound Pressure Level (dB)
N
N−M2
L
noise
0 1 2 3 4 5 6 7 8
Freq. (kHz)
Figure 1: Long term average spectrum of the normal N, normal
modified N-M2 and Lombard L voices for speech-shaped noise.
Voice
speech
(secs.)
pauses
(secs.)
F0
(Hz)
spectral tilt
(dB/octave)
N
2.11 0.16 104.5
-2.24
N-M2 -1.88
L 2.80 0.19 145.0 -1.70
Table 2: Acoustic properties observed in normal N, modified
N-M2 and lombard L voices.
male talker. For intelligibility testing, it is important to avoid
floor or ceiling effects on word error rate. Therefore, in order to
obtain intelligibility scores in similar ranges for each noise, we
mixed them at differing SNRs: -4 dB for speech-shaped noise
and -14 dB for the competing talker. Across the different voices
we made sure that the root mean square value was the same.
For the listening test we used 32 native English speakers
listening to the noisy samples over headphones in soundproof
booths and typing in what he or she heard. Each participant
heard six different sentences per condition, i.e., voice and noise
type, and each sentence could only be played once. We used the
first ten sets of the Harvard sentences [10]; another one of the
sets was used as a practice session which listeners completed
before the test proper.
5.4. Results and discussion
Figs. 2 and 3 show the mean word accuracy rate (WAR) ob-
tained by each voice when mixed with speech-shaped noise
and a competing talker respectively, along with 95 % confi-
dence intervals. Fig.2 shows that the modified voices N-M59,
N-M10 and N-M2 achieve higher WAR than the unmodified
voices N (40.9 %), and this is significantly higher for the N-
M10 (50.6 %) and N-M2 (57.8 %). The N-M2 voice obtains a
higher WAR than the N-L voice (49.4 %). The Lombard voices
L (63.5 %), L-E (68.1 %) and L-E-M2 (70.1 %) performed bet-
ter than the normal speech voices although we did not find a sig-
nificant difference between N-M2 and L. The extrapolated voice
L-E is more intelligible than voice L, a trend that is further en-
hanced by applying our modifications to it, as in voice L-E-M2.
The results obtained for the competing talker situation are dis-
played in Fig. 3 and show a slightly different trend. There is a
drop in performance for N-M59 and N-M10 when compared to
N (36.6 %), although this is not significant. The N-M2 (42.7 %)
voice performs better than the unmodified counterpart N and
obtains a similar WAR to N-L (43.6 %). All Lombard voices
performed significantly better than the other voices, in particu-

N N−M59 N−M10 N−M2 N−L L L−E L−E−M2
20
30
40
50
60
70
80
Word accuracy rate (%)
Figure 2: Word accuracy rates for speech-shaped noise.
lar the L voice (62.2 %). The other versions, L-E (60.5 %) and
L-E-M2 (59.3 %), do not appear to increase intelligibility.
As predicted by our hypothesis that distortions were defeat-
ing potential gains in intelligibility in our previously-published
experiments [5], the voices where we modify only the first few
Mel cepstral coefficients achieved a better WAR, indicating that
very fine frequency modifications cause distortions that cancel
out any potential intelligibility gain they may offer. Compared
to the N-L voice, for which the spectral parameters were ob-
tained using Lombard speech, the modifications proposed here
obtained a similar or higher intelligibility score. The intelligi-
bility gains obtained by the full Lombard voice L over the N-L
voice reflect the impact of changes in duration patterns, F0 and
the aperiodicity parameters that define the excitation signal, as
pointed out in Table 2. We can see, then, that there is a lot to
gain from modifying those parameters in addition to the spec-
tral ones. The spectral modifications proposed here increased
the gains obtained with the Lombard voice for speech-shaped
noise, as we can see from the results for voice L-E-M2, which
shows that there are still gains to be had over and above simply
building voices on recorded Lombard speech.
For the competing talker, spectral changes seem to con-
tribute less than for speech-shaped noise. For the competing
talker, duration stretches as well as F0 increases are more im-
portant. This suggests that for non-stationary noise it is more
effective to perform temporal energy re-allocation (e.g., taking
advantage of quiet or silent regions in the noise signal) than it is
to reallocate energy across different frequencies.
6. Conclusions
We have proposed a new method for modifying Mel cepstral co-
efficients based on an intelligibility measure for speech in noise,
the Glimpse proportion measure. We showed how to control
distortion by modifying only the first few Mel cepstral coeffi-
cients, which is a natural way of limiting the frequency resolu-
tion of the modifications. In the evaluation, we compared syn-
thetic voices whose spectral parameters were modified as well
as using spectral parameters from Lombard speech. Listening
tests using speech-shaped noise and a competing talker indicate
that we only need to modify two Mel cepstral coefficients to ob-
tain a similar or higher intelligibility to Lombard spectral modi-
fications. Moreover we observed that, for the competing talker,
the intelligibility gain obtained by the Lombard voice over the
modified voice was mainly due to changes in duration, F0 and
excitation parameters. In terms of what can be achieved when
modifying only Mel cepstral coefficients, our method obtains
either higher or similar intelligibility scores to Lombard Mel
N N−M59 N−M10 N−M2 N−L L L−E L−E−M2
20
30
40
50
60
70
80
Word accuracy rate (%)
Figure 3: Word accuracy rates for competing talker.
cepstral coefficients. We are currently making a more extensive
comparison of our method to other intelligibility enhancement
methods. In future, we plan to investigate reallocating energy
across time. We also plan operating under a loudness constraint
rather than an energy one.
7. Acknowledgment
The research leading to these results was partly funded from
the European Community’s Seventh Framework Programme
(FP7/2007-2013) under grant agreement 213850 (SCALE) and
256230 (LISTA), and from EPSRC grants EP/I031022/1 and
EP/J002526/1.
8. References
[1] T. Raitio, A. Suni, M. Vainio, and P. Alku, “Analysis of HMM-
based lombard speech synthesis, in Proc. Interspeech, Florence,
Italy, August 2011.
[2] B. Sauert and P. Vary, “Near end listening enhancement: Speech
intelligibility improvement in noisy environments, in Proc.
ICASSP, Toulouse, France, May 2006, p. 493496.
[3] M. Cooke, A glimpsing model of speech perception in noise, J.
Acoust. Soc. Am., vol. 119, no. 3, pp. 1562–1573, 2006.
[4] C. Valentini-Botinhao, J. Yamagishi, and S. King, “Can objec-
tive measures predict the intelligibility of modified HMM-based
synthetic speech in noise?” in Proc. Interspeech, Florence, Italy,
August 2011.
[5] C. Valentini-Botinhao, R. Maia, J. Yamagishi, S. King, and
H. Zen, “Cepstral analysis based on the Glimpse proportion mea-
sure for improving the intelligibility of HMM-based synthetic
speech in noise, in Proc. ICASSP, Kyoto, Japan, March 2012.
[6] T. Fukada, K. Tokuda, T. Kobayashi, and S. Imai, An adaptive
algorithm for mel-cepstral analysis of speech, in Proc. ICASSP,
vol. 1, San Francisco, USA, March 1992, pp. 137–140.
[7] W. Dreschler, H. Verschuure, C. Ludvigsen, and S. Westermann,
“ICRA noises: artificial noise signals with speech-like spectral
and temporal properties for hearing instrument assessment. In-
ternational Collegium for Rehabilitative Audiology.” Audiology,
vol. 40, no. 3, pp. 148–57, 2001.
[8] H. Kawahara, I. Masuda-Katsuse, and A. Cheveign
´
e, “Restructur-
ing speech representations using a pitch-adaptive time-frequency
smoothing and an instantaneous-frequency-based F0 extraction:
possible role of a repetitive structure in sounds, Speech Comm.,
vol. 27, pp. 187–207, 1999.
[9] T. Toda and K. Tokuda, “A speech parameter generation algorithm
considering global variance for HMM-based speech synthesis,
IEICE Trans. Inf. Syst., vol. E90-D, no. 5, pp. 816–824, 2007.
[10] “IEEE recommended pratice for speech quality measurements,
Audio and Electroacoustics, IEEE Transactions on, vol. 17, no. 3,
pp. 225 246, sep 1969.
Citations
More filters

Journal ArticleDOI
TL;DR: The current study compares the benefits of speech modification algorithms in a large-scale speech intelligibility evaluation and quantifies the equivalent intensity change, defined as the amount in decibels that unmodified speech would need to be adjusted by in order to achieve the same intelligibility as modified speech.
Abstract: The use of live and recorded speech is widespread in applications where correct message reception is important. Furthermore, the deployment of synthetic speech in such applications is growing. Modifications to natural and synthetic speech have therefore been proposed which aim at improving intelligibility in noise. The current study compares the benefits of speech modification algorithms in a large-scale speech intelligibility evaluation and quantifies the equivalent intensity change, defined as the amount in decibels that unmodified speech would need to be adjusted by in order to achieve the same intelligibility as modified speech. Listeners identified keywords in phonetically-balanced sentences representing ten different types of speech: plain and Lombard speech, five types of modified speech, and three forms of synthetic speech. Sentences were masked by either a stationary or a competing speech masker. Modification methods varied in the manner and degree to which they exploited estimates of the masking noise. The best-performing modifications led to equivalent intensity changes of around 5dB in moderate and high noise levels for the stationary masker, and 3-4dB in the presence of competing speech. These gains exceed those produced by Lombard speech. Synthetic speech in noise was always less intelligible than plain natural speech, but modified synthetic speech reduced this deficit by a significant amount.

101 citations


Cites methods from "Mel cepstral coefficient modificati..."

  • ...The first two Mel cepstral coefficients were modified (excluding the log-energy coefficient) in order to maximise intelligibility of speech in noise as given by an approximated version of the glimpse proportion measure (Cooke, 2006; Valentini-Botinhao et al., 2012a)....

    [...]

  • ...To create the ‘TTSGP’ type a Mel cepstral coefficient modification method (Valentini-Botinhao et al., 2012b) was applied to the spectral parameters generated by the TTS type....

    [...]

  • ...…audio power reallocation based on the Speech Intelligibility Index (Sauert and Vary, 2010, 2011) or glimpse proportion (Tang and Cooke, 2012), cepstral extraction based on the glimpse proportion measure (Valentini-Botinhao et al., 2012a), and the insertion of small pauses (Tang and Cooke, 2011)....

    [...]


Proceedings ArticleDOI
25 Aug 2013-
TL;DR: Surprisingly, for most conditions the largest gains were observed for noise-independent algorithms, suggesting that performance in this task can be further improved by exploiting information in the masking signal.
Abstract: Speech output is used extensively, including in situations where correct message reception is threatened by adverse listening conditions. Recently, there has been a growing interest in algorithmic modifications that aim to increase the intelligibility of both natural and synthetic speech when presented in noise. The Hurricane Challenge is the first large-scale open evaluation of algorithms designed to enhance speech intelligibility. Eighteen systems operating on a common data set were subjected to extensive listening tests and compared to unmodified natural and text-to-speech (TTS) baselines. The best-performing systems achieved gains over unmodified natural speech of 4.4 and 5.1 dB in competing speaker and stationary noise respectively, while TTS systems made gains of 5.6 and 5.1 dB over their baseline. Surprisingly, for most conditions the largest gains were observed for noise-independent algorithms, suggesting that performance in this task can be further improved by exploiting information in the masking signal.

69 citations


Cites methods from "Mel cepstral coefficient modificati..."

  • ...To enhance the spectral envelope a noise-dependent optimisation based on the glimpse proportion measure was performed [29]....

    [...]


01 Jan 2013-

31 citations


Journal ArticleDOI
TL;DR: A method for modifying the Mel cepstral coefficients generated by statistical parametric models that have been trained on plain speech such that the glimpse proportion - an objective measure of the intelligibility of speech in noise - increases, while keeping the speech energy fixed.
Abstract: This paper describes speech intelligibility enhancement for Hidden Markov Model (HMM) generated synthetic speech in noise. We present a method for modifying the Mel cepstral coefficients generated by statistical parametric models that have been trained on plain speech. We update these coefficients such that the glimpse proportion - an objective measure of the intelligibility of speech in noise - increases, while keeping the speech energy fixed. An acoustic analysis reveals that the modified speech is boosted in the region 1-4kHz, particularly for vowels, nasals and approximants. Results from listening tests employing speech-shaped noise show that the modified speech is as intelligible as a synthetic voice trained on plain speech whose duration, Mel cepstral coefficients and excitation signal parameters have been adapted to Lombard speech from the same speaker. Our proposed method does not require these additional recordings of Lombard speech. In the presence of a competing talker, both modification and adaptation of spectral coefficients give more modest gains.

26 citations


Cites methods from "Mel cepstral coefficient modificati..."

  • ...We then proposed a method to extract cepstral coefficients which maximized the GP measure (Valentini-Botinhao et al., 2012a)....

    [...]

  • ...Our solution to this was to modify the generated speech instead (Valentini-Botinhao et al., 2012b), by modifying the Mel cepstral coefficients....

    [...]


Journal ArticleDOI
TL;DR: How well seven OIMs predict listener responses in three large datasets of modified and synthetic speech which together represent 396 combinations of speech modification, masker type and signal-to-noise ratio is evaluated.
Abstract: HighlightsAlgorithmically modified speech is used to assess objective intelligibility metrics.Reduced predictive power of the metrics for the given speech is demonstrated.Metrics show two opposite predictive patterns in fluctuating and stationary maskers.The glimpse proportion metric is extended. Several modification algorithms that alter natural or synthetic speech with the goal of improving intelligibility in noise have been proposed recently. A key requirement of many modification techniques is the ability to predict intelligibility, both offline during algorithm development, and online, in order to determine the optimal modification for the current noise context. While existing objective intelligibility metrics (OIMs) have good predictive power for unmodified natural speech in stationary and fluctuating noise, little is known about their effectiveness for other forms of speech. The current study evaluated how well seven OIMs predict listener responses in three large datasets of modified and synthetic speech which together represent 396 combinations of speech modification, masker type and signal-to-noise ratio. The chief finding is a clear reduction in predictive power for most OIMs when faced with modified and synthetic speech. Modifications introducing durational changes are particularly harmful to intelligibility predictors. OIMs that measure masked audibility tend to over-estimate intelligibility in the presence of fluctuating maskers relative to stationary maskers, while OIMs that estimate the distortion caused by the masker to a clean speech prototype exhibit the reverse pattern.

21 citations


References
More filters

Journal ArticleDOI
TL;DR: A set of simple new procedures has been developed to enable the real-time manipulation of speech parameters by using pitch-adaptive spectral analysis combined with a surface reconstruction method in the time–frequency region.
Abstract: A set of simple new procedures has been developed to enable the real-time manipulation of speech parameters. The proposed method uses pitch-adaptive spectral analysis combined with a surface reconstruction method in the time–frequency region. The method also consists of a fundamental frequency (F0) extraction using instantaneous frequency calculation based on a new concept called `fundamentalness'. The proposed procedures preserve the details of time–frequency surfaces while almost perfectly removing fine structures due to signal periodicity. This close-to-perfect elimination of interferences and smooth F0 trajectory allow for over 600% manipulation of such speech parameters as pitch, vocal tract length, and speaking rate, while maintaining high reproductive quality.

1,676 citations


"Mel cepstral coefficient modificati..." refers methods in this paper

  • ...To train, adapt and generate speech we extracted: 59 Mel cepstral coefficients with α = 0.77, Mel scale F0, and 25 aperiodicity energy bands extracted using STRAIGHT [8]....

    [...]

  • ...77, Mel scale F0, and 25 aperiodicity energy bands extracted using STRAIGHT [8]....

    [...]


Journal ArticleDOI
Martin Cooke1Institutions (1)
TL;DR: An automatic speech recognition system, adapted for use with partially specified inputs, to identify consonants in noise revealed that cues to voicing are degraded more in the model than in human auditory processing.
Abstract: Do listeners process noisy speech by taking advantage of "glimpses"-spectrotemporal regions in which the target signal is least affected by the background? This study used an automatic speech recognition system, adapted for use with partially specified inputs, to identify consonants in noise. Twelve masking conditions were chosen to create a range of glimpse sizes. Several different glimpsing models were employed, differing in the local signal-to-noise ratio (SNR) used for detection, the minimum glimpse size, and the use of information in the masked regions. Recognition results were compared with behavioral data. A quantitative analysis demonstrated that the proportion of the time-frequency plane glimpsed is a good predictor of intelligibility. Recognition scores in each noise condition confirmed that sufficient information exists in glimpses to support consonant identification. Close fits to listeners' performance were obtained at two local SNR thresholds: one at around 8 dB and another in the range -5 to -2 dB. A transmitted information analysis revealed that cues to voicing are degraded more in the model than in human auditory processing.

615 citations


Journal ArticleDOI
Tomoki Toda1, Keiichi TokudaInstitutions (1)
Abstract: This paper describes a novel parameter generation algorithm for an HMM-based speech synthesis technique. The conventional algorithm generates a parameter trajectory of static features that maximizes the likelihood of a given HMM for the parameter sequence consisting of the static and dynamic features under an explicit constraint between those two features. The generated trajectory is often excessively smoothed due to the statistical processing. Using the over-smoothed speech parameters usually causes muffled sounds. In order to alleviate the over-smoothing effect, we propose a generation algorithm considering not only the HMM likelihood maximized in the conventional algorithm but also a likelihood for a global variance (GV) of the generated trajectory. The latter likelihood works as a penalty for the over-smoothing, i.e., a reduction of the GV of the generated trajectory. The result of a perceptual evaluation demonstrates that the proposed algorithm causes considerably large improvements in the naturalness of synthetic speech.

454 citations


Proceedings ArticleDOI
23 Mar 1992-
TL;DR: The authors apply the criterion used in the unbiased estimation of log spectrum to the spectral model represented by the mel-cepstral coefficients to solve the nonlinear minimization problem involved in the method and derive an adaptive algorithm whose convergence is guaranteed.
Abstract: The authors describe a mel-cepstral analysis method and its adaptive algorithm. In the proposed method, the authors apply the criterion used in the unbiased estimation of log spectrum to the spectral model represented by the mel-cepstral coefficients. To solve the nonlinear minimization problem involved in the method, they give an iterative algorithm whose convergence is guaranteed. Furthermore, they derive an adaptive algorithm for the mel-cepstral analysis by introducing an instantaneous estimate for gradient of the criterion. The adaptive mel-cepstral analysis system is implemented with an IIR adaptive filter which has an exponential transfer function, and whose stability is guaranteed. The authors also present examples of speech analysis and results of an isolated word recognition experiment. >

363 citations


"Mel cepstral coefficient modificati..." refers background or methods in this paper

  • ...A further extension proposed in this paper is the possibility of using this method for Mel cepstral coefficients, which can provide higher speech quality with fewer coefficients [6]....

    [...]

  • ...We can represent the spectrum by M -th order Mel cepstral coefficients {cm}m=0 in the following manner [6]:...

    [...]

  • ...where α is a warping factor which can be chosen to represent, for instance, the Mel scale [6]....

    [...]


Journal Article
TL;DR: The design criteria, the realisation process, and the final selection of nine test signals on a CD show the effectiveness of the ICRA noises, and some initial steps are proposed to develop a standard method of technical specification of noise reduction based on the modulation characteristics.
Abstract: Current standards involving technical specification of hearing aids provide limited possibilities for assessing the influence of the spectral and temporal characteristics of the input signal, and these characteristics have a significant effect on the output signal of many recent types of hearing aids. This is particularly true of digital hearing instruments, which typically include non-linear amplification in multiple channels. Furthermore, these instruments often incorporate additional non-linear functions such as "noise reduction" and "feedback cancellation". The output signal produced by a non-linear hearing instrument relates to the characteristics of the input signal in a complex manner. Therefore, the choice of input signal significantly influences the outcome of any acoustic or psychophysical assessment of a non-linear hearing instrument. For this reason, the International Collegium for Rehabilitative Audiology (ICRA) has introduced a collection of noise signals that can be used for hearing aid testing (including real-ear measurements) and psychophysical evaluation. This paper describes the design criteria, the realisation process, and the final selection of nine test signals on a CD. Also, the spectral and temporal characteristics of these signals are documented. The ICRA noises provide a well-specified set of speech-like noises with spectra shaped according to gender and vocal effort, and with different amounts of speech modulation simulating one or more speakers. These noises can be applied as well-specified background noise in psychophysical experiments. They can also serve as test signals for the evaluation of digital hearing aids with noise reduction. It is demonstrated that the ICRA noises show the effectiveness of the noise reduction schemes. Based on these initial measurements, some initial steps are proposed to develop a standard method of technical specification of noise reduction based on the modulation characteristics. For this purpose, the sensitivity of different noise reduction schemes is compared by measurements with ICRA noises with a varying ratio between unmodulated and modulated test signals: a modulated-unmodulated ratio. It can be anticipated that this information is important to understand the differences between the different implementations of noise reduction schemes in different hearing aid models and makes.

241 citations


Performance
Metrics
No. of citations received by the Paper in previous years
YearCitations
20201
20194
20172
20162
20147
20138