What is the definition of the identification rate of a measure?

The authors define the identification rate of a measure to be the fraction of larynx cycles that contain exactly one NZC and the detection rate to be the fraction that contain either one or two NZCs.

How does the detection rate of a measure change as the window length increases?

As the window length in increased the accuracy steadily worsens but the identification rate improves and reaches a peak of over 90% at a window length of 10 ms.

How many repetitions of the following sentences were recorded?

The database includes ten repetitions from each of ten British English speakers (five male, five female) of the following sentences:

What is the way to identify a measure?

To take a specific example, the measure is identified by circles and the authors see from the first point on the graph that for a 4 ms window, its identification accuracy is 0.34 ms but its identification rate is only 36%.

How many larynx cycles contain exactly one NZC?

For this example, the standard deviation of these “closest” NZCs is 0.97 ms and if the authors combine these with the single-NZC cycles, the authors can detect the GCI in over 97% of larynx cycles with a standard deviation of 0.6 ms.

How can the authors reduce the computational cost of the measures?

The authors have shown how the computational cost of all the measures can be reduced greatly by calculating them recursively provided that a suitable window function is used.

How many NZCs are in the larynx?

Of the remaining 12% of larynx cycles, over three quarters contain exactly two NZCs; in most cases these occur at glottal opening and closure, respectively, giving rise to the histogram shown in Fig. 8(b).

What is the detection rate for the and measures?

The and measures again show the best performance and reach a detection rate of 97.1% for window lengths of 8 ms and 7 ms, respectively.

(Open Access) A Quantitative Assessment of Group Delay Methods for Identifying Glottal Closures in Voiced Speech (2006) | Mike Brookes

Q: What is the effect of noise on the median value of a measure?

It follows that the noise will not affect the median value of unless the noise amplitude is large enough to cause the value of the summation to cross the positive real axis where there is a discontinuity in the function.

456 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 2, MARCH 2006

A Quantitative Assessment of Group Delay Methods

for Identifying Glottal Closures in Voiced Speech

Mike Brookes, Member, IEEE, Patrick A. Naylor, Member, IEEE, and Jon Gudnason, Member, IEEE

Abstract—Measures based on the group delay of the LPC

residual have been used by a number of authors to identify the

time instants of glottal closure in voiced speech. In this paper, we

discuss the theoretical properties of three such measures and we

also present a new measure having useful properties. We give a

quantitative assessment of each measure’s ability to detect glottal

closure instants evaluated using a speech database that includes a

direct measurement of glottal activity from a Laryngograph/EGG

signal. We ﬁnd that when using a ﬁxed-length analysis window, the

best measures can detect the instant of glottal closure in 97% of

larynx cycles with a standard deviation of 0.6 ms and that in 9% of

these cycles an additional excitation instant is found that normally

corresponds to glottal opening. We show that some improvement

in detection rate may be obtained if the analysis window length

is adapted to the speech pitch. If the measures are applied to

the preemphasized speech instead of to the LPC residual, we

ﬁnd that the timing accuracy worsens but the detection rate

improves slightly. We assess the computational cost of evaluating

the measures and we present new recursive algorithms that give a

substantial reduction in computation in all cases.

Index Terms—Closed phase, glottal closure, group delay, speech

analysis.

I. INTRODUCTION

N VOICED SPEECH, the primary acoustic excitation nor-

mally occurs at the instant of vocal-fold closure. This marks

the start of the closed-phase interval during which there is little

or no airﬂow through the glottis. There are several areas of

speech processing in which it is helpful to be able to identify

the glottal closure instants (GCIs) and/or the closed-phase in-

tervals. Recent interest has concentrated on PSOLA-based con-

catenative synthesis and voice-morphing techniques in which

the identiﬁcation of the GCIs is necessary to preserve coher-

ence across segment boundaries [1], [2]. More generally, accu-

rate identiﬁcation of the closed phases allows the blind decon-

volution of the vocal tract and glottal source through the use

of closed phase analysis and modeling [3]–[8]. The resultant

characterization of the glottal source gives beneﬁts to speaker

identiﬁcation systems [9]–[11] and potential beneﬁts to speech

recognition systems and low-bit rate coders. The determination

of glottal closure instants is also important in the clinical diag-

nosis and treatment of voice pathologies.

Manuscript received June 10, 2003; revised February 16, 2005. This work

was supported by EPSRC under Grant GR/N01569. The associate editor coor-

dinating the review of this manuscript and approving it for publication was Dr.

Ramesh A. Gopinath.

The authors are with Imperial College, London SW7 2BT, U.K. (e-mail:

mike.brookes@imperial.ac.uk; p.naylor@imperial.ac.uk; jon.gudnason@im-

perial.ac.uk).

Digital Object Identiﬁer 10.1109/TSA.2005.857810

Fig. 1. (a) A 12.5 ms speech waveform of male voice, phoneme /a/, (b)

laryngograph waveform, (c) estimated glottal volume velocity, and (d)

autocorrelation LPC residual from preemphasised speech.

The accurate identiﬁcation of GCIs has been an aim of

speech researchers for many years and numerous techniques

have been proposed. The most widely used approach is to

look for discontinuities in a linear model of speech production

[11]–[14]. An alternative is to search for energy peaks in

waveforms derived from the speech signal [8], [15], [16] or

for features in its time-frequency representation [17], [18]. To

obtain good results in closed-phase speech processing, it is

essential to identify the time of glottal excitation at closure to

within a fraction of 1 ms whereas locating the precise glottal

opening instant is normally much less critical [3], [10], [19].

In Fig. 1, waveform (a) shows a 12.5 ms segment of male

speech from the vowel /a/. Waveform (b) shows a simultaneous

Laryngographrecording(alsocalledElectroglottographor EGG)

which measures the electrical conductance of the larynx at

2 MHz and provides a direct indication of glottal activity

[5], [20]. The positions of the glottal closure and opening

instants are indicated on this waveform as P and Q, respectively,

and the interval PQ is the closed phase of the larynx cycle.

Acoustic theory shows that, for vowel sounds, the vocal tract

acts as an all-pole ﬁlter whose input is the volume velocity

(also called volume ﬂow rate) of air through the glottis [21].

The estimate of this volume velocity shown as waveform (c)

was obtained by applying covariance LPC to the closed-phase

speech segment PQ, ﬁltering the speech by the resultant all-zero

inverse ﬁlter and then applying a leaky integrator to the result

to compensate for lip radiation [13], [21]. By restricting the

analysis to the closed-phase in this way, we obtain an estimate of

the vocal tract ﬁlter that is unperturbed by the glottal excitation.

The low frequency ﬁdelity of the volume velocity waveform

estimate can be improved by correcting for phase distortion in

the recording process [22] but the important features can be

seen in the uncorrected waveform, namely a rapid decrease at

glottal closure (P) and a less abrupt increase at opening (Q).

Waveform (d) is the LPC residual obtained by applying the

LPC inverse ﬁlter to a preemphasised speech waveform. The

use of preemphasis and the omission of any compensation for

BROOKES et al.: GROUP DELAY METHODS FOR IDENTIFYING GLOTTAL CLOSURES IN VOICED SPEECH 457

lip radiation mean that the waveform is approximately equal to

the second derivative of the volume velocity. It can be seen that

this waveform includes an impulsive feature at closure (P) and

a similar but smaller impulse at opening (Q). The use of this

LPC residual waveform for detecting glottal closure instants

using methods such as those proposed in [12]–[14], [23]–[25]

requires the following assumptions: (i) the vocal tract acts as an

all-pole ﬁlter, (ii) the ﬁlter can be estimated adequately from the

speech waveform alone and (iii) the LPC residual will contain

an identiﬁable impulse at closure for voiced speech sounds.

Assumptions (i) and (ii) are discussed later in this Section.

The main contributions of this paper are (a) to demonstrate

that assumption (iii) is correct for a large proportion of larynx

cycles, (b) to introduce a new energy-weighted group-delay

measure as a means of locating the impulse, (c) to give a

quantitative assessment of the new measure’s performance and

a comparative evaluation of three other measures based on

group-delay, and (d) to provide efﬁcient recursive algorithms

for the computation of all four measures.

The all-pole ﬁlter model of the vocal tract is less good

for voiced consonants than for vowel sounds for two reasons.

Firstly, the closed oral cavity in nasal consonants introduces

zeros into the vocal tract ﬁlter response. For these phonemes

therefore, the the vocal tract is poorly modeled and in some

speakers closure impulses are not apparent in the residual. A

method is proposed in [26] for improving the robustness of

the LPC analysis in these cases by averaging the inverse ﬁlters

obtained for different orders but this has not been evaluated

in this study. Secondly, in voiced consonants there are often

additional excitations arising from turbulence at points of vocal

tract constriction. The effect of these on the speech signal is

equivalent to the addition of colored noise onto the glottal

volume velocity waveform. This noise will partially mask the

closure impulses and may also have an adverse effect on the

ﬁlter obtained from the LPC analysis. It is our experience

however, that these phonemes nevertheless generate detectable

energy peaks in the LPC residual at closure; this is conﬁrmed

by the results reported in Section IV. Although covariance LPC

is preferred for estimating inverse ﬁltered waveforms such as

Fig. 1(c) [13], we have used autocorrelation LPC to derive the

residual signal that is used for GCI detection because it offers

increased robustness and has less sensitivity to the alignment

between analysis frames and larynx cycles [27].

The use of a group delay measure to determine the acoustic

excitation instants was ﬁrst proposed in [23] and later reﬁned in

[24] and [25]. The method calculates the frequency-averaged

group delay over a sliding window applied to the LPC residual.

It has been found to be an effective way of locating the GCIs

and the authors have demonstrated its robustness to additive

noise. The technique was extended in [28], [29] in order to

capture GCIs that were missed by the original algorithms and,

through the use of dynamic programming, to eliminate spurious

detections so as to identify more reliably those that correspond

to true glottal closures rather than to glottal openings or other

events. In [2], two alternative methods of identifying excitation

instants were proposed, both related to the group delay. These

were applied to the problem of inter-segment coherence in

concatenative speech synthesis.

In Section II we deﬁne the four group delay measures to be

evaluated in this paper. Three of these have been described else-

where [2], [25] and the fourth is a new energy-weighted measure

which we introduce here. In Section III we examine the theo-

retical properties of the measures and illustrate aspects of their

behaviorusingsyntheticsignals.InSectionIVweprovideaquan-

titative evaluation of their performance in identifying GCIs in real

speech. Included in our database recordings is a Laryngograph

signal which providesa direct measurement of glottal activityand

allows an objective assessment of accuracy. We examine in de-

tail the effects of analysis window length on performance and we

identify the tradeoffs that exist between detection rate and timing

accuracy. We also evaluate the use of input signals other than the

LPC residual. In Section V we examine the computational cost of

evaluating the measures and we propose new efﬁcient recursive

procedures that signiﬁcantly reduce this cost.

II. G

ROUP DELAY

Given an input signal

, we consider an -sample win-

dowed segment beginning at sample

(1)

The Fourier transform of

at a frequency is

(2)

where

can vary continuously. The group delay of is

given by [24]

(3)

where

is the Fourier transform of .

The motivation for using the group delay is that it is able to

identify the position of an impulse within the analysis window.

, where is the unit impulse function,

then it follows directly from (3) that

. In the

presence of noise, however,

will no longer be constant and

we need to form some sort of average over

. In Section II-A,

we sample the spectrum by restricting

to integer values and we

describe four measures,

, , and that perform

this averaging in different ways to generate alternative estimates

of the delay from the start of the window to the impulse.

A. Average Group Delay

The frequency-averaged group delay is given by

(4)

458 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 2, MARCH 2006

where the conjugate symmetry of and ensures that the

latter summation is real. The use of

was proposed in [23] as

a way of estimating the GCIs and was later reﬁned in [24] and

[25]. Direct evaluation of (4) requires two Fourier transforms

per output sample but the computation may be reduced by the

recursive formulae given in Section V. A disadvantage of this

measure is that if

approaches zero for some , then the

resultant quotient will dominate the summation in (4) and may

result in a very large value for

. To avoid such extreme

values we have found it essential to follow the recommendation

in [25] that a 3-term median ﬁlter be applied to

along the axis before performing the summation in (4).

B. Zero-Frequency Group Delay

The group delay at

was proposed in [2] as a way of

estimating the instant of excitation and is given by

(5)

This measure may be interpreted as the “center of gravity” of

. Although easy to calculate, it is, as we shall see, sensitive

to noise and its value is unbounded if the mean value of

approaches zero. Because of this, we have found it necessary to

apply a median ﬁlter to

after evaluating (5).

C. Energy-Weighted Group Delay

The problem of unbounded terms in the summation of (4)

may be circumvented by weighting each term by

, the

energy at frequency index

. This leads us to propose a new

measure, the energy-weighted group delay,deﬁned by

(6)

This expression may be simpliﬁed by noting that

(7)

Substituting this into (6) gives

(8)

which may be viewed as the “center of energy” of

. The

new measure,

, thus has an efﬁcient time-domain for-

mulation. Unlike the previous measures it is bounded and lies

in the range 0 to

provided that is not identically

zero.

D. Energy-Weighted Phase

Equation (8) may be viewed as a weighted average of

using

as the weighting factors. An alternative way of averaging

is to associate the sample positions within the window with

complex numbers of the form , evenly

spaced around the unit circle on the complex plane. To form

the energy-weighted phase, we take a weighted average of these

complex numbers using

as the weighting factors and then

multiply its argument by

to convert back to a delay. This

gives

(9)

where

. The discontinuity in has been

chosen to lie midway between the complex numbers associated

with

and . It is clear from (9) that

always lies in the range to . A measure similar to

was used in [2] for aligning waveform segments in a speech

synthesis system. The relationship to the energy-weighted group

delay as described above and the noise immunity described in

Section III-B provide useful new insights into the properties of

this measure.

III. P

ROPERTIES OF

GROUP DELAY

MEASURES

In Section IV we will use the delay measures deﬁned above

to identify the excitation instants in the LPC residual from real

speech. In this Section however, we gain insight into their prop-

erties by examining their behavior with synthetic signals that

consist of impulses with additive white Gaussian noise. The

properties that we observe are consistent with those reported in

[23], [25] but we extend the study here to include an analysis

of multiple impulses and a quantitative comparison between the

different measures.

A. Effect of Window Length

An idealized version of the LPC residual waveform is shown

in Fig. 2(a) and consists of an impulse train with additive

white Gaussian noise at 10 dB SNR. The dominant pulse period

is 100 samples with an additional pulse in the fourth period and

with the amplitude of the third pulse half that of the others.

It is convenient to shift the time-origin of the sliding window,

in (1), to its central point by deﬁning

(10)

where

is one of . Note that if is even,

is deﬁned for values of midway between the integers

since the argument of

must always be an integer.

Fig. 2(b)–(e) shows the waveform of

for four dif-

ferent values of window length,

, where is chosen to

be a symmetric Hamming window of period

. The effect of

BROOKES et al.: GROUP DELAY METHODS FOR IDENTIFYING GLOTTAL CLOSURES IN VOICED SPEECH 459

Fig. 2. (a) Impulse train with a dominant period of 100 samples and an SNR

of 10 dB. (b)–(e) the waveform of

for different window lengths,

. The

circles mark the negative-going zero crossings (NZCs).

varying the window length is broadly similar for all measures,

so we will discuss it in detail only for

All four measures from Section II give the correct result for a

noise-free impulse; i.e., if

then .

All the measures also possess a form of shift invariance so that

and then

(11)

and so the graph of

has a gradient of under these cir-

cumstances. Although these conditions do not quite hold in this

example because of the added noise, they are almost true when

an impulse is near the center of the window and

does not

exceed the impulse period. For these cases therefore, we see in

Fig. 2(b) and (c) that

has a negative-going zero crossing

(NZC) with a gradient of approximately

whenever an im-

pulse is present at

. Each NZC is marked with a circle.

In Fig. 2(c), the window size equals the period

resulting in a clearly deﬁned NZC for each impulse without the

introduction of any spurious NZCs. However when the window

size is much less than the period as in Fig. 2(b), there are in-

tervals between each impulse where the window contains only

noise. In these intervals

is almost ﬂat and numerous spu-

rious NZCs are introduced. The local gradient at these spurious

NZCs is close to 0 rather than

and this provides a possible

way of identifying them.

As the window size is increased, it becomes common for

two or more impulses to lie within the window and individual

impulses may no longer be resolved. Thus in Fig. 2(d) where

, we see that the two impulses that are closest to-

gether (40 samples separation) have resulted in a single NZC

approximately midway between them. As the window length is

increased further in Fig. 2(e), each impulse now contains only a

small fraction of the energy in the window. This means that the

amplitude of the

waveform is low and the timing accu-

racy with which impulse locations can be identiﬁed degrades. In

this example, the low amplitude third impulse contains so little

energy compared to other nearby pulses that it fails to generate

an NZC at all.

The example of Fig. 2 therefore illustrates the way in which

the ability of

to detect impulses depends on the ratio of

the window length to the input signal period. As we shall see

in Section IV the choice of window length is a compromise: a

window that is too short will introduce many spurious NZCs

while a window that is too long may result in failure to detect

some of the true GCIs.

Fig. 3. Variation of

and

as the signal-to-noise ratio

(SNR) varies from

+30 dB

for an input consisting of a single impulse

=20

with additive white Gaussian noise in a window length of

= 101

For each measure, the graph shows the median value of

and the upper and

lower quartiles.

B. Robustness to Noise

To assess the effect of noise on the delay measures, we have

applied them to a signal

consisting of a single impulse with

additive white Gaussian noise. Fig. 3 shows the behavior of each

measure as the SNR is varied from

to for an im-

pulse at sample

within a rectangular window of length

. For each measure, the corresponding graph shows

the median value of

and the upper and lower quartiles. We

use the median rather than the mean because of the unbounded

values sometimes generated by

and . At an SNR of

all measures correctly give with a very small

inter-quartile range. As the SNR is reduced all measures show

an increasing spread and a progressive bias with the median

values tending to 50, the center of the window. The most robust

measure is

whose median value is barely affected by noise

until the SNR falls below

. For this measure, the effect of

the noise is to add onto the summation in (9) a random complex

number of arbitrary phase. It follows that the noise will not af-

fect the median value of

unless the noise amplitude is large

enough to cause the value of the summation to cross the positive

real axis where there is a discontinuity in the

function.

For impulses near the centre of the window, the summation in

(9) lies on or near the negative real axis and so for positive SNR

values, the noise has little effect on the median of

The measure whose median is most sensitive to noise is

for which the effects are noticeable in Fig. 3 for SNRs as high as

14 dB. Since this measure calculates the center of energy of the

windowed signal, the bias introduced depends directly on the

SNR and at an SNR of 0 dB, for example,

will be halfway

between

and the window center. The median curves for

and are almost identical to each other and lie between those

of the other two measures with signiﬁcant bias only for SNRs

worse than 5 dB. Although low levels of noise have little effect

on the median value of

, they have a substantial effect on

its inter-quartile range which is considerably larger than that of

the other measures.

When noise is added to an impulse train like that in Fig. 2(a)

the NZCs are affected in two ways. Firstly, the bias toward the

window center means that

is pulled toward zero either side

of the NZC and so its gradient will be less steep. It is possible,

460 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 2, MARCH 2006

Fig. 4. Graph shows, as a function of SNR, how far an impulse must be from

the center of a 101 sample window to ensure that

and

have the correct sign with a probability of 75%.

therefore, to use the gradient of at an NZC to estimate the

SNR of the signal. The second effect is that the combination of

the bias and the increased variance will add uncertainty to the

position of the NZC. Fig. 4 shows, as a function of SNR, how

far an impulse must be from the center of a 101 sample window

for the upper or lower quartile to lie exactly at the center of the

window, i.e., how far the impulse must be from the center for

to have a probability of 0.75 of having the correct sign.

We can view this as a measure of how accurately the position of

the impulse will be located and of how this accuracy degrades

with noise. The algorithms attain a precision of 5 samples (5%

of the window length) with 75% probability at SNR levels of

11.9,

, and for the , , and

measures, respectively. This indicates that the timing of

the NZCs is least affected by noise when using

and is most

affected when using

C. Response to Multiple Impulses

It is possible for the analysis window to contain multiple im-

pulses either because the window is longer than the pulse period

or because, as is often the case with the LPC residual, the signal

includes additional pulses or other impulsive features. We con-

sider here the behavior of the measures when the window con-

tains two impulses. From the shift invariance property, (11), we

may, without loss of generality take the impulses to be at posi-

tions

giving

(12)

where the factor

lies in the range 0 to 1 and determines the

relative amplitude of the two impulses. We can evaluate the four

measures analytically (see Appendix) to obtain the following

exact results. It is convenient to express them in terms of

which ranges from 0 to and is the negative of the

ratio of the impulse magnitudes

(13)

Fig. 5. Values of

and

for a signal containing impulses

at samples 0 and 40 of amplitudes

and

, respectively. The window length

is 101 and

varies between 0 and 1.

where denotes the greatest common divisor and

the equation for

should be regarded as modulo with

. Fig. 5 plots the expressions from

(13) versus

for the particular case of and .

varies from 0 to 1 all the measures change from to

. Measure equals the center of gravity of the

pair of impulses and it therefore changes linearly with

. Mea-

sure

on the other hand, which equals the center of gravity

of the squared input signal, is biassed toward the position of the

larger impulse giving rise to the S-shaped curve shown. In the

expression for

, the exponent of depends on

and is, for this case, equal to 101. Because this is so high,

makes an extremely abrupt transition at and this

measure essentially locates the position of the highest peak in

the window. It is possible to obtain a similar behavior for

or by increasing the exponent of in (8) or (9) but we

have found that this does not improve their performance with

real speech and so we do not discuss the resultant measures in

detail. The behavior of

varies according to the separation

of the two impulses. When they are close to each other it is

almost the same as

but as their separation increases to

half the window length its graph approaches that of

.For

separations greater than

the graph changes completely and

increases from 0, decreases toward , wrapping

around abruptly to

then continuing down to .

IV. E

VALUATION WITH

SPEECH SIGNALS

The four measures deﬁned in Section II have been evalu-

ated using the sentence subset of the APLAWD database [30]

recorded anechoically at a sample rate of 20 kHz with a lip-to-

microphone distance of 15 cm. The database includes a Laryn-

gograph channel which provides a direct measurement of glottal

activity [5], [20] and allows the instants of glottal closure to be

determined using the HQTx program from the Speech Filing

System software suite [31], [32]. The database includes ten rep-

etitions from each of ten British English speakers (ﬁve male,

ﬁve female) of the following sentences:

S1: “George made the girl measure a good blue vase;”

S2: “Why are you early you owl?”

S3: “Cathy hears a voice amongst SPAR’s data;”

S4: “Be sure to fetch a ﬁle and send their’s off to Hove;”

S5: “Six plus three equals nine;”

A Quantitative Assessment of Group Delay Methods for Identifying Glottal Closures in Voiced Speech

Figures

Citations

Digital processing of speech signals

Epoch Extraction From Speech Signals

Estimation of Glottal Closure Instants in Voiced Speech Using the DYPSA Algorithm

Detection of Glottal Closure Instants From Speech Signals: A Quantitative Review

Inference of Room Geometry From Acoustic Impulse Responses

References

Digital Processing of Speech Signals

Digital processing of speech signals

The sliding DFT

Least squares glottal inverse filtering from the acoustic speech waveform

Modeling of the glottal flow derivative waveform with application to speaker identification

Related Papers (5)

Estimation of Glottal Closure Instants in Voiced Speech Using the DYPSA Algorithm

Epoch Extraction From Speech Signals

Least squares glottal inverse filtering from the acoustic speech waveform

Epoch extraction from linear prediction residual for identification of closed glottis interval

Detection of Glottal Closure Instants From Speech Signals: A Quantitative Review

Frequently Asked Questions (12)

Q1. What are the contributions in "A quantitative assessment of group delay methods for identifying glottal closures in voiced speech" ?

Q2. What is the effect of noise on the median of a window?

Q3. What is the definition of the identification rate of a measure?

Q4. What is the effect of noise on the median value of a measure?

Q5. How does the detection rate of a measure change as the window length increases?

Q6. How many repetitions of the following sentences were recorded?

Q7. What is the way to identify a measure?

Q8. How many larynx cycles contain exactly one NZC?

Q9. Why is it possible to have multiple impulses in the analysis window?

Q10. How can the authors reduce the computational cost of the measures?

Q11. How many NZCs are in the larynx?

Q12. What is the detection rate for the and measures?