scispace - formally typeset
Open AccessJournal ArticleDOI

A Quantitative Assessment of Group Delay Methods for Identifying Glottal Closures in Voiced Speech

TLDR
It is found that when using a fixed-length analysis window, the best measures can detect the instant of glottal closure in 97% of larynx cycles with a standard deviation of 0.6 ms and that some improvement in detection rate may be obtained if the analysis window length is adapted to the speech pitch.
Abstract
Measures based on the group delay of the LPC residual have been used by a number of authors to identify the time instants of glottal closure in voiced speech. In this paper, we discuss the theoretical properties of three such measures and we also present a new measure having useful properties. We give a quantitative assessment of each measure's ability to detect glottal closure instants evaluated using a speech database that includes a direct measurement of glottal activity from a Laryngograph/EGG signal. We find that when using a fixed-length analysis window, the best measures can detect the instant of glottal closure in 97% of larynx cycles with a standard deviation of 0.6 ms and that in 9% of these cycles an additional excitation instant is found that normally corresponds to glottal opening. We show that some improvement in detection rate may be obtained if the analysis window length is adapted to the speech pitch. If the measures are applied to the preemphasized speech instead of to the LPC residual, we find that the timing accuracy worsens but the detection rate improves slightly. We assess the computational cost of evaluating the measures and we present new recursive algorithms that give a substantial reduction in computation in all cases.

read more

Content maybe subject to copyright    Report

456 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 2, MARCH 2006
A Quantitative Assessment of Group Delay Methods
for Identifying Glottal Closures in Voiced Speech
Mike Brookes, Member, IEEE, Patrick A. Naylor, Member, IEEE, and Jon Gudnason, Member, IEEE
Abstract—Measures based on the group delay of the LPC
residual have been used by a number of authors to identify the
time instants of glottal closure in voiced speech. In this paper, we
discuss the theoretical properties of three such measures and we
also present a new measure having useful properties. We give a
quantitative assessment of each measure’s ability to detect glottal
closure instants evaluated using a speech database that includes a
direct measurement of glottal activity from a Laryngograph/EGG
signal. We find that when using a fixed-length analysis window, the
best measures can detect the instant of glottal closure in 97% of
larynx cycles with a standard deviation of 0.6 ms and that in 9% of
these cycles an additional excitation instant is found that normally
corresponds to glottal opening. We show that some improvement
in detection rate may be obtained if the analysis window length
is adapted to the speech pitch. If the measures are applied to
the preemphasized speech instead of to the LPC residual, we
find that the timing accuracy worsens but the detection rate
improves slightly. We assess the computational cost of evaluating
the measures and we present new recursive algorithms that give a
substantial reduction in computation in all cases.
Index Terms—Closed phase, glottal closure, group delay, speech
analysis.
I. INTRODUCTION
I
N VOICED SPEECH, the primary acoustic excitation nor-
mally occurs at the instant of vocal-fold closure. This marks
the start of the closed-phase interval during which there is little
or no airflow through the glottis. There are several areas of
speech processing in which it is helpful to be able to identify
the glottal closure instants (GCIs) and/or the closed-phase in-
tervals. Recent interest has concentrated on PSOLA-based con-
catenative synthesis and voice-morphing techniques in which
the identification of the GCIs is necessary to preserve coher-
ence across segment boundaries [1], [2]. More generally, accu-
rate identification of the closed phases allows the blind decon-
volution of the vocal tract and glottal source through the use
of closed phase analysis and modeling [3]–[8]. The resultant
characterization of the glottal source gives benefits to speaker
identification systems [9]–[11] and potential benefits to speech
recognition systems and low-bit rate coders. The determination
of glottal closure instants is also important in the clinical diag-
nosis and treatment of voice pathologies.
Manuscript received June 10, 2003; revised February 16, 2005. This work
was supported by EPSRC under Grant GR/N01569. The associate editor coor-
dinating the review of this manuscript and approving it for publication was Dr.
Ramesh A. Gopinath.
The authors are with Imperial College, London SW7 2BT, U.K. (e-mail:
mike.brookes@imperial.ac.uk; p.naylor@imperial.ac.uk; jon.gudnason@im-
perial.ac.uk).
Digital Object Identifier 10.1109/TSA.2005.857810
Fig. 1. (a) A 12.5 ms speech waveform of male voice, phoneme /a/, (b)
laryngograph waveform, (c) estimated glottal volume velocity, and (d)
autocorrelation LPC residual from preemphasised speech.
The accurate identification of GCIs has been an aim of
speech researchers for many years and numerous techniques
have been proposed. The most widely used approach is to
look for discontinuities in a linear model of speech production
[11]–[14]. An alternative is to search for energy peaks in
waveforms derived from the speech signal [8], [15], [16] or
for features in its time-frequency representation [17], [18]. To
obtain good results in closed-phase speech processing, it is
essential to identify the time of glottal excitation at closure to
within a fraction of 1 ms whereas locating the precise glottal
opening instant is normally much less critical [3], [10], [19].
In Fig. 1, waveform (a) shows a 12.5 ms segment of male
speech from the vowel /a/. Waveform (b) shows a simultaneous
Laryngographrecording(alsocalledElectroglottographor EGG)
which measures the electrical conductance of the larynx at
2 MHz and provides a direct indication of glottal activity
[5], [20]. The positions of the glottal closure and opening
instants are indicated on this waveform as P and Q, respectively,
and the interval PQ is the closed phase of the larynx cycle.
Acoustic theory shows that, for vowel sounds, the vocal tract
acts as an all-pole filter whose input is the volume velocity
(also called volume flow rate) of air through the glottis [21].
The estimate of this volume velocity shown as waveform (c)
was obtained by applying covariance LPC to the closed-phase
speech segment PQ, filtering the speech by the resultant all-zero
inverse filter and then applying a leaky integrator to the result
to compensate for lip radiation [13], [21]. By restricting the
analysis to the closed-phase in this way, we obtain an estimate of
the vocal tract filter that is unperturbed by the glottal excitation.
The low frequency fidelity of the volume velocity waveform
estimate can be improved by correcting for phase distortion in
the recording process [22] but the important features can be
seen in the uncorrected waveform, namely a rapid decrease at
glottal closure (P) and a less abrupt increase at opening (Q).
Waveform (d) is the LPC residual obtained by applying the
LPC inverse filter to a preemphasised speech waveform. The
use of preemphasis and the omission of any compensation for
1558-7916/$20.00 © 2006 IEEE

BROOKES et al.: GROUP DELAY METHODS FOR IDENTIFYING GLOTTAL CLOSURES IN VOICED SPEECH 457
lip radiation mean that the waveform is approximately equal to
the second derivative of the volume velocity. It can be seen that
this waveform includes an impulsive feature at closure (P) and
a similar but smaller impulse at opening (Q). The use of this
LPC residual waveform for detecting glottal closure instants
using methods such as those proposed in [12][14], [23][25]
requires the following assumptions: (i) the vocal tract acts as an
all-pole lter, (ii) the lter can be estimated adequately from the
speech waveform alone and (iii) the LPC residual will contain
an identiable impulse at closure for voiced speech sounds.
Assumptions (i) and (ii) are discussed later in this Section.
The main contributions of this paper are (a) to demonstrate
that assumption (iii) is correct for a large proportion of larynx
cycles, (b) to introduce a new energy-weighted group-delay
measure as a means of locating the impulse, (c) to give a
quantitative assessment of the new measures performance and
a comparative evaluation of three other measures based on
group-delay, and (d) to provide efcient recursive algorithms
for the computation of all four measures.
The all-pole lter model of the vocal tract is less good
for voiced consonants than for vowel sounds for two reasons.
Firstly, the closed oral cavity in nasal consonants introduces
zeros into the vocal tract lter response. For these phonemes
therefore, the the vocal tract is poorly modeled and in some
speakers closure impulses are not apparent in the residual. A
method is proposed in [26] for improving the robustness of
the LPC analysis in these cases by averaging the inverse lters
obtained for different orders but this has not been evaluated
in this study. Secondly, in voiced consonants there are often
additional excitations arising from turbulence at points of vocal
tract constriction. The effect of these on the speech signal is
equivalent to the addition of colored noise onto the glottal
volume velocity waveform. This noise will partially mask the
closure impulses and may also have an adverse effect on the
lter obtained from the LPC analysis. It is our experience
however, that these phonemes nevertheless generate detectable
energy peaks in the LPC residual at closure; this is conrmed
by the results reported in Section IV. Although covariance LPC
is preferred for estimating inverse ltered waveforms such as
Fig. 1(c) [13], we have used autocorrelation LPC to derive the
residual signal that is used for GCI detection because it offers
increased robustness and has less sensitivity to the alignment
between analysis frames and larynx cycles [27].
The use of a group delay measure to determine the acoustic
excitation instants was rst proposed in [23] and later rened in
[24] and [25]. The method calculates the frequency-averaged
group delay over a sliding window applied to the LPC residual.
It has been found to be an effective way of locating the GCIs
and the authors have demonstrated its robustness to additive
noise. The technique was extended in [28], [29] in order to
capture GCIs that were missed by the original algorithms and,
through the use of dynamic programming, to eliminate spurious
detections so as to identify more reliably those that correspond
to true glottal closures rather than to glottal openings or other
events. In [2], two alternative methods of identifying excitation
instants were proposed, both related to the group delay. These
were applied to the problem of inter-segment coherence in
concatenative speech synthesis.
In Section II we dene the four group delay measures to be
evaluated in this paper. Three of these have been described else-
where [2], [25] and the fourth is a new energy-weighted measure
which we introduce here. In Section III we examine the theo-
retical properties of the measures and illustrate aspects of their
behaviorusingsyntheticsignals.InSectionIVweprovideaquan-
titative evaluation of their performance in identifying GCIs in real
speech. Included in our database recordings is a Laryngograph
signal which providesa direct measurement of glottal activityand
allows an objective assessment of accuracy. We examine in de-
tail the effects of analysis window length on performance and we
identify the tradeoffs that exist between detection rate and timing
accuracy. We also evaluate the use of input signals other than the
LPC residual. In Section V we examine the computational cost of
evaluating the measures and we propose new efcient recursive
procedures that signicantly reduce this cost.
II. G
ROUP DELAY
Given an input signal
, we consider an -sample win-
dowed segment beginning at sample
(1)
The Fourier transform of
at a frequency is
(2)
where
can vary continuously. The group delay of is
given by [24]
(3)
where
is the Fourier transform of .
The motivation for using the group delay is that it is able to
identify the position of an impulse within the analysis window.
If
, where is the unit impulse function,
then it follows directly from (3) that
. In the
presence of noise, however,
will no longer be constant and
we need to form some sort of average over
. In Section II-A,
we sample the spectrum by restricting
to integer values and we
describe four measures,
, , and that perform
this averaging in different ways to generate alternative estimates
of the delay from the start of the window to the impulse.
A. Average Group Delay
The frequency-averaged group delay is given by
(4)

458 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 2, MARCH 2006
where the conjugate symmetry of and ensures that the
latter summation is real. The use of
was proposed in [23] as
a way of estimating the GCIs and was later rened in [24] and
[25]. Direct evaluation of (4) requires two Fourier transforms
per output sample but the computation may be reduced by the
recursive formulae given in Section V. A disadvantage of this
measure is that if
approaches zero for some , then the
resultant quotient will dominate the summation in (4) and may
result in a very large value for
. To avoid such extreme
values we have found it essential to follow the recommendation
in [25] that a 3-term median lter be applied to
along the axis before performing the summation in (4).
B. Zero-Frequency Group Delay
The group delay at
was proposed in [2] as a way of
estimating the instant of excitation and is given by
(5)
This measure may be interpreted as the center of gravity of
. Although easy to calculate, it is, as we shall see, sensitive
to noise and its value is unbounded if the mean value of
approaches zero. Because of this, we have found it necessary to
apply a median lter to
after evaluating (5).
C. Energy-Weighted Group Delay
The problem of unbounded terms in the summation of (4)
may be circumvented by weighting each term by
, the
energy at frequency index
. This leads us to propose a new
measure, the energy-weighted group delay,dened by
(6)
This expression may be simplied by noting that
(7)
Substituting this into (6) gives
(8)
which may be viewed as the center of energy of
. The
new measure,
, thus has an efcient time-domain for-
mulation. Unlike the previous measures it is bounded and lies
in the range 0 to
provided that is not identically
zero.
D. Energy-Weighted Phase
Equation (8) may be viewed as a weighted average of
using
as the weighting factors. An alternative way of averaging
is to associate the sample positions within the window with
complex numbers of the form , evenly
spaced around the unit circle on the complex plane. To form
the energy-weighted phase, we take a weighted average of these
complex numbers using
as the weighting factors and then
multiply its argument by
to convert back to a delay. This
gives
(9)
where
. The discontinuity in has been
chosen to lie midway between the complex numbers associated
with
and . It is clear from (9) that
always lies in the range to . A measure similar to
was used in [2] for aligning waveform segments in a speech
synthesis system. The relationship to the energy-weighted group
delay as described above and the noise immunity described in
Section III-B provide useful new insights into the properties of
this measure.
III. P
ROPERTIES OF
GROUP DELAY
MEASURES
In Section IV we will use the delay measures dened above
to identify the excitation instants in the LPC residual from real
speech. In this Section however, we gain insight into their prop-
erties by examining their behavior with synthetic signals that
consist of impulses with additive white Gaussian noise. The
properties that we observe are consistent with those reported in
[23], [25] but we extend the study here to include an analysis
of multiple impulses and a quantitative comparison between the
different measures.
A. Effect of Window Length
An idealized version of the LPC residual waveform is shown
as
in Fig. 2(a) and consists of an impulse train with additive
white Gaussian noise at 10 dB SNR. The dominant pulse period
is 100 samples with an additional pulse in the fourth period and
with the amplitude of the third pulse half that of the others.
It is convenient to shift the time-origin of the sliding window,
in (1), to its central point by dening
(10)
where
is one of . Note that if is even,
is dened for values of midway between the integers
since the argument of
must always be an integer.
Fig. 2(b)(e) shows the waveform of
for four dif-
ferent values of window length,
, where is chosen to
be a symmetric Hamming window of period
. The effect of

BROOKES et al.: GROUP DELAY METHODS FOR IDENTIFYING GLOTTAL CLOSURES IN VOICED SPEECH 459
Fig. 2. (a) Impulse train with a dominant period of 100 samples and an SNR
of 10 dB. (b)(e) the waveform of
d
for different window lengths,
N
. The
circles mark the negative-going zero crossings (NZCs).
varying the window length is broadly similar for all measures,
so we will discuss it in detail only for
.
All four measures from Section II give the correct result for a
noise-free impulse; i.e., if
then .
All the measures also possess a form of shift invariance so that
if
and then
(11)
and so the graph of
has a gradient of under these cir-
cumstances. Although these conditions do not quite hold in this
example because of the added noise, they are almost true when
an impulse is near the center of the window and
does not
exceed the impulse period. For these cases therefore, we see in
Fig. 2(b) and (c) that
has a negative-going zero crossing
(NZC) with a gradient of approximately
whenever an im-
pulse is present at
. Each NZC is marked with a circle.
In Fig. 2(c), the window size equals the period
resulting in a clearly dened NZC for each impulse without the
introduction of any spurious NZCs. However when the window
size is much less than the period as in Fig. 2(b), there are in-
tervals between each impulse where the window contains only
noise. In these intervals
is almost at and numerous spu-
rious NZCs are introduced. The local gradient at these spurious
NZCs is close to 0 rather than
and this provides a possible
way of identifying them.
As the window size is increased, it becomes common for
two or more impulses to lie within the window and individual
impulses may no longer be resolved. Thus in Fig. 2(d) where
, we see that the two impulses that are closest to-
gether (40 samples separation) have resulted in a single NZC
approximately midway between them. As the window length is
increased further in Fig. 2(e), each impulse now contains only a
small fraction of the energy in the window. This means that the
amplitude of the
waveform is low and the timing accu-
racy with which impulse locations can be identied degrades. In
this example, the low amplitude third impulse contains so little
energy compared to other nearby pulses that it fails to generate
an NZC at all.
The example of Fig. 2 therefore illustrates the way in which
the ability of
to detect impulses depends on the ratio of
the window length to the input signal period. As we shall see
in Section IV the choice of window length is a compromise: a
window that is too short will introduce many spurious NZCs
while a window that is too long may result in failure to detect
some of the true GCIs.
Fig. 3. Variation of
d
,
d
,
d
and
d
as the signal-to-noise ratio
(SNR) varies from
0
30
to
+30 dB
for an input consisting of a single impulse
at
n
=20
with additive white Gaussian noise in a window length of
N
= 101
.
For each measure, the graph shows the median value of
d
and the upper and
lower quartiles.
B. Robustness to Noise
To assess the effect of noise on the delay measures, we have
applied them to a signal
consisting of a single impulse with
additive white Gaussian noise. Fig. 3 shows the behavior of each
measure as the SNR is varied from
to for an im-
pulse at sample
within a rectangular window of length
. For each measure, the corresponding graph shows
the median value of
and the upper and lower quartiles. We
use the median rather than the mean because of the unbounded
values sometimes generated by
and . At an SNR of
all measures correctly give with a very small
inter-quartile range. As the SNR is reduced all measures show
an increasing spread and a progressive bias with the median
values tending to 50, the center of the window. The most robust
measure is
whose median value is barely affected by noise
until the SNR falls below
. For this measure, the effect of
the noise is to add onto the summation in (9) a random complex
number of arbitrary phase. It follows that the noise will not af-
fect the median value of
unless the noise amplitude is large
enough to cause the value of the summation to cross the positive
real axis where there is a discontinuity in the
function.
For impulses near the centre of the window, the summation in
(9) lies on or near the negative real axis and so for positive SNR
values, the noise has little effect on the median of
.
The measure whose median is most sensitive to noise is
for which the effects are noticeable in Fig. 3 for SNRs as high as
14 dB. Since this measure calculates the center of energy of the
windowed signal, the bias introduced depends directly on the
SNR and at an SNR of 0 dB, for example,
will be halfway
between
and the window center. The median curves for
and are almost identical to each other and lie between those
of the other two measures with signicant bias only for SNRs
worse than 5 dB. Although low levels of noise have little effect
on the median value of
, they have a substantial effect on
its inter-quartile range which is considerably larger than that of
the other measures.
When noise is added to an impulse train like that in Fig. 2(a)
the NZCs are affected in two ways. Firstly, the bias toward the
window center means that
is pulled toward zero either side
of the NZC and so its gradient will be less steep. It is possible,

460 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 2, MARCH 2006
Fig. 4. Graph shows, as a function of SNR, how far an impulse must be from
the center of a 101 sample window to ensure that
d
,
d
,
d
and
d
have the correct sign with a probability of 75%.
therefore, to use the gradient of at an NZC to estimate the
SNR of the signal. The second effect is that the combination of
the bias and the increased variance will add uncertainty to the
position of the NZC. Fig. 4 shows, as a function of SNR, how
far an impulse must be from the center of a 101 sample window
for the upper or lower quartile to lie exactly at the center of the
window, i.e., how far the impulse must be from the center for
to have a probability of 0.75 of having the correct sign.
We can view this as a measure of how accurately the position of
the impulse will be located and of how this accuracy degrades
with noise. The algorithms attain a precision of 5 samples (5%
of the window length) with 75% probability at SNR levels of
11.9,
, and for the , , and
measures, respectively. This indicates that the timing of
the NZCs is least affected by noise when using
and is most
affected when using
.
C. Response to Multiple Impulses
It is possible for the analysis window to contain multiple im-
pulses either because the window is longer than the pulse period
or because, as is often the case with the LPC residual, the signal
includes additional pulses or other impulsive features. We con-
sider here the behavior of the measures when the window con-
tains two impulses. From the shift invariance property, (11), we
may, without loss of generality take the impulses to be at posi-
tions
giving
(12)
where the factor
lies in the range 0 to 1 and determines the
relative amplitude of the two impulses. We can evaluate the four
measures analytically (see Appendix) to obtain the following
exact results. It is convenient to express them in terms of
which ranges from 0 to and is the negative of the
ratio of the impulse magnitudes
(13)
Fig. 5. Values of
d
,
d
,
d
and
d
for a signal containing impulses
at samples 0 and 40 of amplitudes
1
0
a
and
a
, respectively. The window length
is 101 and
a
varies between 0 and 1.
where denotes the greatest common divisor and
the equation for
should be regarded as modulo with
. Fig. 5 plots the expressions from
(13) versus
for the particular case of and .
As
varies from 0 to 1 all the measures change from to
. Measure equals the center of gravity of the
pair of impulses and it therefore changes linearly with
. Mea-
sure
on the other hand, which equals the center of gravity
of the squared input signal, is biassed toward the position of the
larger impulse giving rise to the S-shaped curve shown. In the
expression for
, the exponent of depends on
and is, for this case, equal to 101. Because this is so high,
makes an extremely abrupt transition at and this
measure essentially locates the position of the highest peak in
the window. It is possible to obtain a similar behavior for
or by increasing the exponent of in (8) or (9) but we
have found that this does not improve their performance with
real speech and so we do not discuss the resultant measures in
detail. The behavior of
varies according to the separation
of the two impulses. When they are close to each other it is
almost the same as
but as their separation increases to
half the window length its graph approaches that of
.For
separations greater than
the graph changes completely and
as
increases from 0, decreases toward , wrapping
around abruptly to
then continuing down to .
IV. E
VALUATION WITH
SPEECH SIGNALS
The four measures dened in Section II have been evalu-
ated using the sentence subset of the APLAWD database [30]
recorded anechoically at a sample rate of 20 kHz with a lip-to-
microphone distance of 15 cm. The database includes a Laryn-
gograph channel which provides a direct measurement of glottal
activity [5], [20] and allows the instants of glottal closure to be
determined using the HQTx program from the Speech Filing
System software suite [31], [32]. The database includes ten rep-
etitions from each of ten British English speakers (ve male,
ve female) of the following sentences:
S1: George made the girl measure a good blue vase;
S2: Why are you early you owl?
S3: Cathy hears a voice amongst SPARs data;
S4: Be sure to fetch a le and send theirs off to Hove;
S5: Six plus three equals nine;

Citations
More filters
Journal ArticleDOI

Digital processing of speech signals

Journal ArticleDOI

Epoch Extraction From Speech Signals

TL;DR: The interesting part of the results is that the epoch extraction by the proposed method seems to be robust against degradations like white noise, babble, high-frequency channel, and vehicle noise.
Journal ArticleDOI

Estimation of Glottal Closure Instants in Voiced Speech Using the DYPSA Algorithm

TL;DR: The Dynamic Programming Projected Phase-Slope Algorithm (DYPSA) is automatic and operates using the speech signal alone without the need for an EGG signal for automatic estimation of glottal closure instants (GCIs) in voiced speech.
Journal ArticleDOI

Detection of Glottal Closure Instants From Speech Signals: A Quantitative Review

TL;DR: In this paper, five state-of-the-art GCI detection algorithms are compared using six different databases with contemporaneous electroglottographic recordings as ground truth, and containing many hours of speech by multiple speakers.
Journal ArticleDOI

Inference of Room Geometry From Acoustic Impulse Responses

TL;DR: This work investigates the problem of locating planar reflectors in rooms, such as walls and furniture, from signals obtained using distributed microphones by estimation of the time of arrival (TOA) of reflected signals by analysis of acoustic impulse responses (AIRs).
References
More filters
Book

Digital Processing of Speech Signals

TL;DR: This paper presents a meta-modelling framework for digital Speech Processing for Man-Machine Communication by Voice that automates the very labor-intensive and therefore time-heavy and expensive process of encoding and decoding speech.
Journal ArticleDOI

Digital processing of speech signals

Journal ArticleDOI

The sliding DFT

TL;DR: The sliding DFT process for spectrum analysis was presented and shown to be more efficient than the popular Goertzel (1958) algorithm for sample-by-sample DFT bin computations and a modified slide DFT structure is proposed that provides improved computational efficiency.
Journal ArticleDOI

Least squares glottal inverse filtering from the acoustic speech waveform

TL;DR: Based on a linear model of speech production, it is shown that both the moment of glottal closure and opening can be determined from the normalized total squared error with proper choices of analysis window length and filter order.
Journal ArticleDOI

Modeling of the glottal flow derivative waveform with application to speaker identification

TL;DR: An automatic technique for estimating and modeling the glottal flow derivative source waveform from speech, and applying the model parameters to speaker identification, is presented.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What are the contributions in "A quantitative assessment of group delay methods for identifying glottal closures in voiced speech" ?

Measures based on the group delay of the LPC residual have been used by a number of authors to identify the time instants of glottal closure in voiced speech. In this paper, the authors discuss the theoretical properties of three such measures and they also present a new measure having useful properties. The authors show that some improvement in detection rate may be obtained if the analysis window length is adapted to the speech pitch. If the measures are applied to the preemphasized speech instead of to the LPC residual, the authors find that the timing accuracy worsens but the detection rate improves slightly. The authors assess the computational cost of evaluating the measures and they present new recursive algorithms that give a substantial reduction in computation in all cases. The authors find that when using a fixed-length analysis window, the best measures can detect the instant of glottal closure in 97 % of larynx cycles with a standard deviation of 0. 

For impulses near the centre of the window, the summation in (9) lies on or near the negative real axis and so for positive SNR values, the noise has little effect on the median of . 

The authors define the identification rate of a measure to be the fraction of larynx cycles that contain exactly one NZC and the detection rate to be the fraction that contain either one or two NZCs. 

It follows that the noise will not affect the median value of unless the noise amplitude is large enough to cause the value of the summation to cross the positive real axis where there is a discontinuity in the function. 

As the window length in increased the accuracy steadily worsens but the identification rate improves and reaches a peak of over 90% at a window length of 10 ms. 

The database includes ten repetitions from each of ten British English speakers (five male, five female) of the following sentences: 

To take a specific example, the measure is identified by circles and the authors see from the first point on the graph that for a 4 ms window, its identification accuracy is 0.34 ms but its identification rate is only 36%. 

For this example, the standard deviation of these “closest” NZCs is 0.97 ms and if the authors combine these with the single-NZC cycles, the authors can detect the GCI in over 97% of larynx cycles with a standard deviation of 0.6 ms. 

It is possible for the analysis window to contain multiple impulses either because the window is longer than the pulse period or because, as is often the case with the LPC residual, the signal includes additional pulses or other impulsive features. 

The authors have shown how the computational cost of all the measures can be reduced greatly by calculating them recursively provided that a suitable window function is used. 

Of the remaining 12% of larynx cycles, over three quarters contain exactly two NZCs; in most cases these occur at glottal opening and closure, respectively, giving rise to the histogram shown in Fig. 8(b). 

The and measures again show the best performance and reach a detection rate of 97.1% for window lengths of 8 ms and 7 ms, respectively.