scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A Data-Driven Weighted LP Method for Formant Estimation

TL;DR: In this article, a data-driven approach is proposed to estimate the formant of the vocal tract using an all-pole filter, whose spectral peaks are taken as the formants estimates.
Abstract: In this paper we propose an algorithm for formant estimation using a data-driven approach. The vocal tract is typically modeled as an all-pole filter, whose spectral peaks are taken as the formant estimates. Estimating the filter parameters by minimizing the 12-norm of the residual signal leads to pitchlocking tendency. This is avoided if the 11-norm criterion is used, but the solution is computationally burdensome. Minimizing the weighted 12-norm provides a good compromise between the pitchlocking tendency and computation, but requires the knowledge of the Glottal Closure Instants (GCIs). In our method knowledge of the GCI is not required; instead, we adjust the weighting sequence parameters in a data-driven manner. Our results on both synthetic as well as natural voiced-speech show that our method is superior to LPC; when compared with SWLP, our method gives estimates with lower variance.
References
More filters
Journal ArticleDOI
TL;DR: The Fundamentals of Statistical Signal Processing: Estimation Theory as mentioned in this paper is a seminal work in the field of statistical signal processing, and it has been used extensively in many applications.
Abstract: (1995). Fundamentals of Statistical Signal Processing: Estimation Theory. Technometrics: Vol. 37, No. 4, pp. 465-466.

14,342 citations

Journal ArticleDOI
John Makhoul1
01 Apr 1975
TL;DR: This paper gives an exposition of linear prediction in the analysis of discrete signals as a linear combination of its past values and present and past values of a hypothetical input to a system whose output is the given signal.
Abstract: This paper gives an exposition of linear prediction in the analysis of discrete signals The signal is modeled as a linear combination of its past values and present and past values of a hypothetical input to a system whose output is the given signal In the frequency domain, this is equivalent to modeling the signal spectrum by a pole-zero spectrum The major part of the paper is devoted to all-pole models The model parameters are obtained by a least squares analysis in the time domain Two methods result, depending on whether the signal is assumed to be stationary or nonstationary The same results are then derived in the frequency domain The resulting spectral matching formulation allows for the modeling of selected portions of a spectrum, for arbitrary spectral shaping in the frequency domain, and for the modeling of continuous as well as discrete spectra This also leads to a discussion of the advantages and disadvantages of the least squares error criterion A spectral interpretation is given to the normalized minimum prediction error Applications of the normalized error are given, including the determination of an "optimal" number of poles The use of linear prediction in data compression is reviewed For purposes of transmission, particular attention is given to the quantization and encoding of the reflection (or partial correlation) coefficients Finally, a brief introduction to pole-zero modeling is given

4,206 citations

Dataset
01 Jan 1993
TL;DR: The TIMIT corpus as mentioned in this paper contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences, including time-aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16kHz speech waveform file for each utterance.
Abstract: The TIMIT corpus of read speech is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16kHz speech waveform file for each utterance. Corpus design was a joint effort among the Massachusetts Institute of Technology (MIT), SRI International (SRI) and Texas Instruments, Inc. (TI). The speech was recorded at TI, transcribed at MIT and verified and prepared for CD-ROM production by the National Institute of Standards and Technology (NIST). The TIMIT corpus transcriptions have been hand verified. Test and training subsets, balanced for phonetic and dialectal coverage, are specified. Tabular computer-searchable information is included as well as written documentation.

2,096 citations

Journal ArticleDOI
TL;DR: An algorithm is presented for the estimation of the fundamental frequency (F0) of speech or musical sounds, based on the well-known autocorrelation method with a number of modifications that combine to prevent errors.
Abstract: An algorithm is presented for the estimation of the fundamental frequency (F0) of speech or musical sounds. It is based on the well-known autocorrelation method with a number of modifications that combine to prevent errors. The algorithm has several desirable features. Error rates are about three times lower than the best competing methods, as evaluated over a database of speech recorded together with a laryngograph signal. There is no upper limit on the frequency search range, so the algorithm is suited for high-pitched voices and music. The algorithm is relatively simple and may be implemented efficiently and with low latency, and it involves few parameters that must be tuned. It is based on a signal model (periodic signal) that may be extended in several ways to handle various forms of aperiodicity that occur in particular applications. Finally, interesting parallels may be drawn with models of auditory processing.

1,975 citations

Journal ArticleDOI
TL;DR: Analysis of the formant data shows numerous differences between the present data and those of PB, both in terms of average frequencies of F1 and F2, and the degree of overlap among adjacent vowels.
Abstract: This study was designed as a replication and extension of the classic study of vowel acoustics by Peterson and Barney (PB) [J. Acoust. Soc. Am. 24, 175–184 (1952)]. Recordings were made of 50 men, 50 women, and 50 children producing the vowels /i, i, eh, ae, hooked backward eh, inverted vee), a, open oh, u, u/ in h–V–d syllables. Formant contours for F1–F4 were measured from LPC spectra using a custom interactive editing tool. For comparison with the PB data, formant patterns were sampled at a time that was judged by visual inspection to be maximally steady. Preliminary analysis shows numerous differences between the present data and those of PB, both in terms of average formant frequencies for vowels, and the degree of overlap among adjacent vowels. As with the original study, listening tests showed that the signals were nearly always identified as the vowel intended by the talker.

1,891 citations