scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A parametric approach to vocal tract length normalization

07 May 1996-Vol. 1, pp 346-348
TL;DR: A parametric method of normalisation is described which counteracts the effect of varied vocal tract length and is shown to be effective across a wide range of recognition systems and paradigms, but is particularly helpful in the case of a small amount of training data.
Abstract: Differences in vocal tract size among individual speakers contribute to the variability of speech waveforms. The first-order effect of a difference in vocal tract length is a scaling of the frequency axis; a female speaker, for example, exhibits formants roughly 20% higher than the formants of from a male speaker, with the differences most severe in open vocal tract configurations. We describe a parametric method of normalisation which counteracts the effect of varied vocal tract length. The method is shown to be effective across a wide range of recognition systems and paradigms, but is particularly helpful in the case of a small amount of training data.
Citations
More filters
Proceedings ArticleDOI
03 Oct 1996
TL;DR: A novel approach to estimating the parameters of continuous density HMMs for speaker-independent (SI) continuous speech recognition that jointly annihilates the inter-speaker variation and estimates the HMM parameters of the SI acoustic models.
Abstract: We formulate a novel approach to estimating the parameters of continuous density HMMs for speaker-independent (SI) continuous speech recognition. It is motivated by the fact that variability in SI acoustic models is attributed to both phonetic variation and variation among the speakers of the training population, that is independent of the information content of the speech signal. These two variation sources are decoupled and the proposed method jointly annihilates the inter-speaker variation and estimates the HMM parameters of the SI acoustic models. We compare the proposed training algorithm to the common SI training paradigm within the context of supervised adaptation. We show that the proposed acoustic models are more efficiently adapted to the test speakers, thus achieving significant overall word error rate reductions of 19% and 25% for 20K and 5K vocabulary tasks respectively.

586 citations


Cites methods from "A parametric approach to vocal trac..."

  • ...In [2] a parametric model of vocal tract length normalization reduces the inter-speaker variability of the acoustic space by appropriately warping the frequency axis for each training speaker prior to computing the cepstral coefficients....

    [...]

Journal ArticleDOI
TL;DR: Current advances related to automatic speech recognition (ASR) and spoken language systems and deficiencies in dealing with variation naturally present in speech are outlined.

507 citations


Cites background or methods from "A parametric approach to vocal trac..."

  • ...A direct application of the tube resonator model of the vocal tract lead to the different vocal tract length normalization (VTLN) techniques: speaker-dependent formant mapping (Di Benedetto and Liénard, 1992; Wakita, 1977), transformation of the LPC pole modeling (Slifka and Anderson, 1995), frequency warping, either linear (Eide and Gish, 1996; Lee and Rose, 1996; Tuerk and Robinson, 1993; Zhan and Westphal, 1997) or non-linear (Ono et al....

    [...]

  • ...The estimation of the VTL factor can either be perform by a maximum likelihood approach (Lee and Rose, 1996; Zhan and Waibel, 1997) or from a direct estimation of the formant positions (Eide and Gish, 1996; Lincoln et al., 1997)....

    [...]

  • ...…Benedetto and Liénard, 1992; Wakita, 1977), transformation of the LPC pole modeling (Slifka and Anderson, 1995), frequency warping, either linear (Eide and Gish, 1996; Lee and Rose, 1996; Tuerk and Robinson, 1993; Zhan and Westphal, 1997) or non-linear (Ono et al., 1993), all consist of…...

    [...]

  • ...Note that VTLN is often combined with an adaptation of the acoustic model to the canonical speaker (Eide and Gish, 1996; Lee and Rose, 1996) (cf. Section 4.2.1)....

    [...]

  • ...Note that VTLN is often combined with an adaptation of the acoustic model to the canonical speaker (Eide and Gish, 1996; Lee and Rose, 1996) (cf....

    [...]

Journal ArticleDOI
TL;DR: An efficient means for estimating a linear frequency Warping factor and a simple mechanism for implementing frequency warping by modifying the filterbank in mel-frequency cepstrum feature analysis are presented.
Abstract: In an effort to reduce the degradation in speech recognition performance caused by variation in vocal tract shape among speakers, a frequency warping approach to speaker normalization is investigated. A set of low complexity, maximum likelihood based frequency warping procedures have been applied to speaker normalization for a telephone based connected digit recognition task. This paper presents an efficient means for estimating a linear frequency warping factor and a simple mechanism for implementing frequency warping by modifying the filterbank in mel-frequency cepstrum feature analysis. An experimental study comparing these techniques to other well-known techniques for reducing variability is described. The results have shown that frequency warping is consistently able to reduce word error rate by 20% even for very short utterances.

338 citations

Journal ArticleDOI
TL;DR: The working group producing this article was charged to elicit from the human language technology community a set of well-considered directions or rich areas for future research that could lead to major paradigm shifts in the field of automatic speech recognition (ASR) and understanding.
Abstract: To advance research, it is important to identify promising future research directions, especially those that have not been adequately pursued or funded in the past. The working group producing this article was charged to elicit from the human language technology (HLT) community a set of well-considered directions or rich areas for future research that could lead to major paradigm shifts in the field of automatic speech recognition (ASR) and understanding. ASR has been an area of great interest and activity to the signal processing and HLT communities over the past several decades. As a first step, this group reviewed major developments in the field and the circumstances that led to their success and then focused on areas it deemed especially fertile for future research. Part 1 of this article will focus on historically significant developments in the ASR area, including several major research efforts that were guided by different funding agencies, and suggest general areas in which to focus research.

244 citations


Cites background from "A parametric approach to vocal trac..."

  • ...As in the past, we expect that further research and development will enable us to create increasingly powerful systems, deployable on a worldwide basis....

    [...]

01 Jan 1998
TL;DR: It is argued and demonstrated empirically that the articulatory feature approach can lead to greater robustness by enhancing the accuracy of the bottom-up acoustic modeling component in a speech recognition system, to improve the robustness of speech recognition systems in adverse acoustic environments.
Abstract: Current automatic speech recognition systems make use of a single source of information about their input, viz a preprocessed form of the acoustic speech signal, which encodes the time-frequency distribution of signal energy The goal of this thesis is to investigate the benefits of integrating articulatory information into state-of-the art speech recognizers, either as a genuine alternative to standard acoustic representations, or as an additional source of information Articulatory information is represented in terms of abstract articulatory classes or "features", which are extracted from the speech signal by means of statistical classifiers A higher-level classifier then combines the scores for these features and maps them to standard subword unit probabilities The main motivation for this approach is to improve the robustness of speech recognition systems in adverse acoustic environments, such as background noise Typically, recognition systems show a sharp decline of performance under these conditions We argue and demonstrate empirically that the articulatory feature approach can lead to greater robustness by enhancing the accuracy of the bottom-up acoustic modeling component in a speech recognition system The second focus point of this thesis is to provide detailed analyses of the different types of information provided by the acoustic and the articulatory representations, respectively, and to develop strategies to optimally combine them To this effect we investigate combination methods at the levels of feature extraction, subword unit probability estimation, and word recognition The feasibility of this approach is demonstrated with respect to two different speech recognition tasks The first of these is an American English corpus of telephone-bandwidth speech; the recognition domain is continuous numbers The second is a German database of studio-quality speech consisting of spontaneous dialogues In both cases recognition performance will be tested not only under clean acoustic conditions but also under deteriorated conditions

221 citations


Cites background or methods from "A parametric approach to vocal trac..."

  • ...In [36], a parametric approach is suggested which eliminates much of the computational overhead associated with the exhaustive search for the optimal scaling factor....

    [...]

  • ...In VTN [67, 36] a scaling factor } is applied to the preprocessed speech signal to achieve a linear frequency warping, ~ € } ~ (2....

    [...]

References
More filters
Book
01 Jan 1973
TL;DR: Speech Sounds and Features is the fourth volume in the series Current Studies in Linguistics and follows the development over the past 15 years of research presented in the author's previous publications on speech analysis, feature theory, and applications to language descriptions.
Abstract: This is a representative collection of the work of one of the world's leading scholars in the area of speech acoustics. It follows the development over the past 15 years of research presented in the author's previous publications on speech analysis, feature theory, and applications to language descriptions. Most of the articles have had very restricted distribution--many appearing only in the Quarterly Progress Reports issued by Dr. Fant's laboratory. The first part of the book covers manifold aspects of speech analysis such as instrumental techniques, spectrum data, formant statistics with an emphasis on Swedish vowels and stops, speaker dependencies, normalization procedures, production theory, and coarticulation. The second part analyzes established feature systems and suggests revisions and general discussions on the concept of distinctive features, perception, and the applicability of feature theory to automatic speech recognition. Articles in this part of the book are especially valuable because they represent Dr. Fant's work on "inherent" features and their phonetic correlates as it has evolved since his collaboration in 1952 with Roman Jakobson and Morris Halle. Speech Sounds and Features is the fourth volume in the series Current Studies in Linguistics.

498 citations

Journal ArticleDOI
TL;DR: In the SWITCHBOARD corpus as mentioned in this paper, an attempt was made to compensate for the systematic variability due to different vocal tract lengths of various speakers by warping the spectrum of each speaker linearly over a 20% range, and finding the maximum a posteriori probability of the data given the warp.
Abstract: The performance of speech recognition systems is often improved by accounting explicitly for sources of variability in the data. In the SWITCHBOARD corpus, studied during the 1994 CAIP workshop [Frontiers in Speech Processing Workshop II, CAIP (August 1994)], an attempt was made to compensate for the systematic variability due to different vocal tract lengths of various speakers. The method found a maximum probability parameter for each speaker which mapped an acoustic model to the mean of the models taken from a homogeneous speaker population. The underlying acoustic model was that of a straight tube, and the parameter estimation was accomplished by warping the spectrum of each speaker linearly over a 20% range (actually accomplished by digitally resampling the data), and finding the maximum a posteriori probability of the data given the warp. The technique produces statistically significant improvements in accuracy on a speech transcription task using each of four different speech recognition systems. The best parametrizations were later found to correlate well with vocal tract estimates computed manually from spectrograms.

103 citations