scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Maximum likelihood modelling of pronunciation variation

01 Nov 1999-Speech Communication (North-Holland)-Vol. 29, Iss: 2, pp 177-191
TL;DR: A maximum likelihood based algorithm for fully automatic data-driven modelling of pronunciation, given a set of subword hidden Markov models (HMMs) and acoustic tokens of a word to create a consistent framework for optimisation of automatic speech recognition systems.
About: This article is published in Speech Communication.The article was published on 1999-11-01. It has received 63 citations till now.
Citations
More filters
Journal ArticleDOI
TL;DR: This contribution provides an overview of the publications on pronunciation variation modeling in automatic speech recognition, paying particular attention to the papers in this special issue and the papers presented at 'the Rolduc workshop'.

259 citations


Cites background from "Maximum likelihood modelling of pro..."

  • ...…and Waibel, 1997; Fosler-Lussier and Morgan, 1998, 1999; Fukada and Sagisaka, 1997; Fukada et al., 1998, 1999; Heine et al., 1998; Holter, 1997; Holter and Svendsen, 1998, 1999; Imai et al., 1995; Kessens and Wester, 1997; Kessens et al., 1999; Lamel and Adda, 1996; Lehtinen and Safra, 1998;…...

    [...]

  • ...…variants leads to the largest gain in performance (Cremelie and Martens, 1995, 1997, 1998, 1999; Fukada et al., 1998, 1999; Holter, 1997; Holter and Svendsen, 1998, 1999; Imai et al., 1995; Kessens and Wester, 1997; Kessens et al., 1999; Lehtinen and Safra, 1998; Mokbel and Jouvet,…...

    [...]

  • ...Given that almost all ASR systems use a lexicon, within-word variation is modeled in the majority of the methods (Adda-Decker and Lamel, 1998, 1999; Aubert and Dugast, 1995; Bacchiani and Ostendorf, 1998, 1999; Beulen et al., 1998; Blackburn and Young, 1995, 1996; Bonaventura et al., 1998; Cohen and Mercer, 1975; Cremelie and Martens, 1995, 1997, 1998, 1999; Ferreiros et al., 1998; Finke and Waibel, 1997; Fosler-Lussier and Morgan, 1998, 1999; Fukada and Sagisaka, 1997; Fukada et al., 1998, 1999; Heine et al., 1998; Holter, 1997; Holter and Svendsen, 1998, 1999; Imai et al., 1995; Kessens and Wester, 1997; Kessens et al., 1999; Lamel and Adda, 1996; Lehtinen and Safra, 1998; Mercer and Cohen, 1987; Mirghafori et al., 1995; Mokbel and Jouvet, 1998; Ravishankar and Eskenazi, 1997; Riley et al., 1998, 1999; Ristad and Yianilos, 1998; Schiel et al., 1998; Sloboda and Waibel, 1996; Svendsen et al., 1995; Torre et al., 1997; Wester et al., 1998a; Williams and Renals, 1998; Zeppenfeld et al., 1997)....

    [...]

  • ...In knowledge-based studies, information on pronunciation variation is primarily derived from sources that are already available (Adda-Decker and Lamel, 1998, 1999; Aubert and Dugast, 1995; Bonaventura et al., 1998; Cohen and Mercer, 1975; Downey and Wiseman, 1997; Ferreiros et al., 1998; Finke and Waibel, 1997; Kessens and Wester, 1997; Kessens et al., 1999; Kipp et al., 1996; Kipp et al., 1997; Lamel and Adda, 1996; Lehtinen and Safra, 1998; Mercer and Cohen, 1987; Mouria-Beji, 1998; Nock and Young, 1998; Perennou and Brieussel-Pousse, 1998; Pousse and Perennou, 1997; Roach and Arn®eld, 1998; Safra et al., 1998; Schiel et al., 1998; Wesenick, 1996; Wester et al., 1998a; Wiseman and Downey, 1998; Zeppenfeld et al., 1997)....

    [...]

  • ...Another component in which pronunciation variation can be taken into account is the language model (LM) (Cremelie and Martens, 1995, 1997, 1998, 1999; Deshmukh et al., 1996; Finke and Waibel, 1997; Fukada et al., 1998, 1999; Kessens et al., 1999; Lehtinen and Safra, 1998; Perennou and Brieussel-Pousse, 1998; Pousse and Perennou, 1997; Schiel et al., 1998; Wester et al., 1998a; Zeppenfeld et al., 1997)....

    [...]

01 Jan 1999
TL;DR: Problems with the phoneme as the basic subword unit in speech recognition are raised, suggesting that finer-grained control is needed to capture the sort of pronunciation variability observed in spontaneous speech.
Abstract: The notion that a word is composed of a sequence of phone segments, sometimes referred to as ‘beads on a string’, has formed the basis of most speech recognition work for over 15 years. However, as more researchers tackle spontaneous speech recognition tasks, that view is being called into question. This paper raises problems with the phoneme as the basic subword unit in speech recognition, suggesting that finer-grained control is needed to capture the sort of pronunciation variability observed in spontaneous speech. We offer two different alternatives – automatically derived subword units and linguistically motivated distinctive feature systems – and discuss current work in these directions. In addition, we look at problems that arise in acoustic modeling when trying to incorporate higher-level structure with these two strategies.

151 citations


Cites background from "Maximum likelihood modelling of pro..."

  • ...The problem of modeling cross-word contextual variations is addressed in [17], and multiple pronunciation dictionary design is covered in [18]....

    [...]

01 Jan 1999
TL;DR: This dissertation examines how pronunciations vary in this speaking style, and how speaking rate and word predictability can be used to predict when greater pronunciation variation can be expected, and suggests that for spontaneous speech, it may be appropriate to build models for syllables and words that can dynamically change the pronunciation used in the speech recognizer based on the extended context.
Abstract: As of this writing, the automatic recognition of spontaneous speech by computer is fraught with errors; many systems transcribe one out of every three to five words incorrectly, whereas humans can transcribe spontaneous speech with one error in twenty words or better. This high error rate is due in part to the poor modeling of pronunciations within spontaneous speech. This dissertation examines how pronunciations vary in this speaking style, and how speaking rate and word predictability can be used to predict when greater pronunciation variation can be expected. It includes an investigation of the relationship between speaking rate, word predictability, pronunciations, and errors made by speech recognition systems. The results of these studies suggest that for spontaneous speech, it may be appropriate to build models for syllables and words that can dynamically change the pronunciations used in the speech recognizer based on the extended context (including surrounding words, phones, speaking rate, etc.). Implementation of new pronunciation models automatically derived from data within the ICSI speech recognition system has shown a 4–5% relative improvement on the Broadcast News recognition task. Roughly two thirds of these gains can be attributed to static baseform improvements; adding the ability to dynamically adjust pronunciations within the recognizer provides the other third of the improvement. The Broadcast News task also allows for comparison of performance on different styles of speech: the new pronunciation models do not help for pre-planned speech, but they provide a significant gain for spontaneous speech. Not only do the automatically learned pronunciation models capture some of the linguistic variation due to the speaking style, but they also represent variation in the acoustic model due to channel effects. The largest improvement was seen in the telephone speech condition, in which 12% of the errors produced by the baseline system were corrected.

87 citations

Journal ArticleDOI
TL;DR: This work constructs a Markov chain Monte Carlo (MCMC) sampling scheme, where sampling from all the posterior probability distributions is very easy and has been tested in extensive computer simulations on finite discrete-valued observed data.
Abstract: Hidden Markov models (HMMs) represent a very important tool for analysis of signals and systems. In the past two decades, HMMs have attracted the attention of various research communities, including the ones in statistics, engineering, and mathematics. Their extensive use in signal processing and, in particular, speech processing is well documented. A major weakness of conventional HMMs is their inflexibility in modeling state durations. This weakness can be avoided by adopting a more complicated class of HMMs known as nonstationary HMMs. We analyze nonstationary HMMs whose state transition probabilities are functions of time that indirectly model state durations by a given probability mass function and whose observation spaces are discrete. The objective of our work is to estimate all the unknowns of a nonstationary HMM, which include its parameters and the state sequence. To that end, we construct a Markov chain Monte Carlo (MCMC) sampling scheme, where sampling from all the posterior probability distributions is very easy. The proposed MCMC sampling scheme has been tested in extensive computer simulations on finite discrete-valued observed data, and some of the simulation results are presented.

83 citations

Journal ArticleDOI
TL;DR: This work uses an English phoneme recogniser to generate English pronunciations for German words and uses these to train decision trees that are able to predict the respective English-accented variant from the German canonical transcription, and combines this approach with online, incremental weighted MLLR speaker adaptation.

83 citations

References
More filters
Journal ArticleDOI
TL;DR: An efficient and intuitive algorithm is presented for the design of vector quantizers based either on a known probabilistic model or on a long training sequence of data.
Abstract: An efficient and intuitive algorithm is presented for the design of vector quantizers based either on a known probabilistic model or on a long training sequence of data. The basic properties of the algorithm are discussed and demonstrated by examples. Quite general distortion measures and long blocklengths are allowed, as exemplified by the design of parameter vector quantizers of ten-dimensional vectors arising in Linear Predictive Coded (LPC) speech compression with a complicated distortion measure arising in LPC analysis that does not depend only on the error vector.

7,935 citations

Book
01 Jan 1971
TL;DR: This paper will concern you to try reading problem solving methods in artificial intelligence as one of the reading material to finish quickly.
Abstract: Feel lonely? What about reading books? Book is one of the greatest friends to accompany while in your lonely time. When you have no friends and activities somewhere and sometimes, reading book can be a great choice. This is not only for spending the time, it will increase the knowledge. Of course the b=benefits to take will relate to what kind of book that you are reading. And now, we will concern you to try reading problem solving methods in artificial intelligence as one of the reading material to finish quickly.

1,431 citations

Journal ArticleDOI
TL;DR: This review traces the early work on the development of speech synthesizers, discovery of minimal acoustic cues for phonetic contrasts, evolution of phonemic rule programs, incorporation of prosodic rules, and formulation of techniques for text analysis.
Abstract: The automatic conversion of English text to synthetic speech is presently being performed, remarkably well, by a number of laboratory systems and commercial devices. Progress in this area has been made possible by advances in linguistic theory, acoustic-phonetic characterization of English sound patterns, perceptual psychology, mathematical modeling of speech production, structured programming, and computer hardware design. This review traces the early work on the development of speech synthesizers, discovery of minimal acoustic cues for phonetic contrasts, evolution of phonemic rule programs, incorporation of prosodic rules, and formulation of techniques for text analysis. Examples of rules are used liberally to illustrate the state of the art. Many of the examples are taken from Klattalk, a text-to-speech system developed by the author. A number of scientific problems are identified that prevent current systems from achieving the goal of completely human-sounding speech. While the emphasis is on rule programs that drive a format synthesizer, alternatives such as articulatory synthesis and waveform concatenation are also reviewed. An extensive bibliography has been assembled to show both the breadth of synthesis activity and the wealth of phenomena covered by rules in the best of these programs. A recording of selected examples of the historical development of synthetic speech, enclosed as a 33 1/3-rpm record, is described in the Appendix.

843 citations

Proceedings ArticleDOI
23 May 1989
TL;DR: The authors present two simple tests for deciding whether the difference in error rates between two algorithms tested on the same data set is statistically significant.
Abstract: The authors present two simple tests for deciding whether the difference in error rates between two algorithms tested on the same data set is statistically significant. The first (McNemar's test) requires the errors made by an algorithm to be independent events and is found to be most appropriate for isolated-word algorithms. The second (a matched-pairs test) can be used even when errors are not independent events and is more appropriate for connected speech. >

715 citations

Proceedings ArticleDOI
11 Apr 1988
TL;DR: A database of continuous read speech has been designed and recorded within the DARPA strategic computing speech recognition program for use in designing and evaluating algorithms for speaker-independent, speaker-adaptive and speaker-dependent speech recognition.
Abstract: A database of continuous read speech has been designed and recorded within the DARPA strategic computing speech recognition program The data is intended for use in designing and evaluating algorithms for speaker-independent, speaker-adaptive and speaker-dependent speech recognition The data consists of read sentences appropriate to a naval resource management task built around existing interactive database and graphics programs The 1000-word task vocabulary is intended to be logically complete and habitable The database, which represents over 21000 recorded utterances from 160 talkers with a variety of dialects, includes a partition of sentences and talkers for training and for testing purposes >

393 citations