scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Maximum likelihood modelling of pronunciation variation

01 Nov 1999-Speech Communication (North-Holland)-Vol. 29, Iss: 2, pp 177-191
TL;DR: A maximum likelihood based algorithm for fully automatic data-driven modelling of pronunciation, given a set of subword hidden Markov models (HMMs) and acoustic tokens of a word to create a consistent framework for optimisation of automatic speech recognition systems.
Abstract: This paper addresses the problem of generating lexical word representations that properly represent natural pronunciation variations for the purpose of improved speech recognition accuracy. In order to create a consistent framework for optimisation of automatic speech recognition systems, we present a maximum likelihood based algorithm for fully automatic data-driven modelling of pronunciation, given a set of subword hidden Markov models (HMMs) and acoustic tokens of a word. We also propose an extension of this formulation in order to achieve optimal modelling of pronunciation variations. Since different words will not in general exhibit the same amount of pronunciation variation, the procedure allows words to be represented by a different number of baseforms. The methods improve the subword description of the vocabulary words and have been shown to improve recognition performance on the DARPA Resource Management task. Dieser Beitrag behandelt das Problem der Erzeugung lexikalischer Wortdarstellungen, die die naturliche Variation der Aussprache von Wortern geeignet reprasentieren, um auf diese Weise die Genauigkeit eines Spracherkennungssystemes zu erhohen. Um einen einheitlichen Ansatz fur die Optimierung von Spracherkennungssystemen zu entwickeln, wird ein maximum-likelihood-basiertes Verfahren zur vollautomatischen datengesteuerten Modellierung der Aussprache von Wortern vorgestellt. Dieses Verfahren basiert auf einen Satz von Teilwortern mit verborgenen Markov-Modellen und akustische Proben eines Wortes. Au?serdem wird eine Erweiterung dieses Verfahrens vorgeschlagen, um eine optimale Modellierung der Aussprachevariation zu erzielen. Da unterschiedliche Worter im allgemeinen nicht den gleichen Grad der Aussprachevariation aufweisen, erlaubt das vorgestellte Verfahren, Worter durch eine unterschiedliche Anzahl von Basisformen darzustellen. Diese Verfahren verbessern die Teilwort-Darstellung der Worter im Wortschatz, und es konnte gezeigt werden, da?s damit die Erkennungsleistung fur den DARPA Resource Management Task verbessert wird. Cette communication aborde le probleme de la generation des representations des mots lexicaux representant des variations naturelles de la prononciation. Le but est d'ameliorer la precision en ce qui concerne la reconnaissance de la parole. Afin de creer un cadre consistant pour l'optimisation des systemes automatiques pour la reconnaissance de la parole, on presente ici un algorithme base sur la classification au maximum de vraisemblance pour la modelisation automatique de la prononciation. Cette modelisation utilise une rame d'unites de parole aux modeles Markov dissimules et des echantillons acoustiques d'un mot. On propose aussi une extension de cette formulation afin d'obtenir une modelisation optimale des variations de la prononciation. Puisque de differents mots n'exposent pas, en general, le meme degre de variation de la prononciation, cette methode permet une representation des mots par un nombre varie d'entrees lexicales. La methode ameliore la description d'unites de parole des mots du vocabulaire, chose qui a demontre une amelioration de la performance de la reconnaissance en ce qui concerne la tâche de la DARPA Resource Management.
Citations
More filters
Journal ArticleDOI
TL;DR: This contribution provides an overview of the publications on pronunciation variation modeling in automatic speech recognition, paying particular attention to the papers in this special issue and the papers presented at 'the Rolduc workshop'.
Abstract: The focus in automatic speech recognition (ASR) research has gradually shifted from isolated words to conversational speech. Consequently, the amount of pronunciation variation present in the speech under study has gradually increased. Pronunciation variation will deteriorate the performance of an ASR system if it is not well accounted for. This is probably the main reason why research on modeling pronunciation variation for ASR has increased lately. In this contribution, we provide an overview of the publications on this topic, paying particular attention to the papers in this special issue and the papers presented at 'the Rolduc workshop'.11Whenever we mention 'the Rolduc workshop' in the text we refer to the ESCA Tutorial and Research Workshop "Modeling pronunciation variation for ASR" that was held in Rolduc from 4 to 6 May 1998. This special issue of Speech Communication contains a selection of papers presented at that workshop. First, the most important characteristics that distinguish the various studies on pronunciation variation modeling are discussed. Subsequently, the issues of evaluation and comparison are addressed. Particular attention is paid to some of the most important factors that make it difficult to compare the different methods in an objective way. Finally, some conclusions are drawn as to the importance of objective evaluation and the way in which it could be carried out. Die Forschungsrichtung der automatischen Spracherkennung (ASR) hat sich nach und nach vom Erkennen isolierter Worter in Richtung Erkennung frei gesprochener Sprache entwickelt. Das hat zur Folge, da?s die Aussprachevariation, so wie sie in der freien Rede zutage tritt, bei der Spracherkennung ein intervenierender Faktor geworden ist. Die Leistung eines ASR-Systems wird namlich erheblich beeintrachtigt, wenn man diesen Faktor nicht berucksichtigt. Dies ist vermutlich der Hauptgrund dafur, warum die systematische Berucksichtigung der Aussprachevariation bei der ASR in letzter Zeit stark zugenommen hat. Dieser Artikel stellt einen Uberblick der Literatur zu diesem Thema dar, wobei den Beitragen in diesem 'special issue' sowie denen des 'Rolduc workshop' besondere Aufmerksamkeit geschenkt wird. Zunachst werden die wichtigsten Unterschiede der zahlreichen Arbeiten zur Modellbildung der Aussprachevariation diskutiert. Dann folgt eine Besprechung der Beurteilung und des Vergleichs verschiedener Methoden, die der Modellbildung zugrunde liegen. Dabei wird den wichtigsten Faktoren, die einen objektiven Vergleich der Methoden erschweren, besondere Aufmerksamkeit geschenkt. Letztendlich schlie?sen sich einige Schlu?sfolgerungen im Hinblick auf die Relevanz objektiver Beurteilung und deren mogliche Realisierung an. Le centre d'interet dans la recherche de la reconnaissance automatique de la parole (ASR), parti des mots isoles, s'est engage vers le discours conversationnel. Par consequence, la quantite de variation de prononciation presente dans le discours dont nous rapportons les resultats a graduellement augmente. La variation de prononciation deteriorera la performance d'un systeme ASR si l'on n'en rend pas compte. C'est probablement la raison principale pourquoi la recherche dans le domaine de la modelisation de la variation de prononciation pour ASR a augmente recemment. Dans cette contribution on fournit une vue d'ensemble des publications sur ce sujet, et en particulier on refere aux articles de cette edition speciale et aux contributions presentees dans les sessions qui ont eu lieu a 'Rolduc'. D'abord, les caracteristiques les plus importantes qui distinguent les diverses etudes sur modelisation de variation de prononciation sont discutees. Puis les questions d'evaluation et de comparaison sont adressees. Une attention particuliere est pretee a certains des facteurs les plus importants qui rendent difficile de comparer les differentes methodes d'une maniere objective. Enfin quelques conclusions sont tirees quant a l'importance de l'evaluation objective et de la facon dans laquelle elle pourrait etre effectuee.

259 citations


Cites background from "Maximum likelihood modelling of pro..."

  • ...…and Waibel, 1997; Fosler-Lussier and Morgan, 1998, 1999; Fukada and Sagisaka, 1997; Fukada et al., 1998, 1999; Heine et al., 1998; Holter, 1997; Holter and Svendsen, 1998, 1999; Imai et al., 1995; Kessens and Wester, 1997; Kessens et al., 1999; Lamel and Adda, 1996; Lehtinen and Safra, 1998;…...

    [...]

  • ...…variants leads to the largest gain in performance (Cremelie and Martens, 1995, 1997, 1998, 1999; Fukada et al., 1998, 1999; Holter, 1997; Holter and Svendsen, 1998, 1999; Imai et al., 1995; Kessens and Wester, 1997; Kessens et al., 1999; Lehtinen and Safra, 1998; Mokbel and Jouvet,…...

    [...]

  • ...Given that almost all ASR systems use a lexicon, within-word variation is modeled in the majority of the methods (Adda-Decker and Lamel, 1998, 1999; Aubert and Dugast, 1995; Bacchiani and Ostendorf, 1998, 1999; Beulen et al., 1998; Blackburn and Young, 1995, 1996; Bonaventura et al., 1998; Cohen and Mercer, 1975; Cremelie and Martens, 1995, 1997, 1998, 1999; Ferreiros et al., 1998; Finke and Waibel, 1997; Fosler-Lussier and Morgan, 1998, 1999; Fukada and Sagisaka, 1997; Fukada et al., 1998, 1999; Heine et al., 1998; Holter, 1997; Holter and Svendsen, 1998, 1999; Imai et al., 1995; Kessens and Wester, 1997; Kessens et al., 1999; Lamel and Adda, 1996; Lehtinen and Safra, 1998; Mercer and Cohen, 1987; Mirghafori et al., 1995; Mokbel and Jouvet, 1998; Ravishankar and Eskenazi, 1997; Riley et al., 1998, 1999; Ristad and Yianilos, 1998; Schiel et al., 1998; Sloboda and Waibel, 1996; Svendsen et al., 1995; Torre et al., 1997; Wester et al., 1998a; Williams and Renals, 1998; Zeppenfeld et al., 1997)....

    [...]

  • ...In knowledge-based studies, information on pronunciation variation is primarily derived from sources that are already available (Adda-Decker and Lamel, 1998, 1999; Aubert and Dugast, 1995; Bonaventura et al., 1998; Cohen and Mercer, 1975; Downey and Wiseman, 1997; Ferreiros et al., 1998; Finke and Waibel, 1997; Kessens and Wester, 1997; Kessens et al., 1999; Kipp et al., 1996; Kipp et al., 1997; Lamel and Adda, 1996; Lehtinen and Safra, 1998; Mercer and Cohen, 1987; Mouria-Beji, 1998; Nock and Young, 1998; Perennou and Brieussel-Pousse, 1998; Pousse and Perennou, 1997; Roach and Arn®eld, 1998; Safra et al., 1998; Schiel et al., 1998; Wesenick, 1996; Wester et al., 1998a; Wiseman and Downey, 1998; Zeppenfeld et al., 1997)....

    [...]

  • ...Another component in which pronunciation variation can be taken into account is the language model (LM) (Cremelie and Martens, 1995, 1997, 1998, 1999; Deshmukh et al., 1996; Finke and Waibel, 1997; Fukada et al., 1998, 1999; Kessens et al., 1999; Lehtinen and Safra, 1998; Perennou and Brieussel-Pousse, 1998; Pousse and Perennou, 1997; Schiel et al., 1998; Wester et al., 1998a; Zeppenfeld et al., 1997)....

    [...]

01 Jan 1999
TL;DR: Problems with the phoneme as the basic subword unit in speech recognition are raised, suggesting that finer-grained control is needed to capture the sort of pronunciation variability observed in spontaneous speech.
Abstract: The notion that a word is composed of a sequence of phone segments, sometimes referred to as ‘beads on a string’, has formed the basis of most speech recognition work for over 15 years. However, as more researchers tackle spontaneous speech recognition tasks, that view is being called into question. This paper raises problems with the phoneme as the basic subword unit in speech recognition, suggesting that finer-grained control is needed to capture the sort of pronunciation variability observed in spontaneous speech. We offer two different alternatives – automatically derived subword units and linguistically motivated distinctive feature systems – and discuss current work in these directions. In addition, we look at problems that arise in acoustic modeling when trying to incorporate higher-level structure with these two strategies.

151 citations


Cites background from "Maximum likelihood modelling of pro..."

  • ...The problem of modeling cross-word contextual variations is addressed in [17], and multiple pronunciation dictionary design is covered in [18]....

    [...]

01 Jan 1999
TL;DR: This dissertation examines how pronunciations vary in this speaking style, and how speaking rate and word predictability can be used to predict when greater pronunciation variation can be expected, and suggests that for spontaneous speech, it may be appropriate to build models for syllables and words that can dynamically change the pronunciation used in the speech recognizer based on the extended context.
Abstract: As of this writing, the automatic recognition of spontaneous speech by computer is fraught with errors; many systems transcribe one out of every three to five words incorrectly, whereas humans can transcribe spontaneous speech with one error in twenty words or better. This high error rate is due in part to the poor modeling of pronunciations within spontaneous speech. This dissertation examines how pronunciations vary in this speaking style, and how speaking rate and word predictability can be used to predict when greater pronunciation variation can be expected. It includes an investigation of the relationship between speaking rate, word predictability, pronunciations, and errors made by speech recognition systems. The results of these studies suggest that for spontaneous speech, it may be appropriate to build models for syllables and words that can dynamically change the pronunciations used in the speech recognizer based on the extended context (including surrounding words, phones, speaking rate, etc.). Implementation of new pronunciation models automatically derived from data within the ICSI speech recognition system has shown a 4–5% relative improvement on the Broadcast News recognition task. Roughly two thirds of these gains can be attributed to static baseform improvements; adding the ability to dynamically adjust pronunciations within the recognizer provides the other third of the improvement. The Broadcast News task also allows for comparison of performance on different styles of speech: the new pronunciation models do not help for pre-planned speech, but they provide a significant gain for spontaneous speech. Not only do the automatically learned pronunciation models capture some of the linguistic variation due to the speaking style, but they also represent variation in the acoustic model due to channel effects. The largest improvement was seen in the telephone speech condition, in which 12% of the errors produced by the baseline system were corrected.

87 citations

Journal ArticleDOI
TL;DR: This work constructs a Markov chain Monte Carlo (MCMC) sampling scheme, where sampling from all the posterior probability distributions is very easy and has been tested in extensive computer simulations on finite discrete-valued observed data.
Abstract: Hidden Markov models (HMMs) represent a very important tool for analysis of signals and systems. In the past two decades, HMMs have attracted the attention of various research communities, including the ones in statistics, engineering, and mathematics. Their extensive use in signal processing and, in particular, speech processing is well documented. A major weakness of conventional HMMs is their inflexibility in modeling state durations. This weakness can be avoided by adopting a more complicated class of HMMs known as nonstationary HMMs. We analyze nonstationary HMMs whose state transition probabilities are functions of time that indirectly model state durations by a given probability mass function and whose observation spaces are discrete. The objective of our work is to estimate all the unknowns of a nonstationary HMM, which include its parameters and the state sequence. To that end, we construct a Markov chain Monte Carlo (MCMC) sampling scheme, where sampling from all the posterior probability distributions is very easy. The proposed MCMC sampling scheme has been tested in extensive computer simulations on finite discrete-valued observed data, and some of the simulation results are presented.

83 citations

Journal ArticleDOI
TL;DR: This work uses an English phoneme recogniser to generate English pronunciations for German words and uses these to train decision trees that are able to predict the respective English-accented variant from the German canonical transcription, and combines this approach with online, incremental weighted MLLR speaker adaptation.
Abstract: Handling non-native speech in automatic speech recognition (ASR) systems is an area of increasing interest. The majority of systems are tailored to native speech only and as a consequence performance for non-native speakers often is not satisfactory. One way to approach the problem is to adapt the acoustic models to the new speaker. Another important means to improve performance for non-native speakers is to consider non-native pronunciations in the dictionary. The difficulty here lies in the generation of the non-native variants, especially if various accents are to be considered. Traditional approaches to model pronunciation variation either require phonetic expertise or extensive speech databases. They are too costly, especially if a flexible modelling of several accents is desired. We propose to exclusively use native speech databases to derive non-native pronunciation variants. We use an English phoneme recogniser to generate English pronunciations for German words and use these to train decision trees that are able to predict the respective English-accented variant from the German canonical transcription. Furthermore we combine this approach with online, incremental weighted MLLR speaker adaptation. Using the enhanced dictionary and the speaker adaptation alone improved the word error rate of the baseline system by 5.2% and 16.8%, respectively. When both methods were combined, we achieved an improvement of 18.2%.

83 citations

References
More filters
Journal ArticleDOI
TL;DR: An efficient and intuitive algorithm is presented for the design of vector quantizers based either on a known probabilistic model or on a long training sequence of data.
Abstract: An efficient and intuitive algorithm is presented for the design of vector quantizers based either on a known probabilistic model or on a long training sequence of data. The basic properties of the algorithm are discussed and demonstrated by examples. Quite general distortion measures and long blocklengths are allowed, as exemplified by the design of parameter vector quantizers of ten-dimensional vectors arising in Linear Predictive Coded (LPC) speech compression with a complicated distortion measure arising in LPC analysis that does not depend only on the error vector.

7,935 citations

Book
01 Jan 1971
TL;DR: This paper will concern you to try reading problem solving methods in artificial intelligence as one of the reading material to finish quickly.
Abstract: Feel lonely? What about reading books? Book is one of the greatest friends to accompany while in your lonely time. When you have no friends and activities somewhere and sometimes, reading book can be a great choice. This is not only for spending the time, it will increase the knowledge. Of course the b=benefits to take will relate to what kind of book that you are reading. And now, we will concern you to try reading problem solving methods in artificial intelligence as one of the reading material to finish quickly.

1,431 citations

Journal ArticleDOI
TL;DR: This review traces the early work on the development of speech synthesizers, discovery of minimal acoustic cues for phonetic contrasts, evolution of phonemic rule programs, incorporation of prosodic rules, and formulation of techniques for text analysis.
Abstract: The automatic conversion of English text to synthetic speech is presently being performed, remarkably well, by a number of laboratory systems and commercial devices. Progress in this area has been made possible by advances in linguistic theory, acoustic-phonetic characterization of English sound patterns, perceptual psychology, mathematical modeling of speech production, structured programming, and computer hardware design. This review traces the early work on the development of speech synthesizers, discovery of minimal acoustic cues for phonetic contrasts, evolution of phonemic rule programs, incorporation of prosodic rules, and formulation of techniques for text analysis. Examples of rules are used liberally to illustrate the state of the art. Many of the examples are taken from Klattalk, a text-to-speech system developed by the author. A number of scientific problems are identified that prevent current systems from achieving the goal of completely human-sounding speech. While the emphasis is on rule programs that drive a format synthesizer, alternatives such as articulatory synthesis and waveform concatenation are also reviewed. An extensive bibliography has been assembled to show both the breadth of synthesis activity and the wealth of phenomena covered by rules in the best of these programs. A recording of selected examples of the historical development of synthetic speech, enclosed as a 33 1/3-rpm record, is described in the Appendix.

843 citations

Proceedings ArticleDOI
23 May 1989
TL;DR: The authors present two simple tests for deciding whether the difference in error rates between two algorithms tested on the same data set is statistically significant.
Abstract: The authors present two simple tests for deciding whether the difference in error rates between two algorithms tested on the same data set is statistically significant. The first (McNemar's test) requires the errors made by an algorithm to be independent events and is found to be most appropriate for isolated-word algorithms. The second (a matched-pairs test) can be used even when errors are not independent events and is more appropriate for connected speech. >

715 citations

Proceedings ArticleDOI
11 Apr 1988
TL;DR: A database of continuous read speech has been designed and recorded within the DARPA strategic computing speech recognition program for use in designing and evaluating algorithms for speaker-independent, speaker-adaptive and speaker-dependent speech recognition.
Abstract: A database of continuous read speech has been designed and recorded within the DARPA strategic computing speech recognition program The data is intended for use in designing and evaluating algorithms for speaker-independent, speaker-adaptive and speaker-dependent speech recognition The data consists of read sentences appropriate to a naval resource management task built around existing interactive database and graphics programs The 1000-word task vocabulary is intended to be logically complete and habitable The database, which represents over 21000 recorded utterances from 160 talkers with a variety of dialects, includes a partition of sentences and talkers for training and for testing purposes >

393 citations